展示 HN:通过打破 DDR4 时序规则在 DRAM 中运行 BitNet b1.58
我一直在研究如何通过故意打破DDR4时序规则,在DRAM中运行BitNet b1.58。同时,我还制作了一个视觉解释器:<a href="https://pcdeni.github.io/CaSA/explainer/" rel="nofollow">https://pcdeni.github.io/CaSA/explainer/</a>。这个方法已经在商业现货内存中进行了测试,并在FPGA中使用了定制的内存控制器。相关的基础效应在学术论文中得到了很好的表征(如cmu safari、simra、dram bender等)。在实现这一过程的过程中,我还发现了关于DDR行为的先前未记录的现象:<a href="https://pcdeni.github.io/CaSA/explainer/xor-spread.html" rel="nofollow">https://pcdeni.github.io/CaSA/explainer/xor-spread.html</a>。
总体来说,这个方法有点慢,因为即使实际上只需要计算‘1’位的数量(popcount),也需要移动完整的行数据。为了使其具有竞争力,可能需要对内存芯片进行一些改动,但不需要像将计算和内存合并为一个硅片那样激进。这将有助于避免当前行业面临的内存墙问题。
查看原文
I have been working on running BitNet b1.58 inside DRAM by intentionally breaking DDR4 timing rules. Also made a visual explainer: <a href="https://pcdeni.github.io/CaSA/explainer/" rel="nofollow">https://pcdeni.github.io/CaSA/explainer/</a>
This is tested and works inside commercial off the shelf memory with custom memory controller in the FPGA. The underlying effect is well characterized in academic papers (cmu safari, simra, dram bender, etc). In the process of getting this to work I also made previously undocumented discovery about DDR behaviour: <a href="https://pcdeni.github.io/CaSA/explainer/xor-spread.html" rel="nofollow">https://pcdeni.github.io/CaSA/explainer/xor-spread.html</a>
Overall it is a bit slow, since data (in full rows) needs to be moved even when what is actually needed is only the count of the '1' bits (popcount). To make it competitive memory die changes would be needed, but not as drastic as merging compute and memory into one silicon. This would then avoid the memory wall issue the industry is currently facing.