展示HN:Hekate – 一款零拷贝的零知识引擎,突破内存瓶颈

1作者: y00zzeek20 天前原帖
大多数零知识证明系统都针对服务器级硬件进行了优化,配备了大量的内存。在扩展到工业规模的跟踪(超过2^20行)时,它们通常会遇到“内存墙”,在这个阶段,内存分配和数据移动成为比实际计算更大的瓶颈。 我正在开发Hekate,这是一个用Rust编写的零知识引擎,采用零拷贝流式模型和混合平铺评估器。为了测试其极限,我在一台使用Keccak-256的Apple M3 Max笔记本电脑上与Binius64进行了正面基准测试。 结果突显出显著的架构差异: 在2^15行时:Binius64更快(147毫秒对比202毫秒),但Hekate的内存效率已经高出10倍(44MB对比约400MB)。 在2^20行时:Binius64的内存使用达到72GB,在笔记本电脑上进入了交换地狱。而Hekate在仅使用1.4GB内存的情况下,处理相同的工作负载只需4.74秒。 在2^24行(16.7M步)时:Hekate在88秒内完成,峰值内存为21.5GB。由于该硬件的内存不足/交换,Binius64无法完成任务。 核心区别在于“物化与流式处理”。许多引擎在Sumcheck和PCS操作期间会在内存中物化并复制大量多项式,而Hekate则通过CPU缓存以平铺的方式进行流式处理。这将零知识证明的单位经济学从每小时2.00美元的高内存云实例转变为每小时0.10美元的普通硬件或本地边缘设备。 我希望能从社区获得反馈,特别是那些在二进制域、GKR以及内存受限的SNARK/STARK实现方面工作的人员。
查看原文
Most ZK proving systems are optimized for server-grade hardware with massive RAM. When scaling to industrial-sized traces (2^20+ rows), they often hit a &quot;Memory Wall&quot; where allocation and data movement become a larger bottleneck than the actual computation.<p>I have been developing Hekate, a ZK engine written in Rust that utilizes a Zero-Copy streaming model and a hybrid tiled evaluator. To test its limits, I ran a head-to-head benchmark against Binius64 on an Apple M3 Max laptop using Keccak-256.<p>The results highlight a significant architectural divergence:<p>At 2^15 rows: Binius64 is faster (147ms vs 202ms), but Hekate is already 10x more memory efficient (44MB vs ~400MB).<p>At 2^20 rows: Binius64 hits 72GB of RAM usage, entering swap hell on a laptop. Hekate processes the same workload in 4.74s using just 1.4GB of RAM.<p>At 2^24 rows (16.7M steps): Hekate finishes in 88s with a peak RAM of 21.5GB. Binius64 is unable to complete the task due to OOM&#x2F;Swap on this hardware.<p>The core difference is &quot;Materialization vs. Streaming&quot;. While many engines materialize and copy massive polynomials in RAM during Sumcheck and PCS operations, Hekate streams them through the CPU cache in tiles. This shifts the unit economics of ZK proving from $2.00&#x2F;hour high-memory cloud instances to $0.10&#x2F;hour commodity hardware or local edge devices.<p>I am looking for feedback from the community, especially those working on binary fields, GKR, and memory-constrained SNARK&#x2F;STARK implementations.