KernelEvolve:用于异构AI加速器的自主内核编码(Meta)

1作者: gangliao大约 1 个月前原帖
我们分享了KernelEvolve,这是我们在Meta构建的一个自主系统,旨在自动生成和演化高性能内核,适用于异构AI加速器。 其核心动机在于,现代AI堆栈越来越依赖手动优化的内核(如GEMM、注意力机制、归约、融合操作),但为每个硬件目标(如NVIDIA GPU、AMD GPU、自定义加速器如MTIA)编写和调整这些内核并不具备可扩展性。 KernelEvolve将内核编程视为一个搜索与演化问题: - 一个大型语言模型(LLM)生成候选内核(例如,类似Triton的代码) - 内核在真实硬件上进行编译、基准测试和验证 - 性能反馈用于在多次迭代中演化出更好的变体 - 系统在大规模集群和多种加速器类型上扩展评估 与一次性代码生成不同,KernelEvolve通过闭环反馈、硬件在环的方式持续改进内核,并能够发现一些不明显的优化,这些优化的性能可以与专家编写的代码相媲美或超越。 在论文中,我们描述了: - 代理架构和搜索空间设计 - 如何在异构加速器上高效扩展内核评估 - 案例研究,展示相较于手动调优基线的性能提升 - 在生产机器学习工作负载中部署该系统的实际经验教训 论文(arXiv):https://arxiv.org/abs/2512.23236(66页) LinkedIn:https://www.linkedin.com/posts/gangliao_excited-to-share-our-recent-work-on-kernelevolve-activity-7411781675740897280-AQth?utm_source=share&utm_medium=member_desktop&rcm=ACoAAAzsrfsBRed-BvPAGqq9FgvVZ-v6F-sG4SM 我们非常希望能收到从事编译器、内核、机器学习系统或自主代码生成方法的朋友们的反馈。
查看原文
We’re sharing KernelEvolve, an agentic system we built at Meta to automatically generate and evolve high-performance kernels across heterogeneous AI accelerators.<p>The core motivation is that modern AI stacks increasingly depend on hand-optimized kernels (GEMM, attention, reductions, fused ops), but writing and tuning them for each hardware target (NVIDIA GPUs, AMD GPUs, custom accelerators like MTIA) does not scale.<p>KernelEvolve treats kernel programming as a search + evolution problem:<p>• An LLM generates candidate kernels (e.g., Triton-like code) • Kernels are compiled, benchmarked, and validated on real hardware • Performance feedback is used to evolve better variants over many iterations • The system scales evaluation across large fleets and multiple accelerator types<p>Unlike one-shot code generation, KernelEvolve continuously improves kernels using closed-loop, hardware-in-the-loop feedback, and can discover non-obvious optimizations that rival or exceed expert-written code.<p>In the paper we describe:<p>• The agent architecture and search space design • How we scale kernel evaluation efficiently across heterogeneous accelerators • Case studies showing performance gains over hand-tuned baselines • Practical lessons from deploying this system in production ML workloads<p>Paper (arXiv): https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2512.23236 (66 pages)<p>LinkedIn: https:&#x2F;&#x2F;www.linkedin.com&#x2F;posts&#x2F;gangliao_excited-to-share-our-recent-work-on-kernelevolve-activity-7411781675740897280-AQth?utm_source=share&amp;utm_medium=member_desktop&amp;rcm=ACoAAAzsrfsBRed-BvPAGqq9FgvVZ-v6F-sG4SM<p>We’d love feedback from folks working on compilers, kernels, ML systems, or agentic approaches to code generation.