KernelEvolve:用于异构AI加速器的自主内核编码(Meta)
我们分享了KernelEvolve,这是我们在Meta构建的一个自主系统,旨在自动生成和演化高性能内核,适用于异构AI加速器。
其核心动机在于,现代AI堆栈越来越依赖手动优化的内核(如GEMM、注意力机制、归约、融合操作),但为每个硬件目标(如NVIDIA GPU、AMD GPU、自定义加速器如MTIA)编写和调整这些内核并不具备可扩展性。
KernelEvolve将内核编程视为一个搜索与演化问题:
- 一个大型语言模型(LLM)生成候选内核(例如,类似Triton的代码)
- 内核在真实硬件上进行编译、基准测试和验证
- 性能反馈用于在多次迭代中演化出更好的变体
- 系统在大规模集群和多种加速器类型上扩展评估
与一次性代码生成不同,KernelEvolve通过闭环反馈、硬件在环的方式持续改进内核,并能够发现一些不明显的优化,这些优化的性能可以与专家编写的代码相媲美或超越。
在论文中,我们描述了:
- 代理架构和搜索空间设计
- 如何在异构加速器上高效扩展内核评估
- 案例研究,展示相较于手动调优基线的性能提升
- 在生产机器学习工作负载中部署该系统的实际经验教训
论文(arXiv):https://arxiv.org/abs/2512.23236(66页)
LinkedIn:https://www.linkedin.com/posts/gangliao_excited-to-share-our-recent-work-on-kernelevolve-activity-7411781675740897280-AQth?utm_source=share&utm_medium=member_desktop&rcm=ACoAAAzsrfsBRed-BvPAGqq9FgvVZ-v6F-sG4SM
我们非常希望能收到从事编译器、内核、机器学习系统或自主代码生成方法的朋友们的反馈。
查看原文
We’re sharing KernelEvolve, an agentic system we built at Meta to automatically generate and evolve high-performance kernels across heterogeneous AI accelerators.<p>The core motivation is that modern AI stacks increasingly depend on hand-optimized kernels (GEMM, attention, reductions, fused ops), but writing and tuning them for each hardware target (NVIDIA GPUs, AMD GPUs, custom accelerators like MTIA) does not scale.<p>KernelEvolve treats kernel programming as a search + evolution problem:<p>• An LLM generates candidate kernels (e.g., Triton-like code)
• Kernels are compiled, benchmarked, and validated on real hardware
• Performance feedback is used to evolve better variants over many iterations
• The system scales evaluation across large fleets and multiple accelerator types<p>Unlike one-shot code generation, KernelEvolve continuously improves kernels using closed-loop, hardware-in-the-loop feedback, and can discover non-obvious optimizations that rival or exceed expert-written code.<p>In the paper we describe:<p>• The agent architecture and search space design
• How we scale kernel evaluation efficiently across heterogeneous accelerators
• Case studies showing performance gains over hand-tuned baselines
• Practical lessons from deploying this system in production ML workloads<p>Paper (arXiv): https://arxiv.org/abs/2512.23236 (66 pages)<p>LinkedIn: https://www.linkedin.com/posts/gangliao_excited-to-share-our-recent-work-on-kernelevolve-activity-7411781675740897280-AQth?utm_source=share&utm_medium=member_desktop&rcm=ACoAAAzsrfsBRed-BvPAGqq9FgvVZ-v6F-sG4SM<p>We’d love feedback from folks working on compilers, kernels, ML systems, or agentic approaches to code generation.