发布 HN:Expanse(YC P26)– 解锁闲置的 GPU 能力

7作者: ismaeel_bashir28 天前原帖
嗨,HN,我们是Ismaeel、Eren、Yafet和Nikodem。我们创建了Expanse(<a href="https:&#x2F;&#x2F;expanse.sh&#x2F;">https:&#x2F;&#x2F;expanse.sh&#x2F;</a>),旨在提高您运行Kubernetes和SLURM等调度器/编排器的HPC/GPU集群的有效容量。我们通过读取源代码、作业提交脚本以及工作负载即将运行的硬件,预测作业实际需要的资源,从而在集群看到作业之前就能做出判断。我们还会标记我们认为即将发生的故障,并提供研究人员可以自行应用的逐行优化建议。 问题是:数据中心的有效利用率大约在30%到40%之间。用户请求的资源往往超过实际需求,这是由于不对称风险造成的:虽然过度请求会导致成本增加并浪费其他人可以使用的容量,但请求不足则会在作业运行中途终止,导致您失去几天的工作。因此,每个人通常会请求两到三倍的资源。 我们对一个国家级的HPC集群进行了一个月的测量,在122,000个作业中,59%的计算资源被浪费。按照同样硬件的按需云计算费率计算,这意味着在一个月内,一个集群浪费了大约850万美元的计算资源。大规模计算行业(如量化基金、人工智能实验室和制造业)的模式也类似。 我们四个人曾在最大的量化基金和HPC设施中运行HPC和GPU训练工作负载。Ismaeel在EPCC(爱丁堡并行计算中心,英国国家HPC站点)进行研究,导师是Adrian Jackson,他构建了第一个多模态HPC资源预测器:一个模型,能够处理作业源代码、提交脚本、硬件遥测和集群元数据,以确定实际需要多少计算资源。在EPCC自己集群的真实工作负载数据集上,该模型的表现比任何其他基线高出34%,并且在同样的预测任务上超越了前沿的通用LLM,约为8倍。这些结果让我们相信这个问题可以通过软件解决。 Expanse安装在每个节点上,并与SLURM(或K8s调度器)连接。它实时获取集群的硬件遥测数据(DCGM、CUPTI、Cgroups、网络/IO监控),创建硬件性能的自定义嵌入。我们扫描即将通过SLURM/K8s提交的任何工作负载(与作业的生命周期连接,因此您无需更改提交方式),并将这些信息输入我们的深度学习模型,以便在提交时为研究人员提供准确的资源建议、故障检测和优化建议。我们微调特定于集群的模型,随着您运行更多工作负载,模型的准确性会逐渐提高。我们的模型经过训练,旨在过度配置而非不足配置,以应对作业崩溃带来的不对称后果。我们还提供不确定性估计和p90值,以便用户选择自己的风险容忍度。 我们为集群用户提供三项功能: (1) 提交时的资源预测。我们预测作业实际需要的GPU VRAM、利用率、内存、CPU和壁钟时间,并提供置信区间。根据这些预测,我们还会提供OOM和其他内存相关问题的故障预测,以及逐行代码优化建议,以提高作业在硬件上的利用率。 (2) 实时可观察性。在作业运行期间,我们通过仪表板展示我们收集的遥测数据,直观地显示硬件的运行情况以及您的工作负载在代码堆栈分析中的位置。我们动态分析工作负载,以实现低单数字的开销,同时提供有用的信息。 (3) 故障诊断。如果工作负载失败,我们会收集所有数据,并对堆栈分析和硬件遥测进行相关性分析,以提供解决方案导向的日志。这些日志简洁明了,不仅告诉您作业失败时发生了什么,还解释了原因以及如何通过逐行代码建议进行修复。 我们的方法与众不同:目前大多数集群的最先进技术是使用来自sacct(SLURM会计数据库)的每用户历史平均值;手动编写的规则/启发式算法;或前沿的LLM编码代理。对于来自sacct的每用户历史平均值,一旦提交新的工作负载类型或进行代码级更改,模型的准确性就会大幅下降。对于LLM基线,我们提供了正在运行的工作负载的提交脚本和源代码,并赋予其在集群中完整的编码能力,但其表现相当糟糕。我们将Expanse与当时的最先进技术(Gemini 3.5 pro、Claude Opus 4.8、GPT 5.5、Codex 5.3)进行了基准测试,结果超越了它们8倍。 您可能会想,随着这些模型的扩展和改进,它们可能会在这项任务上超越我们;然而,我们没有发现模型规模或迭代与准确性提升之间的相关性。Claude Haiku在许多工作负载上实际表现优于Opus,而之前版本的模型也有相同的,甚至略好的准确性。即使是编码特定模型,如Codex 5.3的表现也很差(与GPT 5.5的准确性相当)。这些模型在真空中推理,缺乏对源代码(以理解底层数据流和计算模式)以及硬件遥测和拓扑(以理解集群性能模式)的原生支持,因此无法准确预测工作负载所需的资源。此外,Expanse会不断更新其内部模型,以确保我们的预测在集群上运行更多工作负载时变得更加准确,使其能够适应新硬件或工作负载模式的变化。LLM在编写代码和超参数搜索方面非常出色,但它们需要Expanse来完成自动研究的完整代理循环。我们已经将我们的工具轻松集成到这些代理中,并使我们的CLI工具对LLM友好。有关我们LLM评估的更多细节,请查看:<a href="https:&#x2F;&#x2F;x.com&#x2F;ismaeel_bashir_&#x2F;status&#x2F;2059683849404383283" rel="nofollow">https:&#x2F;&#x2F;x.com&#x2F;ismaeel_bashir_&#x2F;status&#x2F;2059683849404383283</a> 我们目前正在为客户提供付费试点。定价按集群确定。我们提供为期两周的测量窗口,在此期间我们进行安装、数据采集,并向数据中心运营商报告可恢复的容量,随后在一个部门进行付费试点部署,按固定月费续订,除非范围扩大。 如果您运行HPC/GPU集群(SLURM或K8s,100+ GPUs),我们非常希望与您交谈。我们将在您的集群的一部分上安装一周,发送一份关于可恢复资源的书面报告,您可以决定是否继续。如果您尝试过类似的方案但未能成功,我们非常希望听到原因。如果您希望预测的故障模式在帖子中没有提到,请在此线程中留言,我们会回复您该模型是否已经捕捉到该模式或需要什么才能添加。 我从未想过会站在HN的另一边 :)。即使您不运行集群,我们仍然希望听到您的声音。关于我们的方法、您在集群上运行工作负载的经验,或者您认为我们哪里有误的任何想法,我们都非常乐意倾听。 祝好!
查看原文
Hey HN, we’re Ismaeel, Eren, Yafet and Nikodem. We built Expanse (<a href="https:&#x2F;&#x2F;expanse.sh&#x2F;">https:&#x2F;&#x2F;expanse.sh&#x2F;</a>) to increase the effective capacity of your HPC&#x2F;GPU clusters running schedulers&#x2F;orchestrators like Kubernetes and SLURM. We read the source code, job submission script, and the hardware a workload is about to run on to predict what the job actually needs before the cluster sees it. We also flag failures we think are about to happen and surface line-level optimisations the researcher can apply themselves.<p>The problem: Datacenters run at roughly 30% to 40% effective utilisation. Users request more resources than what they actually need, because of asymmetric risk: while over-requesting is bad because it’s expensive and wastes capacity that someone else could have used, under-requesting kills your job mid-run and you lose days of work. So everyone over-requests by two to three times.<p>We measured one national-scale HPC cluster for a month and from 122k jobs, 59% of the compute was wasted. At on-demand cloud rates for the same hardware, that’s roughly $8.5M of compute wasted in one month on one cluster. The pattern is similar in large scale compute industries as well, such as quant funds, AI labs, and manufacturing.<p>The four of us ran HPC and GPU training workloads at the largest quant funds and HPC facilities. Ismaeel did research at EPCC (Edinburgh’s Parallel Computing Centre, the UK’s national HPC site) under Adrian Jackson, where he built the first multimodal HPC resource predictor: a model that ingests job source code, submission scripts, hardware telemetry and cluster metadata in order to figure out how much compute will actually be needed. On a dataset of real workloads on EPCC’s own clusters it scored 34% better than any other baseline, and outperformed frontier general-purpose LLMs prompted on the same prediction task by roughly 8x. These results convinced us the problem was solvable with software.<p>Expanse installs on every node and hooks into SLURM (or the K8s scheduler). It ingests live hardware telemetry (DCGM, CUPTI, Cgroups, Network&#x2F;IO monitoring) of your cluster creating a custom embedding of how your hardware performs. We scan any workloads about to be submitted through SLURM&#x2F;K8s (plugging into the life cycles of the job so you don&#x27;t have to change how you submit things) and we feed this into our deep learning models to give researchers accurate resource recommendations, failure detections, and optimisation suggestions at submission time. We fine tune cluster-specific models that get sharper over time as you run more workloads. Our models are trained to over-provision rather than under-provision due to the asymmetric outcomes of a job crashing. We also provide uncertainty estimates and p90 values to allow users to choose their risk tolerance.<p>We surface three capabilities to users of the cluster:<p>(1) Resource prediction at submit time. We predict the GPU VRAM, Utilisation, memory, CPUs and walltime the job actually needs, with a confidence interval. From these predictions we also surface failure predictions for OOMs and other memory related issues, and code line level optimisations to increase the utilisation of the job on the hardware.<p>(2) Live Observability. While the job runs we showcase the telemetry we are collecting through a dashboard that gives an intuitive view of what&#x27;s going on in the hardware and where your workload is at in terms of code stack profiling. We dynamically profile workloads to achieve a low single digit overhead while being informative.<p>(3) Failure diagnosis. If a workload fails, we take all the data we collected and perform correlations on the stack profiling and the hardware telemetry we collect to surface solution oriented logs. These are one, two line logs telling you not only what happened when the job failed, but why and how to fix it with code line level suggestions.<p>What’s different about our approach: The state of the art for most clusters is to either have per-user historical averages from sacct (SLURM accounting DB); hand-written rules&#x2F;heuristics; or frontier LLM coding agents. For per-user historical averages from sacct, once a new type of workload is submitted onto the cluster or code level changes are made the model becomes wildly inaccurate. For the LLM baseline we provided them with the submission script and source code of the workload being ran, and gave it the full capabilities of its coding harness in the cluster and it performed quite poorly. We benchmarked Expanse against the state of the art at the time (Gemini 3.5 pro, Claude Opus 4.8, GPT 5.5, Codex 5.3) and outperformed them by 8x.<p>You might be thinking, as these models scale and get better, they could beat us on this task; however we saw no correlation in model size or iteration on accuracy improvement. Claude Haiku actually performed better than Opus on a lot of workloads and previous iterations of models had the same, if not slightly better, accuracy. Even coding specific models, such as Codex 5.3 performed poorly (matching accuracy with GPT5.5). These models reason in a vacuum, without native support for modal inputs such as source code (to understand the underlying data flow and computational patterns), and hardware telemetry and topology (to understand performance patterns of the cluster) they cannot accurately predict the resources a workload needs. Additionally, Expanse continuously updates its internal models to make sure our predictions get more accurate as more workloads run on your cluster, making it well suited for changes in new hardware or workload patterns. LLMs are very good at writing code and hyper parameter sweeps, but they need Expanse to complete the full agentic loop for auto research. It&#x27;s super easy to plug our tools into these agents, we have made our CLI tools LLM friendly. For more details on our LLM eval, check out: <a href="https:&#x2F;&#x2F;x.com&#x2F;ismaeel_bashir_&#x2F;status&#x2F;2059683849404383283" rel="nofollow">https:&#x2F;&#x2F;x.com&#x2F;ismaeel_bashir_&#x2F;status&#x2F;2059683849404383283</a><p>We’re currently onboarding customers as paid pilots. Pricing is determined per-cluster. We offer a two-week measurement window where we install, ingest, and report recoverable capacity to datacenter operators, followed by a paid pilot deployment in one department at a fixed monthly fee, renewing at the same rate unless the scope expands.<p>If you run a HPC&#x2F;GPU cluster (SLURM or K8s, 100+ GPUs), we&#x27;d love to have a talk. We’ll install on a section of your cluster for a week, send a written report of what’s recoverable, and you decide whether to keep going. If you’ve tried something like this and it didn’t work, we’d really like to hear why. And if there’s a failure mode you’d want predicted that the post doesn’t mention, drop it in this thread and we’ll write back with whether the model already catches it or what it would take to add. I never thought I’d be on the other side of launch HN :). Even if you don’t run a cluster, we’d still love to hear from you. Any thoughts on our approach, your experiences running workloads on clusters, or even anywhere you think we’re wrong - we&#x27;d love to hear it.<p>Tally Ho!