必要工具?用于分布式系统的异步 LoRA
我正在构建一个我称之为 Async LoRA 的工具,以解决我在使用便宜的 GPU(如 Salad、runpod、临时实例等)进行长时间训练时遇到的问题。这种情况非常糟糕:一个随机节点崩溃,几小时的训练成果就会消失。大多数调度器只是重新启动整个容器,这并没有真正解决问题。到目前为止,我所做的工作包括:
- 聚合器/工作节点设置,聚合器分配小的“租约”工作(按令牌大小而非时间切片)
- 异步检查点保存,进度可以持续保存而无需暂停训练
- 预占处理——当一个工作节点崩溃时,它已经完成的工作仍然有效,剩余的工作会被重新分配
- 训练感知逻辑(步骤、令牌、损失),而不是将作业视为黑箱容器
- 开箱即用的 PyTorch/DeepSpeed 接口,这样你就不需要自己将所有组件拼接在一起。我的目标是让不稳定的集群表现得更像可靠的集群
我希望能得到大家的反馈:
- 如果你在临时/可抢占的 GPU 上进行训练,通常是如何处理检查点和故障的?
- 有什么可以让这个工具更容易集成到现有的工作流程中(如 Airflow、K8s、Ray 等)?
- 在监控方面,你更希望看到原生的训练指标(损失、令牌、过时程度),还是只显示日志/事件,让你可以接入自己的监控系统?
查看原文
I’ve been building something I call Async LoRA to scratch an itch I kept running into: training on cheap GPUs (Salad, runpod, spot instances, etc.) is a nightmare for long jobs. One random node dying and suddenly hours of training are gone. Most schedulers just restart the whole container, which doesn’t really help. What I’ve put together so far:<p>• Aggregator/worker setup where the aggregator hands out small “leases” of work (per token sizes not time slices)<p>• Async checkpointing so progress gets saved continuously without pausing training.<p>• Preemption handling — when a worker dies, whatever it managed to do still counts, and the remaining work just gets reassigned.<p>• Training-aware logic (steps, tokens, loss) instead of treating jobs like black-box containers.<p>• Out-of-the-box hooks for PyTorch/DeepSpeed so you don’t have to glue it all together yourself. My goal is to make sketchy clusters behave more like reliable ones<p>I’d love feedback from people here:<p>• If you run training on spot/preemptible GPUs, how do you usually handle checkpoints/failures?<p>• What would make this easier to drop into an existing pipeline (Airflow, K8s, Ray, etc.)?<p>• For monitoring, would you rather see native training metrics (loss, tokens, staleness) or just surface logs/events and let you plug into your own stack?