HackerNews中文版

我正在构建一个系统，以便在设备上（移动设备、物联网设备、本地服务器）运行小型大语言模型（LLM），非常希望听听其他人是如何应对这些挑战的。背景：用例：离线聊天机器人、智能摄像头、本地数据隐私模型：7-13亿参数的量化模型（例如 Llama 2、Vicuna）限制：有限的内存/闪存，仅支持 CPU 或小型 GPU，间歇性连接问题：你们使用了哪些运行时或框架（ONNX Runtime、TVM、自定义 C++）？在内存紧张的情况下，你们如何处理模型加载、驱逐和批处理？有没有什么巧妙的技巧用于量化、剪枝或内核融合，以提升性能？你们如何在现场安全地监控和更新模型？期待你们的基准测试、经验分享和代码指引！

查看原文

I’m building a system to run small LLMs on-device (mobile, IoT, on-prem servers) and would love to hear how others have tackled the challenges.Context:Use cases: offline chatbots, smart cameras, local data privacyModels: 7–13B parameter quantized models (e.g. Llama 2, Vicuna)Constraints: limited RAM/flash, CPU-only or tiny GPU, intermittent connectivityQuestions:What runtimes or frameworks are you using (ONNX Runtime, TVM, custom C++)?How do you handle model loading, eviction, and batching under tight memory?Any clever tricks for quantization, pruning, or kernel fusions that boost perf?How do you monitor and update models securely in the field?Looking forward to your benchmarks, war stories, and code pointers!

请问HN：你们是如何在边缘设备上管理大语言模型（LLM）推理的？