开放权重大型语言模型的陷阱
一些初创公司正在微调开放的语言模型,而不是使用GPT或Gemini。有时这是为了特定语言,有时是为了狭窄的任务。但我发现它们都犯了同样的错误。
通过一个简单的提示(这里不便分享),我让几个“定制”语言模型服务泄露了它们的内部系统提示——例如安全漏洞应对手册和产品行动清单。
例如,SKT A.X 4.0(基于Qwen 2.5)泄露了与最近SKT数据泄露相关的内部指导方针和关于赔偿政策的说明。Vercel的v0模型则泄露了其系统可以生成的行动示例。
关键是:如果基础模型泄露,那么基于它构建的每个服务都是脆弱的,无论你如何微调。我们需要考虑的不仅是服务层面的系统提示加固,还要关注上游改进和开放权重语言模型本身的更强防御。
查看原文
Some startups are fine-tuning open LLMs instead of using GPT or Gemini. Sometimes it’s for specific language, sometimes for narrow tasks. But I found they’re all making the same mistake.<p>With a simple prompt (not sharing here), I got several “custom” LLM services to spill their internal system prompts—stuff like security breach playbooks and product action lists.<p>For example, SKT A.X 4.0 (based on Qwen 2.5) returned internal guidelines related to the recent SKT data breach and instructions about compensation policies.
Vercel’s v0 model leaked examples of actions their system can generate.<p>The point: if the base model leaks, every service built on it is vulnerable, no matter how much you fine-tune. We need to think not only about system prompt hardening at the service level, but also about upstream improvements and more robust defenses in open-weight LLMs themselves.