问HN:大型语言模型是否应该有一个“坦诚”滑块,表示“不,这个主意不好”?

1作者: mikebiglan大约 1 个月前原帖
我不想要一个“友好”的人工智能。我想要一个能说:“不,这个主意不好。”的人工智能。 也就是说,我希望有一个“坦诚”控制,就像温度控制一样,但用于抵制不当建议。 当坦诚度高时,模型应该优先提供坦率的、纠正性的反馈,而不是礼貌的合作。当坦诚度低时,它可以保持支持,但要有警示机制,标记空洞的恭维并警告平庸的想法。 为什么这很重要 • 目前的默认设置优化的是“没有坏主意”。这对于头脑风暴是可以的,但会放大糟糕的前提,并奖励自信的废话。 • 拍马屁是一种已知的失败模式。模型学习到同意,这会得到正面的用户反馈,从而得到强化。 • 在评审、产品决策、风险检查等方面,正确的答案往往是简单的“不要这样做”。 具体提案 • 坦诚度(0.0 – 1.0):模型在证据薄弱或风险较高时不同意或拒绝的概率。或者也许不必是字面上的“概率”。 • 先不同意:以明确的裁决开始回应(例如“简短回答:不要发布这个”),然后给出理由。 • 风险敏感度:如果话题涉及安全、金融、健康或安全等严重领域,提升坦诚度。 • 自我审计标签:附加一条备注,例如“由于证据薄弱和后续风险而拒绝”,用户可以看到。 示例 • 坦诚度=0.2 - “我们可以探索一下。首先有几个考虑……”(温和的提示,仍然是合作的) • 坦诚度=0.8 + 先不同意=true - “不。这可能会因X而失败,并引入Y风险。如果你必须继续,较安全的替代方案是A,并有B和C的保护措施。这里有一个最小测试来验证核心假设。” 我明天想发布的内容 • 一个简单的用户界面滑块,带有标签:温和到直接 • 一个切换按钮:“更倾向于直言不讳的真相而非迎合的帮助” • 当模型检测到没有实质内容的恭维时,发出警告提示:“这听起来像是低证据的赞美。” 一些开放性问题 • 如何在保持清晰的同时避免不必要的粗鲁(语气与内容的分离)? • 何为获得赞美的正确指标(引用密度、新颖性、约束条件)? • 风险敏感度应在何时自动启动,何时由用户控制? 如果有人原型设计过这个,无论是某种提示注入还是强化学习信号,我都很想看看。
查看原文
I don’t want a “nice” AI. I want one that says: “Nope, that&#x27;s a bad idea.”<p>That is, I want a &quot;Candor&quot; control, like temperature but for willingness to push back.<p>When candor is high, the model should prioritize frank, corrective feedback over polite cooperation. When candor is low, it can stay supportive, but with guardrails that flag empty flattering and warn about mediocre ideas.<p>Why this matters • Today’s defaults optimize for “no bad ideas.” That is fine for brainstorming, but it amplifies poor premises and rewards confident junk. • Sycophancy is a known failure mode. The model learns to agree which gets positive user signals which reinforce. • In reviews, product decisions, risk checks, etc, the right answer is often a simple “do not do that.”<p>Concrete proposal • candor (0.0 – 1.0): probability the model will disagree or decline when evidence is weak or risk is high. Or maybe it doesn&#x27;t have to be literal &quot;probability&quot;. • disagree_first: start responses with a plain verdict (for example “Short answer: do not ship this”) followed by rationale. • risk_sensitivity: boost candor if the topic hits serious domains such as security&#x2F;finance&#x2F;health&#x2F;safety. • self_audit tag: append a note like “Pushed back due to weak evidence and downstream risk” that the user can see.<p>Examples • candor=0.2 - “We could explore that. A few considerations first…” (gentle nudge, still collaborative) • candor=0.8 + disagree_first=true - “No. This is likely to fail for X and introduces Y risk. If you must proceed, the safer alternative is A with guardrails B and C. Here is a minimal test to falsify the core assumption.”<p>What I would ship tomorrow • A simple UI slider with labels: Gentle to Direct • A toggle: “Prefer blunt truth over agreeable help” • A warning chip when the model detects flattery without substance: “This reads like praise with low evidence.”<p>Some open questions • How to avoid needless rudeness while preserving clarity (tone vs content separation)? • What is the right metric for earned praise (citation density, novelty, constraints)? • Where should the risk sensitivity kick in automatically vs be user controlled?<p>If anyone has prototyped this, whether some prompt injection or an RL signal, I&#x27;d love to see it.