部分故障期间的安全漏洞——分布式系统的设计笔记
简要概述:许多安全机制并不是在攻击时失效,而是在部分故障时失效。本文记录了针对分布式系统的故障感知安全框架的早期设计笔记。
问题
在生产环境中的分布式系统中,安全性往往在部分故障时崩溃:
- 认证服务降级 → 重试激增
- 备用路径扩大了访问权限
- 恢复逻辑成为攻击面
没有任何东西被“利用”,但系统变得不安全。
大多数安全模型假设组件稳定且故障干净。
但真实系统并非如此。
设计假设
我们假设:
- 相关故障
- 重试是对抗性的
- 超时是不安全的默认设置
- 恢复路径和稳态逻辑同样重要
我们不假设:
- 全球一致性
- 完美身份
- 可靠时钟
- 集中式执行
框架理念(高层次)
本工作探索了四个理念:
1. 故障感知信任
- 信任在故障下降级,而不仅仅是在被攻破时
- 在部分故障期间,访问权限会自动缩小
2. 运行时安全不变量
- 不变量持续被强制执行
- 违规行为触发隔离,而不是警报
3. 重试安全的安全原语
- 幂等、单调、有限副作用
- 重试不能提升权限
4. 安全作为可观察状态
- 信任级别、降级和隔离是可见的
- 如果你无法观察它,就无法保护它
这不是
- 不是零信任营销
- 不是合规性
- 不是一个完成的系统
这是一次尝试,将故障视为正常情况,而不是例外。
为什么要提前发布?
因为许多真实的故障:
- 不符合干净的研究论文
- 发生在事件期间,而不是攻击时
- 在生产系统之外是不可见的
我们分享设计笔记,以便在进一步正式化或评估之前获得反馈。
欢迎反馈
如果您在故障或重试期间看到安全回归导致不安全行为,我希望听到您的意见。
这项工作仍在进行中,不做新颖性或完整性的声明。
查看原文
TL;DR: Many security mechanisms fail not during attacks, but during partial outages. This post documents early design notes for a failure-aware security framework for distributed systems.<p>The problem<p>In production distributed systems, security often breaks when things are half working:<p>auth services degrade → retries explode<p>fallback paths widen access<p>recovery logic becomes the attack surface<p>Nothing is “exploited”, yet the system becomes unsafe.<p>Most security models assume stable components and clean failures.
Real systems don’t behave that way.<p>Design assumptions<p>We assume:<p>correlated failures<p>retries are adversarial<p>timeouts are unsafe defaults<p>recovery paths matter as much as steady-state logic<p>We don’t assume:<p>global consistency<p>perfect identity<p>reliable clocks<p>centralized enforcement<p>Framework ideas (high level)<p>This work explores four ideas:<p>1. Failure-aware trust<p>Trust degrades under failure, not just compromise<p>Access narrows automatically during partial outages<p>2. Security invariants at runtime<p>Invariants are continuously enforced<p>Violations trigger containment, not alerts<p>3. Retry-safe security primitives<p>Idempotent, monotonic, side-effect bounded<p>Retries can’t escalate privilege<p>4. Security as observable state<p>Trust level, degradation, and containment are visible<p>If you can’t observe it, you can’t secure it<p>What this is not<p>Not zero trust marketing<p>Not compliance<p>Not a finished system<p>It’s an attempt to treat failure as the normal case, not an exception.<p>Why publish this early?<p>Because many real failures:<p>don’t fit clean research papers<p>happen during incidents, not attacks<p>are invisible outside production systems<p>We’re sharing design notes to get feedback before formalizing or evaluating further.<p>Feedback welcome<p>If you’ve seen security regressions during outages or retries causing unsafe behavior, I’d like to hear about it.<p>This is ongoing work. No claims of novelty or completeness.