HackerNews中文版

简要概述：许多安全机制并不是在攻击时失效，而是在部分故障时失效。本文记录了针对分布式系统的故障感知安全框架的早期设计笔记。问题在生产环境中的分布式系统中，安全性往往在部分故障时崩溃： - 认证服务降级 → 重试激增 - 备用路径扩大了访问权限 - 恢复逻辑成为攻击面没有任何东西被“利用”，但系统变得不安全。大多数安全模型假设组件稳定且故障干净。但真实系统并非如此。设计假设我们假设： - 相关故障 - 重试是对抗性的 - 超时是不安全的默认设置 - 恢复路径和稳态逻辑同样重要我们不假设： - 全球一致性 - 完美身份 - 可靠时钟 - 集中式执行框架理念（高层次）本工作探索了四个理念： 1. 故障感知信任 - 信任在故障下降级，而不仅仅是在被攻破时 - 在部分故障期间，访问权限会自动缩小 2. 运行时安全不变量 - 不变量持续被强制执行 - 违规行为触发隔离，而不是警报 3. 重试安全的安全原语 - 幂等、单调、有限副作用 - 重试不能提升权限 4. 安全作为可观察状态 - 信任级别、降级和隔离是可见的 - 如果你无法观察它，就无法保护它这不是 - 不是零信任营销 - 不是合规性 - 不是一个完成的系统这是一次尝试，将故障视为正常情况，而不是例外。为什么要提前发布？因为许多真实的故障： - 不符合干净的研究论文 - 发生在事件期间，而不是攻击时 - 在生产系统之外是不可见的我们分享设计笔记，以便在进一步正式化或评估之前获得反馈。欢迎反馈如果您在故障或重试期间看到安全回归导致不安全行为，我希望听到您的意见。这项工作仍在进行中，不做新颖性或完整性的声明。

查看原文

TL;DR: Many security mechanisms fail not during attacks, but during partial outages. This post documents early design notes for a failure-aware security framework for distributed systems.The problemIn production distributed systems, security often breaks when things are half working:auth services degrade → retries explodefallback paths widen accessrecovery logic becomes the attack surfaceNothing is “exploited”, yet the system becomes unsafe.Most security models assume stable components and clean failures. Real systems don’t behave that way.Design assumptionsWe assume:correlated failuresretries are adversarialtimeouts are unsafe defaultsrecovery paths matter as much as steady-state logicWe don’t assume:global consistencyperfect identityreliable clockscentralized enforcementFramework ideas (high level)This work explores four ideas:1. Failure-aware trustTrust degrades under failure, not just compromiseAccess narrows automatically during partial outages2. Security invariants at runtimeInvariants are continuously enforcedViolations trigger containment, not alerts3. Retry-safe security primitivesIdempotent, monotonic, side-effect boundedRetries can’t escalate privilege4. Security as observable stateTrust level, degradation, and containment are visibleIf you can’t observe it, you can’t secure itWhat this is notNot zero trust marketingNot complianceNot a finished systemIt’s an attempt to treat failure as the normal case, not an exception.Why publish this early?Because many real failures:don’t fit clean research papershappen during incidents, not attacksare invisible outside production systemsWe’re sharing design notes to get feedback before formalizing or evaluating further.Feedback welcomeIf you’ve seen security regressions during outages or retries causing unsafe behavior, I’d like to hear about it.This is ongoing work. No claims of novelty or completeness.

部分故障期间的安全漏洞——分布式系统的设计笔记