问HN:由于最近的故障(美国东部1区事件),AWS损失了多少个9?
AWS 最近发布了关于 2025 年 10 月 us-east-1 区域故障的事后分析报告[1]。DynamoDB 中的 DNS 竞争条件导致了 EC2、Lambda、Redshift 和 NLB 的级联故障,造成了大约 14 小时的新实例启动操作受损,并对多个服务产生了连锁影响。
有没有人对 AWS 的有效可用性进行过定量建模,考虑到它们控制平面和数据平面内的服务间依赖关系?
换句话说:如果 EC2 依赖于 DynamoDB,而 Lambda 依赖于 EC2 + NLB,那么在实际情况中,复合可用性是多少?
[1] - https://aws.amazon.com/message/101925/
查看原文
AWS recently published a postmortem on the October 2025 us-east-1 outage [1]. A DNS race condition in DynamoDB cascaded across EC2, Lambda, Redshift, and NLB, leading to ~14 hours of degraded operations for new instance launches and knock-on effects on multiple services.<p>Has anyone quantitatively modelled AWS’s effective availability once you account for inter-service dependencies inside their control plane and data plane?<p>In other words: if EC2 depends on DynamoDB, and Lambda depends on EC2 + NLB, what’s the composite availability in practice?<p>[1] - https://aws.amazon.com/message/101925/