告诉HN:DigitalOcean的托管服务在更新后相互影响。
昨天我的生产应用出现了故障。原因是什么呢?DigitalOcean 的托管 PostgreSQL 更新导致其托管的 Kubernetes 的私有 VPC 连接中断。<p>公共端点正常工作,而私有端点则超时。根本原因是一个 Cilium 的 bug (#34503),在基础设施变更后,ARP 条目会失效。<p>DO 的支持团队反应相对迅速(不到 12 小时)。他们的解决方案是从一个随机的 GitHub 用户那里部署一个 DaemonSet,每 10 秒 ping 一次失效的 ARP 条目。上游的 Cilium 修复已经合并,但尚未部署到 DOKS,具体时间未定。<p>我选择托管服务就是为了避免运维紧急情况。我们是一家小型初创公司,支付额外费用让其他人来处理这些问题。然而,我却花了深夜的时间调试我无法控制的网络层中的 VPC 路由问题。<p>HN 上常见的建议是“只需使用托管服务,专注于业务。”这通常是个好建议。但“托管”并不意味着无忧无虑,它意味着将你的故障模式换成供应商的故障模式。你并不是在选择有问题和没有问题,而是在选择你能控制的问题和(更少的?)你无法控制的问题。<p>我仍在使用 DO,仍在使用托管服务,只是对“托管”意味着什么有了更少的幻想。
查看原文
Yesterday my production app went down. The cause? DigitalOcean's managed PostgreSQL update broke private VPC connectivity to their managed Kubernetes.<p>Public endpoint worked. Private endpoint timed out. Root cause: a Cilium bug (#34503) where ARP entries go stale after infrastructure changes.<p>DO support responded relatively quickly (<12hrs). Their fix? Deploy a DaemonSet from a random GitHub user to ping stale ARP entries every 10 seconds. The upstream Cilium fix is merged but not yet deployed to DOKS. No ETA.<p>I chose managed services specifically to avoid ops emergencies. We're a tiny startup paying the premium so someone else handles this. Instead, I spent late night hours debugging VPC routing issues in a networking layer I don't control.<p>HN's usual advice is "just use managed services, focus on the business." Generally good advice. But managed doesn't mean worry-free, it means trading your failure modes for the vendor's failure modes. You're not choosing between problems and no problems. You're choosing between problems you control and (fewer?) problems you don't.<p>Still using DO. Still using managed services. Just with fewer illusions about what "managed" means.