HackerNews中文版

昨天我的生产应用出现了故障。原因是什么呢？DigitalOcean 的托管 PostgreSQL 更新导致其托管的 Kubernetes 的私有 VPC 连接中断。公共端点正常工作，而私有端点则超时。根本原因是一个 Cilium 的 bug (#34503)，在基础设施变更后，ARP 条目会失效。DO 的支持团队反应相对迅速（不到 12 小时）。他们的解决方案是从一个随机的 GitHub 用户那里部署一个 DaemonSet，每 10 秒 ping 一次失效的 ARP 条目。上游的 Cilium 修复已经合并，但尚未部署到 DOKS，具体时间未定。我选择托管服务就是为了避免运维紧急情况。我们是一家小型初创公司，支付额外费用让其他人来处理这些问题。然而，我却花了深夜的时间调试我无法控制的网络层中的 VPC 路由问题。HN 上常见的建议是“只需使用托管服务，专注于业务。”这通常是个好建议。但“托管”并不意味着无忧无虑，它意味着将你的故障模式换成供应商的故障模式。你并不是在选择有问题和没有问题，而是在选择你能控制的问题和（更少的？）你无法控制的问题。我仍在使用 DO，仍在使用托管服务，只是对“托管”意味着什么有了更少的幻想。

查看原文

Yesterday my production app went down. The cause? DigitalOcean's managed PostgreSQL update broke private VPC connectivity to their managed Kubernetes.Public endpoint worked. Private endpoint timed out. Root cause: a Cilium bug (#34503) where ARP entries go stale after infrastructure changes.DO support responded relatively quickly (<12hrs). Their fix? Deploy a DaemonSet from a random GitHub user to ping stale ARP entries every 10 seconds. The upstream Cilium fix is merged but not yet deployed to DOKS. No ETA.I chose managed services specifically to avoid ops emergencies. We're a tiny startup paying the premium so someone else handles this. Instead, I spent late night hours debugging VPC routing issues in a networking layer I don't control.HN's usual advice is "just use managed services, focus on the business." Generally good advice. But managed doesn't mean worry-free, it means trading your failure modes for the vendor's failure modes. You're not choosing between problems and no problems. You're choosing between problems you control and (fewer?) problems you don't.Still using DO. Still using managed services. Just with fewer illusions about what "managed" means.

告诉HN：DigitalOcean的托管服务在更新后相互影响。