HackerNews中文版

大家好，最近我一直在思考 webhook 交付的可靠性。在我参与的许多项目中，构建稳健的 webhook 基础设施实际上是相当复杂的： - 重试逻辑（指数退避，超时） - 处理非 2xx 响应 - 交付监控和警报 - 反压或排队，以避免对接收方造成过大压力 - 安全签名和验证流程在一个项目中，由于重试逻辑存在缺陷，导致一个失败的 webhook 使支付处理延迟了几个小时。另一次，突发流量使接收端点崩溃，而没有实施死信队列策略。我一直在研究这里的团队使用的不同方法：你们是自己构建自定义的 webhook 交付队列和监控系统吗？还是使用像 AWS EventBridge 或 Step Functions 这样的云解决方案进行编排？或者集成处理交付、重试和可观察性的第三方工具？我很好奇你们是如何在不消耗开发时间的情况下，确保生产级的可靠性和可扩展性的。最近，我一直在开发一个工具，旨在自动处理这些问题，但我很想听听： - 你们发现哪种架构最可靠？ - 你们遇到过哪些边缘案例（例如，签名不匹配、下游故障）？ - 有没有关于生产环境中 webhook 失败的恐怖故事或经验教训？期待向你们学习关于 webhook 基础设施的经验和最佳实践！

查看原文

Hey everyone,I’ve been thinking a lot about webhook delivery reliability lately. In many projects I’ve worked on, building robust webhook infra turned out to be deceptively complex:- Retry logic (exponential backoff, timeouts) - Handling non-2xx responses - Delivery monitoring and alerting - Back-pressure or queueing to avoid overwhelming receivers - Secure signing and validation flowsIn one project, a failed webhook caused a payment processing delay for hours because the retry logic was buggy. Another time, burst traffic took down the receiver endpoint with no DLQ strategy in place.I’ve been researching different approaches teams here use:Do you build your own custom webhook delivery queue and monitoring system? Use cloud solutions like AWS EventBridge or Step Functions to orchestrate? Or integrate third-party tools that handle delivery, retries, and observability for you?I’m curious about how you ensure production-grade reliability at scale without burning dev hours on plumbing. Recently, I’ve been working on a tool in this space to handle these issues automatically, but would love to hear:- What architecture have you found most reliable? - What are the edge cases you’ve encountered (e.g. signature mismatches, downstream outages)? - Any horror stories or lessons learned from webhook failures in production?Looking forward to learning from your experiences and best practices around webhook infra!

您在应用中如何处理生产环境下的Webhook交付可靠性？