您在应用中如何处理生产环境下的Webhook交付可靠性?

1作者: Tanjim7 个月前原帖
大家好, 最近我一直在思考 webhook 交付的可靠性。在我参与的许多项目中,构建稳健的 webhook 基础设施实际上是相当复杂的: - 重试逻辑(指数退避,超时) - 处理非 2xx 响应 - 交付监控和警报 - 反压或排队,以避免对接收方造成过大压力 - 安全签名和验证流程 在一个项目中,由于重试逻辑存在缺陷,导致一个失败的 webhook 使支付处理延迟了几个小时。另一次,突发流量使接收端点崩溃,而没有实施死信队列策略。 我一直在研究这里的团队使用的不同方法: 你们是自己构建自定义的 webhook 交付队列和监控系统吗?还是使用像 AWS EventBridge 或 Step Functions 这样的云解决方案进行编排?或者集成处理交付、重试和可观察性的第三方工具? 我很好奇你们是如何在不消耗开发时间的情况下,确保生产级的可靠性和可扩展性的。最近,我一直在开发一个工具,旨在自动处理这些问题,但我很想听听: - 你们发现哪种架构最可靠? - 你们遇到过哪些边缘案例(例如,签名不匹配、下游故障)? - 有没有关于生产环境中 webhook 失败的恐怖故事或经验教训? 期待向你们学习关于 webhook 基础设施的经验和最佳实践!
查看原文
Hey everyone,<p>I’ve been thinking a lot about webhook delivery reliability lately. In many projects I’ve worked on, building robust webhook infra turned out to be deceptively complex:<p>- Retry logic (exponential backoff, timeouts) - Handling non-2xx responses - Delivery monitoring and alerting - Back-pressure or queueing to avoid overwhelming receivers - Secure signing and validation flows<p>In one project, a failed webhook caused a payment processing delay for hours because the retry logic was buggy. Another time, burst traffic took down the receiver endpoint with no DLQ strategy in place.<p>I’ve been researching different approaches teams here use:<p>Do you build your own custom webhook delivery queue and monitoring system? Use cloud solutions like AWS EventBridge or Step Functions to orchestrate? Or integrate third-party tools that handle delivery, retries, and observability for you?<p>I’m curious about how you ensure production-grade reliability at scale without burning dev hours on plumbing. Recently, I’ve been working on a tool in this space to handle these issues automatically, but would love to hear:<p>- What architecture have you found most reliable? - What are the edge cases you’ve encountered (e.g. signature mismatches, downstream outages)? - Any horror stories or lessons learned from webhook failures in production?<p>Looking forward to learning from your experiences and best practices around webhook infra!