问HN:在任务和工作流重试时,您如何处理重复的副作用?
快速背景:我正在构建后台作业自动化,并不断遇到以下模式:
1. 作业调用外部 API(Stripe、SendGrid、AWS)
2. API 调用成功
3. 作业在记录成功之前崩溃
4. 作业重试 → 再次调用 API → 产生重复
示例:处理退款,发送电子邮件通知,然后崩溃。重试时再次执行这两个操作。客户收到重复的退款电子邮件(或者更糟,重复的退款)。
我看到几种解决方案:
选项 A:在数据库中存储已处理的 ID
问题:“检查数据库”和“调用 API”之间的竞争仍然可能导致重复。
选项 B:使用 API 幂等性键(Stripe 支持此功能)
问题:并非所有 API 都支持(遗留系统、第三方)。
选项 C:构建去重层,首先检查外部系统
问题:额外的延迟,额外的复杂性。
在生产环境中你会怎么做?接受一些重复?只使用支持幂等性的 API?还是其他方案?
(我为选项 C 构建了一些东西,但想了解这是否是一个足够普遍的问题,还是我在过度设计。)
查看原文
Quick context: I'm building background job automation and keep hitting this pattern:<p>1. Job calls external API (Stripe, SendGrid, AWS)
2. API call succeeds
3. Job crashes before recording success
4. Job retries → calls API again → duplicate<p>Example: process refund, send email notification, crash. Retry does both again. Customer gets duplicate refund email (or worse, duplicate refund).<p>I see a few approaches:<p>Option A: Store processed IDs in database
Problem: Race between "check DB" and "call API" can still duplicate<p>Option B: Use API idempotency keys (Stripe supports this)
Problem: Not all APIs support it (legacy systems, third-party)<p>Option C: Build deduplication layer that checks external system first
Problem: Extra latency, extra complexity<p>What do you do in production? Accept some duplicates? Only use APIs
with idempotency? Something else?<p>(I built something for Option C, but trying to understand if this is actually a common-enough problem or if I'm over-engineering.)