展示HN:用于Cloudflare Workers的使用电路断路器

8作者: ethan_zhao大约 1 个月前原帖
我运营着3mins.news(https://3mins.news),这是一个完全基于Cloudflare Workers构建的人工智能新闻聚合器。后端有10多个定时任务每几分钟运行一次,包括RSS抓取、文章聚类、LLM调用和电子邮件发送。 问题是:Workers付费计划有严格的月度限制(1000万次请求、100万次KV写入、100万次队列操作等)。没有内置的“达到限制时暂停”功能——Cloudflare会直接开始计费超出部分。KV写入超出限制后每百万次需支付5美元,因此重试循环的错误可能会迅速变得昂贵。 AWS有预算警报,但这些只是被动通知——等你读到邮件时,损失已经发生。我想要的是主动的、应用级别的自我保护。 因此,我构建了一个面向内部的断路器——它不是用来防止下游故障(Hystrix模式),而是监控我自己的资源消耗,并在达到上限之前优雅地降级。 关键设计决策: - 每个资源的阈值:Workers请求(超出部分每百万次0.30美元)在80%时仅发出警告。KV写入(超出部分每百万次5美元)在90%时可以触发断路器。并非所有资源都同样危险,因此某些资源被配置为仅警告(触发=空)。 - 滞后效应:在90%时触发,在85%时恢复。5%的间隙可以防止振荡——没有它,系统会在每个检查周期之间在触发和恢复之间波动。 - 监控失败的安全保障:如果Cloudflare的使用API出现故障,保持最后已知状态,而不是假设“一切正常”。监控中断不应掩盖使用的激增。 - 警报去重:按资源和月份去重。没有这个,你将在一个资源达到80%后收到大约8600封相同的邮件。 实现:每5分钟查询Cloudflare的GraphQL API(请求、CPU、KV、队列)和可观察性遥测API(日志/追踪),并行评估8个资源维度,将状态缓存到KV。在检查之间只需进行一次KV读取——基本上是免费的。 当触发时,所有计划任务都会被跳过。定时触发器仍然会触发(你无法停止),但它首先检查断路器,如果触发则退出。 这个系统已经在生产环境中运行了两周。月初捕捉到KV读取的激增,达到了82%——收到了一封警告邮件,进行了调查,修复了根本原因,之后从未触及触发阈值。 这个模式应该适用于任何计量的无服务器平台(Lambda、Vercel、Supabase)或任何有预算上限的API(OpenAI、Twilio)。核心思想是:将自己的资源预算视为健康信号,就像你对待下游服务的错误率一样。 如果有兴趣,我很乐意分享代码细节。 完整的实现代码和测试的详细说明请见:https://yingjiezhao.com/en/articles/Usage-Circuit-Breaker-for-Cloudflare-Workers
查看原文
I run 3mins.news (https:&#x2F;&#x2F;3mins.news), an AI news aggregator built entirely on Cloudflare Workers. The backend has 10+ cron triggers running every few minutes — RSS fetching, article clustering, LLM calls, email delivery.<p>The problem: Workers Paid Plan has hard monthly limits (10M requests, 1M KV writes, 1M queue ops, etc.). There&#x27;s no built-in &quot;pause when you hit the limit&quot; — CF just starts billing overages. KV writes cost $5&#x2F;M over the cap, so a retry loop bug can get expensive fast.<p>AWS has Budget Alerts, but those are passive notifications — by the time you read the email, the damage is done. I wanted active, application-level self-protection.<p>So I built a circuit breaker that faces inward — instead of protecting against downstream failures (the Hystrix pattern), it monitors my own resource consumption and gracefully degrades before hitting the ceiling.<p>Key design decisions:<p>- Per-resource thresholds: Workers Requests ($0.30&#x2F;M overage) only warns at 80%. KV Writes ($5&#x2F;M overage) can trip the breaker at 90%. Not all resources are equally dangerous, so some are configured as warn-only (trip=null).<p>- Hysteresis: Trips at 90%, recovers at 85%. The 5% gap prevents oscillation — without it the system flaps between tripped and recovered every check cycle.<p>- Fail-safe on monitoring failure: If the CF usage API is down, maintain last known state rather than assuming &quot;everything is fine.&quot; A monitoring outage shouldn&#x27;t mask a usage spike.<p>- Alert dedup: Per-resource, per-month. Without it you&#x27;d get ~8,600 identical emails for the rest of the month once a resource hits 80%.<p>Implementation: every 5 minutes, queries CF&#x27;s GraphQL API (requests, CPU, KV, queues) + Observability Telemetry API (logs&#x2F;traces) in parallel, evaluates 8 resource dimensions, caches state to KV. Between checks it&#x27;s a single KV read — essentially free.<p>When tripped, all scheduled tasks are skipped. The cron trigger still fires (you can&#x27;t stop that), but the first thing it does is check the breaker and bail out if tripped.<p>It&#x27;s been running in production for two weeks. Caught a KV reads spike at 82% early in the month — got one warning email, investigated, fixed the root cause, never hit the trip threshold.<p>The pattern should apply to any metered serverless platform (Lambda, Vercel, Supabase) or any API with budget ceilings (OpenAI, Twilio). The core idea: treat your own resource budget as a health signal, just like you&#x27;d treat a downstream service&#x27;s error rate.<p>Happy to share code details if there&#x27;s interest.<p>Full writeup with implementation code and tests: https:&#x2F;&#x2F;yingjiezhao.com&#x2F;en&#x2F;articles&#x2F;Usage-Circuit-Breaker-for-Cloudflare-Workers