展示 HN:在阈值被突破之前检测 API 降级

2作者: AnchorFlow5 天前原帖
在构建支付编排系统时,我遇到了一个问题: 大多数监控工具在阈值被突破时才会发出警报(例如,P95 > 1000毫秒)。但实际上,系统往往在达到这些极限之前就开始降级,尤其是在突发流量的情况下。 因此,我尝试在FastAPI应用程序内部检测降级,提前识别阈值被突破的情况。 我构建了一个小型中间件,它能够: <p>跟踪每个路由模板的P95延迟(例如,/users/{id}) 从最近的流量动态学习基线 使用变化率检测峰值(不仅仅是静态阈值) 计算0-100的健康评分及趋势方向(改善/稳定/降级) 将事件存储在Redis Streams中,以便重放和调试 <p>一个有趣的结果是: 在合成负载测试中(延迟从约200毫秒逐渐上升到约1200毫秒,持续60秒,P95警告阈值为1000毫秒),变化率检测始终在静态阈值警报之前稍微提前地发现了降级。这个窗口虽然很小,但通常足以在突破警报阈值之前注意到系统压力。 <p>设计约束: <p>请求路径上的几乎零开销(异步,火忘写入) 如果Redis不可用,必须静默失败 不需要外部监控栈(在应用内运行) <p>示例用法: pythonapp.add_middleware(RequestMetricsMiddleware, alert_engine=engine) <p>背景: 这是我正在构建的一个更大系统的一部分,该系统将云服务与移动支付API(如EcoCash等)集成,其中部分故障和延迟峰值是常见现象。 目前仍处于早期阶段——尚未在真实生产流量下进行测试。 <p>想知道其他人在FastAPI或类似系统中如何处理早期降级检测。 代码库:<a href="https://github.com/Tandem-Media/fastapi-alertengine" rel="nofollow">https://github.com/Tandem-Media/fastapi-alertengine</a> PyPI:<a href="https://pypi.org/project/fastapi-alertengine/" rel="nofollow">https://pypi.org/project/fastapi-alertengine/</a>
查看原文
While building a payment orchestration system, I ran into a problem: Most monitoring tools alert when a threshold is already breached (e.g. P95 &gt; 1000ms). But in practice, systems often degrade before hitting those limits — especially under bursty traffic. So I experimented with detecting degradation before thresholds are crossed, directly inside a FastAPI app. I built a small middleware that:<p>Tracks P95 latency per route template (e.g. &#x2F;users&#x2F;{id}) Learns a baseline dynamically from recent traffic Detects spikes using rate-of-change (not just static thresholds) Computes a 0–100 health score with trend direction (improving &#x2F; stable &#x2F; degrading) Stores events in Redis Streams for replay and debugging<p>One interesting result: In synthetic load tests (gradual latency ramp from ~200ms to ~1200ms over 60 seconds, with a P95 warning threshold at 1000ms), rate-of-change detection consistently surfaced degradation slightly before static threshold alerts. The window is small, but it was often enough to notice system stress before crossing alert thresholds.<p>Design constraints:<p>Near-zero overhead on the request path (async, fire-and-forget writes) Must fail silently if Redis is unavailable No external monitoring stack required (runs in-app)<p>Example usage: pythonapp.add_middleware(RequestMetricsMiddleware, alert_engine=engine)<p>Context: This is part of a larger system I&#x27;m building that integrates cloud services with mobile money APIs (EcoCash, etc.), where partial failures and latency spikes are common. Still early — hasn&#x27;t been tested under real production traffic yet.<p>Curious how others are handling early degradation detection in FastAPI or similar systems. Repo: <a href="https:&#x2F;&#x2F;github.com&#x2F;Tandem-Media&#x2F;fastapi-alertengine" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;Tandem-Media&#x2F;fastapi-alertengine</a> PyPI: <a href="https:&#x2F;&#x2F;pypi.org&#x2F;project&#x2F;fastapi-alertengine&#x2F;" rel="nofollow">https:&#x2F;&#x2F;pypi.org&#x2F;project&#x2F;fastapi-alertengine&#x2F;</a>