HackerNews中文版

在构建支付编排系统时，我遇到了一个问题：大多数监控工具在阈值被突破时才会发出警报（例如，P95 > 1000毫秒）。但实际上，系统往往在达到这些极限之前就开始降级，尤其是在突发流量的情况下。因此，我尝试在FastAPI应用程序内部检测降级，提前识别阈值被突破的情况。我构建了一个小型中间件，它能够： 跟踪每个路由模板的P95延迟（例如，/users/{id}）从最近的流量动态学习基线使用变化率检测峰值（不仅仅是静态阈值）计算0-100的健康评分及趋势方向（改善/稳定/降级）将事件存储在Redis Streams中，以便重放和调试 一个有趣的结果是：在合成负载测试中（延迟从约200毫秒逐渐上升到约1200毫秒，持续60秒，P95警告阈值为1000毫秒），变化率检测始终在静态阈值警报之前稍微提前地发现了降级。这个窗口虽然很小，但通常足以在突破警报阈值之前注意到系统压力。 设计约束： 请求路径上的几乎零开销（异步，火忘写入）如果Redis不可用，必须静默失败不需要外部监控栈（在应用内运行） 示例用法： pythonapp.add_middleware(RequestMetricsMiddleware, alert_engine=engine) 背景：这是我正在构建的一个更大系统的一部分，该系统将云服务与移动支付API（如EcoCash等）集成，其中部分故障和延迟峰值是常见现象。目前仍处于早期阶段——尚未在真实生产流量下进行测试。 想知道其他人在FastAPI或类似系统中如何处理早期降级检测。代码库：<a href="https://github.com/Tandem-Media/fastapi-alertengine" rel="nofollow">https://github.com/Tandem-Media/fastapi-alertengine</a> PyPI：<a href="https://pypi.org/project/fastapi-alertengine/" rel="nofollow">https://pypi.org/project/fastapi-alertengine/</a>

查看原文

While building a payment orchestration system, I ran into a problem: Most monitoring tools alert when a threshold is already breached (e.g. P95 > 1000ms). But in practice, systems often degrade before hitting those limits — especially under bursty traffic. So I experimented with detecting degradation before thresholds are crossed, directly inside a FastAPI app. I built a small middleware that:Tracks P95 latency per route template (e.g. /users/{id}) Learns a baseline dynamically from recent traffic Detects spikes using rate-of-change (not just static thresholds) Computes a 0–100 health score with trend direction (improving / stable / degrading) Stores events in Redis Streams for replay and debuggingOne interesting result: In synthetic load tests (gradual latency ramp from ~200ms to ~1200ms over 60 seconds, with a P95 warning threshold at 1000ms), rate-of-change detection consistently surfaced degradation slightly before static threshold alerts. The window is small, but it was often enough to notice system stress before crossing alert thresholds.Design constraints:Near-zero overhead on the request path (async, fire-and-forget writes) Must fail silently if Redis is unavailable No external monitoring stack required (runs in-app)Example usage: pythonapp.add_middleware(RequestMetricsMiddleware, alert_engine=engine)Context: This is part of a larger system I'm building that integrates cloud services with mobile money APIs (EcoCash, etc.), where partial failures and latency spikes are common. Still early — hasn't been tested under real production traffic yet.Curious how others are handling early degradation detection in FastAPI or similar systems. Repo: <a href="https://github.com/Tandem-Media/fastapi-alertengine" rel="nofollow">https://github.com/Tandem-Media/fastapi-alertengine</a> PyPI: <a href="https://pypi.org/project/fastapi-alertengine/" rel="nofollow">https://pypi.org/project/fastapi-alertengine/</a>

展示 HN：在阈值被突破之前检测 API 降级