问HN:作为一名开发者,我是否错了,认为监控警报大多数情况下都是噪音?
我是一名独立开发者,正在开发一款新工具,希望能从这里的运维和基础设施专家那里获得一些现实检查。
我的背景是软件开发,而不是 SRE(站点可靠性工程)。从我的角度来看,我们基础设施中冒出的监控警报总让我觉得是一种巨大的干扰。我会因为某个服务的“高 CPU”而收到通知,花一个小时翻阅日志和仪表板,结果发现这只是一次临时的流量激增,并不是一个真正的问题。这让我觉得开发者的时间被浪费了。
我的假设是,我们使用的工具过于关注静态阈值(例如,“CPU > 80%”),缺乏上下文来告诉我们什么是真正的异常。我一直在探索一种基于同类比较的不同方法(例如,api-server-5 的表现是否与其同类 api-server-1 到 4 有所不同)。
但我从开发者的角度来看待这个问题,我非常清楚我可能会忽视更大的图景。我希望能向那些对这些内容了如指掌的人学习。
在你们公司,有多少开发者的时间被浪费在调查“假阳性”的基础设施警报上?
你认为目前的工具(如 Datadog、Prometheus 等)是否给开发团队带来了显著的负担?
“同类上下文”的想法是否是一个合理的方向,还是有更好的解决方案我没有看到?
我还没有构建太多,因为我致力于解决一个真实的问题。任何严厉的反馈或见解都将非常宝贵。
查看原文
I'm a solo developer working on a new tool, and I need a reality check from the ops and infrastructure experts here.
My background is in software development, not SRE. From my perspective, the monitoring alerts that bubble up from our infrastructure have always felt like a massive distraction. I'll get a page for "High CPU" on a service, spend an hour digging through logs and dashboards, only to find out it was just a temporary traffic spike and not a real issue. It feels like a huge waste of developer time.
My hypothesis is that the tools we use are too focused on static thresholds (e.g., "CPU > 80%") and lack the context to tell us what's actually an anomaly. I've been exploring a different approach based on peer-group comparisons (e.g., is api-server-5 behaving differently from its peers api-server-1 through 4?).
But I'm coming at this from a dev perspective and I'm very aware that I might be missing the bigger picture. I'd love to learn from the people who live and breathe this stuff.
How much developer time is lost at your company to investigating "false positive" infrastructure alerts?
Do you think the current tools (Datadog, Prometheus, etc.) create a significant burden for dev teams?
Is the idea of "peer-group context" a sensible direction, or are there better ways to solve this that I'm not seeing?
I haven't built much yet because I'm committed to solving a real problem. Any brutal feedback or insights would be incredibly valuable.