请问HN:有没有软件可以跟踪错误并将其归类为根本原因?

1作者: theamk6 天前原帖
我在一个团队工作,负责维护内部批处理系统。为了保持高质量的服务,我们集中记录所有的故障和错误,逐一查看,并将其分配到根本原因工单中。频繁出现的故障会尽快修复,而那些每周偶尔出现一次的故障则会被优先处理,并安排在下一个冲刺中。有时服务会出现故障,导致出现数十个故障(通常归类为一个根本原因工单),但大多数情况下,每天的故障少于一个。 不幸的是,我们没有好的方式来管理这些故障——目前我们使用自定义脚本和JIRA,但效果并不好。我们愿意为外部服务付费,但我就是找不到合适的解决方案。 像Datadog或Sentry这样的工具处理的是统计数据和错误组……但我们希望查看每一个故障,以确保没有遗漏。JIRA太慢且功能有限。我们甚至尝试过使用Google表格,但它们无法扩展。 有没有人遇到过类似的问题——跟踪每一个单独的故障,而不仅仅是汇总或计数?你们使用什么工具?
查看原文
I am working on a team which maintains internal batch processing system. To keep service quality high, we centrally record all failures&#x2F;errors, look at every one of them, and assign them to root cause tickets. A frequent failure will get fixed ASAP, one of those once-per-week sporadic failures will get prioritized and put in the next sprint. Sometimes a service breaks and there are dozens of failures (usually binned to one root cause ticket), but most of the the times it is less than a failure per day.<p>Unfortunately, we have no good way to manage the failures -- we are currently using custom scripts + JIRA and it does not work very well. We are happy to pay to external service, but I simply cannot find anything!<p>Things like Datadog or Sentry deal in statistics and error groups... but we want to look at every failure to make sure nothing slips through the cracks. JIRA is too slow and limited. We even tried Google sheets, but they do not scale.<p>Does anyone has similar problem - tracking each individual failure, not just aggregate&#x2F;counter? What do you use?