L3 事件的生产数据访问出现故障。
凌晨2点,你最大的客户系统出现故障,而技术支持恰好知道哪个数据库查询可以解决问题。但首先:你需要编写一个数据清理脚本,获得法律批准,配置副本访问权限,然后等待在一个延迟20分钟的副本上执行查询,而每个查询需要8分钟。三个小时后,技术支持终于开始调试。实际的修复只需30分钟,但你的服务水平协议(SLA)已经被打破。
你要么是那个在客户损失资金时,疯狂编写数据掩码脚本的工程师,要么是那个知道确切查询但无法直接操作生产环境的支持人员。我们花在获取数据访问权限上的时间,远远超过了实际解决问题的时间。整个系统都反向运作。
还有其他人也在经历这种疯狂吗,或者你们找到过不那么糟糕的解决办法吗?
查看原文
It's 2 AM, your biggest customer is down, and support knows exactly what database query will solve it.
But first: write a sanitization script, get legal approval, provision replica access, then wait for queries on a 20-minute-lagged replica that takes 8 minutes per query.
Three hours later, support finally starts debugging. The actual fix takes 30 minutes. Your SLA is already blown.
You're either the engineer frantically writing data masking scripts while a customer bleeds money, or you're the support person who knows the exact query to run but can't touch production.
We spend more time getting access to the data than actually fixing the problem. The whole system is backwards.
Anyone else dealing with this madness, or have you found a way that doesn't suck?