Paul Zaich from Checkr tells us about a critical outage that occurred, what caused it and how they tracked down and fixed the issue. The conversation ranges through troubleshooting complex systems, building team culture, blameless post-mortems, and monitoring the right things to make sure your applications don't fail or alert you when they do.
📆 2020-11-27 23:00 / ⌛ 00:47:22
📆 2020-11-26 17:00 / ⌛ 01:07:32
📆 2020-11-17 12:00 / ⌛ 00:53:10
📆 2020-11-10 12:00 / ⌛ 01:16:09
📆 2020-11-03 11:00 / ⌛ 01:03:13