Post-mortems are conducted in a blameless format to identify learnings and action items from the incident.
Time (UTC) | Event |
---|---|
11:24:24 PM | Oncall low urgency page due to automated E2E tests failing |
1:35:39 AM | First indication internally of dashboard issues raised by internal users |
1:55:56 AM | Oncall investigation notices 500 errors to /onboard and begins incident process |
2:01:37 AM | Incident is classified as a SEV2 incident |
2:00:00 AM | Additional engineers join the incident room |
2:13:35 AM | Issue with the Dashboard seems to be related to API service. Oncall engineers deploy a previous version of API that predates a suspicious commit using force deploy |
2:19:00 AM | Further independent investigation confirms API commit likely caused the issue |
2:23:00 AM | Oncall Engineer notices that the deploy did not apply to the actual prod cluster |
2:26:00 AM | Oncall Engineer opens a PR to revert problematic commit |
2:33:12 AM | Engineer identifies issue with force deploy action |
2:37:00 AM | Engineer lands a change to fix force deploy action |
2:37:00 AM | Engineer reverts the problem commit to kick off another deploy |
2:45:00 AM | Staging Environment is reported to be working again |
2:46:00 AM | The reverted version still not showing up in Production environment |
2:52:00 AM | Customers raise issues with the Dashboard in public slack; Oncall engineer misinterpreted severity as API hard down |
2:55:00 AM | Oncall Infrastructure Engineer deletes and recreates Kubernetes Deployment |
2:55:00 AM | Deployment launches pods successfully |
2:57:00 AM | Incident severity is upgraded to SEV0. API is actually hard down due to an unknown issue. Investigation ensues. Public incident created on Status page. |
3:10:00 AM | Pods are not registering to target groups. Engineer restarts the AWS Load Balancer Controller pod to kick pods to register to target group. |
3:11:00 AM | Everything works again |
The Stytch new account signup flow was down for just under four hours. The production Live API was unavailable and/or degraded for a total of 14 minutes. The Dashboard and SDK were unavailable for 16 minutes. The Test API was down for 25 minutes. The incident recovery cost the engineering team 35+ hours of productivity. Additionally, code deploys were blocked for a full day.