Post-mortems are conducted in a blameless format to identify learnings and action items from the incident.

Timeline

Time (UTC) Event
11:24:24 PM Oncall low urgency page due to automated E2E tests failing
1:35:39 AM First indication internally of dashboard issues raised by internal users
1:55:56 AM Oncall investigation notices 500 errors to /onboard and begins incident process
2:01:37 AM Incident is classified as a SEV2 incident
2:00:00 AM Additional engineers join the incident room
2:13:35 AM Issue with the Dashboard seems to be related to API service. Oncall engineers deploy a previous version of API that predates a suspicious commit using force deploy
2:19:00 AM Further independent investigation confirms API commit likely caused the issue
2:23:00 AM Oncall Engineer notices that the deploy did not apply to the actual prod cluster
2:26:00 AM Oncall Engineer opens a PR to revert problematic commit
2:33:12 AM Engineer identifies issue with force deploy action
2:37:00 AM Engineer lands a change to fix force deploy action
2:37:00 AM Engineer reverts the problem commit to kick off another deploy
2:45:00 AM Staging Environment is reported to be working again
2:46:00 AM The reverted version still not showing up in Production environment
2:52:00 AM Customers raise issues with the Dashboard in public slack; Oncall engineer misinterpreted severity as API hard down
2:55:00 AM Oncall Infrastructure Engineer deletes and recreates Kubernetes Deployment
2:55:00 AM Deployment launches pods successfully
2:57:00 AM Incident severity is upgraded to SEV0. API is actually hard down due to an unknown issue. Investigation ensues. https://status.stytch.com/clj54mmoi309744b7oifn7k9sdr created on Status page.
3:10:00 AM Pods are not registering to target groups. Engineer restarts the AWS Load Balancer Controller pod to kick pods to register to target group.
3:11:00 AM Everything works again

Impact

The Stytch new account signup flow was down for just under four hours. The production Live API was unavailable and/or degraded for a total of 14 minutes. The Dashboard and SDK were unavailable for 16 minutes. The Test API was down for 25 minutes. The incident recovery cost the engineering team 35+ hours of productivity. Additionally, code deploys were blocked for a full day.

Causes

Whys

Action items (→ Linear)