June 20th, 2023 Post-Mortem

Post-mortems are conducted in a blameless format to identify learnings and action items from the incident.

Timeline

Time (UTC)	Event
11:24:24 PM	Oncall low urgency page due to automated E2E tests failing
1:35:39 AM	First indication internally of dashboard issues raised by internal users
1:55:56 AM	Oncall investigation notices 500 errors to `/onboard` and begins incident process
2:01:37 AM	Incident is classified as a SEV2 incident
2:00:00 AM	Additional engineers join the incident room
2:13:35 AM	Issue with the Dashboard seems to be related to API service. Oncall engineers deploy a previous version of API that predates a suspicious commit using force deploy
2:19:00 AM	Further independent investigation confirms API commit likely caused the issue
2:23:00 AM	Oncall Engineer notices that the deploy did not apply to the actual prod cluster
2:26:00 AM	Oncall Engineer opens a PR to revert problematic commit
2:33:12 AM	Engineer identifies issue with force deploy action
2:37:00 AM	Engineer lands a change to fix force deploy action
2:37:00 AM	Engineer reverts the problem commit to kick off another deploy
2:45:00 AM	Staging Environment is reported to be working again
2:46:00 AM	The reverted version still not showing up in Production environment
2:52:00 AM	Customers raise issues with the Dashboard in public slack; Oncall engineer misinterpreted severity as API hard down
2:55:00 AM	Oncall Infrastructure Engineer deletes and recreates Kubernetes Deployment
2:55:00 AM	Deployment launches pods successfully
2:57:00 AM	Incident severity is upgraded to SEV0. API is actually hard down due to an unknown issue. Investigation ensues. Public incident created on Status page.
3:10:00 AM	Pods are not registering to target groups. Engineer restarts the AWS Load Balancer Controller pod to kick pods to register to target group.
3:11:00 AM	Everything works again

Impact

The Stytch new account signup flow was down for just under four hours. The production Live API was unavailable and/or degraded for a total of 14 minutes. The Dashboard and SDK were unavailable for 16 minutes. The Test API was down for 25 minutes. The incident recovery cost the engineering team 35+ hours of productivity. Additionally, code deploys were blocked for a full day.

Causes

A bug in API that caused new signups on Stytch to fail. This only impacted signup to the Stytch Dashboard.
An overlooked typo caused forced deployments to fail to fix the signup flow issue.
A miscommunication on the severity of the incident led to a disproportionate response with an avoidable disruptive operation to fix what was a minor incident.
Unexpected behavior on the part of the ALB controller related to pod readiness gates caused downtime on the Stytch API and SDK.

Whys

Why were signups broken?
- We pushed a faulty version to the Stytch API internal endpoints that were not caught by pre-deploy automated tests.
Why didn’t the “force deploy” fix the previous problem?
- We had recently completed the migration to a new Kubernetes cluster configuration and inadvertently landed a typo.
- We hadn’t tested the force deploy mechanism in the production environment and the typo did not surface till this point.
Why did the fix to the force deployment not work?
- The pods didn’t register as target due to an unexpected behavior from AWS load balancer.
Why did we delete the deployment?
- We thought we were hard down and we thought that would fix the registering of the targets.
Why did we think we were hard down?
- The influx of information during the incident made it hard for the incident commander to understand the status of the incident.
- We didn’t gauge the potential impact of the delete and had expected it to be quick; under normal circumstances, it would be return to operation in a matter of seconds.
Why did the AWS Load balancer have registration issues?
- We hit an edge case that doesn’t show up in the AWS Load Balancer logs.
- We have a strong theory that we had made an immutable tag change during the delete and recreation of the ingress that stopped the newly launched deployment from registering new pods.

Action items (→ Linear)

[x] Run e2e tests on deploys of all downstream services
[x] Make e2e failures page, re-assess if noisy