Post-mortems are conducted in a blameless format to identify learnings and action items from the incident

Timeline

Time (UTC) Event
1:21 PM Infrastructure as Code Changed lands and deploys change to remove node groups in staging and production.
1:58 PM Infrastructure as Code Changed completes process to land and deployed to remove node groups
3:06 PM Low Urgency Alerts fire related to Staging Karpenter error logs. On-call begins investigation into low urgency alert for staging Karpenter issues. Issue is thought to be isolated to staging and low urgency because existing deployments were healthy and new pods were being scheduled.
4:48 PM Low Urgency Alerts fire related to prod Karpenter error rate. On-call investigates but issue is thought to be low urgency because existing deployments were healthy, deploys are working and new pods were being scheduled.
6:40 PM On-call engineer resumes investigation into root cause of earlier Karpenter alerts.
6:54 PM First existing staging node is marked as unhealthy by the Kubernetes Control Plane.
7:13 PM First existing prod node is marked as unhealthy by the Kubernetes Control Plane.
7:20PM Deployment goes out to API service
7:26 PM Karpenter error rate spikes and existing and new pods are running into scheduling issues in production. Incident response protocol is triggered. Deploys are Frozen across all environments because deploys seem to have triggered new errors.
7:29 PM Engineers attempt a hotfix in staging which fails to resolve the issue
7:44 PM Deployment Replicas Unavailable Alerts begin firing in production. Engineers refocus on mitigating production impact.
8:00 PM Engineers identify node group change as likely root cause and revert change to spin node groups back up.
8:10 PM Support flags to on-call that Production Live and Test API Goes Down
8:15 PM New node groups are launch but pods are not scheduled due to resource limits.
8:17 PM Dashboard and SDK go down
8:18 PM Engineers remove crossplane application to allow manual intervention on node groups. Node group has taints removed and nodes resized to fast-track recovery.
8:24 PM Production Live API service is restored
8:28 PM Kubernetes Deployments edited to remove node group selector to allow applications to schedule on modified node group.
8:34 PM Dashboard and SDK service is restored;
8:36 PM Production Test API service is restored. End user impact is mitigated
9:10 PM Engineers begin work on disaster recovery including restoring Karpenter service with a new IAM profile, and removing modifications made to infrastructure and Kubernetes deployments

Impact

The production Live API was unavailable for 14 minutes, the dashboard/sdk was unavailable for 17 minutes, and the test api was down for 25 minutes. The incident response and disaster recovery cost the engineering team 35+ hours of productivity. Additionally, engineerings were not able to deploy code for a full day.

Causes

Whys

Action items (→ Linear)