Post-mortems are conducted in a blameless format to identify learnings and action items from the incident
Time (UTC) | Event |
---|---|
1:21 PM | Infrastructure as Code Changed lands and deploys change to remove node groups in staging and production. |
1:58 PM | Infrastructure as Code Changed completes process to land and deployed to remove node groups |
3:06 PM | Low Urgency Alerts fire related to Staging Karpenter error logs. On-call begins investigation into low urgency alert for staging Karpenter issues. Issue is thought to be isolated to staging and low urgency because existing deployments were healthy and new pods were being scheduled. |
4:48 PM | Low Urgency Alerts fire related to prod Karpenter error rate. On-call investigates but issue is thought to be low urgency because existing deployments were healthy, deploys are working and new pods were being scheduled. |
6:40 PM | On-call engineer resumes investigation into root cause of earlier Karpenter alerts. |
6:54 PM | First existing staging node is marked as unhealthy by the Kubernetes Control Plane. |
7:13 PM | First existing prod node is marked as unhealthy by the Kubernetes Control Plane. |
7:20PM | Deployment goes out to API service |
7:26 PM | Karpenter error rate spikes and existing and new pods are running into scheduling issues in production. Incident response protocol is triggered. Deploys are Frozen across all environments because deploys seem to have triggered new errors. |
7:29 PM | Engineers attempt a hotfix in staging which fails to resolve the issue |
7:44 PM | Deployment Replicas Unavailable Alerts begin firing in production. Engineers refocus on mitigating production impact. |
8:00 PM | Engineers identify node group change as likely root cause and revert change to spin node groups back up. |
8:10 PM | Support flags to on-call that Production Live and Test API Goes Down |
8:15 PM | New node groups are launch but pods are not scheduled due to resource limits. |
8:17 PM | Dashboard and SDK go down |
8:18 PM | Engineers remove crossplane application to allow manual intervention on node groups. Node group has taints removed and nodes resized to fast-track recovery. |
8:24 PM | Production Live API service is restored |
8:28 PM | Kubernetes Deployments edited to remove node group selector to allow applications to schedule on modified node group. |
8:34 PM | Dashboard and SDK service is restored; |
8:36 PM | Production Test API service is restored. End user impact is mitigated |
9:10 PM | Engineers begin work on disaster recovery including restoring Karpenter service with a new IAM profile, and removing modifications made to infrastructure and Kubernetes deployments |
The production Live API was unavailable for 14 minutes, the dashboard/sdk was unavailable for 17 minutes, and the test api was down for 25 minutes. The incident response and disaster recovery cost the engineering team 35+ hours of productivity. Additionally, engineerings were not able to deploy code for a full day.
Why did the API/Dashboard/SDK go down?
The existing pods were terminated.
Why were the pods terminated?
The existing nodes were marked as “Not Ready” by the Kubernetes Control Plane.
Why were the nodes marked as “Not Ready”
The IAM instance profile was removed and doing so breaks any applications running on the instance.
Why was the instance profile removed?
When the “Delete Node Group” change was made in AWS, there is a cascading amount of changes made in AWS including draining the worker nodes (Expected) and evicting those pods. Additionally, the service role deletes any resources resources associated with the node group (i.e. roles, policies, security groups, etc.)
Why didn’t we realize that deleting the node group would affect the instance profile or have other unintended effects?
The documentation references that aws-auth config map would be changed, but we mitigated this by overriding the aws-auth config map with crossplane.
There is no in depth documentation that all the consecutive API invoked when a DeleteNodegroup API is initiated. This was confirmed when reaching out to AWS support, who filed a ticket to improve the documentation.
Why did we originally use the same instance profile for the node groups and the self managed karpenter nodes?
We took a (now regretful) shortcut in setting up Karpenter to use the same iam instance profile as the node groups and had planned to switch the instance profile over but we never revisited that ticket.
How did we not realize that we had reused a IAM instance profile in something used by crossplane and karpenter?
We reference AWS resources managed by different IACS (Crossplane, Terraform, Karpenter) and there is no easy way way to find the mapping of where one iam profile is used.
Why did it take hours for the instance profile being removed for the node to be marked as not healthy?
Why did the Pod Disruption Budget (PDB) not help?
The nodes were being marked as not ready and these pods were stopped because of that, so the PDB is irrelevant as the scheduler when its an eviction due to a node failure.
Why were we not able to mitigate the impact if we were in an incident response motion before we went down?
Why were the new node groups we spun up in the recreated node group not current sized?
We had used the node groups as a back up compute resource so its likely that they were not sized to fit our workloads which have changed over time.
Why did we focus on staging?
We had initially though that the incident was only blocking deploys and not impacting the existing pods. We jumped into fixing the root cause fully instead assessing the impact of the incident instead of mitigating the situation.
Why did we wait so long to revert the node group change?
We hadn’t identified the changed as a possible cause and did not realize all our nodes were being marked as pods were healthy and that we would run into a full service interruption. We had a large amount of alerts firing but did not understand how severe the replica count issue had become.
Why did it take so long to modify the node groups?
We had to manually remove the Crossplane resources that manage the node groups so that Crossplane would not override changes we made in the UI.
Why did it take us a 10 minutes to notify customers?
We manually update the status page and we prioritized resolving the issue before assigning out that task. Additionally, it takes a couple minutes to actually update the status page.
Did we test out the change in our staging environment?
We let the change bake for a bit but we had expected if any issues popped up, they would do so immediately instead of taking hours.
Why did we not notice the impact when we landed the change?
There was no impact and even though the change was done following best practices (IAC, Code review, announcement in channels, pairing) we thought we were in the clear when the incident popped up.