On 2024-11-26 starting at 19:22 UTC, the Stytch API started to experience an increase 500 errors. All authentication and fingerprinting routes were affected and the downtime lasted 10 minutes.
This document outlines what caused the downtime and what we’re doing to prevent similar incidents going forward.
You can find this incident on our Status page.
Time to detection: 1 min
Time to identify cause: 4 min
Time to open incident: 6 min
Time to resolve: 10 min
Time (UTC) | Event |
---|---|
19:22 | First instance of failures appears in our API; all routes are affected and failures begin to rise rapidly. |
19:23 | Oncall team is paged. |
19:26 | Problem is identified, API deploys locked and a force deploy back to last known good commit begins. |
19:28 | Incident declared; see below for more discussion. |
19:29 | Force deploy lands, traffic begins to recover. |
19:32 | Last instance of failures. |
While working on a new feature, our team landed a change that updated the Stytch projects
table and added a new field. The new column in the database was DEFAULT NULL
, which meant that the value was initialized to NULL
for all existing rows in the database.
In a subsequent change, application code was written to read and write to this column as a Go string
instead of a *string
. The Go MySQL driver we use was unable to interpret the existing null values as strings.
The result was that any call in our API that relied upon reading from the projects
table encountered a panic and failed. All authentication routes rely on this check.
Stytch services were completely down for ~10 minutes.
Our Oncall team identified the offending change within a minute of being paged and subsequently locked deploys and initiated a forced deploy to the last known good commit.
This force deploy landed and traffic began to recover in ~3 minutes.