Overview

On 2024-11-26 starting at 19:22 UTC, the Stytch API started to experience an increase 500 errors. All authentication and fingerprinting routes were affected and the downtime lasted 10 minutes.

This document outlines what caused the downtime and what we’re doing to prevent similar incidents going forward.

You can find this incident on our Status page.

Timeline

Time to detection: 1 min

Time to identify cause: 4 min

Time to open incident: 6 min

Time to resolve: 10 min

Time (UTC) Event
19:22 First instance of failures appears in our API; all routes are affected and failures begin to rise rapidly.
19:23 Oncall team is paged.
19:26 Problem is identified, API deploys locked and a force deploy back to last known good commit begins.
19:28 Incident declared; see below for more discussion.
19:29 Force deploy lands, traffic begins to recover.
19:32 Last instance of failures.

Cause

While working on a new feature, our team landed a change that updated the Stytch projects table and added a new field. The new column in the database was DEFAULT NULL, which meant that the value was initialized to NULL for all existing rows in the database.

In a subsequent change, application code was written to read and write to this column as a Go string instead of a *string. The Go MySQL driver we use was unable to interpret the existing null values as strings.

The result was that any call in our API that relied upon reading from the projects table encountered a panic and failed. All authentication routes rely on this check.

Impact

Stytch services were completely down for ~10 minutes.

Recovery

Our Oncall team identified the offending change within a minute of being paged and subsequently locked deploys and initiated a forced deploy to the last known good commit.

This force deploy landed and traffic began to recover in ~3 minutes.

Discussion