Overview

On 2024-08-12 starting at 20:10 UTC, the Stytch API started to experience an increase in latency that began timing out API responses resulting in 503 errors and 499 errors. At 22:08 UTC traffic returned to normal.

This document outlines what caused the downtime and what we’re doing to prevent similar incidents going forward.

You can find this incident on our Status page.

Timeline

Time (UTC) Event
20:10 Expensive trigram creation creates a backup in User creation/update flow in our Consumer API.
20:15 Database load starts to rise, requests start to time out.
20:25 Our 5XX Service Level Objective (SLO) alert fires and our oncall team is alerted.
20:35 5XX remain higher than normal but below the threshold for an incident.
20:48 With 5XX errors growing and our database not catching up, an incident is declared.
21:23 First mitigation is put in place to slow down CreateUser traffic and allow the database to recover. This mitigation is unsuccessful at resolving the issue.
21:26 Cause of the issue is identified; trigram creation and insertion.
21:59 Second mitigation is merged which removes the instigating trigger from our database.
22:08 Database recovers and traffic returns to normal.

Causes

The root cause of this incident was a slowdown in trigram creation and insertion during Stytch User creation and update flows.

When a User’s name fields are created or updated, our database will create trigrams of those name values. These trigrams are used to power our Search endpoint’s full_name_fuzzy feature.

During the incident, a particularly expensive transaction introduced a slowdown into this flow at 20:10UTC. Once this slowdown was introduced, several cascading effects, primarily row lock contention, caused database load to climb rapidly as it attempted to “catch up” with the growing backlog.

Database load as expressed in average active sessions

image.png

As the database load grew, requests started to time out or be closed by the client. Additional retries of these flows by end users who were unable to log in also increased this backlog.

Once the backlog of transactions was drained, the database returned to normal.

Impact

During this incident we saw an increase in 5XX and 499 errors for any route in our Consumer API that touched User creation or User update. Additionally, we saw an increase in rate limit errors, 429, as users retried throughout this incident.

We saw no downtime as a result of this incident in our B2B or Fraud APIs.

Examples of User creation or User update routes include: