Summary
On February 23, 2025 at approximately 9:12 PM UTC we started receiving reports of Assistant threads disappearing, and were quickly able to replicate the problem ourselves. After a quick investigation, we found that the read replicas in multiple regions had gotten out of sync with the main DB, and were moderately loaded. This was causing issues where writes are acknowledged by the main DB, but then due to replication lag the data is not present in a Read Replica.
At this point several Read Replicas had already recovered from what they were working on, and were back in sync with the main DB. Within 7 additional minutes all Read Replicas had recovered and were back in sync with the main DB. Recovery did not require any additional work to be done, and happened without intervention.
Timeline
7:00 PM UTC | An engineer starts debugging a slow query that had been previously identified in Postgresql. |
7:10 PM UTC | The nightly process that cleans up the table the slow query is referencing is discovered to have been failing for at least the last month, and the table has data from several years ago still stored in it. |
7:28 PM UTC | The engineer decides to manually clean up the table to avoid a situation where the nightly cleanup process (once fixed) causes an outage due to deleting too many rows in a single statement. Deletes are run in separate statements only doing 50 rows at a time. This way the process can be aborted if it takes too many DB resources. |
8:07 PM UTC | An index is added to a secondary table that holds a foreign key referencing the original table. This is done to speed up deletes and avoid scanning the whole secondary table on every query. |
8:20 PM UTC | Read replicas start to fall behind by a few seconds. Replication lag slowly increases. |
9:12 PM UTC | First ticket comes in about Assistant Threads disappearing. |
9:17 PM UTC | Second ticket comes in about Assistant Threads disappearing. |
9:22 PM UTC | Issue was replicated by a Kagi team member |
9:23 PM UTC | The Kagi team member starts pulling engineers to a combined channel to investigate the issue with the Assistant threads. |
9:24 PM UTC | The manual cleanup has finished. |
9:34 PM UTC | The status.kagi.com page is updated. |
9:37 PM UTC | The issue is identified as the Read Replicas being out of sync, and the source of the lag being the manual cleanup. |
9:44 PM UTC | Read Replicas are back in sync with the Primary. |
Root Cause
Why were Assistant threads disappearing and other queries failing?
We run a replicated PostgreSQL infrastructure for low latency access to user account data across multiple regions. We have one Primary that accepts writes, and Read Replicas in each region that handle (most) reads. This allows us to scale reads by adding more replicas, and more regions, as needed.
The problem that can happen with this style of scaling is that the read replicas MUST be up to date with the Primary. If a Read Replica is more than a few seconds behind, then common database access patterns start to break down as writes will go to the Primary, and not be reflected in the Read Replica in time to fulfill additional queries.
Why were the Read Replicas lagging behind on replication?
They were processing a lot of deletes that had already been run on the Primary Database. These deletes require more processing than normal due to foreign key references needing to be updated.
Why were so many deletes happening in the Primary?
A table was being cleaned up manually because the automated clean up had been failing for a long time. This was split up into small batches of 50 rows at a time to avoid a single delete statement causing issues.
Why didn’t the cleanup get paused at the first sign of trouble?
The Read Replicas did not have any replication issues for almost an hour during the cleanup. This changed near the end of the cleanup, when all the Read Replicas started falling behind.
Why didn’t the engineer doing the cleanup get notified?
There isn’t any monitoring setup specifically for replication lag, and the other alerts that we use to monitor the PostgreSQL Primary instance didn’t get near their configured alerting levels.
Resolution and Recovery
As this was due to replication lag of a manual process that had already finished, the only thing that was done was to wait. The cluster recovered by itself.
Corrective and Preventative Measures
The main thing that we are doing to correct this problem in the future is to add an alert for Replication Lag. If there was an existing alert for this, the engineer would have been notified and then would have spaced out the queries cleaning up the table and given the read replicas time to catch up.
The second thing that we will be doing is watching both the Read Replicas and the Primary instance when doing this kind of manual work to ensure that actions taken on the Primary do not adversely affect the Read Replicas.