Hi,
I'm Daniel, the SRE (Site Reliability Engineer) at Kagi. Working with me on this incident was Zac and Seth, both Senior Engineers.
Summary
Search, Assistant, and the API were completely offline and returned 500 errors, or “stream timeout” errors to everyone who tried to access them. This was due to a migration in the database holding a transaction lock for the duration of the outage. Once the transaction was aborted, every affected service completely recovered.
Timeline
3:09 AM UTC - The process to deploy new code to production was started
3:09 AM UTC - The DB migration was started in a transaction
3:13 AM UTC - The us-east assistant alert fired. This was followed shortly by alerts for every region and included search and the api.
3:15 AM UTC - Investigation is started
3:18 AM UTC - Engineers involved jump into a shared channel to diagnose the issue
3:18 AM UTC - First user reports of Kagi.com being down (Thanks to Pasithea (Pas) from Discord!)
3:24 AM UTC - Transaction locking is identified as a probable cause of failed requests
3:30 AM UTC - Transaction is stopped by an automated process and Kagi.com recovers
3:32 AM UTC - The hung migration is identified as the source of the transaction lock
Root Cause
Why was Kagi.com offline?
Any request that needed to fetch information from the database was timing out, causing search, assistant, and the API to be offline.
Here are the 5 whys:
Q: Why was the information unable to be fetched from the database?
A: The connection pool was full as it had been slowly filling up with connections waiting on a specific transaction to finish. No new connections were available to handle user information queries as we have a max number of connections per vm that we allow to the database
Q: Why were the connections being blocked by a transaction?
A: There was a migration running that had locked an entire table in a long running transaction. Any query made to this table had to wait for the transaction to finish before being able to fetch data.
Q: Why was the migration taking so long?
A: The migration was building an index to allow users to do a full text search on their assistant interactions. This involved reading the messages from the disk, indexing them, then writing the index to the disk
Q: Why did we not expect it to take long?
A: We followed the recommended and documented way to set this index up, and when the migration ran in our staging environment, it finished without timing out or blocking connections
Q: Why was running it in staging different from production?
A: Our staging database is not really a good representation of production, as it has much less data stored in it.
Resolution and Recovery
Once the migration was stopped, the blocked connections in the connection pool were able to finish. This freed up the connections to handle other things like serving searches, the API, and the Assistant.
After verifying that everything was working as expected and nothing else needed to be done, we flipped a switch in our CI/CD system to prevent any future production deployments from happening until the migration was removed.
Once we have verified that the bad migration has been removed, we will turn the switch back off, and allow production deployments to continue.
Corrective and Preventative Measures
The root cause of the issue was that our staging environment has a lot less data stored in it than production. This difference does not allow our staging environment to surface certain kinds of issues before they get sent to production.
We will be filling our staging database up to make it comparable to production so that future migrations and changes in our SQL queries will be done against a large amount of data.
As the transaction was already cancelled, we did not need to make any changes in production to delete a partially changed table.
Thanks for being patient with us during this outage. As always we are blown away by the positive community response when we have these kinds of issues.
Thank you!
Daniel