Notice history

Jul 2025

Aug 2025

Sep 2025

Jul 2025

Aug 2025

Sep 2025

Jul 2025

Aug 2025

Sep 2025

Jul 2025

Aug 2025

Sep 2025

Jul 2025

Aug 2025

Sep 2025

Jul 2025

Aug 2025

Sep 2025

Jul 2025

Aug 2025

Sep 2025

Jul 2025

Aug 2025

Sep 2025

Jul 2025

Aug 2025

Sep 2025

Jul 2025

Aug 2025

Sep 2025

Jul 2025

Aug 2025

Sep 2025

Jul 2025

Aug 2025

Sep 2025

Jul 2025

Aug 2025

Sep 2025

Jul 2025

Aug 2025

Sep 2025

Jul 2025

Aug 2025

Sep 2025

Jul 2025

Aug 2025

Sep 2025

Jul 2025

Aug 2025

Sep 2025

Jul 2025

Aug 2025

Sep 2025

Jul 2025

Aug 2025

Sep 2025

Jul 2025

Aug 2025

Sep 2025

Jul 2025

Aug 2025

Sep 2025

Jul 2025

Aug 2025

Sep 2025

Jul 2025

Aug 2025

Sep 2025

Jul 2025

Aug 2025

Sep 2025

Jul 2025

Aug 2025

Sep 2025

Jul 2025

Aug 2025

Sep 2025

Jul 2025

Aug 2025

Sep 2025

Sep 2025

Resolved
September 29, 2025 at 6:41 PM
Resolved
September 29, 2025 at 6:41 PM
Kagi Translate in asia-southeast1 is now operational! This update was created by an automated monitoring service.
Investigating
September 29, 2025 at 6:35 PM
Investigating
September 29, 2025 at 6:35 PM
Kagi Translate in asia-southeast1 region cannot be accessed at the moment. This incident was created by an automated monitoring service.

Resolved
September 29, 2025 at 6:28 PM
Resolved
September 29, 2025 at 6:28 PM
Kagi Translate in us-central1 is now operational! This update was created by an automated monitoring service.
Investigating
September 29, 2025 at 6:12 PM
Investigating
September 29, 2025 at 6:12 PM
Kagi Translate in us-central1 cannot be accessed at the moment. This incident was created by an automated monitoring service.

Resolved
September 29, 2025 at 6:27 PM
Resolved
September 29, 2025 at 6:27 PM
europe-west4 is now operational! This update was created by an automated monitoring service.
Investigating
September 29, 2025 at 6:09 PM
Investigating
September 29, 2025 at 6:09 PM
Kagi Translate in europe-west4 cannot be accessed at the moment. This incident was created by an automated monitoring service.

Resolved
September 29, 2025 at 6:28 PM
Resolved
September 29, 2025 at 6:28 PM
Kagi Translate in asia-southeast1 is now operational! This update was created by an automated monitoring service.
Investigating
September 29, 2025 at 6:09 PM
Investigating
September 29, 2025 at 6:09 PM
Kagi Translate in asia-southeast1 region cannot be accessed at the moment. This incident was created by an automated monitoring service.

Resolved
September 25, 2025 at 7:22 PM
Resolved
September 25, 2025 at 7:22 PM
[Assistant] us-east4 is now operational! This update was created by an automated monitoring service.
Investigating
September 25, 2025 at 7:16 PM
Investigating
September 25, 2025 at 7:16 PM
[Assistant] us-east4 cannot be accessed at the moment. This incident was created by an automated monitoring service.

Aug 2025

Resolved
August 29, 2025 at 3:32 AM
Resolved
August 29, 2025 at 3:32 AM
Kagi Translate in asia-southeast1 is now operational! This update was created by an automated monitoring service.
Investigating
August 29, 2025 at 3:26 AM
Investigating
August 29, 2025 at 3:26 AM
Kagi Translate in asia-southeast1 region cannot be accessed at the moment. This incident was created by an automated monitoring service.

Resolved
August 29, 2025 at 3:32 AM
Resolved
August 29, 2025 at 3:32 AM
asia-southeast1 is now operational! This update was created by an automated monitoring service.
Investigating
August 29, 2025 at 3:21 AM
Investigating
August 29, 2025 at 3:21 AM
asia-southeast1 cannot be accessed at the moment. This incident was created by an automated monitoring service.

Resolved
August 29, 2025 at 3:31 AM
Resolved
August 29, 2025 at 3:31 AM
australia-southeast1 is now operational! This update was created by an automated monitoring service.
Investigating
August 29, 2025 at 3:21 AM
Investigating
August 29, 2025 at 3:21 AM
australia-southeast1 cannot be accessed at the moment. This incident was created by an automated monitoring service.

Postmortem
August 29, 2025 at 8:16 PM
Postmortem
August 29, 2025 at 8:16 PM
Hi,
I'm Daniel, the SRE (Site Reliability Engineer) at Kagi. Working with me on this incident was Zac and Seth, both Senior Engineers.

Summary
Search, Assistant, and the API were completely offline and returned 500 errors, or “stream timeout” errors to everyone who tried to access them. This was due to a migration in the database holding a transaction lock for the duration of the outage. Once the transaction was aborted, every affected service completely recovered.

Timeline
3:09 AM UTC - The process to deploy new code to production was started
3:09 AM UTC - The DB migration was started in a transaction
3:13 AM UTC - The us-east assistant alert fired. This was followed shortly by alerts for every region and included search and the api.
3:15 AM UTC - Investigation is started
3:18 AM UTC - Engineers involved jump into a shared channel to diagnose the issue
3:18 AM UTC - First user reports of Kagi.com being down (Thanks to Pasithea (Pas) from Discord!)
3:24 AM UTC - Transaction locking is identified as a probable cause of failed requests
3:30 AM UTC - Transaction is stopped by an automated process and Kagi.com recovers
3:32 AM UTC - The hung migration is identified as the source of the transaction lock

Root Cause
Why was Kagi.com offline?
Any request that needed to fetch information from the database was timing out, causing search, assistant, and the API to be offline.

Here are the 5 whys:
Q: Why was the information unable to be fetched from the database?
A: The connection pool was full as it had been slowly filling up with connections waiting on a specific transaction to finish. No new connections were available to handle user information queries as we have a max number of connections per vm that we allow to the database
Q: Why were the connections being blocked by a transaction?
A: There was a migration running that had locked an entire table in a long running transaction. Any query made to this table had to wait for the transaction to finish before being able to fetch data.
Q: Why was the migration taking so long?
A: The migration was building an index to allow users to do a full text search on their assistant interactions. This involved reading the messages from the disk, indexing them, then writing the index to the disk
Q: Why did we not expect it to take long?
A: We followed the recommended and documented way to set this index up, and when the migration ran in our staging environment, it finished without timing out or blocking connections
Q: Why was running it in staging different from production?
A: Our staging database is not really a good representation of production, as it has much less data stored in it.

Resolution and Recovery
Once the migration was stopped, the blocked connections in the connection pool were able to finish. This freed up the connections to handle other things like serving searches, the API, and the Assistant.
After verifying that everything was working as expected and nothing else needed to be done, we flipped a switch in our CI/CD system to prevent any future production deployments from happening until the migration was removed.
Once we have verified that the bad migration has been removed, we will turn the switch back off, and allow production deployments to continue.

Corrective and Preventative Measures
The root cause of the issue was that our staging environment has a lot less data stored in it than production. This difference does not allow our staging environment to surface certain kinds of issues before they get sent to production.
We will be filling our staging database up to make it comparable to production so that future migrations and changes in our SQL queries will be done against a large amount of data.
As the transaction was already cancelled, we did not need to make any changes in production to delete a partially changed table.
Thanks for being patient with us during this outage. As always we are blown away by the positive community response when we have these kinds of issues.
Thank you!
Daniel
Resolved
August 29, 2025 at 3:40 AM
Resolved
August 29, 2025 at 3:40 AM
We know the cause of the issue, and the issue is now resolved.
We will do a post-mortem of the incident sometime soon.
Monitoring
August 29, 2025 at 3:35 AM
Monitoring
August 29, 2025 at 3:35 AM
The service is coming back, but we are still monitoring.
Investigating
August 29, 2025 at 3:21 AM
Investigating
August 29, 2025 at 3:21 AM
We are currently investigating this incident.

Resolved
August 29, 2025 at 3:31 AM
Resolved
August 29, 2025 at 3:31 AM
us-west2 is now operational! This update was created by an automated monitoring service.
Investigating
August 29, 2025 at 3:20 AM
Investigating
August 29, 2025 at 3:20 AM
us-west2 cannot be accessed at the moment. This incident was created by an automated monitoring service.

Jul 2025

Resolved
July 22, 2025 at 10:12 PM
Resolved
July 22, 2025 at 10:12 PM
[Assistant] us-central1 is now operational! This update was created by an automated monitoring service.
Investigating
July 22, 2025 at 10:11 PM
Investigating
July 22, 2025 at 10:11 PM
[Assistant] us-central1 cannot be accessed at the moment. This incident was created by an automated monitoring service.

Resolved
July 22, 2025 at 9:30 PM
Resolved
July 22, 2025 at 9:30 PM
[Assistant] us-central1 is now operational! This update was created by an automated monitoring service.
Investigating
July 22, 2025 at 9:27 PM
Investigating
July 22, 2025 at 9:27 PM
[Assistant] us-central1 cannot be accessed at the moment. This incident was created by an automated monitoring service.

Resolved
July 20, 2025 at 4:29 AM
Resolved
July 20, 2025 at 4:29 AM
us-central1 is now operational! This update was created by an automated monitoring service.
Investigating
July 20, 2025 at 4:28 AM
Investigating
July 20, 2025 at 4:28 AM
us-central1 cannot be accessed at the moment. This incident was created by an automated monitoring service.

Resolved
July 20, 2025 at 4:29 AM
Resolved
July 20, 2025 at 4:29 AM
asia-southeast1 is now operational! This update was created by an automated monitoring service.
Investigating
July 20, 2025 at 4:28 AM
Investigating
July 20, 2025 at 4:28 AM
asia-southeast1 cannot be accessed at the moment. This incident was created by an automated monitoring service.

Resolved
July 20, 2025 at 4:29 AM
Resolved
July 20, 2025 at 4:29 AM
europe-west4 is now operational! This update was created by an automated monitoring service.
Investigating
July 20, 2025 at 4:28 AM
Investigating
July 20, 2025 at 4:28 AM
europe-west4 cannot be accessed at the moment. This incident was created by an automated monitoring service.

Jul 2025 to Sep 2025

kagi - Notice history

All systems operational

Notice history

Sep 2025

Aug 2025

Summary

Timeline

Root Cause

Here are the 5 whys:

Resolution and Recovery

Corrective and Preventative Measures

Jul 2025