ResolvedKagi.com is unstable for all regions
This is Zac, the Tech Lead at Kagi. I’m going to be sharing below a more in-depth post-mortem of our service interruption last week. Assisting me in responding to this incident was Seth, one of our senior engineers, and Luan, our DevOps engineer.
This will be fairly technical, you can skip to “Next Steps” for takeaways. The summary is that we were the target of some actors misusing the service, exploting a pathological case in our infrastructure, and we immediately released mitigations and are working on improvements in several areas of our code and communications.
On January 12th, approx. 5:30PM UTC, the team became aware of an infrastructure issue occurring by way of our internal monitoring and user reports of issues. The nature of the issue was causing slow loading or complete page timeouts for users in various regions.
This incident unfortunately took us quite some time to resolve - we deeply thank our users for their patience, and the opportunity to give some background as to what was going on, and how we plan to move forward.
At first, by what turned out to be a complete coincidence, the incident occurred at precisely the same time that we were performing an infrastructure upgrade to our VMs with additional RAM resources. Our monitoring was reporting both high latency and issues with our application’s database connection pool. While no code changes were part of this upgrade, and “old” instances were reporting some issues as well, we decided the first course of action was to revert this change.
This revert completed at around 6:50 PM, but as we continued to monitor the issue persisted. Meanwhile, we had been inspecting the behavior of our application’s database connection pools, which were saturated with connections to our primary DB instance. It was unclear what the exact cause of this was yet, but what was clear is that the total number of connections being established globally to our primary exceeded its maximum configured connection limit.
Our next move was to evaluate if we had somehow caught ourselves in a “spiral” of exhausting our maximum connections, wherein any instance that we would replace would simply get its connection pool exhausted again, queuing for access to the primary. In several steps we tried replacing a few instances to see what effect reducing the congestion would have. We also were making progress on evaluating various parts of the databases internal health and query performance.
With mild signs that cycling some instances was helping, at 9:30PM UTC we decided to pause all user traffic by redirecting users to our status page to give the database a break and completely reset all connection pools in one shot. We installed the redirect and issued a restart of all nodes. Once all appeared stable, we started letting traffic back in again. Unfortunately, the issue persisted.
While looking at the database state, it became clear to our engineers that the root cause was in fact high contention on rows in the users table. This contention caused a steep increase in write latency, which in turn put backpressure on our application’s connection pool, causing it to eventually exhaust all available connections as writes were taking too long to complete. The writes were all stemming from one instance that would eventually starve the rest of our global instances of access to our primary, thus causing disruption in all other regions.
This didn’t exactly come as a surprise to us, as for the entirety of Kagi’s life so far we have actually used the cheapest, single-core database available to us on GCP! To this day we’ve yet to exceed 50% load capacity on it, which we’ve worked hard to achieve. This has always carried the risk of the database being relatively easy to knock over, but we have so far kept load and latency under control with redundancy, distributed read-replicas, and high scrutiny over the SQL we write.
At the same time, we issued a hotfix that disabled the particular writes that were causing high contention, in addition to some upgrades of our database driver, which included several relevant bug fixes to connection pool health. This would help ensure that immediately the same pathological case could not be exploited again.
By midnight, the issue was fully resolved. The team continued to closely monitor for any signals that the actors were returning, but they did not.
We were later in contact with an account that we blocked who claimed they were using their account to perform automated scraping of our results, which is not something our terms allow for.
There’s a lot we took away from this incident, and we have immediate plans already in motion to make our system more robust to this type of abuse, as well as our communication processes around incidents.
First, regretfully we were not very prompt in updating our status page. We owe it to our customers to be quick and transparent about any issues going on that might affect access to the product that they pay for. To address this, we are moving to a status page platform to one that will more easily allow us expose some of our automated internal monitoring to users. This way users have an idea of the platform’s health in real-time, even if our small team of engineers has their hands full to immediately post an update (which was not a very fast process to begin with, even if we were on top of it). We should have this available by next week.
Secondly, we have directly mitigated the queries causing issues under load. With this issue in mind, we are also running load tests to learn about any other similar deficiencies that may still exist, and what to avoid in the future. We are also installing some additional monitoring to more quickly point us to the right place in our infra, and hopefully not waste as much time chasing a false-flag as we did this time.
Lastly, we are performing a re-upping of our systems set up to detect this kind of abuse of our terms, which were clearly too lax. Besides any potential performance impact, this vector also directly costs us money as we pay for each search. To protect our finances, and all of our subscribers continued access to unlimited (organic) searches, we need to set some automated limits to help us enforce this. From analyzing our user’s usage, we have picked some limits that no good-faith user of Kagi should reasonably hit.
These new limits should already be in place by the time of this post, and we will monitor their impact and continue to tune them as needed. If you believe you find yourself wrongly unable to access Kagi, please reach out to email@example.com.
Thank you so much for bearing with us through this incident! Please look forward to a more robust service as we implement these things, and as usual, more features & improvements are on their way.
All services are now operating as normal. Thank you for your patience while we resolved this issue.
Traffic has been restored and we are continuing to monitor the service as it comes back to full health
In order to fully restore stability we will need to pause traffic momentarily. We will be redirecting users to this page while we restore load to the service in a controlled manner. We will follow up with further details as the situation progresses
We are reverting a configuration change that we believe to be the culprit, and are continuing to monitor as service is coming back to full health
We are experiencing issues following a deployment. The team is working on resolving this.