We experienced a prolonged surge of incoming connections to our HTTP API from 00:50 UTC on March 30th. This resulted in a severe increase in connections to our frontend load balancers which at 01:00 reached the maximum connection limit configured and up to 30% of connection attempts to our HTTP API began timing out. Our render API was also affected, with up to 25% of connection attempts to our graph render endpoints timing out. We stabilised the HTTP ingestion and Render API at 03:40 after we expanded the load balancing layer to deal with the excess connections.
We use a set of load balancers to direct traffic to the various backends in our ingestion layer. These load balancers use HAProxy to balance traffic to our HTTP ingestion endpoint on a round robin basis. Different types of connections are routed to different backends, so anything HTTP-like will be routed to our HTTP API (including Render API requests) so that different types of requests are isolated and issues with one endpoint should not affect the others.
Our services are not hosted in AWS, but our DNS services and some of the health checking of our frontend HTTP load balancers is.
At 01:01 UTC we were first alerted that some of our external canaries for our HTTP ingestion API were reporting less than 100% success rate. Specifically, up to 30% of connection attempts to our HTTP ingestion API were timing out. At this time we see no indication that any spike in ingested data is to blame. We focus our efforts into stabilising our ingestion API, failing to notice that there was also a reduction in the rate of processed render requests.
While investigating the source of the timeouts, we discover that our load balancers started reporting a large increase in CPU usage. While following our internal runbook for the affected service we end up focusing on one particular backend service that isn’t currently in use (the runbook had been updated in preparation of this new service going live, but that hadn’t happened yet). This delayed our investigation until the misunderstanding was cleared.
At 02:15 UTC we observed that the total concurrent HTTP connections to the load balancers had drastically increased reaching the maximum threshold set by our HAProxy limits for that backend. This led us to believe that we were experiencing long running HTTP connections on the load balancers that were causing new requests to be rejected. In reality however, it was the significant increase in the rate of connections, rather than connections not terminating in a reasonable timeframe, that was causing the build up of connections leading to excess connection timeouts. This is due to experience we’ve had with a previous incident which looked similar on the surface, and our assumptions that we were dealing with the same type of incident lead us to investigate some dead ends. The time outs were seen for the HTTP API and the Render API as they both share the same HAProxy frontend config on our load balancers and in turn the same connection limit.
At 02:35 UTC we decided to lower the http-request timeout in HAProxy in an attempt to get the slow connections that were building up to timeout faster, we hoped this would stop the build up and release connections quickly enough to lower the concurrent connection count below the maximum threshold. After making this change, we received an alert at 02:40 UTC that the website was completely down, we quickly realised that the lowering of the request timeout was causing our Route53 health-checks to timeout prematurely marking our frontend load balancers as down. The site was restored at 02:54 UTC after rollback of the change had completed, we then started looking into our other options for dealing with the connection build up on the load balancers.
At this point we noticed that the spikes in connections to HAProxy were all in the PT state, this state is when the connection is being purposely delayed (tarpitted) by the server before returning a 5XX to the client, this mechanism is intended to protect the server from abuse. The longer the server held on to each connection the faster the concurrent connection count would grow as HAProxy was accepting more new connections than it was terminating. We then decided to reinforce the load balancing layer to hopefully help with the buffering of the excess connections reducing the number that were being tarpitted. We also took this opportunity to try to raise the maximum concurrent connections on this new host.
At 03:30 UTC the HTTP API and render layer begin to recover, we observed the concurrent connections to the load balancers drop as the additional host began to receive traffic but it was still quite close to the limit. We attempted to raise the limit on all of the load balancers to give the layer some headroom but an error in editing the config, which placed the change in the wrong section, meant that this wasn’t actually applied. Once we were confirmed that the load balancers would remain stable, at 04:12 UTC we resolved the incident and planned to investigate further in the morning.
On the following day, we continued our investigation and our logging showed that a single IP address was responsible for a large part of the connection attempts we received during the incident. Our firewall ratelimited these connection attempts to make sure this single client would not monopolise the resources in our load balancers but, in doing so, ended up tarpitting the offending connections, holding up a large percentage of the maximum number of connections our load balancers are configured to accept.
Tarpitting the offending connections is a good strategy when we want to protect our backend from an influx of requests, but in this particular scenario it was the connection attempts that caused an issue for us, and a complementary ratelimit mechanism would have been needed to protect our services against it.
Our frontend load balancers did not get taken down by the traffic surge due to the pre-existing limits we had set and the main site was still accessible even with degraded ingestion. As well as this, only one specific endpoint was affected, while the other backends served by the load balancers were unaffected.
Our monitoring immediately gave us feedback that reducing the timeout values caused our ssl handshakes to fail, and bring the site down briefly, which allowed for a quick rollback and resolution.
While we have rate-limiting enabled in our load balancers to protect our ingestion layer, due to the particular characteristics of this traffic surge it failed to protect our services this time. Because our rate-limiting did not account for the case where a single user made a significant amount of requests to our HTTP API, as well as limits for concurrent connections to the HTTP API being lower than they should be, a single user making a large number of requests was enough to max out the connection limit and cause the load balancers to reject new connections.
We made several changes that actually had no practical effect, due to the changes being made in the wrong section of our HAProxy config for the frontend load balancers. This was a tricky manual change, especially at the late stage of this incident.
It was unclear what units our HAProxy config values were in, so in the case where we expected seconds we instead were setting a timeout value of milliseconds.
One of our alert runbooks was out of date, and refers to a new version of the HTTP API that has not yet been rolled out, which lead to confusion during initial investigations as Oncall needed to confirm why the infrastructure being used to test that HTTP API was not affected.
Our log filtering did not take into account termination states from HAProxy that contained tarpitted terminated connections, and did not set request time field used in our monitoring correctly, which lead to us not knowing how long the connections were running for.
A previous incident that on a surface level looked similar to this incident lead to us pursuing the same path of resolution/mitigation which resulted in decisions that had no impact in resolving the incident and instead extended the time taken to resolve the actual issue.
We have implemented network level rate-limiting changes which allow us to detect increases in connection attempts to our load balancers and ratelimit them accordingly. [SRE-891]
Have a clear convention/use consistently labelled units in our HAProxy configs to avoid issues where we think a value is one time unit and it is in fact a much lower one. [SRE-892]
Add support to update important config values from our HAProxy configs that occasionally require tweaking during incidents to our ChatOps, which reduces human error. [SRE-893]
Review our timeout values and connection counts to all backends served by HAProxy and evaluate if any of these need to change [SRE-897]
Update the runbook for our HTTP Canary Alerts so it reflects the current state of reality (this is also part of a larger documentation improvement effort that is ongoing) [SRE-898]
Make open connection duration more visible in our monitoring to avoid the long running connections vs lots of short lived connections [SRE-913]
Sat 30th March
00:50 UTC Our external canary for the HTTP ingestion protocol starts to register a drop.
01:01 UTC We receive an alert for our HTTP API canary, specifically a small drop in the canary value over a longer period of time. We begin investigating.
01:08 UTC We notice that this affects our external canaries located in several different regions.
01:23 UTC Initial update to our status page.
01:30 UTC We notice that a canary node is frequently timing out trying to connect to the HTTP API. CPU usage on the frontend load balancers suggests they’re working harder than usual but the request rate and ingestion rate graphs suggest no extra traffic was hitting the backend servers.
01:55 UTC As the incident continues to drag on without a resolution in sight, the incident is escalated to the rest of the SRE team.
02:10 UTC More engineers are now online and get up to speed with this incident.
02:12 UTC Our load balancers are registering a connection buildup, and this leads us to believe there are some similarities between this incident and a previous AWS related incident we had in the past. We also assumed that the connection buildup was due to slowness to establish a connection on the client side. Both assumptions proved to be incorrect.
02:14 UTC We believe we have established that there's no timeouts or other signs of impact for anything else http related that goes through the load balancers. This is due to the rate of renders generally being lower on the weekends, and we misread this as normal behaviour. The reality was the spike in connections was slowing down the normal rate of render requests.
02:15 UTC We notice the total concurrent connections for the http API on the frontend load balancers are maxed out.
02:18 UTC We continue to believe this is a similar incident to a previous AWS incident, and take similar measures to that incident. This involved attempting to reduce timeouts for client http requests, to eliminate long running connections.
02:22 UTC We decide to reduce the request timeout in HAProxy as an attempt at reducing the pile up of connections on the load balancers.
02:35 UTC We attempt to lower the request timeout from 3s to 1s and client timeout from 10s to 3s (This actually set it to 1ms and 3ms respectively due to an ambiguity on our default units for HAProxy configs)
02:40 UTC We start receiving alerts indicating that external checks are unable to establish a SSL handshake to our site.
02:49 UTC The previous timeout change is rolled back and the load balancers come back up, restoring access to the site. HTTP ingestion API remains affected.
02:50 UTC We note that the spikes in connections to haproxy are all in the PT state. This means that they are being ratelimited from our backend, hinting at a spike of requests coming from a single source. This is significant but, at the moment we fail to identify it as such.
03:00 UTC We consider turning off external health checks for our load balancers, based on a previous incident involving AWS - however we are not confident this is the correct approach as it would have left our load balancing layer without any form of health checking except for manual inspection.
03:25 UTC We decide to expand our load balancer layer to help with the excess connections.
03:28 UTC The maximum connection value is raised on the new load balancer and it starts receiving traffic.
03:30 UTC The canaries begin to recover as the frontend connections on the existing load balancers start to drop below their maximum as the new host takes on its share of the connections.
03:34 UTC As we still believed the issue to be related to a slow client somehow, we still expect connections to start piling up again immediately. This doesn’t happen, and our HTTP ingestion API is fully available again as the new host handles the additional connections.
03:41 UTC We observe the number of connections reach a stable point on the load balancers, the connection count is holding but it’s still too close to the limit, we decide to raise the maximum connection value across all load balancers. We would later find out that this change didn’t actually do anything as it was performed in the wrong section of the HAProxy configuration.
04:12 UTC We confirm full recovery and resolve the incident.