On June 24, between 13:25 UTC and 14:35 UTC, and between 16:40 UTC and 17:15 UTC our aggregation layer experienced instability that resulted in delays of newly ingested data to be processed by our services. This could result in gaps/incorrect data in graphs and some alerts potentially triggering false positives.
Over the past few months we have been working to improve the performance of our ingestion pipeline, with a strong focus on our aggregation layer. Part of this effort involves rewriting parts of our aggregation service to use cython. We have been incrementally rewriting and verifying parts of this service, with the last such change happening in early June.
That last change accidentally removed some exception handling around socket operations while receiving data from other components on our ingestion pipeline. In particular, the socket `recv` operation can fail, resulting in an exception. The previous version of the code handled the exception and closed the connection, leaving it up to the client to reconnect. The new version of the code had no such exception handling.
The different components of our aggregation layer rely on a health checking mechanism for service discovery purposes. This health check is responsible for considering a given server as “healthy”, which means that other components of our ingestion pipeline will route traffic to it. If a server stops being “healthy” it will stop receiving data.
By default, the health check we rely on requires a server to have been processing data successfully for the last five minutes before it can start accepting production traffic. This is a rather strict check but ensures that servers with intermittent issues won’t be flapping in and out of the active pool, and it’s a good default choice for having a robust and stable pipeline.
On some occasions we do need to switch to a less strict check to ensure we have enough available capacity to handle incoming traffic. For instance, during network connectivity events when a large enough percentage of the servers on our aggregation layer might be flapping we can’t afford to mark as inactive every server that hasn’t been successfully processing data for five uninterrupted minutes. In those situations we rely on a health check that just ensures that the responsible service is up and running, which will increase the number of available servers and help us weather the storm. This might result in some processes not being able to forward data, but we can rely on our data replay service to pick up any datapoints that haven’t been forwarded and replay them back when the service is in a healthier state.
Our ingestion pipeline is composed of a variety of different services that process the data as they receive it and forward it to the next service on the pipeline. There are times when one of the components of this pipeline is overloaded or temporarily unavailable. In order to prioritise the availability of leading edge data our services are configured to not retry indefinitely to forward a given datapoint and move on to the next one. This allows us to avoid building up a large backlog that will affect all incoming metrics.
This data is only temporarily dropped, and there are background services that will automatically pick it up and attempt to replay it when the affected services recover. These replay services usually operate under a tight ratelimit to ensure that data replays don’t affect ingestion of leading edge, but during periods of high load they can also be temporarily disabled so we can prioritise stabilising our ingestion pipeline for leading edge data before continuing with data replays.
A change to our aggregation service introduced in early June (June 04) accidentally removed some critical exception handling code around socket `recv` operations. This resulted in a number of our aggregation services partially crashing during a period of network instability, which started at around 13:25 UTC. During this period some connections to our aggregation servers were reset, which the new code wasn’t able to handle, instead crashing the process responsible for accepting incoming data. A large enough percentage of servers in our aggregation layer were affected by this, which means that when connectivity was restored the service was unable to deal normally with all incoming traffic.
These processes crashing reduced the capacity of our aggregation layer, with the remaining healthy servers being unable to deal with the load. This resulted in the non-crashed servers becoming overloaded to the point they started intermittently failing health checks, compounding the issue. Our ingestion service is now unable to forward all ingested datapoints to our aggregation layer, and they are automatically dropped and left for a background replay service to forward them when the aggregation service is healthy.
At this point (13:35 UTC) we switch the servers on our aggregation layer to a less strict mode of health checking to ensure that we don’t run out of capacity while we attempt to mitigate the impact. We also temporarily pause our data replay service to ensure we’re not introducing any extra load into the system and prioritise recovery for leading edge data. We also start provisioning some extra servers to increase capacity and help with the increase in load.
We investigated the affected servers, and noted that in most of them the process responsible for accepting forwarded data had crashed. Usually if the service crashes it would be restarted automatically, but in this case only one of the processes that form the service crashed, which left the service in an unhealthy state. While we started investigating the source of these crashes we manually restarted the service in all affected hosts.
At 15:17 UTC After we fixed all the crashed processes, our service starts to recover and we see our internal queues start trending down towards recovery. Once we are confident that the service is now stable we start enabling data replays to allow for the data that had been dropped earlier on to be backfilled. Our data replay service has an internal ratelimit to ensure we don’t overwhelm our aggregation layer with replayed data, but out of caution we decided to enable the replays in small batches so we could better monitor the impact to the rest of our service.
We identify the original bug that resulted in the aggregation processes crashing, and prepare a fix, that is merged at 16:27 UTC. We decide to roll out this change immediately before another period of network instability results in more process crashes. Unfortunately this didn’t happen quickly enough, and at 16:40 UTC there’s another period of network instability that will impact some of our aggregation servers that hadn’t received the fix yet.
The new period of network instability caused some of our aggregation processes to crash. Our deployment tools will ensure that a large percentage of the aggregation services are healthy before continuing the deploy and restarting more nodes. Unfortunately, because some processes were already crashing this meant there were not enough healthy services around for the deployment process to continue, and the rollout was interrupted until we could manually fix all the affected processes.
Once we had been able to fix all the crashed processes and completed the rollout of the fix our aggregation layer started processing data normally, restoring the leading edge. This allowed us to re-enable data replays as well, which finished at around 19:30 UTC.
13:25 UTC - Beginning of the initial period of connectivity issues. Some of our aggregation servers have processes crashing. We receive alerts and begin investigating.
13:35 UTC - We switch to a less strict health checking for our aggregation service to ensure we have enough capacity to deal with the incoming traffic.
14:01 UTC - We track the issue so the crashed processes on some of our aggregation nodes and begin working on a way to detect all crashed processes and repair them. In parallel to this, we begin investigating the source of this bug.
14:20 UTC - We start provisioning new hosts to help deal with the extra load. We temporarily disable data replays to prioritise the processing of leading edge data and make sure we’re not introducing extra load into the system that would end up making things worse.
14:35 UTC - After removing one particular server with unusually high queues from the active pool, we start seeing a recovery on our internal queues.
15:17 UTC - Our internal queues return to normal levels.
15:20 UTC - We enable data replays on a small subset of hosts to avoid overloading our aggregation layer with the extra load.
16:27 UTC - We merge a fix for the crashed processes issue and begin roll it out.
16:40 UTC - Another period of network instability. We stop data replays again.
17:15 UTC - We detect a single server with high internal queues and remove it from the active pool, which immediately helps.
17:40 UTC - Because some of the hosts had crashed processes our deployment tools refused to continue rolling out the patched version of the service, slowing down the deployment until we could manually clear the affected servers. Our internal queues clear quickly after this and we can enable data replays back.
19:30 UTC - Our data replay service finishes working through its backlog and all data is replayed.
Our monitoring quickly pointed at the affected hosts the moment they failed, so we didn’t have to waste any time trying to figure out what servers were the source of the issue.
We were quickly able to switch to a less strict health checking method and rely on our data replay services to deal with any data that couldn’t be forwarded immediately. This allowed us to prioritise the recovery of ingestion for leading edge data and ensure we wouldn’t run out of capacity, which greatly reduced the potential impact of an incident like this one.
While we managed to identify the bug and prepare and test a fix very quickly, the rollout of the fix wasn’t started quickly enough to prevent the second batch of crashes. Additionally, the rollout itself was interrupted by the crashes, as our deployment tools refused to restart any more services while less than 80% of our aggregation layer was in a healthy state.
Because just one of the multiple processes that form the aggregation service crashed, the service wasn’t automatically restarted and just left the server in an unhealthy state until we addressed it manually.
Our tools and monitoring helped us identify quickly the servers that were not passing health checks, but didn’t quite differentiate between servers that couldn’t pass a health check because they were overloaded and ones that had a crashed process. This caused some confusion and complicated the process of detecting and cleaning up the affected processes.
Our aggregation layer is subdivided into two layers, and our tooling only allows us to change the health check method for one layer at a time. This only added a few extra seconds to the process, but introduced friction to an operation that should otherwise be as simple as possible.
Some of the engineers working on this incident weren’t immediately aware that our data replay service had ratelimiting configured and that its limits could be easily lowered to avoid incurring extra load on our aggregation layer. This would have prevented us from needing to stop replays and re-enable them in batches, allowing us to focus on impact mitigation instead.
Update our automation to enable less strict health checking across our entire aggregation layer at once [SRE-1064]
One of our longer term projects includes splitting the different processes that make up our aggregation service so they can individually benefit from health checking and service discovery when an individual component fails. We will consider reprioritising this project in the near future. [SRE-792]
We are going to review our recent changes to our aggregation layer and ensure there are no other instances where we’ve inadvertently changed how error handling works.
Despite that the rate limits for our data replay service are internally documented, we need to make sure that these limits are well understood and that we improve our tooling around modifying these limits during incidents, making it as simple and as fast as possible.