Delayed Processing of Metrics
Incident Report for Hosted Graphite
Postmortem

Summary

On June 24, between 13:25 UTC and 14:35 UTC, and between 16:40 UTC and 17:15 UTC our aggregation layer experienced instability that resulted in delays of newly ingested data to be processed by our services. This could result in gaps/incorrect data in graphs and some alerts potentially triggering false positives.

Background

Recent changes to our aggregation layer

Over the past few months we have been working to improve the performance of our ingestion pipeline, with a strong focus on our aggregation layer. Part of this effort involves rewriting parts of our aggregation service to use cython. We have been incrementally rewriting and verifying parts of this service, with the last such change happening in early June.

That last change accidentally removed some exception handling around socket operations while receiving data from other components on our ingestion pipeline. In particular, the socket `recv` operation can fail, resulting in an exception. The previous version of the code handled the exception and closed the connection, leaving it up to the client to reconnect. The new version of the code had no such exception handling.

Different kinds of health checking mechanisms for our aggregation layer

The different components of our aggregation layer rely on a health checking mechanism for service discovery purposes. This health check is responsible for considering a given server as “healthy”, which means that other components of our ingestion pipeline will route traffic to it. If a server stops being “healthy” it will stop receiving data.

By default, the health check we rely on requires a server to have been processing data successfully for the last five minutes before it can start accepting production traffic. This is a rather strict check but ensures that servers with intermittent issues won’t be flapping in and out of the active pool, and it’s a good default choice for having a robust and stable pipeline.

On some occasions we do need to switch to a less strict check to ensure we have enough available capacity to handle incoming traffic. For instance, during network connectivity events when a large enough percentage of the servers on our aggregation layer might be flapping we can’t afford to mark as inactive every server that hasn’t been successfully processing data for five uninterrupted minutes. In those situations we rely on a health check that just ensures that the responsible service is up and running, which will increase the number of available servers and help us weather the storm. This might result in some processes not being able to forward data, but we can rely on our data replay service to pick up any datapoints that haven’t been forwarded and replay them back when the service is in a healthier state.

What are data replays and how do they work?

Our ingestion pipeline is composed of a variety of different services that process the data as they receive it and forward it to the next service on the pipeline. There are times when one of the components of this pipeline is overloaded or temporarily unavailable. In order to prioritise the availability of leading edge data our services are configured to not retry indefinitely to forward a given datapoint and move on to the next one. This allows us to avoid building up a large backlog that will affect all incoming metrics.

This data is only temporarily dropped, and there are background services that will automatically pick it up and attempt to replay it when the affected services recover. These replay services usually operate under a tight ratelimit to ensure that data replays don’t affect ingestion of leading edge, but during periods of high load they can also be temporarily disabled so we can prioritise stabilising our ingestion pipeline for leading edge data before continuing with data replays.

What happened?

A change to our aggregation service introduced in early June (June 04) accidentally removed some critical exception handling code around socket `recv` operations. This resulted in a number of our aggregation services partially crashing during a period of network instability, which started at around 13:25 UTC. During this period some connections to our aggregation servers were reset, which the new code wasn’t able to handle, instead crashing the process responsible for accepting incoming data. A large enough percentage of servers in our aggregation layer were affected by this, which means that when connectivity was restored the service was unable to deal normally with all incoming traffic.

These processes crashing reduced the capacity of our aggregation layer, with the remaining healthy servers being unable to deal with the load. This resulted in the non-crashed servers becoming overloaded to the point they started intermittently failing health checks, compounding the issue. Our ingestion service is now unable to forward all ingested datapoints to our aggregation layer, and they are automatically dropped and left for a background replay service to forward them when the aggregation service is healthy.

At this point (13:35 UTC) we switch the servers on our aggregation layer to a less strict mode of health checking to ensure that we don’t run out of capacity while we attempt to mitigate the impact. We also temporarily pause our data replay service to ensure we’re not introducing any extra load into the system and prioritise recovery for leading edge data. We also start provisioning some extra servers to increase capacity and help with the increase in load.

We investigated the affected servers, and noted that in most of them the process responsible for accepting forwarded data had crashed. Usually if the service crashes it would be restarted automatically, but in this case only one of the processes that form the service crashed, which left the service in an unhealthy state. While we started investigating the source of these crashes we manually restarted the service in all affected hosts.

At 15:17 UTC After we fixed all the crashed processes, our service starts to recover and we see our internal queues start trending down towards recovery. Once we are confident that the service is now stable we start enabling data replays to allow for the data that had been dropped earlier on to be backfilled. Our data replay service has an internal ratelimit to ensure we don’t overwhelm our aggregation layer with replayed data, but out of caution we decided to enable the replays in small batches so we could better monitor the impact to the rest of our service.

We identify the original bug that resulted in the aggregation processes crashing, and prepare a fix, that is merged at 16:27 UTC. We decide to roll out this change immediately before another period of network instability results in more process crashes. Unfortunately this didn’t happen quickly enough, and at 16:40 UTC there’s another period of network instability that will impact some of our aggregation servers that hadn’t received the fix yet.

The new period of network instability caused some of our aggregation processes to crash. Our deployment tools will ensure that a large percentage of the aggregation services are healthy before continuing the deploy and restarting more nodes. Unfortunately, because some processes were already crashing this meant there were not enough healthy services around for the deployment process to continue, and the rollout was interrupted until we could manually fix all the affected processes.

Once we had been able to fix all the crashed processes and completed the rollout of the fix our aggregation layer started processing data normally, restoring the leading edge. This allowed us to re-enable data replays as well, which finished at around 19:30 UTC.

Timeline

13:25 UTC - Beginning of the initial period of connectivity issues. Some of our aggregation servers have processes crashing. We receive alerts and begin investigating.

13:35 UTC - We switch to a less strict health checking for our aggregation service to ensure we have enough capacity to deal with the incoming traffic.

14:01 UTC - We track the issue so the crashed processes on some of our aggregation nodes and begin working on a way to detect all crashed processes and repair them. In parallel to this, we begin investigating the source of this bug.

14:20 UTC - We start provisioning new hosts to help deal with the extra load. We temporarily disable data replays to prioritise the processing of leading edge data and make sure we’re not introducing extra load into the system that would end up making things worse.

14:35 UTC - After removing one particular server with unusually high queues from the active pool, we start seeing a recovery on our internal queues.

15:17 UTC - Our internal queues return to normal levels.

15:20 UTC - We enable data replays on a small subset of hosts to avoid overloading our aggregation layer with the extra load.

16:27 UTC - We merge a fix for the crashed processes issue and begin roll it out.

16:40 UTC - Another period of network instability. We stop data replays again.

17:15 UTC - We detect a single server with high internal queues and remove it from the active pool, which immediately helps.

17:40 UTC - Because some of the hosts had crashed processes our deployment tools refused to continue rolling out the patched version of the service, slowing down the deployment until we could manually clear the affected servers. Our internal queues clear quickly after this and we can enable data replays back.

19:30 UTC - Our data replay service finishes working through its backlog and all data is replayed.

What went well?

Our monitoring quickly pointed at the affected hosts the moment they failed, so we didn’t have to waste any time trying to figure out what servers were the source of the issue.

We were quickly able to switch to a less strict health checking method and rely on our data replay services to deal with any data that couldn’t be forwarded immediately. This allowed us to prioritise the recovery of ingestion for leading edge data and ensure we wouldn’t run out of capacity, which greatly reduced the potential impact of an incident like this one.

What went badly?

While we managed to identify the bug and prepare and test a fix very quickly, the rollout of the fix wasn’t started quickly enough to prevent the second batch of crashes. Additionally, the rollout itself was interrupted by the crashes, as our deployment tools refused to restart any more services while less than 80% of our aggregation layer was in a healthy state.

Because just one of the multiple processes that form the aggregation service crashed, the service wasn’t automatically restarted and just left the server in an unhealthy state until we addressed it manually.

Our tools and monitoring helped us identify quickly the servers that were not passing health checks, but didn’t quite differentiate between servers that couldn’t pass a health check because they were overloaded and ones that had a crashed process. This caused some confusion and complicated the process of detecting and cleaning up the affected processes.

Our aggregation layer is subdivided into two layers, and our tooling only allows us to change the health check method for one layer at a time. This only added a few extra seconds to the process, but introduced friction to an operation that should otherwise be as simple as possible.

Some of the engineers working on this incident weren’t immediately aware that our data replay service had ratelimiting configured and that its limits could be easily lowered to avoid incurring extra load on our aggregation layer. This would have prevented us from needing to stop replays and re-enable them in batches, allowing us to focus on impact mitigation instead.

What are we going to do in the future?

Update our automation to enable less strict health checking across our entire aggregation layer at once [SRE-1064]

One of our longer term projects includes splitting the different processes that make up our aggregation service so they can individually benefit from health checking and service discovery when an individual component fails. We will consider reprioritising this project in the near future. [SRE-792]

We are going to review our recent changes to our aggregation layer and ensure there are no other instances where we’ve inadvertently changed how error handling works.

Despite that the rate limits for our data replay service are internally documented, we need to make sure that these limits are well understood and that we improve our tooling around modifying these limits during incidents, making it as simple and as fast as possible.

Posted 25 days ago. Jun 27, 2019 - 15:25 UTC

Resolved
As of 19:30 UTC all backlogs have been processed and all data is once again available for query. This incident is now resolved.
Posted 28 days ago. Jun 24, 2019 - 19:43 UTC
Monitoring
We continue to process the backlogs from between 13:25 UTC and 14:35 UTC, and between 16:40 UTC and 17:15 UTC. While these are processing, you will experience gaps in graphs in those time periods. All data ingested after 17:15 UTC is queryable as normal. We will provide an update in 1 hour or if more information is available.
Posted 28 days ago. Jun 24, 2019 - 18:45 UTC
Update
As of 17:40 UTC, we have identified an issue with handling connectivity problems in our aggregation layer. We are in the process of rolling out a fix which should prevent this in the future.

We are now going to resume replays, and we will provide another update in one hour or when more information is available.

Some data from between 13:25 UTC and 14:35 UTC, and between 16:40 UTC and 17:15 UTC will be unavailable until these replays have finished.
Posted 28 days ago. Jun 24, 2019 - 17:46 UTC
Update
At 17:15 UTC, we identified a node in our aggregation layer which was not processing data correctly. This has been removed and we are verifying the stability of our aggregation layer before resuming replays of data. Some data ingested between 16:40 UTC and 17:15 UTC will be unavailable until these replays have finished.
Posted 28 days ago. Jun 24, 2019 - 17:29 UTC
Update
As of 16:40 UTC we are experiencing connectivity issues within our aggregation layer. We are temporarily suspending the replay of data while we work to resolve this issue. This will result in partial graphs and the misfiring of alerts from 16:40 UTC onwards.
Posted 28 days ago. Jun 24, 2019 - 17:11 UTC
Update
We are continuing to process the backlog of data as replays continue, until these are finished you may see gaps in graphs during the period of 13:25 UTC to 14:35 UTC. We will provide another update in one hour or when more information is available. All data ingested after 14:35 UTC has been processed as normal and all services are currently stable.
Posted 28 days ago. Jun 24, 2019 - 16:53 UTC
Update
As of 15:17 UTC, there are no more backlogs for processing leading edge data. There will no longer be any gaps in the leading edge of graphs.

As of 15:20 UTC, we have increased capacity of our aggregation layer and started the replays of affected data ingested between 13:25 UTC and 14:35 UTC.

We will provide another update in one hour or when more information is available.
Posted 28 days ago. Jun 24, 2019 - 15:55 UTC
Identified
As of 14:35 UTC, we are seeing improvements in our aggregation layer after removing nodes experiencing severe connectivity issues and clearing defunct processes on other nodes which were affected by these issues to a lesser extent. We will soon be enabling replays for any affected ingested data, and we continue to increase the capacity of our aggregation layer to handle replays and to replace any nodes which were removed during this incident.
Posted 28 days ago. Jun 24, 2019 - 14:50 UTC
Update
As of 14:20 UTC, we are working to stabilise the aggregation layer as we provision new hosts to increase the capacity of the layer. We are still seeing a delay in processing datapoints, which will lead to gaps in graphs and alerts firing incorrectly.
Posted 28 days ago. Jun 24, 2019 - 14:24 UTC
Update
At 13:35 UTC, we changed to a less strict healthcheck in our aggregation layer. We are also provisioning extra hosts to increase capacity.
Posted 28 days ago. Jun 24, 2019 - 13:59 UTC
Investigating
Since 13:25 UTC, an issue with our aggregation layer has resulted in a delay processing metrics. This will result in gaps in graphs and missing metrics alerts erroneously firing.
Posted 28 days ago. Jun 24, 2019 - 13:39 UTC
This incident affected: Graph rendering and Ingestion.