Summary

A configuration change resulted in a massive increase in traffic that overloaded our aggregation layer between 17:56 UTC and 18:48 UTC, delaying the processing of ingested data. Some of the affected datapoints were not fully available for querying until 20:40 UTC, when our internal replay process finished working through its backlog.

What happened?

We pushed a configuration change that introduced a new service designed to forward data internally across our ingestion pipeline. This service will read and work through a queue of ingested data, forwarding it through our ingestion pipeline.

This change was not expected to have any impact, since there was no data being ingested by this service, and therefore it would have nothing to forward. Unfortunately this new service was configured incorrectly and pointed at a queue that was actively ingesting data. As a result, it started up with a big backlog of data it needed to forward to the next layer of our ingestion pipeline.

The sudden and massive increase in traffic overwhelmed our aggregation layer, which delayed the processing of newly ingested datapoints. These datapoints would automatically be replayed later, once our aggregation layer had recovered.

Once this service was identified and disabled, our aggregation layer recovered and the affected datapoints were automatically replayed in a safe manner. After this replay process is complete, all ingested datapoints are fully available for querying and there was no resulting data loss.

Timeline

17:56 UTC: The configuration change is merged and starts to propagate across our ingestion layer.

18:16 UTC: We start receiving alerts and begin investigating.

18:20 UTC: We identify the source of the issue as a massive increase in traffic coming from this service.

18:22 UTC: We disable the offending service.

18:48 UTC: Our aggregation layer fully stabilises and is back to normally process ingesting datapoints. Datapoints that were affected by this issue are currently being replayed in the background.

20:40 UTC: Our replay service finishes processing its backlog. At this point all ingested datapoints are fully available for querying and there's no data loss.

What went well?

After the initial alerts, the issue was quickly identified and addressed. Our internal monitoring pointed us exactly where we needed to look and our tooling allowed us to quickly disable the offending service.

The sudden spike in received data didn't affect our storage backends, as our service in charge of persisting data made sure not to write data any faster than our storage backend could take it.

Our internal replay service made sure that any datapoint that couldn't be immediately ingested by our aggregation layer would be later replayed safely, instead of disrupting the current flow of ingested data and delaying processing for all ingested data.

What went badly?

The initial configuration change was considered to be a noop, and therefore didn't get the thorough testing it deserved. This also means that internal communication around this change wasn't as good as we're used to, which slightly delayed the identification and resolution of this incident.

What are we going to do in the future?

After this incident we have identified many areas in which we can do better:

* We're going to continue pushing for higher standards around testing configuration changes and improving our internal communication before pushing them. We've definitely improved in this area and the number of incidents that can be tracked to bad configuration changes has been decreasing, but we can do better.

* Our internal forwarding tools should automatically ratelimit themselves. Given that we have many services that are trying to forward data through our aggregation layer, we can't allow the situation when any single server can overwhelm our aggregation layer and result in service degradation for other services.

* The new service defaults to trying to process items from the beginning of the queue and trying to catch up, instead of just starting from the last available item. This makes sense in some cases, but it's not necessarily a good default mode of operation, and in this incident it certainly made things worse, as the service found itself suddenly needing to forward hours worth of data. We're going to make sure it has a safer default behaviour going forward.

Posted Dec 06, 2018 - 17:08 UTC

Resolved

We are no longer seeing any impact on ingested datapoints, and we are now marking this incident as resolved.

Between 18:01 UTC and 19:10 UTC, we experienced delays of up to 3 minutes processing ingested metrics, which resulted in gaps in the leading edge of graphs.

From 19:10 UTC until 21:05 UTC, there was an improvement in the processing rate of our ingestion pipeline, but we still saw occasional spikes of up to a minute in the delay of processing incoming datapoints.

After 21:05 UTC, our ingestion layer has recovered and regular operation has resumed.

We will be publishing a postmortem before the end of this week for this incident.

Posted Dec 05, 2018 - 21:50 UTC

Update

Our aggregation layer continues to work through the backlog of affected datapoints.

We will provide another update in one hour or when we have more information.

Posted Dec 05, 2018 - 20:24 UTC

Monitoring

Our aggregation layer has returned to a healthy state.

Leading edge data has recovered and we continue to work through backlogs of affected datapoints.

Posted Dec 05, 2018 - 18:44 UTC

Update

The configuration change has been reverted and we are seeing recovery in our aggregation layer.

Delays remain in the processing of datapoints.

Posted Dec 05, 2018 - 18:33 UTC

Identified

As of 18:01 UTC, a configuration change has caused an issue in our aggregation layer
This will result in delays of datapoints being processed.

We are reverting the change now.

Posted Dec 05, 2018 - 18:25 UTC

This incident affected: Ingestion and Alerting.