A configuration change resulted in a massive increase in traffic that overloaded our aggregation layer between 17:56 UTC and 18:48 UTC, delaying the processing of ingested data. Some of the affected datapoints were not fully available for querying until 20:40 UTC, when our internal replay process finished working through its backlog.
We pushed a configuration change that introduced a new service designed to forward data internally across our ingestion pipeline. This service will read and work through a queue of ingested data, forwarding it through our ingestion pipeline.
This change was not expected to have any impact, since there was no data being ingested by this service, and therefore it would have nothing to forward. Unfortunately this new service was configured incorrectly and pointed at a queue that was actively ingesting data. As a result, it started up with a big backlog of data it needed to forward to the next layer of our ingestion pipeline.
The sudden and massive increase in traffic overwhelmed our aggregation layer, which delayed the processing of newly ingested datapoints. These datapoints would automatically be replayed later, once our aggregation layer had recovered.
Once this service was identified and disabled, our aggregation layer recovered and the affected datapoints were automatically replayed in a safe manner. After this replay process is complete, all ingested datapoints are fully available for querying and there was no resulting data loss.
17:56 UTC: The configuration change is merged and starts to propagate across our ingestion layer.
18:16 UTC: We start receiving alerts and begin investigating.
18:20 UTC: We identify the source of the issue as a massive increase in traffic coming from this service.
18:22 UTC: We disable the offending service.
18:48 UTC: Our aggregation layer fully stabilises and is back to normally process ingesting datapoints. Datapoints that were affected by this issue are currently being replayed in the background.
20:40 UTC: Our replay service finishes processing its backlog. At this point all ingested datapoints are fully available for querying and there's no data loss.
After the initial alerts, the issue was quickly identified and addressed. Our internal monitoring pointed us exactly where we needed to look and our tooling allowed us to quickly disable the offending service.
The sudden spike in received data didn't affect our storage backends, as our service in charge of persisting data made sure not to write data any faster than our storage backend could take it.
Our internal replay service made sure that any datapoint that couldn't be immediately ingested by our aggregation layer would be later replayed safely, instead of disrupting the current flow of ingested data and delaying processing for all ingested data.
The initial configuration change was considered to be a noop, and therefore didn't get the thorough testing it deserved. This also means that internal communication around this change wasn't as good as we're used to, which slightly delayed the identification and resolution of this incident.
After this incident we have identified many areas in which we can do better:
* We're going to continue pushing for higher standards around testing configuration changes and improving our internal communication before pushing them. We've definitely improved in this area and the number of incidents that can be tracked to bad configuration changes has been decreasing, but we can do better.
* Our internal forwarding tools should automatically ratelimit themselves. Given that we have many services that are trying to forward data through our aggregation layer, we can't allow the situation when any single server can overwhelm our aggregation layer and result in service degradation for other services.
* The new service defaults to trying to process items from the beginning of the queue and trying to catch up, instead of just starting from the last available item. This makes sense in some cases, but it's not necessarily a good default mode of operation, and in this incident it certainly made things worse, as the service found itself suddenly needing to forward hours worth of data. We're going to make sure it has a safer default behaviour going forward.