Degraded Ingestion
Incident Report for Hosted Graphite
Postmortem

Summary

Between 15:55 UTC and 16:05 UTC on Monday 18 ingestion was degraded across all protocols, resulting in failures to connect and successfully send data to our ingestion endpoints.

Background

Before reaching our load balancers, traffic to our ingestion endpoints is routed through Virtual IP addresses (VIPs). This allows us to pull load balancers outs for maintenance or to automatically remove them out of our active pool if they're experiencing any issues.

Our main automation tool (which we refer to as Dougal) is constantly checking the health of all our load balancer nodes, and will react if it finds that one of the nodes is failing health checks. If the failing node has an active VIP pointing to it then it will find another node to take that traffic and re-route that VIP towards it.

Dougal finds the list of nodes that are candidates to be active load balancers from our ENC (External Node Classifier). A cluster can run under several different configurations (which are also defined on our ENC), and a different set of nodes can be used as load balancer nodes depending on how the cluster is configured. This doesn't mean that these different node configurations are interchangeable, as only the right node configuration will be able to process traffic for the appropriate cluster.

What happened?

Dougal loaded a corrupted copy of the ENC before checking the health status of our load balancer nodes. This copy of the ENC was corrupted in such a way that:
* None of the currently active load balancer nodes were listed as active nodes or potential candidates to become active nodes.
* The cluster was configured to use an incorrect node configuration as load balancer nodes.

As a consequence of this corruption, Dougal decided that none of the load balancer nodes currently receiving traffic were supposed to be active, and proceeded to failover traffic to a different set of nodes. Unfortunately, the new set of nodes didn't have the right configuration for this cluster (in particular they had the wrong firewall settings), and were unable to process traffic once the ingestion VIPs were routing traffic to them, causing a dip in traffic across all our ingestion protocols.

The entire failover process takes a few minutes to complete, and once all the affected VIPs had been failed over, Dougal began a new check with a fresh (uncorrupted) copy of the ENC. It then proceeded to undo all the previous failovers, routing all our ingestion VIPs back to the right set of nodes. The entire process, (from the first to the last failover) took approximately 10 minutes.

Why was Dougal's view of the ENC corrupted?

As we've been growing as a company, so has Dougal. At this point many of our interactions with our systems go through it, and this results in several operations being executed concurrently, most of them needing to either read from or modify our ENC. There is locking in place around most ENC-related operations to avoid concurrent changes resulting in unexpected consequences, but there was one small part of the ENC interactions that wasn't covered by a lock (particularly disk IO when initially loading it). This triggered an edge case where the ENC was modified while it was being read from disk, resulting in a corrupted copy being loaded.

Why did it take so long for Dougal to detect the issue and correct it?

Every failover operation is slow to process, as it depends on our provider's API and requires an acknowledgement of the involved network hardware before it can continue to move forward to the next VIP on the batch. This means that, on some circumstances, our total throughput for failover operations is around one per minute. It took Dougal around 5-6 minutes to complete the first batch of failovers until it was ready to perform a new check. This tool was never designed to need to failover our entire ingestion frontend in one go, but to deal with failure on a smaller percentage of the load balancer nodes.

Additionally, Dougal always notifies our SRE team when unexpected failovers happen, but this wasn't the case. The notification only happens when a full batch of failovers has been completed, but unfortunately we were making unrelated changes to Dougal at the time and it was restarted before the batch was completed and it could notify us. Normally Dougal refuses to restart while it has jobs running but the VIP failover process is considered a background job, which is a type of job that doesn't block restart (as those are happening all the time and can usually be safely interrupted).

Timeline

15:54 UTC: Dougal begins a check of our load balancer nodes (based on a corrupted copy of the ENC), and begins the process of failing over all our VIPs to a different set of hosts.

16:00 UTC: Dougal is restarted before it gets a chance to complete the VIP failovers or to notify us.

16:02 UTC: Dougal begins a new check of our load balancer nodes and begins again failing over VIPs to the right nodes.

16:08 UTC: Dougal informs us that it has failed over a bunch of VIPs to healthy load balancer nodes, which prompts the beginning of our investigation.

16:09 UTC: We start to receive alerts indicating that there was a dip in received traffic and that our own internal canaries were registering issues, but they're already recovering.

What went well?

* Despite the initial issue, all that was required was for our automation to take over for the incident to be resolved. No human intervention was required.

What went badly?

* The failover operations took us by surprise and took us a while to put all the pieces together and figure out what happened. Thankfully, by the time humans were involved in this incident it was already in its way to be resolved, but if things had gone differently it would have taken us longer than we'd like to get up to speed.
* The individual failover operations were slower than we'd like. This is normally not an issue since the system is designed to only need to failover a small percentage of nodes at any given time, but we can still do better.

What are we going to do in the future?

After this incident we have identified many areas in which we can do better:

* We have addressed the immediate locking bug and ensured that all ENC interactions are properly locked. We will also be taking the opportunity to review the whole process to ensure that no similar edge cases are lurking in there.
* We have added better logging to Dougal so we can better track any decisions made by our automation and what information it had at the time.
* The current failover component of our automation tools have been working in their present form for two years now, and they've worked solidly during all this time, but incidents like this one helps us identify areas where we can improve. Mainly:
- Better feedback as failovers are happening.
- Better failover speed for bigger VIP batches.

Posted 24 days ago. Feb 25, 2019 - 19:06 UTC

Resolved
Ingestion remains stable after the dip in traffic between 15:55 UTC and 16:05 UTC. Our automatic failover process corrected an issue in which traffic was being routed to incorrect nodes in our ingestion layer, this incorrect routing resulted in the degraded ingestion during the above period.
Posted about 1 month ago. Feb 18, 2019 - 16:44 UTC
Investigating
Between 15:55 UTC and 16:05 UTC we experienced degraded ingestion across all protocols at our ingestion layer. Ingestion is currently stable and we are investigating the cause of this drop in traffic.
Posted about 1 month ago. Feb 18, 2019 - 16:20 UTC