Internal Network Disruption
Incident Report for Hosted Graphite
Postmortem

Background

At Hosted Graphite, we rely on encryption to give us a private overlay network. We implement this using IPsec. Individual packets traverse the public network between our hosts, but each is encrypted & encapsulated by IPsec. One of the key components of an IPsec implementation is the Internet Key Exchange (IKE) daemon, which in our case is racoon. For more background on our usage of IPsec and racoon in particular you can refer to one of our recent talks on the subject.

In an effort to address some of the long standing issues with our network setup, and motivated by some recent unstability that can be directly attributed to bugs in our keying daemon, we decided to write and release a patch for it to address some of these issues. In particular, the aim of this patch was to allow racoon to automatically recover in cases where a relationship is abruptly lost between two nodes (this can happen in events such as a hard reboot, a crash of the keying daemon itself, or even persistent network issues). We believe this patch, along with other similar changes that we have planned, will help us increase the reliability of our network.

What Happened?

The rollout of the new version of our keying daemon had begun at approximately 13:00 at a smaller scale, with no impact noticed up until that point.

At 16:13 UTC we begin the full scale rollout. We start to notice some connectivity issues between some of our hosts. After some investigation we conclude that the restarts associated with the rollout are more impactful than previously anticipated. We decide to let the process continue to completion, as rolling it back would only make things worse with another restart.

At 16:31 UTC we receive the first reports from our internal monitoring that the new keying daemon is crashing in some hosts. A crash of the keying daemon prevents a host from forming and renewing any relationships, severely impacting connectivity.

At 16:40 UTC, and after confirming that the crashes are not isolated incidents, we decide to start the rollback process. This results in another set of restarts across our fleet. After this, ingestion starts to recover.

At this point, the fact that our keying daemon was crashing in a percentage of servers across our fleet made both the rollback process and the task of identifying the exact impact harder than it should have been. Some of the affected servers lost connectivity to both our configuration management and centralised logging services, which meant we weren't able to use our standard tools to make the changes we needed to restore connectivity. The partial outage in our centralised logging infrastructure prevented us from realising that the impact was more severe than we originally thought, as some of the nodes that had been rolled back were still having issues forming new associations.

At approximately 17:30 the outer edge of our ingestion layer has started to stabilise and we're ingesting data normally, but there are still some unresolved connectivity issues between some components of our ingestion layer, which results in a small percentage of datapoints not being able to advance through our processing pipeline and be made available for rendering. Our ingestion layer is designed to deal with this situation, and a background job starts picking up any datapoints affected by this and attempting to replay it.

We conclude that the reason we are seeing some datapoints not being fully processed by part of our pipeline is because of a capacity issue, as some of our servers were showing symptoms of being overloaded, so we proceed to provision extra capacity to ease the load. While load was higher than usual on these servers, this assumption wasn't quite correct.

Adding extra capacity (at around 18:10 UTC) to the cluster reduces (but doesn't completely solve) the issue preventing some datapoints from being processed. This reinforces our belief that we're experiencing a capacity issue caused by the amount of data that needs to be replayed after the initial impact of our rollout attempt.

At around 18:36 UTC, and helped by the partial recovery of our logging infrastructure, we start to realise that we still have some underlying connectivity issues, and decide to investigate further. We find that the processing issues is limited to a subset of the servers in our aggregation layer and decide force a renegotiation of their associations with the servers on the rest of the fleet (it's worth noting that the change we were trying to roll out initially would have been able to fix this automatically). This fixed the remaining connectivity issues and no more processing issues are experienced after 19:00 UTC.

Unfortunately, at this point the backlog for our background data replay service had grown rather large, and this process didn't complete until 00:00 UTC. We tried to increase capacity during this period to try to speed up the process, but our replay service is designed with the explicit tradeoff of not overwhelming the rest of our ingestion layer during an incident (at the cost of speed) so the process was slow.

What are we going to do in the future?

We are going to revisit the rollout process of changes to our keying daemon. The current rollout process causes too much disruption when deployed across our whole fleet. We believe there are interactions between the order in which some of these tools are restarted that are making the issue worse. If we hadn't experienced the earlier impact due to the restarts we might have been able to identify the connectivity issues sooner.

We are going to investigate why our keying daemon crashed. In the process of patching and building our new version we pulled a slightly more recent upstream version of the code than the one we were currently running in production. This means that the delta of changes introduced was bigger than just our own changes, which makes it harder to confidently assert what the root cause of the crashes is. We're currently performing more long-term testing before we can identify and fix the problem.

We are going to continue to work on our tooling. One of the things that went well during this incident is that our tooling allowed us to inspect the status of relationships across nodes and solve some of those issues. During this incident we realised we can automatically detect and work around the original failure scenario we were trying to resolve with only a small delay when compared to a bug fix in the keying daemon itself, so we have already implemented and deployed this to production.

We are going to revisit some of the design decisions behind our data replay service. This service was explicitly designed to ensure that, in the event of an incident, it wouldn't make matters worse by overloading our servers by trying to replay everything at once. While we still believe that's a good tradeoff, in incidents like this one we could have benefited from a faster mode of operation, particularly after we had added extra capacity and were confident that we would have been able to handle the extra load.

Posted Oct 13, 2017 - 18:43 UTC

Resolved
Our replay service has finished working through its backlog and all delayed data has now been processed and is fully available.

We will be publishing a post-mortem of this incident in the following days.
Posted Oct 12, 2017 - 23:57 UTC
Update
Ingestion has returned to normal levels. Data is being replayed and gaps will appear in graphs until these replays have completed. Some alerts may have been delayed.

Because of the quantity of data being replayed, we've expanded capacity to help expedite backlog clearance.
Posted Oct 12, 2017 - 18:32 UTC
Identified
We have identified a disruption of connectivity within our internal network that has affected all components. While deploying an update to our IPsec management tools we encountered an error and needed to rollback. Our internal network was disrupted by this deploy and connectivity is recovering slowly. We are monitoring the situation and will update you as soon as we know more.
Posted Oct 12, 2017 - 17:12 UTC