Background

At Hosted Graphite, we rely on encryption to give us a private overlay network. We implement this using IPsec. Individual packets traverse the public network between our hosts, but each is encrypted & encapsulated by IPsec.

We have many services/tools that are used to manage IPsec on each host. One of these service's job (hereby referred to as "IPsec-manager") is to test connectivity between two hosts, removing security associations (SA) for those that are timing out. This allows IPsec to re-negotiate SAs, fixing connectivity between hosts. For more information on IPsec and how it is used at Hosted Graphite check out: https://www.usenix.org/node/197468

We also heavily rely on service discovery using a dedicated set of hosts running ETCD. These are used by our extensive health checking to notify each layer of our system which hosts are considered healthy and ready to recieve traffic/requests at any given time.

What happened?

We merged and rolled out a config change to our IPsec-manager service to allow it to remove IPsec SAs for both inbound and outbound connections between hosts. Due to a race condition in our configuration management definitions, IPsec-manager was unexpectantly enabled on our service discovery cluster.

Our IPsec configuration for the service discovery cluster is a little different from our other hosts in that we only expose the ETCD port over IPsec and nothing else. IPsec-manager was not setup to account for this and was checking connectivity on a blocked port resulting in the removal of all SAs on service discovery hosts. Soon after, we saw widespread health check failures as our services could not establish connectivity to service discovery.

What was the user impact of this? Our webservers stopped being able to serve the leading edge cache stored on our aggregation hosts which led to gaps for some metrics at the leading edge of graph renders, explained on our initial StatusPage update.

To fix the issue we stopped the IPsec-manager running on the service discovery hosts. We were then able to restore connectivity by manually cleaning all SAs between our aggregation hosts and service discovery hosts.

Due to a high volume of alerts, our SRE team missed that the StatsD integration had also been affected. Our StatsD processes also using service discovery, so during the period of loss of connectivity, they had no load balancing destinations to forward to. This resulted in a 50% reduction in traffic processed over StatsD between 14:30 UTC to 15:15 UTC.

What are we going to do in the future?

Our main take away from this incident is to ensure that our tools for managing IPsec work equally well accross our many different clusters. The root cause of lay in the incompatability between the IPsec-manager service and the IPsec configuration on our service discovery cluster. We plan to remove this special case and avoid similar cases in the future.

Posted Oct 12, 2017 - 14:18 UTC

Resolved

We are happy to resolve this incident now as everything is stable. We will be publishing the post-mortem soon.

Posted Oct 10, 2017 - 16:05 UTC

Update

We have successfully restored connectivity to our leading edge cache and graph rendering is fully operational again.

We have identified that the impact of this incident was broader than first thought. Our statsd ingestion service was impacted from 14:30 UTC to 15:15 UTC resulting in a 50% reduction in traffic received at our statsd endpoint. We have fixed the issue and the traffic rates have returned to normal.

We will be publishing a full post mortem for this incident to outline what went wrong and what we plan to do to avoid this happening in the future.

Posted Oct 10, 2017 - 15:41 UTC

Monitoring

We have identified the issue to be a recent config change which prevented our leading edge cache from connecting to our health checking service. We have deployed a fix for this and are now monitoring the situation.

No data has been lost, and all data will be available once the affected servers have recovered.

Posted Oct 10, 2017 - 15:04 UTC

Investigating

We are currently investigating an issue with our leading edge cache that is affecting graph renders. You can expect to see gaps for some metrics at the leading edge of graph renders.

Posted Oct 10, 2017 - 14:39 UTC