At Hosted Graphite, we rely on encryption to give us a private overlay network. We implement this using IPsec. Individual packets traverse the public network between our hosts, but each is encrypted & encapsulated by IPsec.
We have many services/tools that are used to manage IPsec on each host. One of these service's job (hereby referred to as "IPsec-manager") is to test connectivity between two hosts, removing security associations (SA) for those that are timing out. This allows IPsec to re-negotiate SAs, fixing connectivity between hosts. For more information on IPsec and how it is used at Hosted Graphite check out: https://www.usenix.org/node/197468
We also heavily rely on service discovery using a dedicated set of hosts running ETCD. These are used by our extensive health checking to notify each layer of our system which hosts are considered healthy and ready to recieve traffic/requests at any given time.
We merged and rolled out a config change to our IPsec-manager service to allow it to remove IPsec SAs for both inbound and outbound connections between hosts. Due to a race condition in our configuration management definitions, IPsec-manager was unexpectantly enabled on our service discovery cluster.
Our IPsec configuration for the service discovery cluster is a little different from our other hosts in that we only expose the ETCD port over IPsec and nothing else. IPsec-manager was not setup to account for this and was checking connectivity on a blocked port resulting in the removal of all SAs on service discovery hosts. Soon after, we saw widespread health check failures as our services could not establish connectivity to service discovery.
What was the user impact of this? Our webservers stopped being able to serve the leading edge cache stored on our aggregation hosts which led to gaps for some metrics at the leading edge of graph renders, explained on our initial StatusPage update.
To fix the issue we stopped the IPsec-manager running on the service discovery hosts. We were then able to restore connectivity by manually cleaning all SAs between our aggregation hosts and service discovery hosts.
Due to a high volume of alerts, our SRE team missed that the StatsD integration had also been affected. Our StatsD processes also using service discovery, so during the period of loss of connectivity, they had no load balancing destinations to forward to. This resulted in a 50% reduction in traffic processed over StatsD between 14:30 UTC to 15:15 UTC.
Our main take away from this incident is to ensure that our tools for managing IPsec work equally well accross our many different clusters. The root cause of lay in the incompatability between the IPsec-manager service and the IPsec configuration on our service discovery cluster. We plan to remove this special case and avoid similar cases in the future.