We rolled out an incorrect change to our DNS caching config at 11:20 UTC which resulted in the loss of network connectivity across our machines until 15:30 UTC. Ingestion, alerting, and graph rendering were affected during this time period.
At Hosted Graphite we run Bind9 on our hosts to provide a local caching DNS server. We use Puppet across all our hosts to provide config management. This consists of a set of puppet master hosts and a puppet agent that runs on every machine. The puppet masters each run a simple webhook receiver service that ensures the local configuration files are in sync with our git repositories. Currently, Bind9 runs on our hosts as an unsupervised daemon. As part of improving the reliability of our service, we decided to move Bind9 to be a service that's run under supervisord.
We deployed a change to disable the existing bind service and run it under supervisord at 11:20 UTC. However, a bug in the supervisord config meant that the service still reported as running while it had crashed. Failure of the bind service meant that hosts were unable to resolve hostnames, leading to network instability, failure of puppet runs and failure to report internal metrics. At 11:24 UTC our monitoring started triggering alerts indicating ingestion and rendering failures and we opened a status page.
At 11:36 UTC we decided to revert the config changes. However, DNS failures on the puppet master hosts meant that git fetches to update the local repo failed. Reverting the changes was also further delayed due to bugs that were introduced when the revert was made. At 13:26 UTC, we fixed DNS on the puppet master by temporarily adding an external DNS server to resolv.conf.
At 13:46 UTC, we temporarily added the external DNS server and performed a puppet run across all our hosts. We began to observe recovery of our services and updated the status page at 14:14 UTC. The nameserver change was a temporary change allowing us to get puppet runs to succeed again.
At 14:24 UTC we noticed that a subset of machines had again started failing to resolve hosts. We temporarily stopped Puppet across our hosts and identified that the problem was that bind failed to start on these hosts due to a named
process already running. We weren't initially sure on why this had happened and decided that we would permanently add the external DNS server to resolv.conf, kill any named
process that was currently running and re-enable puppet across our hosts. The addition of the external DNS server was performed at 15:20 UTC, resulting in full recovery. We updated the status page.
Later investigation revealed to us that the still running named
process was due to some of the puppet masters having an older version of the repo. Multiple commits were made during the revert of the original Bind change and webhook failures to certain puppetmasters meant that they hadn't performed git fetch
which left them with an out of date repo. Lack of monitoring of these webhook failures meant that this went undetected until after the incident. At 17:37 UTC puppet finished running across our entire cluster and we resolved the incident at 18:15 UTC.
Currently we lack monitoring for our git
webhook services, which lead to out of sync puppetmasters going undetected until after the incident. We are going to add monitoring for this service so that we are notified of failures to deliver webhooks.
We will investigate using a dedicated DNS server to serve as a fallback when the local server is unavailable due to bind issues. Since we were only using the local server, bind failures meant that we were unable to resolve any hostnames instead of falling back to a slower remote server.
We will continue to work on our tooling to make both testing and reverting our config changes easier. Delays in being able to perform a clean revert of the original change meant that initially recovery of the incident was delayed.