Summary

We rolled out an incorrect change to our DNS caching config at 11:20 UTC which resulted in the loss of network connectivity across our machines until 15:30 UTC. Ingestion, alerting, and graph rendering were affected during this time period.

Background

At Hosted Graphite we run Bind9 on our hosts to provide a local caching DNS server. We use Puppet across all our hosts to provide config management. This consists of a set of puppet master hosts and a puppet agent that runs on every machine. The puppet masters each run a simple webhook receiver service that ensures the local configuration files are in sync with our git repositories. Currently, Bind9 runs on our hosts as an unsupervised daemon. As part of improving the reliability of our service, we decided to move Bind9 to be a service that's run under supervisord.

What happened?

We deployed a change to disable the existing bind service and run it under supervisord at 11:20 UTC. However, a bug in the supervisord config meant that the service still reported as running while it had crashed. Failure of the bind service meant that hosts were unable to resolve hostnames, leading to network instability, failure of puppet runs and failure to report internal metrics. At 11:24 UTC our monitoring started triggering alerts indicating ingestion and rendering failures and we opened a status page.

At 11:36 UTC we decided to revert the config changes. However, DNS failures on the puppet master hosts meant that git fetches to update the local repo failed. Reverting the changes was also further delayed due to bugs that were introduced when the revert was made. At 13:26 UTC, we fixed DNS on the puppet master by temporarily adding an external DNS server to resolv.conf.

At 13:46 UTC, we temporarily added the external DNS server and performed a puppet run across all our hosts. We began to observe recovery of our services and updated the status page at 14:14 UTC. The nameserver change was a temporary change allowing us to get puppet runs to succeed again.

At 14:24 UTC we noticed that a subset of machines had again started failing to resolve hosts. We temporarily stopped Puppet across our hosts and identified that the problem was that bind failed to start on these hosts due to a named process already running. We weren't initially sure on why this had happened and decided that we would permanently add the external DNS server to resolv.conf, kill any named process that was currently running and re-enable puppet across our hosts. The addition of the external DNS server was performed at 15:20 UTC, resulting in full recovery. We updated the status page.

Later investigation revealed to us that the still running named process was due to some of the puppet masters having an older version of the repo. Multiple commits were made during the revert of the original Bind change and webhook failures to certain puppetmasters meant that they hadn't performed git fetch which left them with an out of date repo. Lack of monitoring of these webhook failures meant that this went undetected until after the incident. At 17:37 UTC puppet finished running across our entire cluster and we resolved the incident at 18:15 UTC.

What are we going to do in the future?

Currently we lack monitoring for our git webhook services, which lead to out of sync puppetmasters going undetected until after the incident. We are going to add monitoring for this service so that we are notified of failures to deliver webhooks.

We will investigate using a dedicated DNS server to serve as a fallback when the local server is unavailable due to bind issues. Since we were only using the local server, bind failures meant that we were unable to resolve any hostnames instead of falling back to a slower remote server.

We will continue to work on our tooling to make both testing and reverting our config changes easier. Delays in being able to perform a clean revert of the original change meant that initially recovery of the incident was delayed.

Posted Dec 15, 2017 - 14:37 UTC

Resolved

This incident has been resolved.

Posted Dec 12, 2017 - 18:15 UTC

Update

We are continuing to see improvements and will keep monitoring the situation while services return to normal levels.

We will publish a post mortem for this incident in the coming days to outline what went wrong and what we plan to do to avoid this happening in the future.

Posted Dec 12, 2017 - 15:20 UTC

Monitoring

After rolling back the configuration changes, we are seeing recovery in ingestion, graph rendering, and alerting. We will continue to monitor the situation and provide another update in one hour.

Posted Dec 12, 2017 - 14:15 UTC

Update

We are still in the process of rolling back the configuration changes. We will provide another update when this is completed.

Posted Dec 12, 2017 - 13:34 UTC

Identified

We have identified the issue as being related to a DNS configuration change that was made earlier today. As a result of this, approximately 70% of ingestion traffic is failing, which will result in partial graphs. This may cause alerts to incorrectly fire. We are rolling back the changes and will provide updates as more information is available.

Posted Dec 12, 2017 - 12:33 UTC

Investigating

We're currently investigating an issue at our load-balancing layer which is affecting ingestion - we will post updates as more information becomes available.

Posted Dec 12, 2017 - 11:42 UTC