Summary

Between 15:10 UTC and 15:17 UTC of August 17, our internal health checking webservice for our database was unavailable, causing our webservers to be unable to find a healthy database server to connect to, resulting in both our website and render API to become unavailable.

What happened?

We rely on an internal health checking service in order for our webservers to be able to connect to a healthy database server. This allows us to quickly fail over to a different node in case there are issues with our database, and to ensure the requests don't reach the wrong host. This webservice runs under a local nginx instance.

A global change to our templates for nginx configuration was pushed at 15:02 UTC to address an issue on a different service that also relies on nginx. This change was only setting a configuration value to its default value for nginx, so it wasn't expected to have any effect in other services.

Unfortunately, this change also introduced a syntax error in our nginx configuration, so when our configuration management system picked it up, it applied an invalid configuration on our health checking web services and attempted to restart them, which failed. At that point, our health checking webservices became unavailable, causing our webservers to be unable to connect to a database server. This meant that any request to our website or render API would result in a 500 HTTP status code.

What went well?

The change was quickly identified as a health checking issue, which made trivial to find and revert the responsible change.

What went badly?

We should have been able to detect that the change was invalid before merging, and even after merging our configuration management system should have refused to apply an invalid change.

What are we going to do in the future?

We have already implemented a change that will make our configuration management system refuse to apply invalid changes to our webserver configuration. This is something we already do for most of our services and we're going to continue implementing it for all of them.
We're going to work on improving our own validation process to ensure we can catch similar issues before we get the chance to merge them.
Our webservers should be able to revert to a "safe mode" of operation if our health checking service itself is unavailable, and find their active database node from a static list.

Incident timeline

August 17

15:02 UTC: A configuration change is pushed to our nginx webservers that inadvertently introduced invalid syntax. The change will be slowly rolled out automatically over the next few minutes, causing some of our internal health checking webservices to fail to start up properly.

15:10 UTC: We receive the first alerts indicating that our website and render API are down. We start investigating.

15:13 UTC: The issue is identified as a failure from our webservers to connect to our database, so we direct our attention to our database health checking webservice.

15:15 UTC: We identify the previous configuration change as the cause of the incident and proceed to revert it.

15:17 UTC: After the configuration change has been fully reverted, both website and render API are fully operational once again.

Posted Aug 20, 2018 - 15:06 UTC

Resolved

At 15:17UTC, we identified an incorrect configuration change that affected our database health checks which resulted in our site being unavailable due to the lack of a healthy database. The change has been rolled back, and the website and API are now fully operational and the incident has been resolved.

Posted Aug 17, 2018 - 15:25 UTC

Investigating

At 15:10 UTC, the website started experiencing database issues. We're working to resolve the issue.

Posted Aug 17, 2018 - 15:18 UTC

This incident affected: Website.