https://status.hostedgraphite.com/incidents/14j3jj23zrd9
Between 15:10 UTC and 15:17 UTC of August 17, our internal health checking webservice for our database was unavailable, causing our webservers to be unable to find a healthy database server to connect to, resulting in both our website and render API to become unavailable.
We rely on an internal health checking service in order for our webservers to be able to connect to a healthy database server. This allows us to quickly fail over to a different node in case there are issues with our database, and to ensure the requests don't reach the wrong host. This webservice runs under a local nginx instance.
A global change to our templates for nginx configuration was pushed at 15:02 UTC to address an issue on a different service that also relies on nginx. This change was only setting a configuration value to its default value for nginx, so it wasn't expected to have any effect in other services.
Unfortunately, this change also introduced a syntax error in our nginx configuration, so when our configuration management system picked it up, it applied an invalid configuration on our health checking web services and attempted to restart them, which failed. At that point, our health checking webservices became unavailable, causing our webservers to be unable to connect to a database server. This meant that any request to our website or render API would result in a 500 HTTP status code.
15:02 UTC: A configuration change is pushed to our nginx webservers that inadvertently introduced invalid syntax. The change will be slowly rolled out automatically over the next few minutes, causing some of our internal health checking webservices to fail to start up properly.
15:10 UTC: We receive the first alerts indicating that our website and render API are down. We start investigating.
15:13 UTC: The issue is identified as a failure from our webservers to connect to our database, so we direct our attention to our database health checking webservice.
15:15 UTC: We identify the previous configuration change as the cause of the incident and proceed to revert it.
15:17 UTC: After the configuration change has been fully reverted, both website and render API are fully operational once again.