Website unavailable
Incident Report for Hosted Graphite
Postmortem

Summary

On May 23, Between 11:03 UTC and 11:18 UTC, and between 14:03 UTC and 14:44 UTC, requests to both the Hosted Graphite website and Render API were returning HTTP 503 errors.

Background

We’re currently in the process of migrating all our MySQL instances from MySQL 5.7 to the latest GA version (8.0). As part of the upgrade process we build a new replica running 8.0 and use mysqldump to copy all existing data from our primary database and enable replication. At this point we can decide to failover to this new host after verifying it.

For this failover process we rely on our health checking mechanism. This is a small web service running alongside the database that will check several health indicators before allowing this instance to receive traffic. This allows us to failover traffic from one instance to another, by setting the former primary node as read-only, which will cause it to start failing health checks, resulting in traffic being directed to the next healthy node (the new primary node).

Normally, when upgrading to a newer version of MySQL, you are required to run the mysql_upgrade command, to ensure that the new version can apply any changes to its internal tables that are required to maintain compatibility. Starting with MySQL 8 this command has been phased out, and this upgrade process (if required) happens as part of the MySQL startup process itself.

The MySQL optimizer relies on keeping up to date internal table statistics to be able to make informed decisions when planning the execution path for a query. By default, MySQL will automatically attempt to recalculate the internal statistics for a table after roughly 10% of it has changed. This process can also be triggered manually by running the ANALYZE TABLE command.

What happened?

At 10:24 UTC, we failed over from a MySQL 5.7 database to a MySQL 8 instance we had previously prepared. When originally copying the data over to the new instance we experienced some errors, and after some investigation decided to run mysqldump with the --column-statistics=0 flag. This is because the newer version of mysqldump attempts to also copy over the internal table statistics from the master and fails, as the format has changed across versions. Disabling this flag meant that no table statistics would be copied over and that their internal statistics would not be recalculated after the data restore (we didn’t know this at the time).

At 11:03 UTC, we start receiving alerts that our site is unavailable and returning 503s. We begin to investigate and immediately turn our attention to the database. While troubleshooting we notice that commands such as “SHOW SCHEMAS” return an error, making us suspect that the issue was caused by incompatibility between MySQL versions. At no point did our health checks for this MySQL instance fail.

At 11:13 UTC we restart this MySQL instance and force it to upgrade its internal tables. Both our site and our render API become fully available as of 11:17 UTC.

At this time, we thought that it was the forced upgrade and not the restart which resolved the incident. We would later find out that it was the restart that did it, since it cleared all the long-running queries from the database.

At 14:03 UTC we start again receiving alerts that our website and render API are unavailable (and returning HTTP 503 errors). Several things were happening at the same time, which made our troubleshooting of this issue more difficult:

Our scheduled database backups were running on the new primary node.

We were working on bringing up an upgraded replica node, who was currently copying data from the new primary node.

We initially suspect the latter to be contributing to this issue, so we stop our replication script and proceed to investigate.

Between 14:03 UTC and 14:40 UTC, we investigated different parts of our infrastructure, looking for an explanation for this outage. We suspected a database issue because of the obvious similarities between this incident and its previous instance earlier that day, but our database health checks show no issues, and we can verify that connectivity between our webservers and database instances is fine. Our webservers don’t report any exceptions that make us suspect there’s an issue with our database instances either. It’s at this point that we start looking for an explanation somewhere other than our database instances.

The issue was causing all of the worker threads of our webservers to serve requests which would never finish due to the inability to run database requests (we didn’t know this at the time). This resulted in our front-end load balancers failing to forward any requests to our webservers. We restarted our webservers in an attempt to clear the worker threads, but this did not restore website connectivity, and they returned to a similar state almost instantly.

At 14:40 UTC, we saw webservers begin to return errors connecting to the database, as the maximum connection limit on our database is reached. When further investigating the process table on our database instance wequickly see a large number of processes in “waiting for table flush” state on a few tables. At this point we restart MySQL which quickly clears up the issue.

At 15:13 UTC we consider the incident as resolved, but we continue investigating. At this point we suspect that the combination of the scheduled backup running at the same time as the replication script (which was forcefully aborted) resulted in locking. Client timeouts are rather high for MySQL so we suspected that the replication script could have acquired a lock and MySQL wouldn’t have yet realised that it had disconnected.

While we continue the investigation into our current theory, it is brought to our attention that the search option on our internal admin interface is not working. Some searches are taking an unusually long time, sometimes not even returning. Given that this search operation is performed against the same database that was experiencing issues we decide to take a closer look.

We then find out that the reason this search query was taking an unusually large amount of time is because the table statistics in MySQL were not populated (because of our initial use of “--column-statistics=0” during the use of the replication script), which means that the MySQL optimizer could not make the right decisions when planning the query, and results in a suboptimal query execution. At this time we manually force MySQL to recalculate the internal statistics for each affected table with ANALYZE TABLE, and verified that the affected queries run in under a few seconds again.

This allows us to put all the pieces together and gain a better understanding of what had been happening until now. Going back to the process list output that we gathered earlier during the incident, we notice that not all of the processes that were running were waiting for a table flush, there were a bunch of queries that were in “sending data” status and seemed to have been doing so for a long time. Those queries were, in fact, the admin search queries that had been abnormally slow since the database failover.

Most of the queries that are executed against this particular database instance are not complex enough to be impacted by the lack of table statistics, but the admin search query is very heavily affected. Normally, this single query being slow wouldn’t affect the other queries running in the same database instance, and because these search queries are executed relatively infrequently we would expect there to be no impact aside from a slow user search in the admin interface.

Unfortunately, while these queries were running MySQL attempted to recalculate the internal statistics for the affected tables. MySQL will attempt to recalculate the internal statistics for a table when around 10% of its rows have changed, or when asked to do so manually (by running ANALYZE TABLE). However, this mechanism isn’t fully documented and we suspect there might be other triggers for this action that aren’t yet documented.

Now the chain of events that lead to both incidents becomes clear:

After the failover, the new primary database has no internal table statistics, but this is not enough to impact normal operation of the database as most queries are simple enough that they can run without the optimizer having data to make an informed decision on a query plan.

At some point a more expensive query is performed, one that will take a long time to finish without updated table statistics.

While that query is still running, MySQL attempts to rebuild the internal table statistics for some of the tables involved in that query. This operation can’t be completed while the query is still running.

Any new queries to the affected tables are also put on hold until after the stats recalculation finishes (and they’re put into “waiting for table flush” state). Because the stats recalculation can’t start until the original slow queries complete all new queries will start piling up. At this point any query touching the affected tables will just hang.

The number of active threads in our MySQL instance will continue increasing until it hits a limit, at which point MySQL will start rejecting new connection attempts, that result in our webservers throwing errors.

Once the internal stats for the affected tables had been recalculated, we could verify that the issue was no longer reproducible.

Timeline

At 10:24 UTC, we perform the failover from our MySQL 5.7 database to the replica running MySQL 8.

At 11:03 UTC, we receive alerts that connections to our website are timing out, and our Render API is returning HTTP 503 errors.

At 11:08 UTC, we update our status page.

At 11:13 UTC, we run the MySQL binary manually on the host to force it to run an upgrade.

At 11:17 UTC, we verify that we can connect to our website and Render API and we resolve the incident.

At 14:00 UTC, our scheduled backup begins to run.

At 14:02 UTC, we run a script which makes our secondary database a replica of our primary database.

At 14:03 UTC, we receive alerts that requests to our website and Render API were returning HTTP 500 errors and we begin to investigate. At this time, the replication script is stopped and we turn MySQL off on our secondary database.

At 14:04 UTC, we verify that our primary database is passing health checks.

At 14:10 UTC, we publish our first status page for this incident.

At 14:12 UTC, we notice high FCGI load on our webservers, and restart those processes. FCGI load reduces after this, but our website and render API remain unavailable.

At 14:18 UTC, we verify that we can connect to the primary database from our webservers.

At 14:23 UTC, we verify that our front-end load balancers are still routing requests to our webservers.

At 14:30 UTC, we verify that we can connect to our Grafana backend from our webservers.

At 14:40 UTC, we see connections start to fail from our webservers to our primary database.

At 14:42 UTC, upon investigation into the failing connections, we start to see error messages saying there are too many connections to the database. We now note that all open connections are waiting for a Table Flush.

At 14:44 UTC, we restart the database. This restores connectivity to our website and render API. We set the status page to Monitoring, and begin to look for the cause of this incident.

At 15:13 UTC, we mark the second incident as resolved as we have what we feel is an acceptable explanation for what happened.

At 16:09 UTC, we run `analyze table` on the tables which are used in our admin search query, and test the query again. This runs in a few seconds after that.

What went badly?

MySQL health checks never failed, even when no queries were finishing. This resulted in the second incident running much longer than it otherwise would have, as we didn’t realise that it was happening until the connection limit was hit.

We don’t currently have a timeout for queries from our webapp to our database, so the long-running admin search queries were going to run basically forever.

We had all the necessary information we required to properly diagnose this issue, but due to all the (seemingly contradictory) information we were receiving, we didn’t look in the right place.

What are we going to do in the future?

Make sure that, as part of our failover process, we perform a forced upgrade and a recalculation of internal table statistics. [SRE-997]

Add timeouts to database queries from our webapp. [DEV-400]

Improve health checking for our databases, by identifying any signs of issues that should prevent us from failing over to a new node. [SRE-998]

Improve our own monitoring for our databases, such as reporting when we have an abnormal number of threads running or a high number of processes locked in a single operation. [SRE-999]

Posted May 31, 2019 - 15:50 UTC

Resolved
We have confirmed that there is no remaining impact and the website and graph rendering are now fully available. We believe we have uncovered the cause of the database connectivity issue that led to the outage and will be posting a postmortem explaining our findings soon.
Posted May 23, 2019 - 15:13 UTC
Update
A suspected issue with our database host was causing critical load on our webservers as they attempted to connect to it. We have restored healthy connectivity to the database and the website is now functional. We're still investigating and will continue to monitor the situation.
Posted May 23, 2019 - 14:51 UTC
Update
We are still experiencing all requests hitting www.hostedgraphite.com returning 5XX responses and the site is still unavailable. We continue to investigate the cause.
Posted May 23, 2019 - 14:30 UTC
Investigating
Since 14:03 UTC we are experiencing timeouts to our website and render API. We are investigating the issue.
Posted May 23, 2019 - 14:10 UTC
This incident affected: Website.