Summary

While rolling out a website change in preparation to support Grafana 5 our deployment procedure failed to fully rollout the change to all our web servers. This partial deployment resulted in approximately 30% of Grafana requests returning HTTP 500 responses when opening new dashboards.

Background

We run Docker in Swarm Mode for our Grafana infrastructure. When a user makes a request to Grafana, it is routed to one of our Docker Swarm nodes which then routes it to one of the user’s replicated containers in the cluster.

To determine where to route any Grafana requests within the Swarm cluster, we rely on our own service discovery mechanism, which is responsible for marking Swarm nodes as healthy, making them available to receive requests. Health checking for the Docker Swarm layer involves checking if Grafana is available to serve requests for a canary user.

The usual process for deployments to our website involves rolling out the update to a canary group, where they are tested before rolling out to the rest of the fleet. A canary group is a small subset of servers that receive a portion of traffic, helping to minimise impact if there are any issues with the change. Once the changes are deemed safe to roll out the deployment can continue to the rest of our fleet, and during this process our tooling ensures that the canary group and the rest of our webservers are running the same version of our site.

What happened?

To prepare our website for Grafana 5.0 and to maintain backwards compatibility between Grafana versions, an update to our website was rolled out at 12:56 UTC. This change was tested in one of our dedicated environments before rollout which was considered enough testing, and led us to bypass our usual canary procedure (defined above). Because of this the canary group continued to run the older incompatible version of the website.

At 14:35 UTC we started rolling out Grafana 5.0 to our users. Upgrading the version of Grafana for a given user involves stopping their instances and spinning a new one, which coupled with the required database migrations for some upgrades can cause Grafana to be unavailable for a brief period.

Because our internal health checking relies on the Grafana instance being available for our canary user, this means that while Grafana 5 was being rolled out for the Grafana user our health checks (incorrectly) assumed that the whole swarm cluster was unavailable. When there are no healthy swarm nodes available, Grafana requests are routed to a legacy Grafana instance which is running Grafana 4.0. This fallback process is also a bug on our service discovery code, when there are no known healthy nodes available the service is supposed to fall back to a static list of nodes, but instead routed the request locally (through the legacy Grafana instance).

At this point, users that opened dashboards would see their requests routed to this legacy Grafana instance, in many cases leading to HTTP 404 responses as the legacy Grafana instance runs with an older version of the Grafana database, resulting in many dashboards not existing in this instance.

This caused an increase in HTTP 404 responses that wasn’t picked up by our monitoring, and it wasn’t until 15:02 UTC that we were made aware of this through a support request. At this point we decided to stop the rollout and begin rolling back to Grafana 4.6. This process was completed at 15:20 UTC.

The issue described by the support requests initially seemed to have been solved and the incident was considered resolved. At 18:10 UTC we realised the issue wasn’t fully solved and there were still intermittent errors when opening dashboards. Not only some dashboards were resulting in a 404 response, but users were also getting intermittent 500 errors.

We continued investigating and realised that the 500 errors were isolated to the canary group that was still running the old version of the website. After we upgraded this canary group to the latest version, we verified that the HTTP 500 errors had stopped.

Why were requests being routed to the legacy Grafana instance for so long?

So far there are two contributing factors for requests being routed to the legacy Grafana instance:

The bug in our service discovery code that falls back to a local Grafana instance if there are no healthy nodes available.

The fact that the health check depends on a Grafana instance that went down during the rollout (and while database migrations took place).

But the health checks only failed for a brief period of time (while the new instances are spinning up and the database migrations are happening) so our service discovery mechanism should only have been routing requests to the wrong instance for less than two minutes. This was not the case.

A second bug in the service discovery code meant that if the list of available nodes is empty, subsequent attempts to refresh its internal view of this list will fail even if there are available nodes now. Any requests will then get routed to the fallback option until the service gets restarted or the workers get recycled (which will happen after a number of requests). This also made the issue harder to debug because it means that its impact is less and less frequent over time.

What went well?

Once we were fully aware of what the issue was, tracking the exact source of the 500 responses was quick, given that we already had all the right tools and dashboards that immediately pointed us to the right subset of hosts.

What went badly?

While we had monitoring to observe the increase in HTTP 500 and 404 responses from Grafana, it took us until a support request was received to be aware of the issue. The information was there, but we failed to follow it.

The breakdown in communication between handling of the support ticket and incident management meant that the different teams were unsure of the incident status. There were three different teams working on this incident and the lines of communication and separation of responsibilities were not as clear as they could have been.

Because of the above, we incorrectly assumed that rolling back Grafana was sufficient to resolve the errors and failed to verify this until 18:10 UTC.

The canary rollout process wasn’t fully followed leading to a subset of the hosts running the wrong version of the website. The lack of feedback from our automation meant that we incorrectly assumed all the hosts were updated.

Out health checking for Grafana relies on a single service being available which leads to false positives during rollouts.

Our service discovery mechanism wrongly decided to fall back to routing requests through a legacy Grafana instance instead of a static list of nodes.

What are we going to do in the future?

We will improve our health checking for our Grafana infrastructure so that a roll-out of Grafana doesn’t cause any interruptions to the service. [SRE-861]

Our deployment tracking of Grafana instances will be improved to track HTTP 5XX responses. [SRE-869]

We will improve the automation of our website deployment process to detect discrepancies between versions of our website across our hosts. [SRE-870]

We are going to be making improvements to our tooling around Grafana rollouts to increase the visibility that our dev team has on the whole process, and in particular to more accurately assess progress and status of a rollout. [SRE-873]

The fallback process has already been fixed to stop routing Grafana requests to a legacy version of Grafana when the swarm hosts are unavailable.

Posted Apr 01, 2019 - 13:02 UTC

Resolved

We have confirmed we are no longer seeing these 500s being returned by our Grafana servers. A full post mortem will be forthcoming in the next few days.

Posted Mar 25, 2019 - 19:39 UTC

Monitoring

From 15:02 UTC until 18:15 UTC ~30% of Grafana requests were returning 500 responses from our servers which caused dashboards to fail to load in these cases.

The issue has been identified and a rollback took place.

Posted Mar 25, 2019 - 18:34 UTC