From 14:12 UTC on March 27th to 12:14 UTC on March 28th, Grafana dashboards created or updated after 15:17 UTC on March 25 were unavailable.
We rolled out a new version of Grafana (Grafana 5) on March 25, but subsequently had to be rolled back during this incident. Later attempts to roll out Grafana 5 to our users resulted in necessary database migrations not being performed, leading to any dashboards created during that time period to be inaccessible.
We run Docker in Swarm Mode for our Grafana infrastructure. When a user makes a request to Grafana, it is routed to one of our Docker Swarm nodes which then routes it to one of the user’s replicated containers in the cluster.
To upgrade the version of Grafana for one user, we use a periodic job-running tool that, over time, updates the image for each user one at a time. Because Grafana upgrades usually perform rather expensive database migrations this upgrade job works in batches and operates sequentially, to make sure we don’t overload our database when performing upgrades.
Each new version of grafana comes with a set of database migrations that need to be performed to upgrade from one version to another. Migrations are only performed for upgrade operations, and never when downgrading to a previous version.
On March 25th, we rolled out Grafana 5 to all of our users. The database schema was different for this version, so database migrations were performed during the upgrade. In particular, a uid column was added to the table storing dashboards, and the migration process ensured that existing dashboards would have that column populated with the right value after the migration finished. Shortly after the deploy, we discovered an issue which caused us to rollback the upgrade as a precaution. Unlike the upgrade, the downgrade doesn’t perform database migrations which left the new column intact. This wasn’t a problem at the time because the new column would just be ignored by the older version of Grafana.
On March 27th 14:12 UTC, we attempted to roll out Grafana 5 to our users again. This time, the deployment process didn’t perform any database migrations, since the new column already existed. Unfortunately, this also meant that any dashboards created since the previous rollback wouldn’t have the uid column populated, and the lack of a migration meant that they wouldn’t be automatically fixed during the deployment. This missing field caused the dashboard urls to be incomplete, leading to HTTP 404 responses when trying to open them. This increase in 404 responses was recorded by our monitoring, but went unnoticed by us until a support request was received at 21:00 UTC.
At 21:40 UTC we identify the cause of the 404 responses to be NULL fields in the database. The potential impact to these dashboards from rolling back isn’t clear so the downgrade is first tested on a staging server. After confirming that the rollback fixes the issue, we decide to downgrade to our previous version of Grafana (4.6) across production at 23:00 UTC to minimize future impact.
While we were able to test that rolling back fixed the dashboards with missing UIDs, we didn’t test for (and therefore didn’t notice) that dashboards saved under Grafana 5 would be inaccessible. This rollback finished at 07:51 UTC on the 28th of March.
On March 28th, 11:00 UTC we realise that dashboards that were originally saved under Grafana 5 are inaccessible under Grafana 4. This is because the resulting dashboard JSON generated under Grafana 5 is not backwards compatible with Grafana 4. Given that only a small number of dashboards had been affected our support fixed them by reverting them to a previous good version and notifying any affected users directly. This process was completed by 12:50 UTC.
There is a combination of factors that contributed to this rollback process being slower than we think it should:
Because we want to make sure that migrations don’t negatively affect our database during a deployment, our rollout process tends to be favour safety over speed in terms of batch sizes. Our rollback process relies in the same code, so it had the same assumptions, which is not correct since database migrations will not happen during a version downgrade.
Due to an unrelated issue, the service responsible for scheduling container updates was temporarily running with a single worker as opposed to multiple workers. This has now been addressed.
Once we had been alerted that there was an issue, we had the right tools to easily diagnose it and quickly find the source of the problem, including the scope of the impact and what users had been affected by this. The incident would have taken much longer to deal with if we didn’t have all this information readily available to us.
Our testing infrastructure easily allowed us to reproduce the incident and test rollout/rollback scenarios before committing to a full-scale rollback without knowing if it would address the issue.
The affected dashboards had been inaccessible for hours before we communicated this to our users through our status page and that’s not good enough. This is because of a combination of two factors:
It wasn’t immediately obvious to us during/after the deployment that the dashboards were inaccessible, and we need our monitoring to alert us when there’s an increase in 404 responses.
After the issue had been reported to us, it had been reported through our support team. Now that several teams were working on this issue, this resulted in miscommunication and confusion into how it should be handled and communicated to users.
Related to the point above, we didn’t follow our usual channels for communicating during incidents, so when more people joined the investigation on the following day, a lot of the needed context was spread around different channels of communication and this resulted in a lot more time required just to catch up.
Potential impact from rolling back wasn’t originally clear and had to be first tested which delayed the process. The testing process also missed that dashboards saved under Grafana 5 would be inaccessible in Grafana 4.
We are unable to downgrade Grafana on a per user level this mean we needed to downgrade everyone at once instead of just those affected. This delayed resolution for the affected users and led to the final issue with malformed JSON in dashboards.
One recurring issue during deployments is the lack of visibility our different teams have into the rollout/rollback process for new Grafana versions. This includes things like easily checking the version each user is running, replication status for their containers, availability of their service… All this information is available, but not centralised. We’re going to make sure all the relevant information during deployments can be easily found in a central place. [SRE-872 | SRE-874]
Our alerting system needs to be able to let us know that there’s a variance in 404 responses across our service, particularly during/after a deployment. [SRE-886]
The deploy and rollback process of grafana versions will be improved to allow faster and safer version changes. For instance, rollbacks will be allowed to be performed quicker as they don’t have to worry about database migrations causing extra load in the database. [SRE-881]
Our automation around Grafana deploys will be improved to provide better feedback in terms of progress, expected duration and failures. [SRE-881 | SRE-873 | SRE-875 | SRE-876]
Our deployment process for grafana will be expanded to allow for subsets of users to exist on different grafana versions temporarily. [SRE-880]
We are going to work in improving our own internal processes to make sure we can clearly identify potential impact and rollback procedure as a requirement before rolling out Grafana changes. We’re also going to clarify what the incident management process should look like for incidents that involve multiple teams and are initiated through non-standard channels (like support requests). [SRE-882]