Issue creating and saving alerts through Grafana
Incident Report for Hosted Graphite
Postmortem

Summary

From Thursday 11 April 2019 14:07 UTC until Friday 19 April 10:45 UTC alerts created using Grafana’s alerting interface were not mapped correctly to our alerting service. Additionally, any existing alerts on Grafana dashboards were deleted on dashboard save. We have directly contacted users with alerts that were deleted in this way and ensured these alerts were restored. As of Friday 19 April 18:16 UTC, Grafana alerts could be created as normal.

Background

We provide a modified version of Grafana as a service for our customers. In order to provide Grafana alerting, we intercept the alert information as it’s stored in a Grafana dashboard and pass it to our alerting service to be stored as a Hosted Graphite alert. Users can always create alerts in the Hosted Graphite panel, which is completely separate from Grafana, but many find it more convenient to use Grafana’s interface instead.

Grafana 5 included many changes to Grafana’s look-and-feel as well as some additional features. The dashboard JSON model also changed in order to support some of its new features, which has the potential to affect automation for Grafana users, including us.

What happened?

On Thursday 11 April at 14:07 UTC we started our upgrade from Grafana 4.6 to 5.0, which completed that afternoon. From this point, any new alerts created through a Grafana dashboard were not saved to our alerting service, though they did appear on the dashboard. In addition, saving a dashboard with existing Grafana alerts resulted in deletion of the existing alerts in our alerting service.

A week later, on Thursday 18 April at 20:39 UTC a support ticket arrived indicating new Grafana alerts were not being saved correctly. The dev team began investigating the following morning immediately after being informed, and a stopgap was deployed at 10:36 UTC, Friday 19 April. From this point Grafana dashboard alerts were no longer being deleted, but new alerts remained unsaved. The support team directly contacted users with inadvertently deleted alerts, and the dev team worked on the bug itself. The fix was deployed starting at 15:21 UTC, and by 18:16 UTC Friday afternoon, new Grafana dashboard alerts could be created as expected.

Timeline

On Thursday 11 April at 14:07 UTC we started our upgrade from Grafana 4.6 to 5.0, which completed that afternoon. From this point, any new alerts created through a Grafana dashboard were not saved to our alerting service, though they did appear on the dashboard. In addition, saving a dashboard with existing Grafana alerts resulted in deletion of the existing alerts in our alerting service.

On Thursday 18 April 20:39 UTC we received a support ticket mentioning that dashboard alerts weren’t working. The next morning, Friday 19 April, 08:51 UTC, the dev team was notified and began to investigate. By 09:37 UTC we determined the bug’s impact was enough to alert users through a statuspage, and started to collect data on who was affected and when the impact started.

At 10:36 UTC Friday morning we rolled out a change that prevented Grafana alerts from being deleted when dashboards were saved. This deploy was completed at 10:46 UTC, and we turned our focus to fixing the original bug as well as directly contacting customers who had been affected. By 14:09 UTC the support team had contacted all users who’d had alerts inadvertently deleted.

At 15:21 UTC Friday afternoon we started to deploy a fix for the original bug, which required two steps. We deployed the second step at 17:58 UTC, and by 18:16 UTC it had completed for all users.

What went well?

Once our engineers arrived Friday morning they read the support ticket and immediately started working on the problem. We quickly identified and rolled out a way to mitigate the impact in parallel with fixing and testing the bug itself. While the dev team worked on a fix, the SRE and support teams coordinated communication with users which was key to helping mitigate the impact.

What went badly?

We should have spotted this bug earlier in our testing of Grafana 5; the interaction between Grafana alerts and our alerting service appears on our testing checklist for upgrades. This particular case wasn’t enumerated, however, so it wasn’t detected before rollout and remained in production for nearly a week until reported by a customer. When it was reported as a support ticket after hours, the on-call engineer wasn’t notified until the following morning.

What are we going to do in the future?

We are adding additional unit tests for the interaction between Grafana and the rest of our service, and will be enforcing unit test requirements for new features more stringently in the future.

Prior to this incident we’d already started work to move towards a less-modified version of Grafana. We’d started the work in order to reduce developer resources required for upgrades and maintenance, but it will also result in less potential for this kind of incident.

Posted Apr 29, 2019 - 11:59 UTC

Resolved
At 18:15 UTC we deployed a fix to our web-app which fixes the creation and editing of alerts associated with Grafana panels.

This incident is now resolved.
Posted Apr 19, 2019 - 18:15 UTC
Update
Our developers have finished testing the web-app fix for this issue. We are now waiting for the Grafana deploy to finish, at which time we will deploy the new version of the web-app.

We will provide another update when this has been done.
Posted Apr 19, 2019 - 17:31 UTC
Update
At 15:21 UTC, we began the rollout of a new Grafana version to all users which contains part of the fix for the issues we've been seeing with creating alerts through Grafana panels.

We are currently testing a change to our web-app which will fix this issue and we will be deploying this soon.

We will update in one hour or when more information is available.
Posted Apr 19, 2019 - 16:03 UTC
Update
As of 13:15 UTC, all users who had alerts incorrectly deleted was contacted by our support team with further details.

Our developer team is currently working on fixing the bug that resulted in this incident. We will provide another update in two hours or when we have more information.
Posted Apr 19, 2019 - 14:09 UTC
Update
As of 12:00 UTC, our developers are working on fixing the issues with creating or updating alerts associated with Grafana panels.

Our support team has identified a small percentage of users who were affected by the bug, and will be contacting them with further information over the next hour.

We will provide further updates in two hours or when more information becomes available.
Posted Apr 19, 2019 - 12:03 UTC
Identified
Since April 11th, alerts created through our Grafana UI have not been saved to our alerting service. Our developers are working on fixing this, but until we deploy a patched version of Grafana, any new alerts created through Grafana will not be evaluated or updated in our Alerting Service.

Additionally, from April 11th until 10:40 UTC today, any dashboard which previously had an alert attached to a panel which was edited and saved in this timeframe will have resulted in the alert being deleted in our alerting service.

While our developers work on fixing this issue, we have deployed a change to prevent any more alerts from being erroneously deleted when dashboards are saved.

We have also identified a subset of users who have had alerts deleted in this period who we will be contacting directly.

We will provide another update in one hour or when we have more information.
Posted Apr 19, 2019 - 11:01 UTC
This incident affected: Alerting.