From Thursday 11 April 2019 14:07 UTC until Friday 19 April 10:45 UTC alerts created using Grafana’s alerting interface were not mapped correctly to our alerting service. Additionally, any existing alerts on Grafana dashboards were deleted on dashboard save. We have directly contacted users with alerts that were deleted in this way and ensured these alerts were restored. As of Friday 19 April 18:16 UTC, Grafana alerts could be created as normal.
We provide a modified version of Grafana as a service for our customers. In order to provide Grafana alerting, we intercept the alert information as it’s stored in a Grafana dashboard and pass it to our alerting service to be stored as a Hosted Graphite alert. Users can always create alerts in the Hosted Graphite panel, which is completely separate from Grafana, but many find it more convenient to use Grafana’s interface instead.
Grafana 5 included many changes to Grafana’s look-and-feel as well as some additional features. The dashboard JSON model also changed in order to support some of its new features, which has the potential to affect automation for Grafana users, including us.
On Thursday 11 April at 14:07 UTC we started our upgrade from Grafana 4.6 to 5.0, which completed that afternoon. From this point, any new alerts created through a Grafana dashboard were not saved to our alerting service, though they did appear on the dashboard. In addition, saving a dashboard with existing Grafana alerts resulted in deletion of the existing alerts in our alerting service.
A week later, on Thursday 18 April at 20:39 UTC a support ticket arrived indicating new Grafana alerts were not being saved correctly. The dev team began investigating the following morning immediately after being informed, and a stopgap was deployed at 10:36 UTC, Friday 19 April. From this point Grafana dashboard alerts were no longer being deleted, but new alerts remained unsaved. The support team directly contacted users with inadvertently deleted alerts, and the dev team worked on the bug itself. The fix was deployed starting at 15:21 UTC, and by 18:16 UTC Friday afternoon, new Grafana dashboard alerts could be created as expected.
On Thursday 11 April at 14:07 UTC we started our upgrade from Grafana 4.6 to 5.0, which completed that afternoon. From this point, any new alerts created through a Grafana dashboard were not saved to our alerting service, though they did appear on the dashboard. In addition, saving a dashboard with existing Grafana alerts resulted in deletion of the existing alerts in our alerting service.
On Thursday 18 April 20:39 UTC we received a support ticket mentioning that dashboard alerts weren’t working. The next morning, Friday 19 April, 08:51 UTC, the dev team was notified and began to investigate. By 09:37 UTC we determined the bug’s impact was enough to alert users through a statuspage, and started to collect data on who was affected and when the impact started.
At 10:36 UTC Friday morning we rolled out a change that prevented Grafana alerts from being deleted when dashboards were saved. This deploy was completed at 10:46 UTC, and we turned our focus to fixing the original bug as well as directly contacting customers who had been affected. By 14:09 UTC the support team had contacted all users who’d had alerts inadvertently deleted.
At 15:21 UTC Friday afternoon we started to deploy a fix for the original bug, which required two steps. We deployed the second step at 17:58 UTC, and by 18:16 UTC it had completed for all users.
Once our engineers arrived Friday morning they read the support ticket and immediately started working on the problem. We quickly identified and rolled out a way to mitigate the impact in parallel with fixing and testing the bug itself. While the dev team worked on a fix, the SRE and support teams coordinated communication with users which was key to helping mitigate the impact.
We should have spotted this bug earlier in our testing of Grafana 5; the interaction between Grafana alerts and our alerting service appears on our testing checklist for upgrades. This particular case wasn’t enumerated, however, so it wasn’t detected before rollout and remained in production for nearly a week until reported by a customer. When it was reported as a support ticket after hours, the on-call engineer wasn’t notified until the following morning.
We are adding additional unit tests for the interaction between Grafana and the rest of our service, and will be enforcing unit test requirements for new features more stringently in the future.
Prior to this incident we’d already started work to move towards a less-modified version of Grafana. We’d started the work in order to reduce developer resources required for upgrades and maintenance, but it will also result in less potential for this kind of incident.