Grafana Connectivity Issues
Incident Report for Hosted Graphite
Postmortem

Summary

From 10:54 UTC on 19 November to 18:00 UTC 21 November an estimated 4% of customers were unable to access Grafana as a result of a destabilising deploy and a breaking change to the API we use to determine container orchestration. Due to the nature of the instability we were unable to simply roll back the change, which prolonged the incident. Core ingestion services maintained 100% availability; aggregation, alerting and API access were also unaffected.

Background

We run a cluster of instances of Grafana for each account on our service, orchestrated by docker swarm. Several services interact with the swarm: one for service creation and deletion based on user status, one that handles user creation and login flow, and our chatops bot that automates some swarm operations.

What happened?

We initiated a deploy to upgrade Grafana for our shared tenancy cluster, having successfully upgraded Grafana in dedicated clusters two weeks prior. The parameters for the deploy resulted in overloading the swarm with new container requests, and users started to lose access to their dashboards.

First, we focused on understanding the impact and improving visibility into the swarm itself and services interacting with it. Attempts to restore the swarm included failing over to a new database and restarting services on the swarm.

Once we'd restarted what we could, we discovered a recent change to one of the services interacting with Grafana that prevented containers from being accessed correctly by the main app. We implemented a fix, which then revealed an incompatibility between the new version of Grafana and our authentication proxy.

By then the swarm still hadn't stabilised, so we built a new swarm and migrated Grafana services over to it. With a healthy swarm, we were able to roll back the Grafana upgrade, restoring service for all users.

What went well?

Everyone pitched in, from support to dev to the SREs responsible for the services. Many of the people involved were relatively new to supporting these services, but that didn’t stop them from diving in, finding what was wrong, and assisting with fixes.

Although visibility started off bad, we were able to add more over the course of the incident to get the right insights. This has resulted in immediate improvements to the logging in these services, which will aid in future work.

During a lengthy incident the team prioritised the many issues correctly: restoring swarm health remained most important, even as we discovered bugs and other problems that needed attention. While we did address some issues in a “fix-forward” way, we were able to rebuild the swarm and roll back the upgrade in order to restore access for customers.

What went badly?

Grafana 6 was a relatively recent production migration that went into production over the summer. As part of this large update, we had intermittent connection problems reported and existing in investigation status in our support queue. Therefore, after the initial deploy, we investigated our initial support tickets as part of a “known issue” instead of a result of the deploy. This slowed down our first responses, and allowed the already unstable swarm to continue to fail. In addition, we have not had strong success tuning alerting and logging for portions of our docker swarm infrastructure--sometimes being exceedingly verbose which can obfuscate the nature of a problem.

Although the upgraded version of Grafana had been successfully deployed to our smaller clusters, the process for the larger shared cluster is different and had not been sufficiently documented. Moreover, we had a reliance on our chatops bot, which although extremely convenient for infrastructure management, further obfuscated the ability to see immediately what the source of the issue was. What we experienced was a seamless deploy as usual followed by no indications of failure until our the swarm was critically overloaded. Additionally, the shared cluster has a greater variety of user and team configurations, not all of which were reflected in the dedicated clusters, so there were many possible error states to investigate.

Several factors complicated troubleshooting in addition to poor visibility:

  • User configurations have changed over time, and the permutations hadn’t been documented as they changed. This resulted in insufficient testing of different configurations.
  • Poor feedback from some of our tools led to us missing the results of different attempts to change or mitigate individual issues, confounding troubleshooting.
  • Documentation for some of our chatops-automated processes lagged behind changes in those processes, leading to confusion. We further question whether chatops will remain a best practice -- as this purposefully reduces comprehension of underlying activities.
  • As we worked towards solutions, we found that one of the builds for a critical service had been failing silently, which needed to be fixed before we could implement any incident-related changes.

Finally, fatigue wasn’t managed well during a lengthy incident. It was difficult to know when to step back and when to push a little harder, and it wasn’t possible to perform suitable handovers. In the belief that we were close to a solution, we delayed making status updates. This was unacceptable as it left you, our customers, without the ability to determine whether to continue to invest into, or vacate, our services. This compounded the other issues mentioned.

What will we do differently in the future?

  • We’re revisiting our deploy documentation and tooling to emphasise the difference in deploy parameters for different clusters.
  • Visibility into the related services has already been improved.
  • We’re adding people to the team and revisiting our incident processes to ensure no one person is overstretched.
  • We intend to upgrade our docker services for better load handling, and we’re now running the database on a more powerful machine.
  • We’re investigating migration of our swarm services to a managed solution.

Customer Next Steps

If you were affected by this disruption, please reach out to our Director of Ops, Yuga (yuga@metricfire.com), and we’ll talk to your specific needs and concerns.

Incident Timeline

November 19

* 2019-11-19 10:54 UTC: Initial deploy of Grafana 6.4.3

* 2019-11-19 12:01 UTC: First support ticket related to the incident arrives; investigation begins for possible database load issues.

* 2019-11-19 12:38 UTC: We determine not all users are affected, and continue with the database-related investigation.

* 2019-11-19 12:47 UTC: Finding no immediate database issues, we restart the user’s Grafana service, a solution to a known issue. No improvement observed.

* 2019-11-19 12:57 UTC: CPU usage on DB is observed to be declining steadily since ~11UTC, early indication of swarm issues, but the rollout has stalled.

* 2019-11-19 13:01 UTC: The second related support request arrives. We notice more accounts are affected after doing spot checks and suspect a more widespread problem.

* 2019-11-19 13:58 UTC: Containers appear to be created correctly but the check for "is it alive?" (grafana_ping) is failing, resulting in the rollout appearing to be stalled. Investigation into this error begins.

* 2019-11-19 14:29 UTC: We attempt to rebalance containers, as timeouts could be due to a concentration of containers on one node over the others. No improvement is observed.

* 2019-11-19 14:59 UTC: We initiate a rolling reboot of swarm nodes, as the earlier rolling restart of docker services hadn’t improved access. This does ease certain errors, but full access to Grafana hasn’t been restored.

* 2019-11-19 15:00 UTC: Rollback to previous version considered. We determine this would compound database load issues rather than mitigate them, so work continues on understanding what's happened to the swarm.

* 2019-11-19 15:45 UTC: Database failover performed. The upgrade appears to be proceeding at this point, and spot checks of individual users' Grafana instances are working.

* 2019-11-19 17:36 UTC: Grafana rollout paused.

* 2019-11-19 17:50 UTC: Grafana rollout unpaused, but the tooling doesn't respond or give an indication that it hasn't received the command, so we continue to monitor the rollout.

* 2019-11-19 18:15 UTC: As rollout is monitored, we see that it's still stalled. The interval for container starting is increased, to see if the service responsible for spinning up containers is responsible.

* 2019-11-19 23:30 UTC: Users' containers are still not being created correctly, and investigation into the service responsible for it continues.

November 20

* 2019-11-20 03:15 UTC: A workaround for users involving running grafana locally is tested and documented so that users are able to access their data while troubleshooting continues.

* 2019-11-20 07:36 UTC: We determine that grafana_ping (the “is it alive?”) check for grafana containers by the managing service is failing, pointing us towards a software solution rather than infrastructure.

* 2019-11-20 08:10 UTC: Rolling reboot of swarm nodes started.

* 2019-11-02 09:05 UTC: We decide to move the database to a larger machine in order to facilitate the number of connections required.

* 2019-11-20 10:00 UTC: The rolling reboot of swarm nodes has finished, but the rebooted nodes aren't connecting to the database.

* 2019-11-20 11:14 UTC: Fix for grafana_ping deployed, verification begins but service creation is much slower than expected.

* 2019-11-20 11:31 UTC: We determine the check is deployed correctly, but a different issue is causing Grafana services to remain unavailable.

* 2019-11-20 13:00 UTC: Investigation begins into service discovery issues with the swarm restart.

* 2019-11-20 14:15 UTC: Logging is added to the main app to understand whether containers are having new authentication issues.

* 2019-11-20 15:17 UTC: Swarm instances and DBs appear to be mostly idle, but the service responsible for creating and pruning containers is overloaded; investigation steers to try and understand why.

* 2019-11-20 15:45 UTC: Container replicas still appear “stuck,” and we consider recreating the docker network.

* 2019-11-20 16:45 UTC: We discover an issue with the new Grafana version that may necessitate a rollback once the swarm is healthy, unrelated to previous software issues we investigated.

* 2019-11-20 17:17 UTC: We discover that the Grafana rollout is still paused, and restart it.

* 2019-11-20 18:48 UTC: Another change to logging is deployed improve visibility into why containers aren't being created as expected.

* 2019-11-20 23:27 UTC: We discover many orphan processes causing problems on the swarm and start closing them manually. In the meantime, we discover a source of errors due to lack of admin privileges in individual grafana instances that are causing the "this container isn't working" signals.

November 21

* 2019-11-21 00:01 UTC: More logging changes are deployed. As a result, we're able to identify the change in the new version of Grafana that is causing containers to not register as working: the check now requires elevated privileges.

* 2019-11-21 00:15 UTC: Swarm rollout paused, based on the knowledge that containers will never be registered as successful.

* 2019-11-21 00:30 UTC: Rollback to previous version of Grafana initiated.

* 2019-11-21 00:56 UTC: During rollback, we discover that the API change won't result in failing containers for all users. We pause the rollback in order to spare the swarm and DB and instead change the config for grafana authentication manually for all affected users. This doesn't solve the problem for everyone, however.

* 2019-11-21 07:54 UTC: We reboot the swarm again, and start building a new one rather than continuing to repair the current one.

* 2019-11-21 11:49 UTC: Swarm build is finished and migration to the new swarm begins.

* 2019-11-21 15:14 UTC: Grafana 6.3.2 rolled out to new swarm

* 2019-11-21 16:00 UTC: Grafana 6.3.2 rollout appears successful, and verification begins. Some users need to have their configurations manually changed and their services restarted.

* 2019-11-21 18:00 UTC: The swarm is verified healthy and all users are able to access Grafana.

Posted Nov 29, 2019 - 10:59 UTC

Resolved
We have rebuilt the infrastructure that hosts our grafana services.

Full service has been restored to all customers.

A postmortem for this incident will be published in the coming days.
Posted Nov 21, 2019 - 17:52 UTC
Update
We are continuing to monitor for any further issues.
Posted Nov 21, 2019 - 17:52 UTC
Update
We are continuing to restore access to Grafana, and remaining affected users should see improvement shortly.
Posted Nov 21, 2019 - 15:35 UTC
Update
Some connecitivity issues to Grafana persist however we're seeing access being restored for some more users.
We know this has had a significant impact to those affected and will be publishing a postmortem analysis of the incident once resolved with more details of what happened.
Posted Nov 21, 2019 - 10:30 UTC
Update
We're continuing to channel all resources to investigate the issue connecting to Grafana for 4% of our users.
Some users have reported success in connecting to Grafana.
We're also looking at that the Grafana version upgrade to 6.4.3 that was rolled out to investigate any correlation between the upgrade and the connectivity issue.
Posted Nov 21, 2019 - 03:12 UTC
Update
We continue to see connectivity issues for 4% of customers.
We have narrowed down the issue to infrastructure and are continuing to investigate.
Customers who needs dashboard urgently, can setup a local grafana with the instructions in our previous update.
We can also export your dashboards in json format for you to import into your local grafana seamlessly to regain access to your original dashboards on hostedgraphite.
We are also in the process of spinning up independent cloud-based grafana instances as another alternative workaround.
Posted Nov 20, 2019 - 09:19 UTC
Update
We continue to see connectivity issues for 4% of customers.

We have prepared steps for a workaround - you can use a local Grafana to view you data on hostedgraphite.com:

0. Install docker (https://docs.docker.com/install/)
1. Install grafana. From your terminal, do 'docker run -d --name=grafana -p 3000:3000 grafana/grafana'
2. Point your browser at http://localhost:3000
3. Userid/Password is 'admin'. You will be prompted to change.
4. Configure your local grafana to point to hostedgraphite.com as datasource (https://www.hostedgraphite.com/docs/dashboards/local-grafana.html)
5. Plot the metrics that you want. To see list of metrics that you have (https://www.hostedgraphite.com/docs/advanced/delete_metrics.html)

We are investingating both the issue and also the possibility of exporting out your dashboards in a format that you can import to your local Grafana.
Posted Nov 20, 2019 - 03:59 UTC
Monitoring
Our grafana services are recovering.
We continue to see connectivity issues for 4% of customers.
These will continue to recover over the coming hours.
Posted Nov 19, 2019 - 18:48 UTC
Investigating
We're currently investigating issues connecting to Grafana.

Ingestion continues as normal and no data has been lost.
Posted Nov 19, 2019 - 14:43 UTC
This incident affected: Website.