Summary
From 10:54 UTC on 19 November to 18:00 UTC 21 November an estimated 4% of customers were unable to access Grafana as a result of a destabilising deploy and a breaking change to the API we use to determine container orchestration. Due to the nature of the instability we were unable to simply roll back the change, which prolonged the incident. Core ingestion services maintained 100% availability; aggregation, alerting and API access were also unaffected.
Background
We run a cluster of instances of Grafana for each account on our service, orchestrated by docker swarm. Several services interact with the swarm: one for service creation and deletion based on user status, one that handles user creation and login flow, and our chatops bot that automates some swarm operations.
What happened?
We initiated a deploy to upgrade Grafana for our shared tenancy cluster, having successfully upgraded Grafana in dedicated clusters two weeks prior. The parameters for the deploy resulted in overloading the swarm with new container requests, and users started to lose access to their dashboards.
First, we focused on understanding the impact and improving visibility into the swarm itself and services interacting with it. Attempts to restore the swarm included failing over to a new database and restarting services on the swarm.
Once we'd restarted what we could, we discovered a recent change to one of the services interacting with Grafana that prevented containers from being accessed correctly by the main app. We implemented a fix, which then revealed an incompatibility between the new version of Grafana and our authentication proxy.
By then the swarm still hadn't stabilised, so we built a new swarm and migrated Grafana services over to it. With a healthy swarm, we were able to roll back the Grafana upgrade, restoring service for all users.
What went well?
Everyone pitched in, from support to dev to the SREs responsible for the services. Many of the people involved were relatively new to supporting these services, but that didn’t stop them from diving in, finding what was wrong, and assisting with fixes.
Although visibility started off bad, we were able to add more over the course of the incident to get the right insights. This has resulted in immediate improvements to the logging in these services, which will aid in future work.
During a lengthy incident the team prioritised the many issues correctly: restoring swarm health remained most important, even as we discovered bugs and other problems that needed attention. While we did address some issues in a “fix-forward” way, we were able to rebuild the swarm and roll back the upgrade in order to restore access for customers.
What went badly?
Grafana 6 was a relatively recent production migration that went into production over the summer. As part of this large update, we had intermittent connection problems reported and existing in investigation status in our support queue. Therefore, after the initial deploy, we investigated our initial support tickets as part of a “known issue” instead of a result of the deploy. This slowed down our first responses, and allowed the already unstable swarm to continue to fail. In addition, we have not had strong success tuning alerting and logging for portions of our docker swarm infrastructure--sometimes being exceedingly verbose which can obfuscate the nature of a problem.
Although the upgraded version of Grafana had been successfully deployed to our smaller clusters, the process for the larger shared cluster is different and had not been sufficiently documented. Moreover, we had a reliance on our chatops bot, which although extremely convenient for infrastructure management, further obfuscated the ability to see immediately what the source of the issue was. What we experienced was a seamless deploy as usual followed by no indications of failure until our the swarm was critically overloaded. Additionally, the shared cluster has a greater variety of user and team configurations, not all of which were reflected in the dedicated clusters, so there were many possible error states to investigate.
Several factors complicated troubleshooting in addition to poor visibility:
Finally, fatigue wasn’t managed well during a lengthy incident. It was difficult to know when to step back and when to push a little harder, and it wasn’t possible to perform suitable handovers. In the belief that we were close to a solution, we delayed making status updates. This was unacceptable as it left you, our customers, without the ability to determine whether to continue to invest into, or vacate, our services. This compounded the other issues mentioned.
What will we do differently in the future?
Customer Next Steps
If you were affected by this disruption, please reach out to our Director of Ops, Yuga (yuga@metricfire.com), and we’ll talk to your specific needs and concerns.
Incident Timeline
November 19
* 2019-11-19 10:54 UTC: Initial deploy of Grafana 6.4.3
* 2019-11-19 12:01 UTC: First support ticket related to the incident arrives; investigation begins for possible database load issues.
* 2019-11-19 12:38 UTC: We determine not all users are affected, and continue with the database-related investigation.
* 2019-11-19 12:47 UTC: Finding no immediate database issues, we restart the user’s Grafana service, a solution to a known issue. No improvement observed.
* 2019-11-19 12:57 UTC: CPU usage on DB is observed to be declining steadily since ~11UTC, early indication of swarm issues, but the rollout has stalled.
* 2019-11-19 13:01 UTC: The second related support request arrives. We notice more accounts are affected after doing spot checks and suspect a more widespread problem.
* 2019-11-19 13:58 UTC: Containers appear to be created correctly but the check for "is it alive?" (grafana_ping) is failing, resulting in the rollout appearing to be stalled. Investigation into this error begins.
* 2019-11-19 14:29 UTC: We attempt to rebalance containers, as timeouts could be due to a concentration of containers on one node over the others. No improvement is observed.
* 2019-11-19 14:59 UTC: We initiate a rolling reboot of swarm nodes, as the earlier rolling restart of docker services hadn’t improved access. This does ease certain errors, but full access to Grafana hasn’t been restored.
* 2019-11-19 15:00 UTC: Rollback to previous version considered. We determine this would compound database load issues rather than mitigate them, so work continues on understanding what's happened to the swarm.
* 2019-11-19 15:45 UTC: Database failover performed. The upgrade appears to be proceeding at this point, and spot checks of individual users' Grafana instances are working.
* 2019-11-19 17:36 UTC: Grafana rollout paused.
* 2019-11-19 17:50 UTC: Grafana rollout unpaused, but the tooling doesn't respond or give an indication that it hasn't received the command, so we continue to monitor the rollout.
* 2019-11-19 18:15 UTC: As rollout is monitored, we see that it's still stalled. The interval for container starting is increased, to see if the service responsible for spinning up containers is responsible.
* 2019-11-19 23:30 UTC: Users' containers are still not being created correctly, and investigation into the service responsible for it continues.
November 20
* 2019-11-20 03:15 UTC: A workaround for users involving running grafana locally is tested and documented so that users are able to access their data while troubleshooting continues.
* 2019-11-20 07:36 UTC: We determine that grafana_ping (the “is it alive?”) check for grafana containers by the managing service is failing, pointing us towards a software solution rather than infrastructure.
* 2019-11-20 08:10 UTC: Rolling reboot of swarm nodes started.
* 2019-11-02 09:05 UTC: We decide to move the database to a larger machine in order to facilitate the number of connections required.
* 2019-11-20 10:00 UTC: The rolling reboot of swarm nodes has finished, but the rebooted nodes aren't connecting to the database.
* 2019-11-20 11:14 UTC: Fix for grafana_ping deployed, verification begins but service creation is much slower than expected.
* 2019-11-20 11:31 UTC: We determine the check is deployed correctly, but a different issue is causing Grafana services to remain unavailable.
* 2019-11-20 13:00 UTC: Investigation begins into service discovery issues with the swarm restart.
* 2019-11-20 14:15 UTC: Logging is added to the main app to understand whether containers are having new authentication issues.
* 2019-11-20 15:17 UTC: Swarm instances and DBs appear to be mostly idle, but the service responsible for creating and pruning containers is overloaded; investigation steers to try and understand why.
* 2019-11-20 15:45 UTC: Container replicas still appear “stuck,” and we consider recreating the docker network.
* 2019-11-20 16:45 UTC: We discover an issue with the new Grafana version that may necessitate a rollback once the swarm is healthy, unrelated to previous software issues we investigated.
* 2019-11-20 17:17 UTC: We discover that the Grafana rollout is still paused, and restart it.
* 2019-11-20 18:48 UTC: Another change to logging is deployed improve visibility into why containers aren't being created as expected.
* 2019-11-20 23:27 UTC: We discover many orphan processes causing problems on the swarm and start closing them manually. In the meantime, we discover a source of errors due to lack of admin privileges in individual grafana instances that are causing the "this container isn't working" signals.
November 21
* 2019-11-21 00:01 UTC: More logging changes are deployed. As a result, we're able to identify the change in the new version of Grafana that is causing containers to not register as working: the check now requires elevated privileges.
* 2019-11-21 00:15 UTC: Swarm rollout paused, based on the knowledge that containers will never be registered as successful.
* 2019-11-21 00:30 UTC: Rollback to previous version of Grafana initiated.
* 2019-11-21 00:56 UTC: During rollback, we discover that the API change won't result in failing containers for all users. We pause the rollback in order to spare the swarm and DB and instead change the config for grafana authentication manually for all affected users. This doesn't solve the problem for everyone, however.
* 2019-11-21 07:54 UTC: We reboot the swarm again, and start building a new one rather than continuing to repair the current one.
* 2019-11-21 11:49 UTC: Swarm build is finished and migration to the new swarm begins.
* 2019-11-21 15:14 UTC: Grafana 6.3.2 rolled out to new swarm
* 2019-11-21 16:00 UTC: Grafana 6.3.2 rollout appears successful, and verification begins. Some users need to have their configurations manually changed and their services restarted.
* 2019-11-21 18:00 UTC: The swarm is verified healthy and all users are able to access Grafana.