Our ingestion layer relies on health checking to route traffic to our aggregation servers. Concurrent limiting of a users traffic is handled per server depending on the number of active servers as the traffic is evenly distributed between them.
We perform health checking at each layer of our injestion using external canaries ( https://blog.hostedgraphite.com/2017/07/06/continuous-self-testing-at-hosted-graphite-why-we-send-external-canaries-every-second/ ) which are used to identify healthy hosts. The health checking service has different modes of operation which vary in the strictness of the check performed, from requiring 5 minutes of healthy canary data to just ensuring that the service is up and listening on a port.
We made a change made to our health checking service which caused a drop in number of healthy servers while the active server count was slower to update. This meant that the limits applied per server was incorrect causing them to drop datapoints due to concurrent limiting.
We rolled out a new version of our health checking service at 13:30 UTC to monitor the health of a sub layer on our aggregation servers. This was intended to be an no-op, however a config error caused the health checks to fail leading to a drop in the number of healthy servers. We immediately noticed this and rolled back which prevented further drop of healthy servers. The healthy server count however failed to fully recover and suspecting capacity issues we switched to a less strict version of healthchecking at 14:22 UTC and increased the number of servers.
Our ingestion service is built to handle the drop of aggregation servers with an automatic replay service which worked as intended to replay any dropped data. At 15:15 UTC we received support tickets that suggested that we had perhaps dropped some data. A Status Page incident was opened at this point and we started to reassess the scope of impact.
We noticed that our external canaries hadn't recovered indicating that the datapoints weren't fully replayed and the status page is updated with the impact. The incident is resolved at 16:35 UTC after observing that the canaries and count of healthy servers had been stable since 15:32 UTC. We also started investigating the disparity between replayed datapoints and the health of the canaries.
We identified that the healthy server count dropping, combined with the slower updating server count, meant that the effective limit per machine was lower than intended which caused them to incorrectly drop datapoints due to concurrent limiting. Our replay service was also actively trying to replay data during this period which would have further exacerbated the issue.
We will be ensuring that the active server count used to determine the limits is obtained from our service discovery similar to our injestion service which will prevent disparities causing wrong limits from being applied.
We will also be improving our monitoring to make it easier to detect impact of this kind in the future.