Background

Our ingestion layer relies on health checking to route traffic to our aggregation servers. Concurrent limiting of a users traffic is handled per server depending on the number of active servers as the traffic is evenly distributed between them.

We perform health checking at each layer of our injestion using external canaries ( https://blog.hostedgraphite.com/2017/07/06/continuous-self-testing-at-hosted-graphite-why-we-send-external-canaries-every-second/ ) which are used to identify healthy hosts. The health checking service has different modes of operation which vary in the strictness of the check performed, from requiring 5 minutes of healthy canary data to just ensuring that the service is up and listening on a port.

We made a change made to our health checking service which caused a drop in number of healthy servers while the active server count was slower to update. This meant that the limits applied per server was incorrect causing them to drop datapoints due to concurrent limiting.

What happened

We rolled out a new version of our health checking service at 13:30 UTC to monitor the health of a sub layer on our aggregation servers. This was intended to be an no-op, however a config error caused the health checks to fail leading to a drop in the number of healthy servers. We immediately noticed this and rolled back which prevented further drop of healthy servers. The healthy server count however failed to fully recover and suspecting capacity issues we switched to a less strict version of healthchecking at 14:22 UTC and increased the number of servers.

Our ingestion service is built to handle the drop of aggregation servers with an automatic replay service which worked as intended to replay any dropped data. At 15:15 UTC we received support tickets that suggested that we had perhaps dropped some data. A Status Page incident was opened at this point and we started to reassess the scope of impact.

We noticed that our external canaries hadn't recovered indicating that the datapoints weren't fully replayed and the status page is updated with the impact. The incident is resolved at 16:35 UTC after observing that the canaries and count of healthy servers had been stable since 15:32 UTC. We also started investigating the disparity between replayed datapoints and the health of the canaries.

We identified that the healthy server count dropping, combined with the slower updating server count, meant that the effective limit per machine was lower than intended which caused them to incorrectly drop datapoints due to concurrent limiting. Our replay service was also actively trying to replay data during this period which would have further exacerbated the issue.

What are we going to do in the future

We will be ensuring that the active server count used to determine the limits is obtained from our service discovery similar to our injestion service which will prevent disparities causing wrong limits from being applied.

We will also be improving our monitoring to make it easier to detect impact of this kind in the future.

Posted Oct 12, 2017 - 15:28 UTC

Resolved

We have identified that approximately 50% of data received across all protocols during the time period 13:40 UTC to 14:30 UTC were dropped.

We have been stable since the last update at 15:32 UTC and this incident is now resolved. We will be publishing the post-mortem in 24 hours.

Posted Oct 11, 2017 - 16:35 UTC

Update

Our leading edge cache has recovered and graph rendering is fully operational again.

We have identified that ingestion was affected between 13:40 UTC to 14:30 UTC and we are currently investigating the full scope of the impact.

Posted Oct 11, 2017 - 15:32 UTC

Investigating

We are currently investigating an issue with our leading edge cache that is affecting graph renders. You can expect to see gaps for some metrics at the leading edge of graph renders.

Posted Oct 11, 2017 - 14:53 UTC