Between 17:09 UTC and 18:03 UTC, a misconfigured node in our HTTP ingestion layer lead to approximately 16% of requests sent to our HTTP endpoint to be not processed.
Any HTTP request routed by our load balancers to the misconfigured node was not processed.
We use a set of frontend load balancers to direct traffic to the various nodes in our ingestion layer. These load balancers operate on a round robin basis for our HTTP ingestion endpoint.
We recently re-architected our HTTP ingestion layer, and the deploy which enabled the cutover to the new set of services in this layer was rolled out at 17:09 UTC, and while this was performed successfully, a single node with an incorrect configuration was not updated as part of this deploy. Part of this change-over was enabling a new service to handle ingestion of HTTP datapoints, as well as a service which takes this data and processes it.
One of the nodes running this data-processing service had a previous version of the service running on it, and configuration management tooling was disabled on this host. The result of this was that this service did not get updated with the other nodes.
The configuration management tooling was temporarily disabled before the service was deployed, to test a change to the service which circumvented our normal change management process. Normally we would be notified and alerted when our configuration management tooling is disabled, however in this case the notification was not apparent enough, and the alert criteria was set to too wide a time period before alerting, which allowed this to pass undetected until we deployed. At the time of this change being performed, the new HTTP Ingestion layer was not yet taking production traffic.
The implication of the incorrect configuration was that this node was unable to successfully process HTTP requests, so any requests load-balanced to this node failed, with the data not being processed. Due to the nature of the load balancing, our canary for HTTP ingestion did not catch this immediately, only dropping when it was load balanced to the affected host. The HTTP canary itself sends datapoints at regular intervals through our HTTP ingestion endpoint to verify that data is bein processed correctly. Usually we also have all of our services running local canaries as well, to catch single node failures, but in the case of this incident, the service in question did not yet have local canaries in place.
The health checking in place for this ingestion layer architecture was not robust enough to handle the case where a node was up and accessible, but was not processing data correctly, which resulted in a much longer time to alert (and because of this, time to resolution) than the service should have.
Once we identified the issue after our HTTP canary (https://blog.hostedgraphite.com/2017/07/06/continuous-self-testing-at-hosted-graphite-why-we-send-external-canaries-every-second/) began dipping after being load balanced to the affected host, the change was immediately rolled back, switching us back over to the original ingestion layer.
We had a host which had a change that did not follow our change management and review process that was manually applied. Our configuration management was disabled for the affected node, and we missed the notification for this as well as the time frame for our alerting being high enough that this change was undetected during the changeover long before the alerting would have notified us. Normally we alert when our configuration management tooling has been disabled for five hours, but in this case the changeover happened before that five hour period was reached.
Our health checks (both from the load balancer side, and locally on the nodes themselves) did not account for the case where the service was marked as running, but had issues processing input.
We will be adding local canary data for the nodes in the HTTP ingestion layer, which would have prevented this incident entirely - we would have been alerted that a node was failing on local canary data before we enabled the changeover to the re-architected HTTP ingestion layer. We will also be adding better health checks to the services themselves, which would not have prevented the incident, but would have minimised the impact of a single misconfigured node by removing it as soon as it was marked as unhealthy.
We are improving our alerting and notifications around cases where our configuration management tooling has been disabled, so that we are alerted to this fact and can respond in a more timely fashion.
Our PRR process will also need to be adjusted to place a greater emphasis on how health checking and alerting should work in cases other than services becoming unavailable/nodes being inaccessible, especially for services that do forms of data processing.
For the moment, our new HTTP Ingestion Layer is not live - during the incident we failed back over to the original implementation, and we intend to implement our future steps and ensure everything required is in place before we enable it again.
17:09 UTC - the HTTP api change is deployed, and we switch over to our new ingestion method, our canaries are stable and the nodes we deployed to are validated by our configuration management tooling.
17:52 UTC - our HTTP canary is load balanced to the misconfigured ingestion node which begins attempting to process the canary data. The canary data begins to be dropped by the affected node, and this triggers our alerting.
18:03 UTC - we revert the HTTP api change and switch back to the old method of ingestion, thus bypassing the misconfigured node.