Intermittent timeouts on our website and HTTP ingestion endpoints
Incident Report for Hosted Graphite
Postmortem

Background

Some AWS regions (N. Virginia in particular) experienced connectivity issues on Friday night, resulting in higher latencies for some of our users hosted in AWS. Overlapping with this, we experienced higher latency and timeouts to our HTTP APIs and website.

Our services are not hosted in AWS, but our DNS services and some of the health checking of our frontend HTTP load balancers is.

What happened?

At 23:28 UTC we get the first report (via our monitoring tools) that both our render API and our website are intermittently timing out. We also observe reduced ingestion levels, and our initial investigation suggests this might be an internal network issue.

After further investigation, and checking both our internal and external canaries we noticed that ingestion is affected in certain AWS regions, but that ingestion through our HTTP API is affected from every location we check. Some of our own internal service (non production facing) are starting to time out, and we initially take this as a sign that this is a network connectivity issue on our end (these services were hosted in AWS and were affected by their outage, of which we were not aware yet).

at around 00:40 UTC we give up in the theory that there's a connectivity issue on our side, and realise that the internal services that are having issues have dependencies on AWS. We confirm that AWS are reporting connectivity issues in N. Virginia. This explains the reduced ingestion rates, as some of our users in AWS would have been unable to send us data during the outage, but not the website timeouts, so we continue investigating.

We turn our attention to some of our frontend health checking, which is based in AWS. We observe that the checks are randomly failing, but given the ongoing incident with AWS we're unsure if the problem lies with the check or with our load balancers. As a precaution, we proceed to disable the checks at 01:00 UTC.

We observe our traffic rates after this change, but fail to see more than a marginal improvement. We continue investigating and find in our logs that our load balancers are complaining about SSL handshake failures for IP addresses outside of the affected regions in AWS. We test locally and confirm that the SSL handhake takes unacceptably long regardless from what location we try it. We focus our efforts into troubleshooting the slow SSL handshakes on our load balancers.

At around 01:47 UTC we see our traffic levels go back to normal and our website is no longer timing out.

After confirming that everything continues to be stable, we enable our health checks back at 02:20 UTC.

What actually happened?

After the incident was resolved, we put together all the available information and managed to paint a clearer picture of what actually happened.

At approximately 23:20 UTC, connectivity issues in certain AWS regions start, and one of the observed symptoms is increased latencies for making HTTP requests.

This meant that, for clients coming from AWS, the average time to send the initial HTTP request to our load balancers went from 200 ms before the incident, to 5 seconds (with a maximum reported time of minutes). Other providers (except, interestingly enough, Comcast) didn't see the time to make a full request change at all during the incident.

So if only AWS IP addresses were reporting higher times, why was the website intermittently unavailable for everybody? That's because the slow AWS requests acted as a DoS (Denial of Service) of sorts against our load balancers, as the slow requests started to pile up and hog all the available connections in our load balancers, so new connections would need to wait for an available connection to open, therefore suffering of higher latency (and even timing out) as a consequence of a connectivity issue that doesn't affect that particular issue nor our systems.

What are we going to do in the future?

This incident has been a lesson into what can happen when enough of the Internet is slow, and how there are indirect hidden dependencies everywhere, such as one AWS region experiencing issues in N. Virginia can affect our users in Europe, for example.

One of the things we discovered is that we were missing some key parts in our load balancer configuration that would have helped us protect from this (and others like it) particular scenario. As such, we have already introduced changes to our load balancer configuration. Among other changes, we're increasing the number of active connections our load balancers will accept and enforcing stricter timeouts to receive a full HTTP request from a client. These settings can easily be tweaked by an operator during an incident to help us recover faster.

We are happy that we had all the necessary information to correctly diagnose the cause of this incident (even to the point where we can study the evolution of latency across providers), but it wasn't until the incident was over that we were able to put all the pieces together, and we believe we can do better than that. We are making sure that we have increased visibility over our load balancers, particularly any latency increases that might be affecting one particular provider/region. We believe that having easier access to this information would have allowed us to correctly diagnose and mitigate this incident sooner.

Posted Oct 15, 2017 - 09:49 UTC

Resolved
AWS has declared this incident as resolved, and we can confirm that, as of 01:46 UTC traffic rates are returning to previous levels and latency when accessing our website and render APIs has also decreased back to normal levels. SSL handshakes to our website were affected from 23:20 UTC to 01:46 UTC.

Despite the fact that this connectivity issue originated within AWS, we can confirm that traffic originating from outside of AWS was also experiencing slow SSL handshakes for the duration of this incident. We will be investigating the reason for this but we experienced an increase in established connections to our load balancers for the duration of this incident that we believe was due to slower connections from the affected AWS regions. This increase in established connections might have been big enough to fill the internal queues of our load balancers, resulting in elevated latency for all SSL operations.
Posted Oct 14, 2017 - 02:13 UTC
Identified
AWS is reporting connectivity issues in one of their regions (N. Virginia): https://status.aws.amazon.com/

Some of our services rely on route 53 health checking to ensure availability in case of a node or service failure, but the health checking itself seems to have been affected by this outage, intermittently marking some of our load balancers as down and resulting in timeouts.

As a temporary workaround, we have disabled the route53 health checking and will manually monitor the state of our load balancers to see if the traffic rates recover.
Posted Oct 14, 2017 - 01:08 UTC
Investigating
We're currently investigating intermittent timeouts accessing our website and our HTTP ingestion endpoints.
Posted Oct 14, 2017 - 00:41 UTC