Some AWS regions (N. Virginia in particular) experienced connectivity issues on Friday night, resulting in higher latencies for some of our users hosted in AWS. Overlapping with this, we experienced higher latency and timeouts to our HTTP APIs and website.
Our services are not hosted in AWS, but our DNS services and some of the health checking of our frontend HTTP load balancers is.
At 23:28 UTC we get the first report (via our monitoring tools) that both our render API and our website are intermittently timing out. We also observe reduced ingestion levels, and our initial investigation suggests this might be an internal network issue.
After further investigation, and checking both our internal and external canaries we noticed that ingestion is affected in certain AWS regions, but that ingestion through our HTTP API is affected from every location we check. Some of our own internal service (non production facing) are starting to time out, and we initially take this as a sign that this is a network connectivity issue on our end (these services were hosted in AWS and were affected by their outage, of which we were not aware yet).
at around 00:40 UTC we give up in the theory that there's a connectivity issue on our side, and realise that the internal services that are having issues have dependencies on AWS. We confirm that AWS are reporting connectivity issues in N. Virginia. This explains the reduced ingestion rates, as some of our users in AWS would have been unable to send us data during the outage, but not the website timeouts, so we continue investigating.
We turn our attention to some of our frontend health checking, which is based in AWS. We observe that the checks are randomly failing, but given the ongoing incident with AWS we're unsure if the problem lies with the check or with our load balancers. As a precaution, we proceed to disable the checks at 01:00 UTC.
We observe our traffic rates after this change, but fail to see more than a marginal improvement. We continue investigating and find in our logs that our load balancers are complaining about SSL handshake failures for IP addresses outside of the affected regions in AWS. We test locally and confirm that the SSL handhake takes unacceptably long regardless from what location we try it. We focus our efforts into troubleshooting the slow SSL handshakes on our load balancers.
At around 01:47 UTC we see our traffic levels go back to normal and our website is no longer timing out.
After confirming that everything continues to be stable, we enable our health checks back at 02:20 UTC.
After the incident was resolved, we put together all the available information and managed to paint a clearer picture of what actually happened.
At approximately 23:20 UTC, connectivity issues in certain AWS regions start, and one of the observed symptoms is increased latencies for making HTTP requests.
This meant that, for clients coming from AWS, the average time to send the initial HTTP request to our load balancers went from 200 ms before the incident, to 5 seconds (with a maximum reported time of minutes). Other providers (except, interestingly enough, Comcast) didn't see the time to make a full request change at all during the incident.
So if only AWS IP addresses were reporting higher times, why was the website intermittently unavailable for everybody? That's because the slow AWS requests acted as a DoS (Denial of Service) of sorts against our load balancers, as the slow requests started to pile up and hog all the available connections in our load balancers, so new connections would need to wait for an available connection to open, therefore suffering of higher latency (and even timing out) as a consequence of a connectivity issue that doesn't affect that particular issue nor our systems.
This incident has been a lesson into what can happen when enough of the Internet is slow, and how there are indirect hidden dependencies everywhere, such as one AWS region experiencing issues in N. Virginia can affect our users in Europe, for example.
One of the things we discovered is that we were missing some key parts in our load balancer configuration that would have helped us protect from this (and others like it) particular scenario. As such, we have already introduced changes to our load balancer configuration. Among other changes, we're increasing the number of active connections our load balancers will accept and enforcing stricter timeouts to receive a full HTTP request from a client. These settings can easily be tweaked by an operator during an incident to help us recover faster.
We are happy that we had all the necessary information to correctly diagnose the cause of this incident (even to the point where we can study the evolution of latency across providers), but it wasn't until the incident was over that we were able to put all the pieces together, and we believe we can do better than that. We are making sure that we have increased visibility over our load balancers, particularly any latency increases that might be affecting one particular provider/region. We believe that having easier access to this information would have allowed us to correctly diagnose and mitigate this incident sooner.