We’ve had several incidents where our render API would become unavailable and return 503 responses. This was caused by all our webservers responsible for render queries being marked as down by our load balancers, and the issues were usually resolved by restarting our webservers.
This postmortem references the following incidents:
At Hosted Graphite, we have several load-balancers running HAProxy which sit in front of our various backend layers. Our webservers run lighttpd, which currently has two applications running, which we’ll call “hg” and “g”:
g: Our modified version of graphite-web, with changes to be able to communicate with our backend and other improvements such as data views. This is the application that will serve render requests.
hg: Used for almost everything else, including user authentication.
The “hg” application doesn’t actually handle render requests, but it’s still needed when a render request comes in so it can perform authentication and then proxy the request to the “g” app, which completes the request. This means that if either of the two applications is unavailable, render requests won’t work.
The process for rendering PNGs (used for sharing grafana panels and on our alerting product) goes through our “hg” app though, and spins a phantomjs process to do the rendering and serve it to the user. This process runs with a small timeout to avoid PNG renders affecting our backend.
All of the incidents listed above falls into one of two categories, which are described here:
Over time we will see a single webserver steadily increasing the amount of connections it has in CLOSE_WAIT state. When the number of connections in CLOSE_WAIT hits a certain level lighttpd will refuse to take any new connections, resulting in the load balancer marking the webserver as down.
Once the first webserver is down, this behaviour repeats for the rest of webservers under the load balancer until there are no available servers left, at which point the render API becomes fully unavailable.
Restarting lighttpd in the affected servers restores the service.
We focused our investigation for this incident in the steady increase of connections in CLOSE_WAIT state. A connection in CLOSE_WAIT state is a connection that has already been closed on the other end, and is just waiting for us to close to socket on our end. Our logs indicate that this behaviour matches with an increase of clients making a request and then timing out, which would imply that we weren’t handling the case of all those clients timing out while waiting for a response properly.
After some digging we found that the version of lighttpd that we were running was affected by a bug that prevented it from properly closing connections in some scenarios, which can lead to a leak of unclosed connections. At some point you’ll hit the maximum number of connections that your lighttpd instance is configured to handle, and it will refuse to handle new connections. This would usually result in the server being marked as down temporarily as the old connections die, but since lighttpd is unable to close this connections, once it hits the connection limit it stays in that state until it’s restarted.
We upgraded our webservers to a patched version of lighttpd that doesn’t exhibit this behaviour. We still didn’t know what was causing the increase in timeouts, but we expected that this patched version of lighttpd would be more resilient and would not result in our webservers being marked as down one by one if the same issue was to reappear.
Shortly after upgrading our lighttpd instances, we were hit with a very similar incident. Once again our render API was unavailable returning 503 errors, but this time the steady increase of connections in CLOSE_WAIT state wasn’t there. Once again, restarting lighttpd resolved the issue.
When investigating this issue, we could see no sign of load on the webservers, but the lighttpd logs were telling us a different story. We could see signs of lighttpd complaining that some of its backends were overloaded, but we couldn’t see any load signs on any of our metrics, which was confusing.
After further investigation, we realised that all the load indicators we were looking at were for the “g” app, but the overloaded backends belonged to the “hg” app. Because this set of webservers are exclusively dedicated to render requests, all our alerts and monitoring dashboards were focused on the application directly responsible for serving render requests, despite the fact that all requests were still proxied through the “hg” app.
When checking our metrics for the “hg” app, we could see how the number of active workers started quickly increasing until it would reach a worker limit and stop processing new requests.
It still wasn’t clear what was causing this load increase, and why it was affecting the “hg” app. When inspecting our webservers during an incident we notice a large number of phantomjs processes running that were at least a few minutes old. These processes are used for PNG rendering from grafana dashboards or from our alerting service, and were not respecting the timeouts we set for them. Because every process is tied to a “hg” worker, as these processes continue to build up the number of available “hg” workers is reduced, eventually overloading the backend.
After making changes to the “hg” app to enforce a strict timeout on any phantomjs processes we haven’t observed any more instances of this issue.
Our monitoring was able to give us an insight into what the state of the connections was during our investigation, which was invaluable to find the issue with lighttpd.
The lighttpd upgrade wasn’t enough to make the issue go away completely, but it greatly reduced its impact, as servers would still have a chance to recover, and would flap during the incident, instead of going permanently down until they were restarted.
This has been a string of related incidents, which is something that we work hard to try to avoid. We believe that we should be able to get to the main contributing factors of every incident so we can avoid repeat instances, and in this case we haven’t been able to do that quickly enough to avoid repeated instances.
We have upgraded our version of lighttpd to prevent a buildup of connections in CLOSE_WAIT state from affecting our services in the future.
We have ensured that all our phantomjs processes have a strictly enforced timeout going forward.
We are going to adjust our max connection limit for our render layer after verifying that the connection count for a given node on that layer is close to our current max value. [SRE-923]
We are making changes to our monitoring to make sure that we can track the load of both our apps across our renderservers.
We are going to look into improving our monitoring and alerting for our web servers being marked as down so that we can catch incidents like this before they have customer-facing impact. [SRE-925]