Network Incident
Incident Report for Hosted Graphite
Resolved
The backlog of ingested data has been processed. No data was lost during the incident.
Posted 3 months ago. Feb 25, 2019 - 22:20 UTC
Update
We're continuing to work through the backlog of ingested data and will update again in two hours or when the incident is resolved.
Posted 3 months ago. Feb 25, 2019 - 20:30 UTC
Update
We're continuing to work through the backlog of ingested data and will update again in an hour or when the incident is resolved.
Posted 3 months ago. Feb 25, 2019 - 19:27 UTC
Update
Our service provider has confirmed that the network incident is resolved as of 17:54 UTC. We are no longer experiencing elevated latency across our internal network.

We are working through our backlog of previously ingested data, and current data is being ingested at normal levels. We will continue to monitor this situation and we will provide another update in one hour or when the incident has been resolved.
Posted 3 months ago. Feb 25, 2019 - 18:23 UTC
Monitoring
As of 17:40 UTC, we have seen improvements in ingestion, rendering, and alerts. We were previously seeing between 60 and 70 times the usual latency on our internal network, however this has since gone back down to normal levels.

We are working with our hosting provider to identify the cause of this increased latency, and will continue to monitor the situation.
Posted 3 months ago. Feb 25, 2019 - 17:49 UTC
Update
At 17:10 UTC, we changed our aggregation layer's health checking to a more fault-tolerant method, which resulted in some improvement in the backlog of metrics to be processed.

However, as of 17:25 UTC, we are still seeing instability in our ingestion layer, an increase in render response times, and delayed alerts.
Posted 3 months ago. Feb 25, 2019 - 17:34 UTC
Investigating
As of 16:48 UTC, we have been experiencing a network incident causing severe service disruption. Users will experience increased ingestion times, connectivity issues with our site and render API, and delayed processing of alerts.
Posted 3 months ago. Feb 25, 2019 - 17:08 UTC
This incident affected: Website, Graph rendering, Ingestion, and Alerting.