Delayed Processing of Metrics

Incident Report for Hosted Graphite

Resolved

Between 07:28 and 08:00 we saw delays in processing of metrics caused by networking issues.

We replayed Datapoints dropped during this period and these replays completed at 09:22 UTC.

Ingestion levels are normal and no data has been lost.

This incident is resolved.

Posted Apr 24, 2019 - 09:50 UTC

Update

We are continuing to replay dropped datapoints and are expanding capacity in our aggregation layer to expediate this.

Ingestion levels remain normal.

Posted Apr 24, 2019 - 08:59 UTC

Monitoring

At 07:55 UTC, we switched to a less aggressive form of health checking for our aggregation layer which is more fault tolerant during network issues.

As of 08:10 UTC, we have seen ingestion levels return to normal. Data which was dropped at our ingestion layer is being replayed, with backlogs of up to 30 minutes. Additionally, we are seeing delays in processing alerts of up to 6 minutes

Posted Apr 24, 2019 - 08:20 UTC

Investigating

Since 07:35 UTC, a network issue has resulted in delayed processing of metrics, as well as some issues with the website. We are investigating the full impact of our services.

Posted Apr 24, 2019 - 07:49 UTC

This incident affected: Website, Graph rendering, Ingestion, and Alerting.