Our aggregation layer has suffered a further decrease in capacity leading to backlogs of up to 5 minutes..
We have expanded capacity in our aggregation layer to help work through the backlogs.
Posted Oct 17, 2019 - 12:20 UTC
A fix has been implemented and we are monitoring the results.
Posted Oct 17, 2019 - 10:55 UTC
As of 10:52 UTC our aggregation layer has returned to full health and all backlogs have been replayed.
We continue to monitor the situation.
Posted Oct 17, 2019 - 10:53 UTC
As of 10:12 UTC network connectivity issues have caused datapoints to be dropped in our aggregation layer. We have switched to a less strict healthcheck mechanism and are seeing recovery. Backlogs of up to 7 minutes are currently being replayed.
This will have caused delays in processing datapoints leading to gaps in graphs causing alerts to trigger in error.
No data has been lost.
Posted Oct 17, 2019 - 10:35 UTC
This incident affected: Graph rendering, Ingestion, and Alerting.