Ingestion issues

Incident Report for Hosted Graphite

Resolved

All replays have been complete. If you see any data discrepancy please contact us.

Posted Jan 02, 2020 - 10:52 UTC

Update

Data storage backlogs have been completely replayed. Ingestion layer 2 backlogs are still being replayed.

Posted Dec 25, 2019 - 06:48 UTC

Update

We have tested various ways to replay ingestion layer 2 backlogs, and have settled on the one that offers the best combination of speed and stability. Data storage backlogs are still replaying.

Posted Dec 22, 2019 - 05:26 UTC

Update

We have stabilized the ingestion and aggregation layers. You may notice some missing datapoints if those were in buffered in the ingestion layer 2 backlogs, which are waiting to be replayed.

Posted Dec 19, 2019 - 08:58 UTC

Identified

As we were trying to tune the system to clear the backlogs faster it destabilized the aggregation layer and we are in the process of stabilizing the aggregation layer. Datapoints may be buffered in the backlogs while we are working on this.

Posted Dec 19, 2019 - 06:27 UTC

Update

The replay of datapoints in the backlog of ingestion layer 2 and aggregation layer is greater than 90% completed.
There were some datacenter network wobbles earlier but the system is stable and rode the storm.

Posted Dec 18, 2019 - 06:49 UTC

Update

We have completed stabilizing ingestion layer 2 as well.
Replay of datapoints buffered while the ingestion layer 2 and aggregation layer were being stabilized is ongoing.
For graphs viewed within 1-10 hour periods and those greater 5 days there may be gaps until the replay is completed.
Some alerts will function normally except for cases where the gaps are huge.
We will continue to work on speeding up the replays by tuning the system.

Posted Dec 16, 2019 - 13:27 UTC

Update

We have restored stability to our aggregation layer and also putting the the restoration of the stability of layer 2 of ingestion (upstream of the aggregation) is near.

The likely cause of the incident is a combination of network connectivity issues at our datacenter which was later compounded by a big customer sending a surge of datapoints before the ingestion and aggregation layer could be fully stabilized.

Throughout the incident the layer 1 of ingestion remained stable. We had to be diligent in our efforts to allocate and shift additional resource to stabilize ingestion layer 2 and aggregation layer in other to keep the storage layer stable.

Posted Dec 16, 2019 - 12:24 UTC

Update

Ingestion is being restored. UDP, pickle, Heroku, and TCP are back to 100%. We are still investigating the root cause.

Posted Dec 16, 2019 - 06:07 UTC

Update

There are backlogs that need to be replayed but we are allocating resources to prioritize real-time traffic. Some gaps in graph of older traffic in the last hours may appear.

Posted Dec 16, 2019 - 02:51 UTC

Update

We are continuing to investigate the cause of this issue. Ingestion is still degraded across TCP, HTTP, UDP, StatsD, pickle, and Heroku. You may also see delays in graph rendering. As a temporary measure we have restarted the affected servers and are shifting resources.

Posted Dec 16, 2019 - 02:09 UTC

Investigating

We're currently investigating issues with ingestion. This impacts TCP and HTTP API.

Posted Dec 15, 2019 - 22:49 UTC

This incident affected: Ingestion.