We have replayed and verified all affected data.
This incident is resolved.
Apr 5, 09:11 UTC
We have replayed 80% of the affected data, which is now again available for querying.
We expect the remaining replays to complete in the next couple of hours.
We'll resolve this issue tomorrow once we have verified the replayed data.
Apr 4, 22:12 UTC
The replay is underway and some of the data from the affected period is now available. More data will become available as the replay continues. We expect this process to take several more hours, and we will provide further updates when we have new information.
Apr 4, 17:54 UTC
As of 16:48 UTC we have started replaying data from the affected period. As this is a significant amount of data, we are taking measures to ensure that the persistent storage backend, as well as the aggregation servers responsible for replaying the data, remain stable by expanding our aggregation layer and limiting the replay process to a subset of hosts. We will provide additional updates on the status of this replay in an hour or when we have further information.
Apr 4, 16:51 UTC
Up to 3% of metrics ingested between 04:11 UTC on April 3 and 11:51 UTC on April 4 may not have been persisted to our backend storage layer for 5 minute resolution. All other resolutions are unaffected. Our leading edge cache is protecting up to 16 hours of this data, which is still available for query.
We've identified an issue that affected several nodes across our backend storage layer for 5 minute resolution data during the time period.
We are currently working on replaying the affected data from this time period. No data has been lost, and we will provide an update on the status of the data replay in an hour or when we have further information.
Apr 4, 15:48 UTC