Delay In UDP/Heroku Ingestion
Incident Report for Hosted Graphite
Postmortem

Overview

Our UDP traffic ingestion service failed to process incoming metrics, causing a loss of metrics for users using this protocol.
Users of our Heroku and AWS add-ons were also likely affected by this.

Root cause

The root cause came from a previous issue we had caught and fixed during our testing while updating our services. However, our services for ingesting all protocols of metric traffic share this similar method for processing metrics and we regrettably missed fixing this issue in the service for UDP-specific metrics. The issue would cause the service to fail if incoming packets were unable to be decoded (usually occurring when metrics are improperly formatted or corrupted).

Resolution

Resolving the issue was quite simple once we were able to identify the root cause as we already had a solution from our testing. Though it proved difficult to identify the same issue in a live production environment and we mistakenly believe this issue to be fixed for all protocols after our original testing.

The solution came down to adding additional checks and validation to the processing and ingestion of packets to ensure they don’t cause any exceptions during decoding of data.

Follow-up actions

Following up on this issue we are again reviewing our previous testing fixes to ensure other issues weren’t overlooked as this one was, and we are adding steps to our code reviewing process to help support matching issues tracked across multiple services.

Posted Apr 03, 2023 - 04:39 UTC

Resolved
This incident has been resolved.
Posted Apr 03, 2023 - 02:32 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Apr 03, 2023 - 01:25 UTC
Identified
We have located the root cause of the issue and are implementing a fix.
Posted Apr 02, 2023 - 18:31 UTC
Investigating
We are currently investigating this issue.
Posted Apr 02, 2023 - 16:14 UTC
This incident affected: Ingestion.