Delay In UDP/Heroku Ingestion

Incident Report for Hosted Graphite

Postmortem

Overview

Our UDP traffic ingestion service failed to process incoming metrics, causing a loss of metrics for users using this protocol.
Users of our Heroku and AWS add-ons were also likely affected by this.

Root cause

The root cause came from a previous issue we had caught and fixed during our testing while updating our services. However, our services for ingesting all protocols of metric traffic share this similar method for processing metrics and we regrettably missed fixing this issue in the service for UDP-specific metrics. The issue would cause the service to fail if incoming packets were unable to be decoded (usually occurring when metrics are improperly formatted or corrupted).

Resolution

Resolving the issue was quite simple once we were able to identify the root cause as we already had a solution from our testing. Though it proved difficult to identify the same issue in a live production environment and we mistakenly believe this issue to be fixed for all protocols after our original testing.

The solution came down to adding additional checks and validation to the processing and ingestion of packets to ensure they don’t cause any exceptions during decoding of data.

Follow-up actions

Following up on this issue we are again reviewing our previous testing fixes to ensure other issues weren’t overlooked as this one was, and we are adding steps to our code reviewing process to help support matching issues tracked across multiple services.

Posted Apr 03, 2023 - 04:39 UTC

Resolved

This incident has been resolved.

Posted Apr 03, 2023 - 02:32 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Apr 03, 2023 - 01:25 UTC

Identified

We have located the root cause of the issue and are implementing a fix.

Posted Apr 02, 2023 - 18:31 UTC

Investigating

We are currently investigating this issue.

Posted Apr 02, 2023 - 16:14 UTC

This incident affected: Ingestion.