Our UDP traffic ingestion service failed to process incoming metrics, causing a loss of metrics for users using this protocol.
Users of our Heroku and AWS add-ons were also likely affected by this.
The root cause came from a previous issue we had caught and fixed during our testing while updating our services. However, our services for ingesting all protocols of metric traffic share this similar method for processing metrics and we regrettably missed fixing this issue in the service for UDP-specific metrics. The issue would cause the service to fail if incoming packets were unable to be decoded (usually occurring when metrics are improperly formatted or corrupted).
Resolving the issue was quite simple once we were able to identify the root cause as we already had a solution from our testing. Though it proved difficult to identify the same issue in a live production environment and we mistakenly believe this issue to be fixed for all protocols after our original testing.
The solution came down to adding additional checks and validation to the processing and ingestion of packets to ensure they don’t cause any exceptions during decoding of data.
Following up on this issue we are again reviewing our previous testing fixes to ensure other issues weren’t overlooked as this one was, and we are adding steps to our code reviewing process to help support matching issues tracked across multiple services.