We have reverted a previous configuration change and things are now back to normal.
Our ingestion pipeline was approximately dropping 16% of incoming traffic from 18:13 to 18:28 (UTC). The elevated render times proved to be a false alarm as the 99th percentile remain unchanged during the whole incident.
We have traced the root cause of the issue back to a configuration change made while testing improvements to our DNS automation.
Our DNS automation worked as expected but the configuration change had side effects we weren't expecting, that caused traffic from one of our load balancers to be rejected by our ingestion pipeline.
We now understand the side effects of this kind of change and we will work on adding extra safeguards around it in the future.
Dec 2, 18:44 UTC
We're currently experiencing elevated response times and seeing reduced levels of traffic across our ingestion pipeline.
Dec 2, 18:30 UTC