Upgrade of internal VPN

Scheduled Maintenance Report for Hosted Graphite

Completed

Last night we completed the replay of long term data from last week's failed maintenance work. The replay took an unusually long time because of some recent architectural changes that meant we had a lot more data to replay than we expected, and we're already part way through work to mitigate this weakness so replays shouldn't take this long in future.

We'd like to offer our sincere apologies to our customers for the disruption caused by the failed maintenance and the unusually long time to replay data stashed during the incident. We're undertaking several weeks of focused maintenance and reliability work to help us address all the things that went wrong here.

Once again, sorry for the disruption. We're working hard on making the service better.

Posted Aug 20, 2015 - 12:05 UTC

Update

Our replays of one hour data are proceeding. We currently have approximately a third of the data replayed. We expect to have half replayed by midday Wednesday, and the large majority replayed by the end of that day. We'll provide an update once we have more information.

Posted Aug 19, 2015 - 00:08 UTC

Update

A few hours ago we started replaying the stashed data from the last incident, and it'll be starting to show up on your long term graphs soon. The replay is proceeding metric-by-metric, so you may see some metrics filled in before others. We'll provide another update tomorrow with progress on how the replay is going.

Posted Aug 17, 2015 - 16:58 UTC

Update

There was a short disruption to the webservers, which has been resolved by the on-call engineer.

Data from the last 48 hours is available as normal and we'll be replaying buffered data on Monday morning.

Posted Aug 15, 2015 - 20:46 UTC

Update

Webservers have been restored to stability and we're continuing to monitor. Data from the last 24 hours is available as normal and we'll be replaying buffered data on Monday morning.

Posted Aug 14, 2015 - 19:16 UTC

Update

The webservers are overloaded. We're working to fix it.

Posted Aug 14, 2015 - 17:49 UTC

Update

We have reverted the system to its previous state and it is now using the original VPN. The system as a whole is to fully operational but we're still addressing some lingering graph responsiveness problems.

We've decided to defer replaying the previously stashed data until Monday in order to not risk further instability on a Friday evening. Long term graphs (>10 hours) will have a gap covering the maintenance/incident period this weekend. No significant volume of data loss occurred during this maintenance/incident and we’ll have the stashed data restored before most of you are back at your desks on Monday. Data from the last 24 hours is available as normal.

Sorry for any disruption caused, we obviously didn't expect this impact and we really regret it. Please be assured that we'll be taking a close look at what went wrong here and we'll learn all we can from it.

Posted Aug 14, 2015 - 17:33 UTC

Update

We are still working on this. Unfortunately it has become clear that the new VPN system is not stable when handling all the network traffic for our backend systems. We have begun slowly migrating back to the old system in order to keep things stable and fully functional for the weekend. We will reassess our options next week.

More updates to follow as the migration completes.

Posted Aug 14, 2015 - 13:47 UTC

Update

We have stabilised the webservers, and graphs are loading as normal.
Historical reads are still disabled. We were also forced to stash some datapoints for the last few hours which we will restore to the system tomorrow.

We are very sorry that this happened, the VPN change that we were trying to make touched almost every component of our backend and unfortunately caused trouble in places we had not anticipated despite careful planning, testing and dry runs.

Posted Aug 13, 2015 - 19:17 UTC

Update

The system is now overloaded. In order to restore service we need to temporarily disable all reads. Your graphs will be temporarily empty until we can stabilise the system.

Posted Aug 13, 2015 - 17:54 UTC

Update

Some webservers have become unresponsive, we're working to return them to service now.

Posted Aug 13, 2015 - 17:22 UTC

Update

This maintenance continues. All of our Riak clusters are now transitioning to the new VPN. We estimate this will take an hour or so.

During the migration of our edge loadbalancers to the new network some datapoints were lost, but not in significant numbers.

Reading historical data is still disabled, but the system as a whole appears to be functioning as normal.

Posted Aug 13, 2015 - 16:01 UTC

Update

This maintenance is still ongoing, and we need to extend the maintenance window in order to complete it.

Right now the only customer facing effect of the maintenance is that historical data is disabled, but there may be intermittent spikes in response time.

Please get in touch (help@hostedgraphite.com) if you have any questions or concerns. Apologies again for any disruption this is causing.

Posted Aug 13, 2015 - 13:10 UTC

Update

Our webservers are reporting elevated response times as a result of this maintenance. Sorry for the disruption.

Posted Aug 13, 2015 - 11:10 UTC

In progress

Scheduled maintenance is currently in progress. We will provide updates as necessary.

Posted Aug 13, 2015 - 09:00 UTC

Scheduled

This is an upgrade to our internal VPN which prevents further issues
that caused our recent outage, and gives us the ability to add more
scale to our storage clusters.

No data loss is anticipated.
Reading of some historical data may be unavailable for short periods
as we migrate storage clusters to the new network.

Posted Aug 06, 2015 - 15:28 UTC