All Systems Operational
Website   Operational
90 days ago
99.97 % uptime
Today
Graph rendering   Operational
90 days ago
99.97 % uptime
Today
Ingestion   Operational
90 days ago
99.97 % uptime
Today
Alerting   Operational
90 days ago
99.97 % uptime
Today
Operational
Degraded Performance
Partial Outage
Major Outage
Maintenance
Scheduled Maintenance
Emergency Network Maintenance Oct 25, 03:00-05:30 UTC
Between the hours of 03:00AM UTC and 05:30AM UTC our provider will be performing emergency maintenance on their network infrastructure, which will affect many of our services including:

Graph Rendering - gaps on the leading edge may be observed
Alerting - alerts may be delayed

Historical data at different resolutions might be intermittently unavailable or present gaps. The following are affected:
- Data older than 16 hours for 300s resolution.
- Data older than 8 days for 3600s resolution.


Once this window of maintenance is over, we will be monitoring the situation and providing updates on the status of the service.
Posted on Oct 19, 15:47 UTC
System Metrics Month Week Day
www.hostedgraphite.com uptime ?
Fetching
Interface health: TCP ?
Fetching
Interface health: UDP ?
Fetching
Interface health: StatsD ?
Fetching
Interface health: HTTP API ?
Fetching
Interface health: carbon relay (pickle) ?
Fetching
Graph render time (95th percentile)
Fetching
Interface health: Heroku integration ?
Fetching
AWS connectivity (US-East-1) ?
Fetching
AWS connectivity (US-West-1) ?
Fetching
Past Incidents
Oct 21, 2017
Resolved - Our Heroku and AWS Cloudwatch integration have recovered and are backfilling any missed data.

We have determined the cause of the incident to be due to network connectivity issues between AWS and our provider from 2:00 UTC to 3:16 UTC. Datapoints received through the Heroku and CloudWatch integration during this period were delayed and have now been replayed.
Oct 21, 03:40 UTC
Identified - We have identified the cause to be due to drop in connectivity to AWS.

This will also cause AWS Cloudwatch metrics to be delayed. We are investigating the source of this connectivity drop.
Oct 21, 03:15 UTC
Investigating - Our Heroku integration currently isn't receiving all log-based metric data and we are currently investigating.
Oct 21, 02:34 UTC
Oct 20, 2017

No incidents reported.

Oct 19, 2017

No incidents reported.

Oct 18, 2017
Completed - The scheduled maintenance has been completed.
Oct 18, 05:30 UTC
In progress - Scheduled maintenance is currently in progress. We will provide updates as necessary.
Oct 18, 03:00 UTC
Scheduled - Between the hours of 03:00AM UTC and 05:30AM UTC our provider will be performing emergency maintenance on their network infrastructure, which will affect many of our services including:

- Both creating and querying annotations will be partially unavailable
- Creation time for new metrics may be affected
- Graphs may show gaps in leading-edge data
- Alerts may be delayed

Ingestion will continue to process incoming data as normal

Once this window of maintenance is over, we will be monitoring the situation and providing updates on the status of the service.
Oct 16, 17:52 UTC
Oct 17, 2017
Resolved - We have successfully failed over the database. We will be reviewing our failover process to ensure that the impact from this is reduced in future.
Oct 17, 17:57 UTC
Update - From 17:00UTC until 17:02UTC we performed another failover attempt which may have resulted in alerts triggered as malformed queries. Alerts triggered in this fashion will have recovered immediately and normal operation has been resumed.
Oct 17, 17:24 UTC
Monitoring - While performing a database failover as part of essential maintenance, alerts triggered between 13:50UTC and 13:55UTC may have been triggered as false positives - this is due to the alert queries being interpreted as a 'bad query' when the database is unavailable. Alerts triggered in this fashion will have recovered immediately after, and normal operation has been resumed.
Oct 17, 14:14 UTC
Oct 16, 2017

No incidents reported.

Oct 15, 2017

No incidents reported.

Oct 14, 2017
Postmortem - Read details
Oct 15, 09:53 UTC
Resolved - This incident has been resolved.
Oct 14, 23:52 UTC
Monitoring - Ingestion levels have returned to normal and we are monitoring the situation.
Oct 14, 23:08 UTC
Investigating - We're currently investigating intermittent timeouts accessing our website and our HTTP ingestion endpoints.
Oct 14, 22:27 UTC
Postmortem - Read details
Oct 15, 09:49 UTC
Resolved - AWS has declared this incident as resolved, and we can confirm that, as of 01:46 UTC traffic rates are returning to previous levels and latency when accessing our website and render APIs has also decreased back to normal levels. SSL handshakes to our website were affected from 23:20 UTC to 01:46 UTC.

Despite the fact that this connectivity issue originated within AWS, we can confirm that traffic originating from outside of AWS was also experiencing slow SSL handshakes for the duration of this incident. We will be investigating the reason for this but we experienced an increase in established connections to our load balancers for the duration of this incident that we believe was due to slower connections from the affected AWS regions. This increase in established connections might have been big enough to fill the internal queues of our load balancers, resulting in elevated latency for all SSL operations.
Oct 14, 02:13 UTC
Identified - AWS is reporting connectivity issues in one of their regions (N. Virginia): https://status.aws.amazon.com/

Some of our services rely on route 53 health checking to ensure availability in case of a node or service failure, but the health checking itself seems to have been affected by this outage, intermittently marking some of our load balancers as down and resulting in timeouts.

As a temporary workaround, we have disabled the route53 health checking and will manually monitor the state of our load balancers to see if the traffic rates recover.
Oct 14, 01:08 UTC
Investigating - We're currently investigating intermittent timeouts accessing our website and our HTTP ingestion endpoints.
Oct 14, 00:41 UTC
Oct 13, 2017

No incidents reported.

Oct 12, 2017
Postmortem - Read details
Oct 13, 18:43 UTC
Resolved - Our replay service has finished working through its backlog and all delayed data has now been processed and is fully available.

We will be publishing a post-mortem of this incident in the following days.
Oct 12, 23:57 UTC
Update - Ingestion has returned to normal levels. Data is being replayed and gaps will appear in graphs until these replays have completed. Some alerts may have been delayed.

Because of the quantity of data being replayed, we've expanded capacity to help expedite backlog clearance.
Oct 12, 18:32 UTC
Identified - We have identified a disruption of connectivity within our internal network that has affected all components. While deploying an update to our IPsec management tools we encountered an error and needed to rollback. Our internal network was disrupted by this deploy and connectivity is recovering slowly. We are monitoring the situation and will update you as soon as we know more.
Oct 12, 17:12 UTC
Oct 11, 2017
Postmortem - Read details
Oct 12, 15:28 UTC
Resolved - We have identified that approximately 50% of data received across all protocols during the time period 13:40 UTC to 14:30 UTC were dropped.

We have been stable since the last update at 15:32 UTC and this incident is now resolved. We will be publishing the post-mortem in 24 hours.
Oct 11, 16:35 UTC
Update - Our leading edge cache has recovered and graph rendering is fully operational again.

We have identified that ingestion was affected between 13:40 UTC to 14:30 UTC and we are currently investigating the full scope of the impact.
Oct 11, 15:32 UTC
Investigating - We are currently investigating an issue with our leading edge cache that is affecting graph renders. You can expect to see gaps for some metrics at the leading edge of graph renders.
Oct 11, 14:53 UTC
Oct 10, 2017
Postmortem - Read details
Oct 12, 14:18 UTC
Resolved - We are happy to resolve this incident now as everything is stable. We will be publishing the post-mortem soon.
Oct 10, 16:05 UTC
Update - We have successfully restored connectivity to our leading edge cache and graph rendering is fully operational again.

We have identified that the impact of this incident was broader than first thought. Our statsd ingestion service was impacted from 14:30 UTC to 15:15 UTC resulting in a 50% reduction in traffic received at our statsd endpoint. We have fixed the issue and the traffic rates have returned to normal.

We will be publishing a full post mortem for this incident to outline what went wrong and what we plan to do to avoid this happening in the future.
Oct 10, 15:41 UTC
Monitoring - We have identified the issue to be a recent config change which prevented our leading edge cache from connecting to our health checking service. We have deployed a fix for this and are now monitoring the situation.

No data has been lost, and all data will be available once the affected servers have recovered.
Oct 10, 15:04 UTC
Investigating - We are currently investigating an issue with our leading edge cache that is affecting graph renders. You can expect to see gaps for some metrics at the leading edge of graph renders.
Oct 10, 14:39 UTC
Oct 9, 2017

No incidents reported.

Oct 8, 2017
Resolved - We have successfully replayed all data-points that were stored on the affected aggregation server at the time of failure (18:33 UTC).

We are happy to resolve this incident now.
Oct 8, 19:26 UTC
Monitoring - We are replaying the missing data for all resolutions and will update when this process has completed.
Oct 8, 19:11 UTC
Investigating - We have identified a failure in one of our aggregation servers resulting in leading edge data being unavailable for approximately 1% of all metrics for all resolutions. No data has been lost, and all data will be available again once the affected server has been recovered and we carry out a replay of all data-points stored on this server at the time of failure.
Oct 8, 18:55 UTC
Oct 7, 2017

No incidents reported.