Outage US-West - May 29th

JasonJason CEOCheckfront
edited May 2015 in News and Updates
At approximately 6:30 PM PST, May 29th, our upstream provider experienced a power failure on our US-West network. Unfortunately one out of eight generators also experienced an 'electromechanical failure' resulting in an extended outage.

We worked to move all customers on our US-West network to our backup network within the hour. Although service was restored on the fail-back location, performance was degraded, booking images were not available, and some intermittent timeouts while creating a booking may have occurred.

At approximately 11:45 PM PST, power was restored, and traffic was re-routed back to the home servers.

Customers on our US-East, US-Central, UK & Asia networks were NOT impacted. You can find out what network you are on by going to Manage / System in your account. Our main website (www.checkfront.com) was unavailable during the outage.

We are still investigating the full series of events with our provider. We apologize to any customers impacted and will review our fail-back procedures to ensure any future outages are as un-disruptive as possible.

-Jason

Comments

  • JasonJason CEO Checkfront
    edited June 2015
    Just a follow up on this. The power outage on our US-West network - May 29th was confirmed by our upstream provider. The full detail of events follows.

    Although the entire outage was over 8 hours, we were able to re-route the impacted traffic on our us-west node to us-central within the hour to make sure new bookings on the consumer side continued as did access to the admin interface. Once service was restored, there were around a dozen accounts missing bookings that happened on the fail-over location. It took us some time to get them back into the system.

    This outage impacted service for around 26% of our customers, as well as our main website (www.checkfront.com).

    The postmortem:

    It took us longer to move traffic to our failover location than we planned (it should be less than 10 minutes, not 50 minutes).

    For a handful of accounts, it took us longer than acceptable to restore bookings from the failover location (most happened once the switch over occurred, but some took a few days to untangle). This was due to a complication in our database structure and replication.

    The takeway:

    We are moving to a new hosting provider on our US-west network. Although we had a backup plan that was enacted, the extended outage is not acceptable to us, or to you.

    We are in the process of reviewing our failover procedures to ensure that any outage (they happen) have the least amount of disruption as possible.

    To cover off:

    Your data is backed up ever hour to a remote location that we can failover to incase of an outage. We back the 24/h data up daily, and replicate it again elsewhere x 7 days, and then 30. So that means every customer has 672 snapshot backup of their data over 30 days somewhere else than their network is located.

    We have means to restore and replicate data during an outage once a network is restored. We’ll work to make sure this is a cleaner, more timely switchover in the future.

    Thanks,
    Jason

    --- Upstream outage postmortem ---

    At approximately 01:30 UTC, on May 30, 2015, the power utility (PG&E) experienced an outage affecting our Fremont datacenter. Seven of the facility’s eight generators started correctly and provided uninterrupted power. Unfortunately, one generator experienced an electromechanical failure and failed to start. This caused an outage which affected our entire deployment in Fremont.

    PG&E was in contact and gave an initial ETR for restoration of utility power of 04:30 UTC. This was later revised to 05:00 UTC and then 06:30 UTC. Utility power was actually restored at 06:05 UTC.

    The maintenance vendor for the generator dispatched a technician to the datacenter and it was determined that a battery used for starting the generator failed under load. The batteries were subsequently replaced by the technician. The generators are tested monthly, and the failed generator passed all of its checks two weeks prior to the outage. It was also tested under load earlier in the month.

    The UPS system and its batteries did not suffer a failure.

    As soon as the outage occurred, engineers verified it was indeed power related and remained on standby for over four hours waiting for power to be restored. Critical infrastructure was made operational immediately after power was restored and then customer nodes were booted.

    Several servers did not survive the sudden loss of power and needed individual attention. Engineers worked well after the power was restored in order to repair and make these systems operational again which involved both hot and cold spare components. We were able to recover every system.

    We apologize for this power interruption and any inconvenience it has caused you. We sincerely appreciate your business and are committed to providing the best service possible. Our colocation provider is in the process of reevaluating their maintenance procedures and adding additional tests for this battery condition.
This discussion has been closed.