Power Outage – Recap #fisherplazafire

As many of you are aware we had a day long outage on Friday. I wanted to give everyone a sense of what happened, what we are doing etc.

On Friday (July 3rd) – there was a fire in Fisher Plaza. Fisher Plaza is a “Communication Hub” in the NorthWest – its host to a bunch of datacenters as well as TV and Radio Stations. The fire caused the automatic sprinker system to kick in and essentially shut down power to one of the buildings.

3:00 AM

We learned about this around 3AM Friday morning. All QuestionPro technical staff were online assessing the situation by about 5AM. Since this was a system wide outage (as opposed to a group of servers failing) – we simply had to asses the situation as it developed.

6:00 AM

We get preliminary indication that the root cause of the power failure is the fire and that no-one is allowed to enter/leave the building till Seattle Fire Department does a sweep. At this point we are all online, waiting for Seattle Fire to clear and give the thumbs up. We redirected traffic to QuestionPro.com to a temporary set of servers with a downtime notice. We asked users to check out our twitter feed for updates as well as our status page (status.surveyanalytics.com)

9:00AM

We get indication that Fisher Electrical contractor is trying to get power back online – drying out the equipment to make sure its safe to operate. We start putting updates on the QuestionPro and IdeaScale twiiter accounts (twitter.com/questionpro and twitter.com/ideascale) – both QP and IS are hosted in the same set of cabinets. The entire building is out of power and its a challenge even to get in and out of the building (no elevators, electronic key cards don’t work etc.)

We also make the determination that we should wait till about 5PM to see of the power comes back online before moving all the data to another data-center. We also have space in a backup data-center.

12:00PM
Fisher and Internap communicate that they are bringing in mobile generators in Flat Bed Trucks – the plan is to get the Generators fired up and bypass the electrical room altogether (where the water damage was)


5:00PM
Engineers are still working on bypassing the electrical room. We decide to wait for a couple of more hours. There are a lot of other issues with moving all the data into the backup data-center – re-configuring the systems would take us a longer time and we run the risk of not having enough servers to handle to load. Our backup systems are meant to store backups (not run the entire load and applications.)

10:00PM
Power is restored to HVAC (heating and cooling) equipment. Power is then slowly turned on to all the customer (our) equipment.

3:00AM
Power comes back online – Our servers start humming – All QuestionPro technical staff is online by 5AM – We start working on making sure all services come back online properly. – By about 6:45AM we are all back to normal.

Twitter – Works like a charm:
We tried to keep everyone abreast of issues as they developed on twitter. We were issuing updates and following updates using the following hashtags – #fisherfire and #fisherplazafire. If anyone every doubted the uselfullness of Twitter in an emergency – this has proven (to me at least) first hand that Twitter indeed is amazingly useful to communicate in the face of an emergency.

Through twitter we found out that we were not the only ones affected by this fire – Some of the other sites that went offline are:

  • authorize.net
  • bing.com/travel (farecast)
  • bigfishgames
  • Bartell Drugs
  • allrecepies.com

Needless to say, this is pretty big disruption of our services. Both Fisher Plaza and Internap have promised us that they’ll come up with a detailed explanation of the issues and steps to prevent such outages in the future. Meanwhile this also exposed a couple of vulenerabilities on our own preparedness. In the spirit of openness I’ll talk about them – and not only will we talk about it, we’ll also do something about it – and keep you posted on progress.

We will be posting a series of blog posts with the hashtag #fisherplazfire to communicate effectively the steps we are taking to make sure this kind of a distuption does not happen in the future. Like with any system, we cannot make things 100% – but we sure as hell can try.

Short Term Issues:

Communication:
One of the shortcomings we noticed was that our Blog (which is our primary medium of communication) was also hosted within our data-center. This has to change — we;ll be moving our blog (blog.questionpro.com) to a hosted WordPress – Rob Hoehn is in charge of that and will oversee that. We’ll also take this opportunity to segment our blogging – we’ll setup three separate blogs (one for QuestionPro, IdeaScale and MicroPoll.)

Automated Phone Message:
We should be able to deliver the same information (like twitter updates) when people call up. We use Angel.com for our hosted PBX system – We’ll setup the system so we can give out updates when users call in in times of emergencies like this.

Pre-Planned Error Page:
We should have a system in place to switch our systems to an error page (when all hell has broken loose) – we had to scramble at the last minute to setup a separate system (in our backup data-center) to host the error page itself.

Long Term Issues:

Real-Time Data-Center Redundancy:
We have full redundancy _within_ the data-center. So if any one of our servers dies (hard drive failure, etc.) – other servers pick up the slack automatically. If one of our database-servers crash, we have replicated servers that will come online automatically within seconds. However, if the entire data-center goes offline, our current plan does not have a solution to move to another data-center within minutes. We have full copies of the data stored offsite – but that is only the data.

What we need to get to, is to _operate_ out of a different data-center in case of a massive emergency like this. This will undoubtedly will double our operating expenses, but given then business we are in, we simply need to do this. Over the next three months, we’ll be figuring out a solution so that we can sustain turning off power to our primary data-center and things move to our backup data-center.

Finally, I want to acknowledge the patience some of you have shown and understanding many of our customers have shown in the face of this emergency. As the CEO and an Owner of this business I do not take this lightly.

If there is something I can do for you, please feel free to ping me directly – vivek[dot]bhaskaran[at]surveyanalytics[dot]com

Comments are closed.