What the H happened! (site down, major hardware failure)

If you had been trying to get on this site and were having problems, I can assure you it is entirely not a problem with your computer. It was all on the server side.

In spite of having three layers of back up and redundancy, some times it really does hit the fan and last night starting at about 10PM PST things started to get down right nasty. At 3 AM the site was completely dead. Not only were the sites dead, but also the email system, the Private Message system and more.

What happened was that we suffered a catastrophic hardware failure. The entire contents of the server we are on had to be reinstalled from back ups. And that is the good news.

The bad news is that the server had been feeding corrupted files to the multiple layers of back up, and the newest clean back up was from December 29th. At first it looked like we were going to lose everything, that had been posted from that point forward. Fortunately, the host we rent the server space from was able to provide Jay with selected files including our own back up file from 3AM on the morning of the 3rd. So we did lose a few things, but a lot less than originally thought. The thread that I think was hurt the worst was the Are You Coming To Barrett Jackson thread. If you posted there, you might want to replace your post.

Due to some very very good work on Jay’s part we are back and almost completely intact. Jay suggested that I need to stock up on Chapstick as I have some serious butt kissing to do at the Hosts location as they really did come through for us in a big way. I am not much of a kisser so this should not be much fun for them either.

Anyway, I am sorry about the interruption of service and the general confusion as we fired the board up, took it down and fired it up again, and I know several of you posts disappeared in the process. I’d like to say that it will never happen again (this is only the 2nd time in almost twenty years we have seen this failure) but sometimes the bear does eat you.

Thanks for everyone’s patience and understanding!

Bill, I’ve pulled a few all-nighters getting IT systems back on-line so I know from experience that some failures don’t care how many layers of protection you have. We appreciate the speed with which you and Jay go the site back up and running. It was quick enough that I didn’t even realize there had been an outage! Great work to all those involved in the recovery effort.

Bill, I thought you and Leon would both be smiling knowing what we were up against. Probably more IT folks on here than I know about, to. Anyway, we seem to be running smoothly now. Jay really deserves all the credit.

Bravo gentlemen, very good recovery time and work! :beerchug:
Thanks for all you do for us.

Bill.

I didn’t respond because I remember too well the lost holiday weekends and long nights from failures of software vendors and service providers and secure 24x7 environments that can never go down.

While I might have to go back to the grind soon, I like the fact that my current challenge is how to replace the exhaust gasket on the 428 (and yes I know that 428’s shouldn’t have exhaust gaskets)

Failures like the one you just had just reinforce my view that service level agreements aren’t worth the electrons they agitated in theirm preparation and failures of 428 exhaust gaskets just reinforce my desire to go back to working on VWs and Porsches

Thanks for overcoming

Leon: Amen brother!

In over 30 years in the IT industry, I don’t think I’ve ever encountered a vendor who was willing to be held financially accountable for their SLA’s. To me, they are nothing more than a marketing document until there is some vendor skin attached to meeting them.

I’ve been out of the IT organization for 7 years now. It’s a lot more fun to train bright, energetic 20-somethings on how to sell to IT than it ever was to do IT. And it probably beats dealing with 428 gaskets as well, if for no other reason than that half of my trainees are recent co-eds entering Sales. It’s like an eye candy factory around here!

But I do feel for you, Bill and Jay. And Leon, I’d also prefer working on VW’s and Porsches to going back to 24-hour pages and system recovery events.

I worked on a couple of accounts with EDS where we would provide customers with compensation for SLA’s we didn’t meet. The Texas based integrity of EDS was far more satisfying than the west coast ethics of the current owners.

Leon, did their financial accountability extend beyond crediting back the prorated cost of the SLA for the period of the outage? That’s what most vendors offered when pressed and it typically amounted to peanuts compared to the actual cost of the outage.

I only ever dealt with EDS on projects and SLA’s in APJ and EMEA, since we didn’t have IT resources in those regions at the time. But I can see where Ross Perot would have operated a bit differently than most.

The penalties in some cases related to business impact, but as a minimum was equivalent to the cost of providing the service during the outage