28 July 2008

100 hours

I've been through the most distressing work week of my life...well, at least since the days in the oilpatch when I would occasionally end up staying awake for 30 to 40 hours at a time.

This week, the primary data server for all web applications crashed and burned a week ago Friday, followed closely by its backup on Sunday. The servers were over eight years old, the software components even older, and of course there were no rebuild instructions or disaster recovery plan. Lets just say that it has taken several attempts to get things going down the right path and it was only this weekend -- seven days after the initial incident -- that things have started to make remarkable progress.

Today, the servers are back up as is the customer portal and 90% of the applications. We are working to deal with partially complete source code bases to try and get the remaining 10% back online.

Needless to say, since last Sunday I have logged over 110 hours on this issue. No cycling, no socializing, nothing but sitting in a room and figuring things out. My brain is tired in addition to my eyes and body.

I'm not the only one. There has essentially been a team of 20-40 people working on this issue at any given time over the past week, including direct involvement by my company's CEO and our customer company's CEO and CIO. Makes for interesting decisions and political posturing which we probably could have not been involved with, but no matter. That crap follows you wherever you go. It's the same reason why this problem occurred in the first place.

There is a group of people closer to the level of the technologies and software that have been saying for four years now that this infrastructure needed to be upgraded; the risks of not doing so were high, but the customer seemed okay to accept that risk profile. You would think that because of that, a proper disaster recovery plan would have been put in place, but it didn't. Needless to say, now fully knowing what the issues and problems were, even with a proper recovery plan, the outages maybe would have been four days instead of seven days, but even so, millions of dolloars have been lost every day while this has been going on.

I'm very sick and tired and sore and stressed and depressed. It is particularly stuff like this that makes me hate my job; I think this may be the straw that breaks the camel's back. I don't think I can continue this position for much longer. I did like having the leverage of experience in an environment where everyone else has left, but after days like Wednesday that were 16 hours straight of being asked a question every 30 seconds, I'm not sure I want to assume the responsibility anymore. Is that wrong of me?

2 comments:

Cyrus said...

Dude, get out now. Why put yourself through this?

Acquiel said...

Ouch, no fun.

Give me a shout if you want something different ;)

I am sure I can make the budget happen to get a new resource in...

But wait... you'll have to work with Devon :S