Where the Devoted Followers of Uptime and the Cult of 9’s go to Pray.
Uptime or Death!
It’s 4 am, on a Saturday. You got in at 2 am after sleeping for a lovely 1.5 hours, monitor glow on your face, your jaw tightening up for a nice big yawn as you are poring over logs.
What changed? When was the last reboot? What versions are running? Does anyone in engineering have root access? What patches have been applied? What happened right before the downtime? Have we seen this before…..?
<…Meanwhile outside of your head, in the really, real world….>
There’s a dull buzzing on the Polycom as you listen to 22 people on the line breathing, typing and yawning. Someone, somewhere is calculating how much money you are losing the company for each microsecond the service is still down.
You hear footsteps, feel someone behind you and hear the dreaded “is it back up yet?”
Something snaps, your vision goes red, then black…
…as your vision returns you realize there is a broken keyboard in your hand and an unconscious “Suit” lying on the floor…..
(This story is fictional; any likeness to actual people alive or dead is a coincidence or a very unfortunate fact of your life. No actual “Suits” were harmed in the writing of this blog.)
I always wanted to say “Oh, damn! YES! It’s been all been back up for 20 minutes, and I just FORGOT to tell anyone. Thank the gods you came, otherwise no one would have EVER known that it was all back online! You’re my hero.”
Ever been on-call? Do you know what it’s like to carry a laptop with two forms of network access with you everywhere you go — knowing that even if nothing breaks in the next two hours, you won’t be able to relax because any second the service could come crashing down and you would need to stop whatever you were doing and “save the world”…again?
I was on-call for the better part of 10 years. Sometimes I had shifts where I was the first escalation point for anything that went wrong. Sometimes I was the architect that didn’t have shifts but was ALWAYS on-call because, if the NOC couldn’t figure it out, I was the “Break Glass in Case of Emergency” guy.
Still to this today, if I hear hard plastic vibrate on a table, my pupils dilate. If I see a Motorola pager, my palms start to sweat.
You have one job — just one job. Keep the bloody service up all the time no matter what it takes. Thinking about taking in a movie? Enjoying a nice Christmas dinner? Seeking enlightenment at Church? Perhaps you’re at your best friend’s bachelor party? Enjoying some quiet time in jail after said bachelor party? It doesn’t matter. You have an “expected response time” or SLA (service level agreement). Whether it’s hell or high water, nukes are raining from the sky or you’re watching the Extended version of The Lord of the Rings Trilogy, it doesn’t matter, and no one cares. You WILL respond and either fix the problem or find someone who can. End of story.
So now it’s Monday morning. A vendor has come to meet with you, and you agree to see them only because they offered coffee. They said they have the coolest new product: it makes toast, is bigger on the inside, blah, blah, blah…
And all you can think about is that voice in your head asking “in what creative way will this new magical device bring down my network?”
Production is not for the weak of heart. As a result:
Uptime is better than “faster.”
Uptime is better than”easier.”
Uptime ranks right up there with breathing and Skyrim.
“What do you think?” comes a voice. “Ah yes,” I say, “You are still here. So phenomenal cosmic powers, you say? Practically prints money, huh?”
Tell me Mr. Sales Representative…
Can it be deployed in a cluster?A/A or A/P?How many power supplies?Can it do a hot-code load?How many 9’s can you guarantee?Are you willing to put money behind that claim?Who else is running your stuff?Who was running your stuff before and now refuses to?What about electromagnetic pulses?Is any of your stuff running on an aircraft carrier or perhaps on a nuclear attack sub?What would happen if I physically ripped out the management modules (you have two, right?) of a live system?Would data keep flowing?How would you compare your quality to the Mars Rover Opportunity?
Yes, Mr. Sales Representative, I understand that Virtualization, Clouds, SDN/Openflow and <insert cool new thing here> are the best things since sliced bread. What I need to know is if YOUR implementation of them is ready for production?
Think about how long Linux was running in Dev and QA environments before someone like a bank would run it in production. How many years were we all secretly running VMware on every system we could so we could prove it was production ready?
It’s a rite of passage. I want to see not months but YEARS of data. I want to know that your stuff, the version you are trying to sell me, has been running large, heavily-used services for days and days and days. Don’t bring me something you invented 3 weeks ago and ask me to put it into production.
You don’t get to cut in line! You don’t get to skip the QA or Dev years. You don’t get to bypass running a bunch of less critical IT services before you can earn a chance to run in the production email environment.
UNIX had to prove itself over time. Linux crawled out of the ocean and floundered on the beach before becoming a heavy production player. VMware spent years and years prostrating themselves in QA and Dev environments before being invited to the Grand Cathedral.
Anyone trying to circumvent that process is a non-believer blurting blasphemies. Production is Sacred. Production Data Centers are Hallowed Ground. The technologies and products that are already here bled and sacrificed to get where they are. They earned their right to be here.
One must show just a wee bit of deference and humility if they wish to kneel at the Altar of Uptime.
This is Production.
switch_5:admin> uptime Up for: 3653 days, Powered for: 3655 days