Data Center

Site Reliability Engineers. They’re Here, Happy, Hurting.

by plaporte ‎03-18-2017 02:39 PM - edited ‎03-18-2017 03:09 PM (8,958 Views)

Last week, I attended my first SRE Con in San Francisco. In case you’re unfamiliar, SRE stands for Site Reliability Engineer. SRECon_sm.jpg
It’s the hot new job title and role as more and more organizations embrace Digital Transformation and Cloud operations models. For two days, the StackStorm team met with and fielded in-depth questions from a steady stream of attendees seeking ways to deploy services to users and customers faster and more efficiently.


Side Note: If you question just how hot this job function/title is, half the booths manned at this event were recruiters! They didn’t have technology or services to sell. They were there to recruit!
Just sayin’.


During the sessions, attendees heard success stories and advice from SRE experts from Google, Facebook, LinkedIn, and many more. If you’re in Ops or Dev or both, some pretty darn cool stuff. Like Facebook describing how their Pub-Sub service needed to handle over 1 million messages / minute. Or the example of how LinkedIn’s “Nurse” automation reduces mean-time-to-resolution (MTTR) by taking inputs from a custom event correlation utility to hone in on and remediate service failures without human intervention.  Pretty cool.


LinkedIn Event


But it wasn’t all tech talk. Comcast shared keys to their successful digital transformation by starting with the statement that “tech was the easy part”. Spoiler alert! Their 3 keys to success are:


  1. Cultural shift that requires a new mindset, new way of thinking, most importantly, support from management of ALL levels.
  2.  Vision that’s inspired by Cloud models and a core belief that infrastructure should be cheap (“throwaway”), reusable, and created, managed, and destroyed by automation.
  3. People must be supported and empowered to learn new skills, adapt at their own pace, or at worst, be humanely managed out.

All this great information and demonstrations of success had the standing-room-only meeting rooms pining to apply learnings to their own journeys. Then reality would sink in, sometimes as soon as they exited the meeting room. It’s like they suddenly realized they lacked the same training, time, and resources of these mega-scale cloud providers. Sad!


What they needed was a boost –technology they could wrap their heads around quickly, was built by Ops and Dev engineers for Ops and Dev (note I didn’t use DevOps) and that would give them a solid foundation for achieving the same success as the presenters.


Enter StackStorm / Brocade Workflow Composer and a message of “event-driven automated remediation”. Talk about right place, right time. Throw in that StackStorm is open-sourced, free, cross-domain, AND includes nearly 2000 pre-built integrations and attendees were hooked.


Automation is key. SREs realize it but are confused on where/how to get started. The answer could be as easy as 1-2-3.


  1.  StackStorm provides a great starting point. It’s free with a vibrant and active community that includes active participation from StackStorm engineering. It also includes ~2000 pre-built integrations. Talk about a low barrier to entry!
  2. Brocade Workflow Composer takes StackStorm into production with backing for global technical support and enterprise-grade security (LDAP/RBAC).
  3. Automation Suites bring in turnkey network lifecycle automation to achieving immediate business value with confidence.


Conclusion: This is an awesome time for operations. So many new opportunities and cool technology. Don’t stress about where to get started and how to proceed. The community will provide..and it wouldn’t hurt to look at StackStorm or Brocade Workflow Composer.