The Needle in the Haystack (The Visibility Challenge of IT Operations)
byjrizo05-01-201711:19 AM - edited 05-01-201711:57 AM
My intent in putting this post together is not to bore you with mundane talk about how we can make networks great again. Rather, I’m going to quickly review the challenges I’ve personally faced operating networks over the past 20+ years and offer a revolutionary solution for networkers. I’ll show you how this works in a real world scenario that solves “The Needle in the Haystack” challenge.
Three Operational Challenges
The Network is Slow: Operationally, simple problems like user access issues, links that have gone down, or switches or routers that have gone belly up can be quickly identified and remediated by network engineers targeting specific points in the infrastructure.
What happens with my favorite support call? When the complaint is that the network is slow, the real fun begins. Engineers sift through the latest batch of Syslogs and check NMS for error messages or alerts that may have been logged. Nothing found? Then it’s time to go element by element to see if anything along the path is behaving badly. Chances are that nothing is going to immediately jump out, leading to a painstaking search through the entire infrastructure to identify and resolve the cause of the slowdown.
Border Gateway Protocol (BGP) Peer Goes Down: Another one of my favorites challenges is when a long established BGP peer goes down overnight. Depending on peering relationships and network failover, this could be a big deal. For instance, if a customer uses BGP peer A for their primary network traffic and have a contract with BGP peer B as a backup connection, this comes at a hefty usage cost. The longer they are on BGP peer B, the more money they are paying for this service. Service providers typically staff BGP experts 7x24, but most federal agencies do not and cannot afford to do so. This means costs are mounting as the issue goes either unnoticed or unresolved until appropriate resources arrive. Worst case scenario…you’re in Texas and the outage is in Minot, North Dakota. Come on, tell me some of you haven’t been in this situation before.
Command Line Interface (CLI) is Cumbersome: If you’ve been following the Federal Insights Tech Corner, then you may have read through the most recent Tech Corner post on how the CLI is dead. The piece explains why using CLI and point management tools are inefficient and operationally cumbersome. I’m not going to belabor the fact that this is correct, but want to present the current widespread use of CLI as another operational challenge agencies face.
Do any of these three challenges sound familiar? They do to me, painfully so. What I’ve been lacking in my career are really good tools that help mitigate the challenges of finding and fixing difficult network issues, like the examples above. Also, I need a way around the CLI, it’s just not an efficient means of looking at many network elements at one time.
So, now let’s look at options to simplify your life as an operator, enhance the customer experience and reduce mean time to resolution…all ultimately caused by a lack of network visibility.
A New and Improved Toolbag
The Brocade SLX Insight Architecture offers a resolution to these three challenges by providing the visibility necessary to quickly and accurately address the real issues on the network. This solution delivers unparalleled real-time visibility into network traffic with no impact to performance or reliability of the network data plane or control plane. How? As a platform within a platform with a 230 Tbps fabric capacity routing and switching platform, it can handle basic connectivity requirements for 10, 40 and 100 Gbps Ethernet.
However, the hardware is not the cool part (although our hardware developers would certainly disagree).
What makes the SLX platform stand out is its Insight Architecture. Experts who have run a network or two built the architecture from the ground up with DevOps-style automation in mind (think NETCONF, OpenFlow, REST).
Inside the system reside two virtual machines running on KVM hypervisors. The first one, the System Virtual Machine, is for the system and only accessible from the CLI. The second virtual machine, the Guest Virtual Machine (Guest VM) is for the network operator. The system comes with pre-installed tools on the Guest VM, including Wireshark, Google Chrome and TCPDump (for you security types reading, yes, they are disabled by default).
Here’s what this means in action: Remember the previously mentioned outage in North Dakota? You can go ahead and cancel your flight as built-in tools enable remote troubleshooting. Insight Architecture offers a 10 GbE dedicated analytics path from the interface modules to the Guest VM and (not or) a 10 GbE dedicated services port to stream data back to the network operations center, providing comprehensive, automated visibility.
That’s the tool bag, but there’s still more. Let’s finish up with how we can make these tools do the work for us.
The Old School Approach
The BGP link failure scenario offers a real world example. While this use case is common, the same approach supports everything from provisioning a new tenant to adding a switch to the network. Typically these issues require a lot of manual effort. Here’s the old school way of dealing with a BGP outage:
That’s a pretty simplistic set of steps, but each requires a significant amount of time at each step before the actual remediation begins.
The Advanced Degree Approach: The Power of Insight
If you like to tinker around with home automation like I do, you’ve probably played with If This Then That (IFTTT). If this is the case then you know that a simple tool with a user-friendly interface can be powerful. Similar tools to IFTTT can drastically reduce operational expenses and mean time to repair network infrastructure; sort of an IFTTT on steroids.
The Insight Architecture offers a powerful new approach. It provides unprecedented visibility into what is happening under the covers of the infrastructure. Insight enables automation of operational functions. Information streams to trigger actions and automate almost any function.
Brocade Workflow Composer (BWC) provides the IFTTT front end, integrating and leveraging existing tools behind it, underneath it and on top of it. Take a quick look at the Splunk reference architecture below to see how:
The following example illustrates how the Insight Architecture identifies, attempts to resolve and alerts the operator of the BGP issue and outcome of the troubleshooting it performed. The Insight Architecture enables usage of Commercial Off the Shelf (COTS) tools including BWC, SDN Controller, Splunk, Slack, Zendesk and PagerDuty, which are featured in this scenario. Now let’s take a look at the graphic and see what visibility and automation can do working together:
The automated workflows do all the heavy lifting - identifying the issue, collecting required data, analyzing the data, taking action based on analysis and posting messages, creating a ticket and notifying the on-call operator.
An open network automation platform that enables cross-domain workflows to improve mission readiness is what the Insight Architecture and Brocade Workflow composer are all about.
This is one of many use cases that can help improve efficiency and IT reliability for government. The multitude of other ways these technologies could be leveraged range from enabling end-to-end IT workflow automation to capitalizing on the power of DevOps methodologies and much more. The Power of Insight is in your hands now…so how will you use it?