In this blog, we’ll walk through a specific example of how to use this combination of platform and automation features to determine the root cause of an application performance problem, whether it has to do with the network, the application itself, or another reason having to do with the physical or virtual compute resources.
Let’s assume that your software team has deployed a distributed application on a scale-out leaf-spine IP fabric network (Figure 1). The fabric uses BGP-EVPN, and application isolation, to provide Layer 2 services across the fabric.
Figure 1: Distributed Application across Scale-Out IP Fabric
Selected users have been reporting intermittent, inconsistent performance problems. The software team suspect a network problem, and have passed it to the network team to investigate further.
How do we go about troubleshooting the problem? We start at a high level, then work down to deeper detail until we isolate the issue:
Check overall traffic – look for any link congestion problems
Drill into specific per-application server traffic levels to identify abnormalities
Capture traffic from anomalous servers to drill down into specific packets
Streaming Data: The Big Picture
Brocade SLX switches support streaming interface counters. Using Brocade Workflow Composer, we can run a workflow to configure the streaming settings on our switches. This pushes out a profile that defines the statistics we want to stream, and where to send the data to. No need to login to each individual switch. Our profile needs to include interface counters.
This data can be collected and displayed by tools such as Splunk, Influx DB, Grafana, or the Elastic Stack.
Our starting point is to login to a dashboard showing interface utilization graphs (Figure 2). This will tell us if there is any congestion occurring on links within the fabric, or at the edge ports:
But these graphs don’t show anything unusual. Traffic levels are normal, no interfaces are showing congestion. We need to go deeper.
SLX Visibility Services gives us multilayer classification capabilities including network parameter filters such as IP and MAC addresses, port numbers, VNIs, and workload matching. We can then take action on matching packets, such as count, drop or mirror.
We want to get traffic counters for each of our application servers, at every leaf switch that the application currently uses.
We need to:
Identify the IP addresses used by our application, which compute nodes they currently run on, and which switch ports they are connected to
Figure out which VNIs are used for that traffic
Create rules to match that traffic, and install on all relevant leaf switches
Monitor the results
The first three steps are tedious, repetitive work: a perfect case for automation. So we run a workflow to gather the IP addresses from our compute system, identify the VNIs used, and pass the details through to a workflow that sets up the matching rules, with a “count” action.
Watching the results, we can then see traffic on a per-IP basis, rather than the aggregated interface stats we had earlier. This reveals something unusual: one of the servers has lower traffic volumes than the other. It’s not zero, but it is lower than the others. What’s going on with that server?
SLX Insight Architecture
So now we want to dig deeper into that traffic. We run a new workflow that applies a “mirror” action to the interesting traffic, and sets up a packet capture on our Guest VM in the SLX Insight Architecture. No dedicated taps or hardware needed.
Now we have a pcap file that we can analyze in Wireshark. Looking at the packets in more detail, we see something a little unusual: one of the application components isn’t loading. Clients are timing out with that component, and failing over to another server.
Armed with this information, we can go back to the software team, who resolve the issue. Traffic is now balanced properly across all systems, all are working as expected, and users are happy.
Finally, we run a “cleanup” workflow that removes our packet capturing rules, and we’re done!
on 11-01-201608:55 AM - last edited on 11-01-201604:23 PM by jason_cmgr
Two weeks ago, I mentioned how the topic of denial-of-service DDoS issues, and the potential of BGP Flowspec (BGP-FS) as a way to mitigate them, was a very hot topic at NANOG 68 in Dallas. In general, this topic is top of mind for everyone, especially with the very recent widespread cyberattack. Let’s take a look at the continued evolution of methods to handle these attacks.
Traffic filtering policies have traditionally been very static in service provider networks. But in this age of traffic-based, and more widespread, DDoS attacks, operators need to create dynamic filters
byJeni Lloyd10-18-201609:13 AM - edited 10-18-201611:08 AM
At VMworld Barcelona today, Fujitsu announced their PRIMEFLEX for VMware Cloud Foundation integrated system. Brocade® VDX switches are the connective tissue for this new hyperconverged infrastructure option. In this post we’ll take a look at the drivers and benefits of this latest collaboration between Brocade, Fujitsu and VMware.
on 10-12-201611:00 AM - last edited on 11-01-201604:25 PM by jason_cmgr
For my twenty or so years in computer networking, the promise of a network that is sensitive to the applications running on it has been put forth almost continuously. And to some extent, constructs such as Quality of Service (QoS), policy-based routing, and even simple logical domain separation through subnetworks and VLANs have provided the support to improve application performance.
As part of the digital transformation, however, applications are becoming much more complex
on 10-10-201603:06 PM - last edited on 10-20-201602:31 PM by jason_cmgr
The many approaches to data center design strategy can be confusing. But, essentially there are 2 approaches – a vertically integrated complete stack solution or build your own stack. There is a parallel here with bring-your-own-device and bring-your-own-app.
on 10-04-201609:42 AM - last edited on 11-01-201604:30 PM by jason_cmgr
The Product Management team in Brocade’s Switching, Routing and Analytics Business Unit is shipping the 230Tbps-capable SLX 9850 router to NANOG 68 in Dallas (October 17-19). It will be presented at NANOG’s “Beer and Gear” event on Tuesday afternoon, October 18 at 6:00PM.
This high-end router is especially suited to IP fabric and core-aggregation
Have you ever thought about what types of cars are passing through an intersection, whether there are any accidents or if cars are ignoring traffic lights? Operators use cameras and sensors to get visbility into traffic intersections. Do you have the same visibility into your network? ...
byJason.Nolet09-13-201605:00 AM - edited 09-15-201611:22 AM
Looking back on the changes of the last two decades, we can see that networks were at the center of the disruption that swept through every industry twenty years ago with the introduction of the Internet...
byDavid Gorman08-17-201603:22 PM - edited 04-13-201711:27 AM
It’s August and that means it’s time for VMworld 2016 in Las Vegas! This is THE destination to find all the sexy things you’ve been peeking at online and just never get the chance to sample in real life. And there’s also Vegas entertainment if that’s your thing instead…
byJeni Lloyd07-18-201601:52 PM - edited 03-24-201709:44 AM
The latest release of VMware NSX-V and the accompanying networking HW certifications for vendors including Brocade is an exciting new step towards driving agility across IT and delivering increased value from the network to the business. At Brocade, we are delighted to be the first HW VTEP to be listed in the VMware Compatibility Guide.
Network automation can’t be addressed with “protocols” alone! The missing piece in network automation puzzle is a software framework that addresses the inefficiencies in today’s model with the following characteristics: