In this blog, we’ll walk through a specific example of how to use this combination of platform and automation features to determine the root cause of an application performance problem, whether it has to do with the network, the application itself, or another reason having to do with the physical or virtual compute resources.
Let’s assume that your software team has deployed a distributed application on a scale-out leaf-spine IP fabric network (Figure 1). The fabric uses BGP-EVPN, and application isolation, to provide Layer 2 services across the fabric.
Figure 1: Distributed Application across Scale-Out IP Fabric
Selected users have been reporting intermittent, inconsistent performance problems. The software team suspect a network problem, and have passed it to the network team to investigate further.
How do we go about troubleshooting the problem? We start at a high level, then work down to deeper detail until we isolate the issue:
Check overall traffic – look for any link congestion problems
Drill into specific per-application server traffic levels to identify abnormalities
Capture traffic from anomalous servers to drill down into specific packets
Streaming Data: The Big Picture
Brocade SLX switches support streaming interface counters. Using Brocade Workflow Composer, we can run a workflow to configure the streaming settings on our switches. This pushes out a profile that defines the statistics we want to stream, and where to send the data to. No need to login to each individual switch. Our profile needs to include interface counters.
This data can be collected and displayed by tools such as Splunk, Influx DB, Grafana, or the Elastic Stack.
Our starting point is to login to a dashboard showing interface utilization graphs (Figure 2). This will tell us if there is any congestion occurring on links within the fabric, or at the edge ports:
But these graphs don’t show anything unusual. Traffic levels are normal, no interfaces are showing congestion. We need to go deeper.
SLX Visibility Services gives us multilayer classification capabilities including network parameter filters such as IP and MAC addresses, port numbers, VNIs, and workload matching. We can then take action on matching packets, such as count, drop or mirror.
We want to get traffic counters for each of our application servers, at every leaf switch that the application currently uses.
We need to:
Identify the IP addresses used by our application, which compute nodes they currently run on, and which switch ports they are connected to
Figure out which VNIs are used for that traffic
Create rules to match that traffic, and install on all relevant leaf switches
Monitor the results
The first three steps are tedious, repetitive work: a perfect case for automation. So we run a workflow to gather the IP addresses from our compute system, identify the VNIs used, and pass the details through to a workflow that sets up the matching rules, with a “count” action.
Watching the results, we can then see traffic on a per-IP basis, rather than the aggregated interface stats we had earlier. This reveals something unusual: one of the servers has lower traffic volumes than the other. It’s not zero, but it is lower than the others. What’s going on with that server?
SLX Insight Architecture
So now we want to dig deeper into that traffic. We run a new workflow that applies a “mirror” action to the interesting traffic, and sets up a packet capture on our Guest VM in the SLX Insight Architecture. No dedicated taps or hardware needed.
Now we have a pcap file that we can analyze in Wireshark. Looking at the packets in more detail, we see something a little unusual: one of the application components isn’t loading. Clients are timing out with that component, and failing over to another server.
Armed with this information, we can go back to the software team, who resolve the issue. Traffic is now balanced properly across all systems, all are working as expected, and users are happy.
Finally, we run a “cleanup” workflow that removes our packet capturing rules, and we’re done!