byasardell03-07-201701:20 PM - edited 03-08-201710:55 AM
Auto-remediation can mean slightly different things depending on the scope of a problem or how “automatic” the solution is intended to be:
Do you want to remediate a problem (say, latency or congestion) based on all possible causes? 80% of the most likely ones? Just the top 1-2?
How much of the solution do you want handled by positive automatic responses to monitored events, or how much should be left to responsible engineers, armed with rapid, sufficient diagnostic information?
The first point speaks to the open-endedness of possible network states and complexity. The second may sound like a cop-out until you consider the potential risk accepting an automated configuration change.
How Does Brocade View This?
The StackStorm project took a very ambitious view of auto-remediation, and in this August 2015 post provided a canonical definition that included identifying, diagnosing, reporting, and taking the necessary steps to automatically correct issues where possible, or to provide full problem analysis where it’s imprudent to leave humans out of the equation.
Brocade Workflow Composer (BWC) continued with this approach, taking it to heart as it applied to solving networking problems and ensuring that resultant updates are permeated throughout other domains, such as compute, storage and applications. What sets BWC apart, and makes it uniquely suited to auto-remediation, is its event-driven nature--the ability to see what comes up via sensors and to take immediate action.
Network Auto-Remediation Example
The following use case ensures fabric reliability by dealing with any interface that is not forwarding packets. Among other benefits, this can prevent application timeouts in mission critical networks.
This example uses a simple spine-leaf topology built with SLX and VDX switches. Brocade Workflow Composer (BWC) is installed, and is interacting with the network while leveraging community integrations in place with Splunk and Jira for automatically analyzing network events (reported in logs) and (if needed) automatically creating trouble tickets.
A multistage workflow (Figure 1) is set to ensure all links are passing traffic: BWC is notified when this is not the case. It is not initially known why the interface is malfunctioning, and the problem can show itself in the form of a slow application, latency accessing a server, or a related symptom.
Figure 1: Auto-remediation on an Interface that is Not Passing Traffic
Link 1 on Switch 1 (highlighted in red) is not passing traffic. At this point, we don’t know the reason.
Switch 1, as part of normal record keeping, produces an alert and sends it to a Syslog server; this could be a separate server or part of the Splunk installation.
Knowing that BWC is looking for records on issues with network elements, Splunk filters pass a trigger of Syslog entries from the switch to BWC
BWC get the webhook trigger from Splunk with the impacted device ID and interface information. BWC then runs auto-remediation workflow to recover from this situation.
BWC first creates a Jira ticket.
BWC troubleshoots Switch 1
Detects the link is down
Figures out why and remedies the problem
L1 is up and Switch 1 notifies Splunk who notifies BWC of Success
BWC updates Jira and closes the ticket.
How Could We Take This Further?
Notice that in Step 6 above, we omitted the question of what exactly BWC found out and what it did to fix the problem. This raises the broader issue of auto-remediation's role in determining what should be versus what really is in a configuration or a process.
Assuming that there are autonomous systems between the spine and leaf, perhaps a BGP peer has gone down. So a logical first step might be to take some data from both “affected” devices--those on either side of the connection--and see what is causing the issue.
The problem may be a flap due to broken peering, convergence time mismatches, or inconsistent route selection. Or, unrelated to BGP at all, it may be a latent configuration error such as an MTU mismatch or an ACL entry that only just now affected the reachability of a server. Alternatively, it could be that congestion caused packet drops and a resultant application timeout.
Any or all of these possibilities can be explored and remediated through BWC, depending on the type of engineering expertise can be captured in future workflows to make it even more self-healing.
And how aggressive do you want to be in terms of correcting the error? If you are concerned with having to explain that an unexpected side-effect came from an automated workflow, perhaps it’s best to be cautious. In this example, you could try something benign like administratively shutting down the interface and bringing it back up. If that doesn’t fix the problem, address it manually.
A related question: do these questions tie into the larger issue of intent-based networking? We’ll deal with this also-evolving concept in a future post.
Call to Action
Auto-remediation is invaluable, and Workflow Composer’s event-driven nature provides the basis to examine and provide corrective action for all aspects of a network issue. Find out more from these related links: