vADC Docs

Custom event handling in Stingray Traffic Manager

by aknox on ‎02-26-2013 04:06 AM (1,166 Views)

Server on fire1.jpgOverview

This article illustrates how you can create a custom 'program' action that is triggered when a selection of events is raised.  The action will seek to take the appropriate debugging or remedial action to address the problem associated with each event.

Candidate actions

  • If a node fails, we will capture network traffic to and from that node
  • If a node begins to underperform, we will obtain process information from that node
  • If the number of file descriptors gets too low, we will generate a report of file descriptor usage
  • If the Stingray software encounters a FATAL problem, we will generate a technical support report


Creating the Action


To help us create and debug the event handler, we'll first create a very simple debugging action.  Go to the System -> Alerting -> Manage Actions page. Create a Program Action named 'Debug Problem', and configure it to call /bin/echo:

debugevent.png


The program (/bin/echo) is passed two parameters by default: the name of the event type that triggered the action and information about the specific event reported within that event type.

This will suffice for now - we will add more arguments later when we have finished writing the program.

Creating an event type

Next, we create a set of events (an 'event type') that will trigger the action:

Go to System -> Alerting -> Manage Event Types and create a new event type called 'Problems to Debug'. You will be presented with a list of all the events that Stingray can catch in a tree structure. Select the following events:

  • Nodes -> General Events -> Serious Errors -> Node has failed
  • SLM Classes -> Information Messages -> Node information when SLM is non-conforming
  • General -> Warnings -> Running out of free file descriptors
  • General -> ZXTM Software -> Internal software error

Create Event Type.jpg

Save the event type by clicking 'Update'.

Linking the event type to the action

The next step is to configure Stingray to trigger the action when one of the events in our event type occurs.

Go to the System -> Alerting page and select the 'Problems to Debug' event type from the drop-down box at the bottom of the page. The event type will appear in the list of mappings alongside a drop-down box containing a list of all the actions that have been configured. Select the 'Debug Problem' action from the list.

Mapping.jpg

It would also be useful to receive a notification that some debug output has been produced, so select 'E-Mail' from the list of actions as well. Click 'Update' to save the changes and then, if you haven't already done so, configure the E-Mail action to use your mail server and e-mail address.

Writing the Program

Currently the 'Debug Problem' action will not do anything useful when it is triggered, so we need to write a program for it to run.  The code for this program is attached to this article.

The program examines the event information it receives and, for certain events, performs some debugging actions. The program determines which event it is handling by matching the primary tag (as presented in the 'Event Type' configuration list).

When a node fails...

The Perl program looks for the 'nodefail' tag, then extracts the name of the node and its port from the message.


if( $message =~ /\tnodefail\t/ ) {


  my( $node, $port ) = ( $1, $2 ) if $message =~ /\tnodes\/(\S+)Smiley Sad\d+)\t/;



It then starts capturing traffic going between Stingray and that node to see if there are any clues as to what is causing the failure. The node might, for example, be ignoring invalid requests from a particular client, thus causing the passive monitoring feature of Stingray to mark it as failed.


`tcpdump -c 1000 -n -s 0 -i any -w $diagnostic_file host $node`;



The captured traffic is then sent to a different machine so it can be analysed.


`scp $diagnostic_file $scpuser\@$scpdest`;



The program uses scp to send the information, which usually requires a password to be entered to access the remote machine. Because scp is being invoked by the program there is no opportunity to enter a password. To get around this problem, you can configure scp to contact a particular remote machine without requiring a password. Alternatively, if no location is passed to the program, it will just write the files to a specific location on the Stingray machine so you can access them manually.

When the Stingray software encounters a problem...

If there is a problem with the Stingray software, the program will create a technical support report that you can send to the Riverbed support team should you need further assistance with the problem. Information about the specific problem that occurred in the software will be sent in the notification e-mail that we configured earlier.


`$ENV{ZEUSHOME}/zxtm/bin/support-report $diagnostic_file`;



When the number of free file descriptors is running low...

If Stingray detects that it is running low on free file descriptors, the program will obtain information about current memory usage, disk usage, active connections and file descriptor settings.


`ulimit -a >> $diagnostic_file`;


`vmstat -s >> $diagnostic_file`;


`df -h >> $diagnostic_file`;


`netstat -an >> $diagnostic_file`;



By examining this information, you should be able to determine why the system is running low on file descriptors. Often it is because the maximum number of file descriptors (as reported by ulimit) is too low, though it could also be caused by the system running out of memory or disk space or there simply being an abnormally high number of active connections.

When a Service Level Monitoring class fails...

Finally, if SLM fails the program is triggered with the 'slmnodeinfo' event that identifies which nodes contributed to the SLM failure. In this case, the program will log on to the nodes in question and obtain information about the running processes to see what is going wrong. To do this it uses rsh, which means that you need to have the appropriate permissions configured in the '.rhosts' files on each node to allow the machine running Stingray to access them without a password.


`rsh -l $rshuser $node "ps -eo pid,ppid,rss,vsize,pcpu,pmem,cmd -ww --sort=-pcpu" >> $diagnostic_file`;


`rsh -l $rshuser $node "vmstat -s" >> $diagnostic_file`;



Testing the program

The program also looks out for a 'testaction' event, which is reported when you use the 'Update and Test' button on the action page. We will use this later to make sure the program is working correctly and copies the debug output to the correct location.

Adding the Program to Stingray

We can now configure the 'Debug Problem' action to use the correct program.  Upload the program to Stingray's Action Programs catalog (in the 'Extra Files' section.

Action Program.jpg

Go to System -> Alerting -> Manage Actions, and edit the Debug Problem action; change the program from 'Custom...' to the program you just uploaded.

You will have noticed that the program takes several arguments beyond just the event information. These arguments include the location to which files should be sent and the scp and rsh usernames to use when connecting to remote machines. You can use the 'Argument Descriptions' section of the page to configure the action to supply these arguments. After expanding the Argument Descriptions section, enter 'rshuser' into the name box and 'Username used to log on to failing nodes' in the description box. Click update and then add the remaining arguments - scpuser and scpdest - in the same way.

The arguments will appear in the 'Additional Settings' section where you can configure them with the appropriate values for your system. Click 'Update' to save the configuration and scroll down to the Additional Settings section again. The command that will be executed when the action is triggered is shown at the bottom of this section:

Update Action.jpg

It would also be helpful to enable 'Verbose' mode on the action at this point so any problems that occur are reported in the Event Log.

If you want to test the program out, click 'Update and Test' from the Debug Problem action's page and you should find a file called 'test-event.txt' in the location you put in the 'scpdest' parameter. If not then double check that you can use scp to copy files from the Stingray machine to that location without requiring any user interaction.

If you did get the file then when any of the events in the 'Debug Problems' event type occur you will receive some additional debugging information!