vADC Docs

Feature Brief: Clustering and Fault Tolerance in Stingray Traffic Manager

by on ‎02-28-2013 02:39 AM (5,429 Views)

This technical brief discusses Stingray's Clustering and Fault Tolerance mechanisms ('TrafficCluster').

Clustering

Stingray Traffic Managers are routinely deployed in clusters of two or more devices, for fault-tolerance and scalability reasons:

cluster3.png

A cluster is a set of traffic managers that share the same basic configuration (for variations, see 'Multi-Site Management' below).  These traffic managers act as a single unit with respect to the web-based administration and monitoring interface: configuration updates are automatically propagated across the cluster, and diagnostics, logs and statistics are automatically gathered and merged by the admin interface.

Architecture - fully distributed, no 'master'

There is no explicit or implicit 'master' in a Stingray cluster - like the Knights of the Round Table, all Stingrays have equal status.  This design improves reliability of the cluster as there is no need to nominate and track the status of a single master device.

Administrators can use the admin interface on any Stingray device to manage the cluster.  Intra-cluster communication is secured using SSL and fingerprinting to protect against the interception of configuration updates or false impersonation (man-in-the-middle) of a cluster member.

Note: You can remove administration privileges from selected traffic managers by disabling the control!canupdate configuration setting for those traffic managers.  Once a traffic manager is restricted in that way, its peers will refuse to accept configuration updates from it, and the administration interface is disabled on that traffic manager.  If a restricted traffic manager is in some way compromised, it cannot be used to further compromise the other traffic managers in your cluster.

If you find yourself in the position that you cannot access any of the unrestricted traffic managers and you need to promote a restricted traffic manager to regain control, please refer to the technical note What to do if you need to access a restricted Stingray Traffic Manager.

Traffic Distribution across a Cluster

Incoming network traffic is distributed across a cluster using a concept named 'Traffic IP Groups'.

A Traffic IP Group contains a set of floating (virtual) IP addresses (known as 'Traffic IPs') and it spans some or all of the traffic managers in a cluster:

cluster4.png

The Stingray Cluster contains traffic managers 1, 2, 3 and 4.

Traffic IP Group A contains traffic IP addresses 'x' and 'y' and is managed by traffic managers 1, 2 and 3.

Traffic IP Group B contains traffic IP address 'z' and is managed by traffic managers 3 and 4.

The traffic managers handle traffic that is destined to the traffic IP addresses using one of two methods:

  • Single-hosted traffic IP groups:  If a group is configured to operate in a 'single-hosted' fashion, each IP address is raised on a single traffic manager.  If there are multiple IP addresses in the group, the IP addresses will be shared between the traffic managers in an even fashion.
  • Multi-hosted traffic IP groups:  If a group is configured to operate in a 'multi-hosted' fashion, each IP address is raised on all of the traffic managers.  The traffic managers publish the IP address using a multi-cast MAC address and employ the zcluster Stingray Kernel Modules for Linux Software module to filter the incoming traffic so that each connection is processed by one traffic manager, and the workload is shared evenly.

Single-hosted is typically easier to manage and debug in the event of problems because all of the traffic to a traffic IP address is targetted to the same traffic manager.  In high-traffic environments, it's common to assign multiple IP addresses to a single-hosted traffic IP group and let the traffic managers distribute those IP addresses evenly.  Publish all of the IP addresses in a round-robin DNS fashion.  This gives approximately even distribution of traffic across these IP addresses.

Multi-hosted traffic IP groups are more challenging to manage, but they have the advantage that all traffic is evenly distributed across the machines that manage the traffic IP group.

For more information, refer to the article Feature Brief: Deep-dive on Multi-Hosted IP addresses in Stingray Traffic Manager

If possible, you should use single-hosted traffic IP groups in very high traffic environments.  Although multi-hosted gives even traffic distribution, this comes at a cost:

  • Incoming packets are sprayed to all of the traffic managers in the multi-hosted traffic IP group, resulting in an increase in network traffic
  • Each traffic manager must run the zcluster kernel module to filter incoming traffic; this module will increase the CPU utilization of the kernel on that traffic manager

Fault Tolerance

The traffic managers in a cluster each perform frequent self-tests, verifying network connectivity, correct operation and internal self-tests.  They broadcast health messages periodically (every 500 ms by default - see flipper!monitor_interval) and listen for the health messages from their peers.

If a traffic manager fails, it either broadcasts a health message indicating the problem, or (in the event of a catastrophic situation) it stops broadcasting health messages completely.  Either way, its peers in the Stingray cluster will rapidly identify that it has failed.

In this situation, two actions are taken:

  • An event is raised to notify the Event System that a failure has occured.  This will typically raise an alert in the event log and UI, and may send an email or other actions if they have been configured
  • Any traffic IP addresses that the failed traffic manager was responsible for are redistributed appropriately across the remaining traffic managers in each traffic IP group

Note that if a traffic manager fails, it will voluntarily drop any traffic IP addresses that it is responsible for.

Failover

If a traffic manager fails, the traffic IP addresses that it is responsible for are redistributed.  The goal of the redistribution method is to share the  orphaned IP responsibilities as evenly as possible with the remaining traffic managers in the group, without reassigning any other IP allocations.  This minimizes disruption and seeks to ensure that traffic is as evenly shared as possible across the remaining cluster members.

The single-hosted method is granular to the level of individual traffic IP addresses.  The failover method is described in the article How are single-hosted traffic IP addresses distributed in a Stingray cluster (TODO).

The multi-hosted method is granular to the level of an individual TCP connection.  It's failover method is described in the article How are multihosted-hosted traffic IP addresses distributed in a Stingray cluster (TODO).

State sharing within a cluster

Stingray machines within a cluster will share some state information:

  • Configuration: Configuration is automatically replicated across the cluster and all traffic managers will hold an identical copy of the entire configuration at all points
  • Health Broadcasts: Stingray machines periodically broadcast their health to the rest of the cluster
  • Session Persistence data: Some session persistence methods depend on Stingray's internal store (see Session Persistence - implementing timeouts).  Local updates to that store are automatically replicated across the cluster on a sub-second granularity
  • Bandwidth Data: Bandwidth classes that share a bandwidth allocation across a cluster (see Feature Brief: Bandwidth and Rate Shaping in Stingray Traffic Manager) will periodically exchange state so that each traffic manager can dynamically negotiate its share of the bandwidth class based on current demand

Stingray does not share detailed connection information across a cluster (SSL state, rules state etc), so if a Stingray Traffic Manager were to fail, any TCP connections it is currently managing will be dropped.  You can guarantee that no connections are ever dropped by using a technique like VMware Fault Tolerance to run a shadow traffic manager that tracks the state of the active traffic manager completely.  This solution is supported by Riverbed and is in use in a number of deployments where 5- or 6-9's uptime is not sufficient:

vmwareft.png

VMware Fault Tolerance is used to ensure that no connections are dropped in the event of a Stingray failure

Multi-Site Management

Recall that all of the traffic managers in a Stingray cluster have identical copies of configuration and therefore will operate in identical fashions.

Stingray Traffic Manager clusters may span multiple locations, and in some situations, you may need to run slightly different configurations in each location.  For example, you may wish to use a different pool of web servers when your service is running in your New York datacenter compared to your Docklands datacenter.

In simple situations, this can be achieved with judicious use of TrafficScript to apply slightly different traffic management actions based on the identity of the traffic manager that is processing the request (sys.hostname()), or the IP address that the request was received on:


$ip = request.getLocalIP();



# Traffic IPs in the range 31.44.1.* are hosted in Docklands


if( string.ipmaskMatch( $ip, "31.44.1.0/24" ) )


  pool.select( "Docklands Webservers" );



# Traffic IPs in the range 154.76.87.* are hosted in New Jersey


if( string.ipmaskMatch( $ip, "154.76.87.0/24" ) )


  pool.select( "New Jersey Webservers" );



In more complex situations, you can enable the Multi-Site Management option for the Stingray configuration.  This option allows you to apply a layer of templating to your configuration - you define a set of locations, assign each traffic manager to one of these locations, and then you can template individual configuration keys so that they take different values depending on the location in which the configuration is read.

There are limitations to the scope of Multi-Site Manager (it currently does not interoperate with The specified space was not found. and the REST API is not able to manage configuration that is templated using Multi-Site Manager).  Please refer to the What is Stingray Multi-Site Manager? feature brief for more information, and to the relevant chapter in the Stingray Product Documentation for details of limitations and caveats.

Read More