vADC Docs

Feature Brief: Health Monitoring in Stingray Traffic Manager

by on ‎02-22-2013 07:39 AM (7,698 Views)

Stingray operates as a TCP/UDP proxy.  It receives traffic from remote clients, processes it internally, and then hands it off to a pool.  The pool performs a load balancing decision to pick a back-end server, and then forwards the traffic to that server node.

Stingray has two main methods to verify the correct operation of a back-end server. If it detects that a server has failed, it will stop sending traffic to it until the server has recovered and can be re-introduced back into the pool of servers.

  • Passive Health Monitoring is a basic technique that checks each response from a back-end server. It uses live traffic to detect a limited range of errors (it is not application aware), and end-users may experience slow or broken responses while the health monitors seek to confirm that the node has failed.
  • Active Health Monitoring uses synthetic transactions to regularly probe each node to verify correct operation. It can detect a very wide range of application-level failures, and will detect node failures independently of managing live traffic. In most cases, both health monitoring techniques are used together to assure the availability of the service.

Passive Health Monitoring

Every time Stingray forwards a request to a back-end server, it verifies that a response is received. A number of tests are performed:

    1. If the connection is refused, or is not established within the max_connect_time setting (default 4 seconds), the request is considered to have failed;
    2. If the connection is closed prematurely, or if the beginning of a response is not received within max_reply_time seconds (default 30 seconds), the request is considered to have failed;
    3. SSL only: if the SSL handshake to the node fails, the request is considered to have failed;
    4. HTTP only; if an invalid HTTP response or a 503 status code is received, the request is considered to have failed.

    The max_connect_time and max_reply_time settings are properties of the Connection Management settings in a Pool. If these tests fail, then the request may be retried against another node in the pool, depending on the Idempotent status of the request. The request may be tried against all of the working nodes in the pool before the traffic manager gives up. If the traffic manager does give up, it either drops the connection or returns a custom error message (Virtual Server -> Connection Management -> Connection Error Settings).

    If requests to a particular back-end server fail 3 times in a row (node_connection_attempts), then Stingray assumes that the back-end server has failed. Stingray does not send the server any new requests for a period of time (node_fail_time, default 60 seconds) and then speculatively sends it requests occasionally to probe it, to determine whether or not it has recovered.

    When a node recovers, the perceptive load balancing algorithm will gradually ramp up traffic to the new node: Tech Tip: Perceptive Load Balancing.  The 'Fastest Response Time' algorithm has similar behavior; other load balancing algorithms will immediately load up the node with new requests.

    Summary

    Passive Health Monitoring is basic and very easy to configure. However, it only performs basic tests to verify the health of a server, so it cannot detect application-level failures other than '503' HTTP responses. Furthermore, Passive Health Monitoring only detects failures when Stingray attempts to forward a request to a back-end node, so some clients may experience a slow or broken response while Stingray attempts to verify whether the node has failed or not.

    Active Health Monitoring

    Stingray may be configured with explicit health monitors. These health monitors perform periodic tests and verify the correct operation of each node. By using health monitors, the traffic manager can detect failures even when no traffic is being handled.

    The tests are performed at configured intervals against each node in a pool (defined per-health-monitor). If the monitors fail sufficiently often, that node is marked as unavailable and the traffic manager will stop sending traffic to that node.

    The monitors are held in the Monitors Catalog and can be applied to any pool. Each performs a specific test, ranging from simple operations, such as pinging each node, to more sophisticated tests which check that the appropriate port is open and the node is able to serve specified data, such as your home page. You may also create custom health monitors to perform sophisticated tests against your servers.

    Summary

    Active Health Monitors are very effective at determining the correct operation of each back-end server and are able to detect a wide range of network and application failures.

    Logging and debugging health monitoring

    Whenever a node fails or recovers, a log message is written to Stingray’s event log, and an event is raised.

    The Virtual Server setting log!server_connection_failures may be used to log request failures against back-end servers and will help to track how passive health monitoring is functioning.

    Each active Health Monitor has a verbose configuration setting that may be used to log the result of each test that is conducted.

    Note that when a node fails, it may take several seconds for a health monitor to detect that it has failed (a combination of the delay, timeout and failures settings). Passive Health Monitoring will not detect the failure until Stingray has attempted to send sufficient live traffic to the node.

    Find out more

    Please refer to the Health Monitoring chapter in the User Manual (Stingray Product Documentation) for a more detailed description of the various methods used to measure server health.