vADC Docs

Connection mirroring and failover with Stingray Traffic Manager

by on ‎03-27-2013 12:46 AM (4,745 Views)

Stingray Traffic Manager does not provide a ‘connection mirroring’ or ‘transparent failover’ capability.  This article describes contemporary connection mirroring techniques and their strengths and limitations, and explains how Stingray Traffic Manager may be used with VMware Fault Tolerance to create an effective solution that preserves all connections in the event of a hardware failure, while processing them fully at layer 7.

What is connection mirroring?

A fault tolerant load balancer cluster eliminates single points of failure:  When load balancers are deployed in a fault tolerant cluster, they present a reliable endpoint for the services they manage.  If one load balancer device fails, its peers are able to step in and accept traffic so the service can continue to operate.

…but a failover event will drop all established TCP connections: However, if a load balancer fails, any TCP connections that are established to that load balancer will be dropped.  Clients will either receive a RST or FIN close message, or they may just experience a timeout.  The clients will need to re-establish the TCP connection.  This is an inconvenience for long-lived protocols that do not support automatic reconnects, such as FTP.

Connection Mirroring offers a solution: If the load balancers are operating in a basic layer-4 packet forwarding mode, the only actions they perform is to NAT the packets to the correct target node, and to apply sequence number translation.  They can share this connection information with their peer.  If a load balancer fails, the TCP client will retransmit its packets after an appropriate timeout.  The packets will be received by the peer who can then apply the correct NAT and sequence number operations.

When is it appropriate to use connection mirroring?

Connection mirroring is best used when only very basic packet-based load balancing is in use.  For example, F5 recommend that you "enable connection mirroring on Performance (Layer 4) virtual servers only" and comment "mirroring short-term connections such as HTTP and UDP is not recommended, as this will cause a decrease in system performance... is typically not necessary, as those protocols allow for failure of individual requests without loss of the entire sessi....

Cisco also support layer 4 connection mirroring (referring to it as ‘Stateful Failover’) and note that it is only possible for layer 4 connections.  When using a Cisco ACE device, it is not possible to failover connections that are proxied, including connections that employ SSL decryption or HTTP compression.

Layer 7 connection mirroring imposes a significant network and CPU overhead

Layer 7 connection mirroring puts a very high load on the dedicated heartbeat link (all incoming packets are replicated to the standby peer) and is CPU intensive (both traffic managers must process the same transactions at layer 7). It may add latency or interfere with normal system operation, and not all ADC features are supported in a mirrored configuration.  Because of these limitations, F5 advise "the overhead incurred by mirroring HTTP connections is excessive given the minimal advantages".

Does connection mirroring guarantee seamless failover?

Due to timing and implementation details, connection mirroring does not guarantee seamless failover.  State data must be shared to the peer once the TCP connection is established, and this must be done asynchronously to avoid delaying every TCP connection.  If a load balancer fails before it has shared the state information, the TCP session cannot be resumed.

Typical duration of a TCP transaction (not including lingering keepalives)500 ms
Typical window before which state information is synchronized (implementation dependent)

200 ms

(state exchanged 5 times per second)

On failure, percentage of connections that cannot be re-established200/500 = 40%

Connection mirroring does not guarantee seamless failover because connections must proceed while state is being shared

What is the effect of connection mirroring on uptime?

Connection mirroring carries a cost: increased internal traffic for state sharing, and severe limitations on the functionality that may be used at the load balancing tier.  What effect does it have on a service’s uptime?

Typical duration of a TCP transaction (not including lingering keepalives)500 ms
Typical number of individual load balancer failures in a 12 month period5

Percentage of transactions that would be dropped if a load balancer failed

50%

(assuming an active-active pair of load balancers)

Percentage of transactions that would be recovered on a failure

60%

(analysis above: 40% would not be recovered)

What is the probability that an individual connection would be impacted by a load balancer failure?500/(365.5*24*3600*1000) * 50% * 5 = 0.000000040
What is the probability that connection could be ‘rescued’ with connection mirroring?60% = 0.6
What proportion of transactions would be impacted by a failure, and then recovered by connection mirroring?

0.000000040 * 0.6 = 0.000000024

(i.e. 0.0000024%)

Connection mirroring improves uptime by an infinitesimal amount

General advice

Consider using connection mirroring when:

  • Operating in L2-4 NAT load balancing modes
  • Performing NAT load balancing with no content inspection (no delayed binding)
  • No content processing e.g. SSL, compression, caching, cookie injection is required
  • Base protocol does not support automatic reconnects – e.g. FTP
  • Connections are long-lived and a dropped connection would inconvenience the user, e.g. SSH
  • Your load balancer is unreliable and failures are sufficiently frequent that the overhead of mirroring is worthwhile
  • You are running a fault-tolerant pair of load balancers

Don’t use connection mirroring when:

  • Operating in full proxy modes
  • Performing NAT or full proxy load balancing with content inspection
  • Compressing content, SSL decrypting, caching, session persistence methods that inject cookies, application firewall
  • Base protocol supports reconnects – e.g. RDP
  • Connections are short-lived and easily re-established e.g HTTP
  • Your load balancers are reliable and you can accommodate instantaneous loss of connections in the event that one does fail
  • You plan to run a cluster of three or more load balancers (this configuration is not supported by the major vendors who offer connection mirroring)

Benefits of using Connection Mirroring

  • Improves uptime by 0.0000024% (typical) (2.4 millionths of a percent)

Costs of using Connection Mirroring

  • Limits traffic inspection or manipulation in load balancer.
  • Increases internal traffic and increases load on load balancer

Balance the benefits of connection limiting against the additional risk and complexity of enabling it and the potential loss in performance and functionality that will result.  Be aware that, based on the preceding analysis, unless your goal is to achieve more than 7-9’s uptime (99.99999%), connection mirroring will not measurably contribute to the reliability of your service.

When connections are too valuable to lose…

Stingray customers include emergency and first-response services around the world, NGO services publishing disaster-response information and even major political fund-raising concerns.  In each case, extremely high availability and consistent performance in the face of large spikes of traffic are paramount to the organizations who selected Stingray.

A number of customers use VMware Fault Tolerance with Stingray Traffic Manager to achieve enhanced uptime without compromising the any of the functionality that Stingray offers.  VMware Fault Tolerance maintains a perfect shadow of a running virtual machine, running on a separate host.  If the primary virtual machine fails due to a catastrophic hardware failure, the shadow seamlessly takes over all traffic, including established connections, with a typical latency of less than 1 ms. All application-level workloads, such as SSL decryption, TrafficScript processing and Authentication are maintained without any interruption in service:

lockstep1.jpg

VMware Fault Tolerance runs a secondary virtual machine in ‘lock step’ with the primary. Network traffic and other non-determinstic events are replicated to the secondary, ensuring that it maintains an identical execution state to the primary.

If the primary fails, the secondary takes over seamlessly and a new secondary is started.

Such configurations leverage standard VMware technology and are fully supported by Riverbed.  They have been proven in production and offer enhanced connection mirroring functionality compared to proprietary ADC solutions