SteelApp Docs

Managing consistent caches across a Stingray Cluster

by on ‎03-15-2013 07:03 AM (1,868 Views)

Stingray's Feature Brief: Stingray Content Caching capability allows the traffic manager to operate as a high-performance proxy cache, with full control using TrafficScript.

When Stingray is deployed in a cluster, each Stingray device manages its own local cache.  When an item expires from one Stingray's cache, it retrieves that item from the origin server the next time that it is used.

ccache1.pngEach Stingray manages its own cache independently

In the majority of situations, this deployment pattern is entirely appropriate.  In some specialized situations, it is preferable to have 100% consistency between the caches in the traffic managers and to absolutely minimize the load on the origin servers.  This article describes the recommended deployment configuration, meeting the following technical goals:

  • Fault Tolerance - Failure: the pair of Stingray Traffic Managers operate in a fully-fault tolerant fashion.  Cache consistency between peers ensures that in the event of planned or unintentional failure of one of the peers in the cluster, the remaining peer has a complete and up-to-date cache of content and the load on the origin servers does not increase.
  • Fault Tolerance - Recovery: when the failed Stingray traffic manager recovers, it is automatically reintroduced into the cluster.  Its cache is automatically repopulated over a period of time (corresponding to the cache refresh period) without increasing the load on the origin servers.
  • Load Minimization: the configuration minimizes the load on the origin servers.  In particular, one peer member will never request content from the origin server when the other peer has content that is still current.

Stingray’s cache flood protection is used in this configuration to manage the effects of multiple simultaneous requests for the same content that is no longer fresh (defined by a cache refresh time).  If multiple requests arrive at the same instant, a maximum of one request per second will be forwarded to the origin servers.

Operational Benefits

The configuration delivers the following operational benefits:

  • Resilience: The service is fully resilient in the event of planned or unplanned failure of one traffic manager, or one or more origin servers.
  • Optimal end user performance: the load on the origin servers is minimized in accordance with the fully-user-configurable cache policy, meaning that the origin servers are most unlikely to be overloaded and end user performance will not suffer.

... letting you deliver reliable services with predictable performance whilst minimizing infrastructure costs and eliminating overspend to accommodate occasional spikes of traffic or visitor numbers.

Solution Architecture

Stingray Traffic Managers are deployed in a fault-tolerant pair, and external traffic directed to a public-facing IP address (“Traffic IP address”).  Both traffic managers are active, and the one that owns the traffic IP handles incoming traffic:

ccache2.png

Failover Scenarios

During normal operation, the primary traffic manager receives traffic and responds directly from its local cache.  If a cache miss occurs (i.e when priming the cache, or when content expires or needs refreshing), the traffic manager first checks with its secondary peer and retrieves the secondary’s cached version if that is still valid (not expired or needing refreshed).  It caches the copy retrieved from the secondary according to the local cache policy.

If the secondary does not have valid content, then the resource will be retrieved from the origin servers (load balancing as appropriate) and cached by both traffic managers.

If the secondary traffic manager fails, the primary will continue to respond directly from its local cache whenever possible.  If a cache miss occurs or a refresh is needed, content is retrieved from the origin servers and cached in the primary traffic manager’s cache.

When the secondary traffic manager recovers, it may have an out-of-date cache (in the event of a network failure), or the cache may be completely empty (in the event of a software restart).  This cache is fully repopulated with the working set of documents within the configured refresh time.  There is no risk of the traffic manager serving stale content, and the load on the origin server is not increased during the repopulation.

If the primary traffic manager fails, the secondary will take over with a fully primed cache and respond to all traffic.  If a cache miss occurs or a refresh is needed, the secondary will retrieve the content from the origin servers and cache it locally.

When the primary traffic manager recovers, its cache may be out-of-date or completely empty (as above).  The primary takes back responsibility for user traffic and will update its cache rapidly by retrieving fresh content from the secondary traffic manager on every cache miss or refresh.  There is no risk of the primary traffic manager returning stale content, and the load on the origin servers is not increased during the repopulation.

Stingray Configuration

The configuration instructions are based on the following context:

  • Two traffic managers, named ‘stingray-1’ and ‘stingray-2’ are configured in a fault tolerant cluster.
  • A single-hosted traffic IP group with one IP address (‘traffic IP’) is configured to receive incoming traffic, and ‘stingray-2’ is designated a passive member of that group.  External DNS is configured to resolve to the traffic IP address.  This ensures that in normal operation, stingray-1 will be the active traffic manager with respect to handling user traffic.
  • Content is stored on three webservers (webserver1, webserver2, webserver3) and the intent is that requests are load-balanced across these.
  • The desired cache policy is to cache HTTP content for a long period of time (99915 seconds), but refresh content with one origin request every 15 seconds.

Configure the pools

Configure two pools as follows:

  Name: Origin Servers, pool members: webserver1:80, webserver2:80, webserver3:80

  Name: Cache Servers, pool members: stingray-2:80, webserver1:80, webserver2:80, webserver3:80

Configure both pools to use an appropriate load balancing algorithm.  Additionally, configure priority lists for the Cache Servers pool as follows:

plists.png

Cache Servers pool: priority list configuration

This configuration ensures that when the Cache Servers pool is selected, all traffic is sent to stingray-2 if it is available, or is load-balanced across the three origin webservers if not.

Note: if any changes are made to the origin servers (nodes are added or removed), both the Origin Servers and Cache Servers pools must be updated.

Configure the Virtual Server

Create an HTTP virtual server named Cache Service, and set the default pool to Origin Servers. 

Note: This virtual server should generally listen on all IP addresses.  If it’s necessary to restrict the IP addresses it listens on, it should listen on the public traffic IP and on the IP address that is resolved by ‘stingray-2’ (the first node in the Cache Servers pool).

Configure caching in the virtual server to cache content for the desired period (e.g. 99915 seconds), and to refresh it (with a maximum of one request per second) once it has been cached for 15 seconds, as follows:

cachesettings.png

Caching settings for the Cache Service virtual server

Add a request rule named Synchronize Cache to the virtual server to select the Cache Servers pool (overriding the default Origin Servers pool).

This rule selects the Cache Servers pool when traffic is received on the primary traffic manager (‘stingray-1’).

Note: you will need to update this rule if the hostname for the primary traffic manager is not stingray-1.

This completes the configuration for the Cache Service virtual server and related pools and rule:

ccache3.png

Testing the caching service

Testing this configuration must be done with due care because the presence of multiple caches can make debugging dificult.  The following techniques are useful:

Sample Traffic

Test with a small sample set of content to verify that the cache policies function as desired.  For example, the following command will repeatedly request a single cacheable resource:


bash$ while sleep 1 ; do wget http://host/cacheablecontent.gif ; done


Activity Monitor

Use the activity monitor to chart connections made to the origin servers:

amon.png

Note that the activity monitor charts generally merge data from all traffic managers in a cluster; for clarity, this chart plots traffic individually per traffic manager.  Observe that despite the front-end load of 1 request per second (not plotted), only one request every 15 seconds is sent to the origin server to refresh the cache.  All requests originate from stingray-2.

If Stingray-2 is suspended or shut down, the activity monitor report run from stingray-1 will verify that there is no increase in origin server traffic; likewise if stingray-1 fails and the activity monitor chart is run from stingray-2.

Connections report

The connections report will assist in identifying where traffic is routed.

In the following report, requests for the same 866-byte resource were issued once a second, and received by stingray-1 and responded from the local cache (‘To’ is ‘none’).

At 10:47:25, a cache refresh event occurred.  Stingray-1 forwarded the request to stingray-2 (192.168.35.11).  Stingray-2’s cache also required a refresh (because the caches are synchronized) so stingray-2 requested the content from one of the origin servers (192.168.207.103).

conns.png

Conclusion

The synchronization solution effectively meets the goals of reliability, performance and minimization of load on the origin servers.  It may be extended from an active-active pair of Stingray traffic managers to a larger cluster if required, which will increase the level of redundancy in the system, but at the expense of a small (probably insignificant) increase in latency as caches must be synchronized across a larger set of devices.

Priming the cache

It is generally not a good idea to pre-prime a cache because the act of priming the cache puts a large one-time load on the origin servers.  In the majority of situations, it is better to use this synchronization solution and allow the caches to fill on demand, in response to end user traffic.

If it is necessary to pre-prime the cache, this can be done using synthetic transactions to submit requests through the stingray cluster.  For example, ‘wget –r’ was used with success during advanced testing of this solution.