Application Delivery (ADX)

Health Checking 101

by on ‎06-04-2009 08:25 AM (14,399 Views)

Summary

Health Checking is an important feature of an Application Delivery Controller. As an ADC is doing load balancing, It is distributing traffic to the nodes in a pool of servers (real servers). It does not make a lot of sense to forward traffic to a real server which is not healthy.

 

 

It is therefore the ADCs (ServerIron / ADX ) responsibility to ensure the chosen backend resource is available at the time of the load balancing decision. The ServerIron itself is going to send regular health checks to the backend real servers to check if they are healthy or not.

 

Specifics

 

We will use various types of health checks and various health check configuration items which pros and cons for each of them.

 

Sample Script/Code/Configuration and Explanations

 

The ServerIron is able to do Layer 4 and Layer 7 health checks. The decision for a L4 or L7 health check is per default based on the TCP/UDP port the health check is going to. The ServerIron is using a Layer 7 health check for well known ports like HTTP (80), SSL (443), LDAP (389), ftp (21), dns (53) and a lot more. Have a look at the documentation for a list of well known ports.

Unknown ports are getting checked using a Layer 4 health check by default. Health checks are getting done in multiple steps.

 

  1. ARP (aka Layer 2 health check) -  This occurs only for locally connected real server and not for so called remote servers. The first ARP request is getting send out at the time the real server is getting created.

  2. ICMP (aka Layer 3 health check) - This occurs for all real servers (local and remote real servers). The first ICMP echo request is getting send out at the time the real server is getting created.

  3. Layer 4 - This occurs for all services at each real server. This happens at the time a service port is getting bound to a virtual server using the bind command.

  4. Layer 7 - This occurs by default for all well-known TCP/UDP ports but not for unknown ports. The first Layer 7 health check is getting down at the time the service got bound to a virtual server and the layer 4 health check was successful already. 

The ServerIron offers a lot of settings to influence the default health checking behavior.

 

Layer 4 vs Layer 7 Health Checks

 

A layer 4 health check might be TCP or UDP:

 

TCP: the health check status of the TCP port which is getting checked is based on a 3-way handshake to the port. The ServerIron send a TCP SYN packet to the port and expects a SYN-ACK back from the port. The ServerIron is going to reset the connection right away in case the response is coming back. The port is getting declared as available at this time.

 

UDP: the health check status of the UDP port which is getting checked is based on the reply to a UDP garbage (meaningless) packet which is getting send out by the ServerIron to the port of the real server. The port is getting declared as up in case there is no reply to this packet. The application daemon which is running at the real server should discard the garbage packet because it is meaningless. The port is getting declared as failed (down) in case the real server is sending an "ICMP destination unreachable" message back to the ServerIron because this means there is no daemon listening to the port.

 

Layer 4 health checks do not have any application intelligence. L4 health checks do work at the UDP/TCP level without any knowledge of the application running via the TCP/UDP port.

 

Layer 7 health checks do look at the application level. There are various Layer 7 health checks for a lot of applications. Some examples are (please have a look at the documentation for a complete list):

 

HTTP health check:

 

Using the HTTP health check the ServerIron is going to establish a connection to the real servers HTTP daemon first of all and it is sending an HTTP request to the web server. The web server needs to respond with an HTTP reply. The web server needs to respond with an HTTP status code indicating a success (status code 200-299) to make the health check successful. No reply OR a status code indicating a problem is going to bring the web server down from the ServerIron's point of view.

 

FTP health check:

 

Using the FTP health check the ServerIron is going to establish a connection to the real servers FTP daemon first of all and it is waiting for a greeting message from the FTP server - the FTP service is getting declared as up in case the ServerIron gets a greeting message with a 220 status code back. Everything else is causing the FTP service get declared as failed.

 

LDAP health check:

 

Using the LDAP health check, the ServerIron is going to establish a connection to the real servers LDAP daemon first of all and it is sending an LDAP bind request to the LDAP service. The LDAP bind request is using LDAP version 3 by default. The ServerIron is waiting for a reply to the bind request - a reply with the return code 0 (no error) is indicating a successful bind and the LDAP service is getting declared as UP. Anon-successful bind is a non-successful health check and the LDAP service is getting declared as DOWN.

 

NOTE:

 

The same real server might be used behind multiple virtual servers and the same real server might offer multiple services. EACH service port is getting health checked independently. The ServerIron is therefore able to detect a problem with the HTTP daemon of the real server BUT other services like FTP, LDAP and DNS are still working. The ServerIron would declare the HTTP service as down but it would continue to load balance traffic to the other services which are still available because their health checks are still successful.

 

Keepalive (regular health checks)

 

The ServerIron is not going to send regular Layer 4 or Layer 7 health checks to the application ports by default. A real server's application/protocol service is getting declared as available/up at the time the first Layer 4/Layer 7 health check gets completed successfully. ARPs and ICMP request to the real server will continue but L4 and L7 checks do stop at this time. Regular health checks are called "keepalive".

 

It is possible to enable regular health checks on a per service basis. You are able to enable regular checks for a given port globally or just for a given real server. You have to configure a port profile to enable keepalive globally for a given port. The commands to enable regular health checks for the HTTP port (port 80) globally would be:

 

server port 80
  tcp

This is a so called port profile for port 80. It is of course possible to enable keepalive health checks for some real servers only and only for some of their services. This is part of the real server configuration:

 

server real name <ip>
  port http keepalive

How frequently is a server getting health checked?

 

Is it necessary to enable regular Layer 4 and/or Layer 7 health checks as mentioned above. The default interval in between two health checks is 5 seconds, the amount of times the ServerIron is going to retry the health check before the server is getting declared as down is 2 by default (so called retries setting). This is as well changeable using the port profile:

 

server port 80
  tcp keepalive 20 5

This would tell the ServerIron to change the interval in between two health checks to 20 seconds and to wait for 5 consecutive unsuccessful health check in front of declaring the real server as down.

 

Important Health Check Parameters

 

Some health checks do require special settings to work correctly. The document is not covering all of them but some - please have a look at the documentation to see how to configure additional parameters.

 

LDAP:

 

An LDAP health check is basically an LDAP bind. The default LDAP version is 3 but it is possible to change this to LDAP version 2 - this is a setting below the real server definition:

 

server real name <ip>
  port ldap 2

HTTP:

 

HTTP health checks are very common and there are a lot of settings/possibilities related to HTTP health checks. A very simple HTTP health check is added by default as soon as you add the http port to a real server. The basic check is looking like

 

port http url "HEAD /"

This is going to send an HTTP HEAD request for the URL / to the real servers HTTP port. It is an HTTP 1.0 request looking like:

 

HEAD / HTTP/1.0\r\n
\r\n

It is possible to change the method from HEAD to GET and it is possible to change the URL which is getting requested.
Some examples are:

 

port http url "HEAD /health.html"

-> HTTP HEAD request requesting "health.html"

 

port http url "GET /myfile.html"

-> HTTP GET request requesting "myfile.html"

port http url "GET /whatever.html HTTP/1.1\r\nHost: www.domain.com\r\n\r\n"

-> HTTP GET request requesting "whatever.html" using HTTP 1.1 (host header required). The request is going to look like:

 

GET /whatever.html HTTP/1.1\r\n
Host: www.domain.com\r\n
\r\n

Additional HTTP headers are addable like the host header above.

 

All the HTTP health checks above will look at the HTTP return code coming back from the real server. The ServerIron is going to declare a server as up in case the reply is a 2xx return code (a return code between 200 and 299). It is possible to declare other return codes as a success as well using

 

port http status-code 200 299 403 403

This is declaring the range 403 to 403 as a success as well (beside the range 200 to 299).

 

It is as well possible to do HTTP health checks with Content Verification instead of looking at the return codes only. HTTP match-lists are used to do this. The following is a simple example:

 

http match-list healthyck
  default down
  up simple healthy
 
server real rs203 192.168.9.203
port http
port http url "GET /mypage.html"
port http content-match healthyck 

The http match-list does the following:

 

1. the default status is DOWN
2. the status is UP in case the string "healthy" is part of the response

 

The match-list is finally getting bound to the real server port via "port http content-match healthy". The real servers port 80 health check is now content based. The ServerIron is sending a GET request for /mypage.html and the real server is going to respond. The real servers service is getting declared as up (available) when the response contains the string "healthy". The real servers http server is getting declared as down with all other responses

 

RADIUS:

 

A RADIUS health check is authentication request. A radius health check is configurable use 3 parameters: username, password and key. A radius health check is successful in case the radius daemon send a ACCEPT or REJECT message back to the ServerIron.  The parameters are configurable during the real server configuration

 

server real name <ip>
  port radius username <username>
  port radius password <password>
  port radius key <key>

There are other Layer 7 health checks available as well - please have a look at the documentation to get some more details about them. I would like to proceed with some more complex things here.

 

Avery common problem as soon as people move to layer 7 health checks are flapping real server ports - real server ports are getting declared as up and down and up and down and so on. This is going to look like the following in the log files:

 

SYSLOG: Mar  3 22:20:36:<13>L4 server 192.168.9.50 r50 port 80 is up
SYSLOG: Mar  3 22:20:50:<13>L4 server 192.168.9.50 r50 port 80 is down due to healthcheck
SYSLOG: Mar  3 22:20:56:<13>L4 server 192.168.9.50 r50 port 80 is up
SYSLOG: Mar  3 22:21:10:<13>L4 server 192.168.9.50 r50 port 80 is down due to healthcheck

The most common reason for this is the fact that the ServerIron is going to declare a real server port as up as soon as the layer 4 health check passes. This is the default behaviour. There is still a chance that the layer 7 health check is not successful after a successful layer 4 health check. A reason could be the fact that the health check page is not there anymore. The layer 4 health check is going to be OK because the daemon is up and running but the layer 7 health check is not successful due to the missing health check page. Health checks start from scratch again because a failing layer 7 health check declares the real server port as down again. This would result in L4 OK -> UP -> L7 NOT OK -> DOWN -> L4 OK -> UP and so on.

 

The suggestion here is to put the following line into the configuration:

 

server no-fast-bringup

This is going to change the default ServerIron behaviour. A real server port needs to pass ALL health check layers until it is getting declared as up using this setting. This will ensure ports are staying down as long as they are able to pass all health check levels.

 

The no-fast-bringup setting is as well available on a per service/port basis. It is possible to put it into the port profile of specific ports instead of enabling it globally:

 

server port 80
  tcp keepalive 20 5
  no-fast-bringup 

Scripted Health Checks (Creating a custom ASCII/Binary string as health check over TCP/UDP)

 

All the stuff above is the simple health checking stuff - it is based on default health checks for well known ports and so on. It might be necessary to do more complex things from time to time. Examples are non-standard services/applications which are not based on HTTP or any well-known protocol.

 

Non-standard applications do require special ways to do health checks. Scripted health checks do offer a way to check unknown applications. A scripted health check is able to send something and to expect something back from the real server port. There are two sorts of scripted health checks in the TCP area: ASCII or BINARY.

 

ASCII based scripted health checks are able to send ASCII strings and to expect ASCII strings as answers. BINARY based scripted health checks are able to send BINARY data and to expect BINARY data back.

 

Example ASCII based scripted health check:

 

Step 1: ServerIron opens TCP connection to real server port
Step 2: ServerIron sends configured ASCII string
Step 3: ServerIron is going to wait for configured answer
Step 4: ServerIron closes TCP connection to real server port

 

The following example is going to send the string "do you feel ok" in step 2 and it is expecting "yep" as answer to declare the real server port as up:

 

server real rs-test 1.2.3.4
  port 1234 keepalive
  port 1234 content-check m1
  port 1234 content-check send "do you feel ok"
http match-list m1
  up simple yep
  default down

The line "port 1234 keepalive" ensures there are regular Layer 7 health checks.

 

Example BINARY based scripted health check:

 

Step 1: ServerIron opens TCP connection to real server port
Step 2: ServerIron sends configured BINARY data
Step 3: ServerIron is going to wait for configured answer
Step 4: ServerIron closes TCP connection to real server port

 

The following example is going to send the following 4 bytes: 0xFE 0xED 0xME 0xBA 0xBE and it is expecting the following 4 bytes back 0xAA 0xBB 0xCC 0xDD

 

server real rs-test 1.2.3.4
  port 1234 keepalive
  port 1234 content-check-carray m1
  port 1234 content-check-carray send "0xfe,0xed,0xba,0xbe"
http match-list m1
  default down
  up simple 0xaa,0xbb,0xcc,0xdd

The third and last scripted health check type is the one for UDP based services. The configuration is the same as the one for ASCII based scripted health check in the TCP area. The behaviour is slightly different due to the fact that UDP does not know anything about sessions. The following happens:

 

Step 1: ServerIron sends configured ASCII string inside a UDP packet
Step 2: ServerIron is going to wait for a UDP packet with the configured answer


The configuration is very similar to the TCP one:

 

server real rs-test 1.2.3.4
  port 1234 keepalive
  port 1234 content-check m1
  port 1234 content-check send "do you feel ok"
http match-list m1
  up simple yep
  default down

-> send a UDP packet with the following ASCII data in "do you feel ok" and expect a UDP packet containing "yep".

 

Health Checks using Port-Policies

 

It is as well possible to configure healthck's using the real servers own IP as dest-ip. The config overhead is pretty huge doing this given the fact that you have to configure a healthck for every dest-ip/port combination. It is better to use so called port-policies in this case. A port-policy does not need a destination IP and it does not need a destination port. A port policy is using the real servers IP and the port it is getting bound to as destination IP. Note the distinction with healthck in prior example above where it is going to base the health check of all three servers at the HTTP reply from port 1234 of destination IP 192.168.9.155.

 

server port-policy check-me
protocol http
protocol http url "GET /whatever.html"
!
server real myReal1 192.168.9.101
port http
port http use-port-policy check-me
!
server real myReal2 192.168.9.102
port http
port http use-port-policy check-me
!
server real myReal3 192.168.9.103
port http
port http use-port-policy check-me

The port-policy example above is doing an HTTP health check as well and it is querying the URL /whatever.html as well but there is no IP address or port part specification needed. The port policy is using the IP address and port it is getting bound to. The example above is going to result in an HTTP query to the IP/port pairs:

 

192.168.9.101:80
192.168.9.102:80
192.168.9.103:80

 

It is possible to select a different port using a port-policy - you could do the following

 

server port-policy check-me
  port 1234
  protocol http
  protocol http url "GET /whatever.html"

to base the health of port 80 on the reply of port 1234 - the IP/port pairs which are getting checked are:

 

192.168.9.101:1234
192.168.9.102:1234
192.168.9.103:1234

 

Port-policies are not as mighty and flexible as healthck's because they do have less options BUT they do reduce the config size and it is much easier to reuse a port-policy compared to the healthck.

 

Health Checks using healthck's and healthck policies

 

We have used various health checks so far and the stuff mentioned above is enough for 90% of the installations. There are nevertheless installations with more complex requirements. ALL of the health checks above went to the real server itself and the destination port of the health check was the one which is the final destination of the client traffic as well. Some setups do require special health checks. It is maybe necessary to load balance a very complex application depending on a lot of daemons running at the real server. The clients need to send their traffic to port 1234 to get it processed but a lot of daemons need to be running to ensure the application is OK. The application offers a special health check service via port 5432 running at another server - it is an HTTP based service. The whole suite of daemons is up and running in case the string "RUNNING" is part of the page "myhealth.html" which is available via port 5432 via the health check server.

 

What does that mean?

 

The status of service x of real server A is based on a reply from service y available via real server B.

 

Healthck's are the tool to configure health checks with the maximum flexibility. All parameters like destination IP, destination port, protocol, interval, retries, content match and so on are configurable by the user.

 

Example:

 

healthck check-server-B tcp
  dest-ip 192.168.9.155
  port 1234
  protocol http
  protocol http url "GET /whatever.html"
  retries 2
  l7-check
server real myRealA 192.168.9.101
port http
port http healthck check-server-B

It is therefore possible to use another server/service as health check destination.

 

This is a pretty mighty tool but you have to keep in mind that you do have to define a dest-ip and a port for every healthck you are configuring. This is a lot of overhead because it is normally not possible to reuse a healthck for more than 2 or 3 real server services which would look like:

 

server real myRealA1 192.168.9.101
port http
port http healthck check-server-B
server real myRealA2 192.168.9.102
port http
port http healthck check-server-B
server real myRealA3 192.168.9.103
port http
port http healthck check-server-B

The health check of the HTTP port (port 80) of all of the three real server configured above is based on the healthck "check-server-B". Port 1234 of destination IP 192.168.9.155 needs to be contactable via HTTP and it needs to come back with a return code in between 200 and 299 in reply to the query for the URL /whatever.html.

 

Nested Health Check Policies (using logical expressions)

 

Another advantage of healthck's is the fact that it is possible to create so called nested health check policies. That means you are able to create multiple healthck's and you can combine them using the logical operators AND / OR or NOT. You are able to bind the resulting nested health check policy to a real server service and this service is getting declared as ACTIVE/UP as long as the logical expression is TRUE and it is getting declared as FAILED/DOWN as soon as the expression is FALSE.

 

It might be necessary to take a front-end web service down as soon as none of the backend FTP servers is up and running anymore. The front-end web server itself needs to be up and at least one of the backend FTP server to have a working service. The logical expression behind this is

 

WEBSERVER & ( FTPSERVER1 OR FTPSERVER2 )

 

Create healthck for every element:

 

healthck web tcp
  dest-ip 192.168.9.101
  port http
  protocol http
  protocol http url "GET /"
  interval 2
  retries 2
  l7-check
healthck ftp1 tcp
  dest-ip 192.168.9.102
  port ftp
  protocol ftp
  interval 2
  retries 2
  l7-check
healthck ftp2 tcp
  dest-ip 192.168.9.103
  port ftp
  protocol ftp
  interval 2
  retries 2
  l7-check

Put them together step by step - first of all combine the FTP services with an OR:

 

healthck FTP1orFTP2 boolean
  or ftp1 ftp2

Combine the WEB service with the FTP services using an AND:

 

healthck WEBandFTP1orFTP2 boolean
  and web FTP1orFTP2 

Do use the result "WEBandFTP1orFTP2" as healthck for the WEB service:

 

TRUE
server real rs101 192.168.9.101
port http
port http healthck WEBandFTP1orFTP2

The http service of real server rs101 is now up as long as the HTTP service of the real server itself is up and as long as at least one of the FTP services is up. You can check this via "show healthck" and "show server bind":

 

Both FTP servers are up and running - the web service should be UP:

 

telnet@SI#show healthck
Total nodes: 6; Max nodes: 128
      Name   Value   Enable   Type       Dest-IP        Port   Proto    Layer
--------------------------------------------------------------------------
       web    TRUE YES    tcp     192.168.9.101    http    http   l7-chk
      ftp1    TRUE YES    tcp     192.168.9.102     ftp     ftp   l7-chk
      ftp2 TRUE YES    tcp     192.168.9.103     ftp     ftp   l7-chk
FTP1orFTP2    TRUE na   or ftp1 ftp2
WEBandFTP1    TRUE na   and web FTP1orFTP2
 
telnet@SI#show server bind
Bind info
Virtual server: vs222                    Status: enabled  IP: 192.168.9.222
        http -------> rs101: 192.168.9.101,  http (Active)

FTP server 192.168.9.102 is going down - the web service should stay UP:

 

telnet@SI#show healthck
Total nodes: 6; Max nodes: 128
      Name   Value   Enable   Type       Dest-IP        Port   Proto    Layer
--------------------------------------------------------------------------
       web    TRUE YES    tcp     192.168.9.101    http    http   l7-chk
      ftp1   FALSE YES    tcp     192.168.9.102     ftp     ftp   l7-chk
      ftp2    TRUE YES    tcp     192.168.9.103     ftp     ftp   l7-chk
FTP1orFTP2    TRUE na   or ftp1 ftp2
WEBandFTP1    TRUE na   and web FTP1orFTP2
telnet@SI#show server bind
Bind info
Virtual server: vs222                    Status: enabled  IP: 192.168.9.222
        http -------> rs101: 192.168.9.101,  http (Active)

The other FTP server is going down as well - web service should move DOWN:

 

telnet@SI#show healthck
Total nodes: 6; Max nodes: 128
      Name   Value   Enable   Type       Dest-IP        Port   Proto    Layer
--------------------------------------------------------------------------
       web    TRUE YES    tcp     192.168.9.101    http    http   l7-chk
      ftp1   FALSE YES    tcp     192.168.9.102     ftp     ftp   l7-chk
      ftp2   FALSE YES    tcp     192.168.9.103     ftp     ftp   l7-chk
FTP1orFTP2   FALSE na   or ftp1 ftp2
WEBandFTP1   FALSE na   and web FTP1orFTP2
telnet@SI#show server bind
Bind info
Virtual server: vs222                    Status: enabled  IP: 192.168.9.222
        http -------> rs101: 192.168.9.101,  http (Failed)

HOORAY!

 

I will add some more stuff as soon as I find some time to do this.

 

Have fun!