06-20-2017 06:03 AM
Two of three nodes failed in a pool overnight with messages like:
Node 18.104.22.168 has failed - Timeout while establishing connection (the machine may be down, or the network congested; increasing the 'max_connect_time' on the pool's 'Connection Management' page may help)
When I came in this morning, I was able to verify the ip/port was available and functioning on both nodes, but the pool still had them marked as failed. I stopped the virtual server and started it back up and the pool connected to all three nodes and has been happy for the last hour. It seems like it just didn't try to reconnect to those nodes.
I'm new to Stingray load balancers, having only worked with F5 in the past. Is this normal? Do I really need to manually intervene to fix this? Am I missing some obvious config?
One attached screenshot is the health monitor config (listed in the catalog as Connect, I'm not sure if that is a standard monitor or was created by my predecessor).
The pool also has passive monitoring turned on.
The second screen shot shows the connection management settings:
Is something in those configs causing the LB to give up on my nodes after some outage and require me to manually intervene?
Solved! Go to Solution.
06-20-2017 06:27 AM
It looks like the pasive monitor did indeed fail your nodes. In which case the passive monitor needs to recover them, a working active monitor will not recover a node which was failed by the passive checks. The most common reasons for a node not recovering from a passive failure are:
1. You don't have any traffic, or
2. Your traffic is all non-idempotent (eg POSTS)
If you traffic is largely POSTs then I would suggest disabling passive monitoring, because without any idempotent requests the passive monitor will never be able to recover the failed nodes.
06-20-2017 07:35 AM