Yes my friend, you can have Zombie ports in your SAN, and I’ll explain what they are in just a moment.
Source: Pakalert Press
But first, I want to point out the difference between theory and practice. In theory, everything works. In practice, all computing, storage and networking hardware and software breaks at some point. And you would be amazed at the novel ways complex systems come up with to behave badly. It’s almost as if they have innate evil intent.
Best Practice for Storage Availability
For these reasons it's best practice in the world of storage to adopt a belt and suspenders design philosophy to create “protection in depth”. A good designer always assumes s#$% will happen. For example, mission critical applications commonly run on server clusters that have redundant IO paths, redundant SAN fabrics and redundant storage. In the server are multipath IO drivers providing two physical “gozinta/gozouta” (aka, HBAs) that are connected to two SAN fabrics, Fabric A and Fabric B, similar to the diagram below.
In theory, the server protects application IO from failure anywhere along the path to the storage. If an IO does not complete (a SCSI status confirms an IO was successful or not successful and if no SCSI status is returned within a specific time that means bad things happened), then the multipath IO driver concludes the path the IO was sent on is bad, and stops sending traffic through the HBA on that path using the other HBA instead. Any failed disk IO is resent (note this is not always the case for tape IO).
If a SAN fabric went off-line for some reason (say due to human error or the capacious whims of Mr. Murphy), then IO will stack up in the server, and once again the multipath IO driver does its thing and redirects the pending IO through the HBA connected to the other SAN fabric.
So, “in theory”, this active/active fully redundant design provides a failure proof storage design that can recover from all manner of bad things (server failure, PCI bus failure, HBA failure, cable failure, switch failure, SAN fabric failure, array failure, array port failure, disk failure, human failure to configure things correctly) and continue to keep the applications happy.
Enter the Zombie Port
Let me tell you a customer story about how they discovered they had a Zombie port. This customer has a sizeable SAN investment and faithfully follows the belt and suspenders best practice for designing an active/active, fully redundant storage system. But one day, latency went way up for several important applications. The analysis showed that frames were being corrupted by a partially failing component in a switch. The effect was retransmissions slowing down storage IO so the application was underperforming and you know what happens then don't you? The business owner calls you, then your boss, then your boss’s boss and up the chain of command until things return to normal. The good news in this case was the problem was isolated, the card replaced, all without having to take the switch down.
This situation is what I call a Zombie port: it’s not quite dead enough to completely fail, but hungry for the life blood of any applications that come across its path. In this case, an active/active completely redundant storage system didn't protect against a Zombie port. In a state similar to that of the “living dead”, the misbehaving port was able to evade detection by the multipath IO driver, so the driver never failed over letting the Zombie port continue its mayhem on unsuspecting applications.
Although simple in concept, port fencing relies on some pretty sophisticated ASIC engineering. Based on thresholds that you can set for various ASIC event counters, the switch monitors these vital signs to determine the health of its ports. If any port’s vital signs exceed a threshold, then it is shut off. BANG!! No more Zombie port.
Since the port is off line, any traffic using it automatically reroutes to the next available, least cost path in the fabric because that’s what fabrics do: they automatically find the least cost path for traffic so you don’t have to. Should there be no other paths available, the multipath IO driver detects a path failure and moves all the server traffic to the other SAN fabric.
With Fabric Watch and port fencing, you can help make your SAN a Zombie free zone, once again safe for application data traffic … even after dark.