Data Center

Brook.Reams_1

How to Kill a Zombie Port in a Fibre Channel SAN

by Brook.Reams_1 on ‎06-27-2012 08:35 AM (124 Views)

Yes my friend, you can have Zombie ports in your SAN, and I’ll explain what they are in just a moment.

ZombiesAhead.jpg

Source: Pakalert Press

But first, I want to point out the difference between theory and practice. In theory, everything works. In practice, all computing, storage and networking hardware and software breaks at some point. And you would be amazed at the novel ways complex systems come up with to behave badly. It’s almost as if they have innate evil intent.

DrEvil.jpg

Source: ESPN

Best Practice for Storage Availability

For these reasons it's best practice in the world of storage to adopt a belt and suspenders design philosophy to create “protection in depth”. A good designer always assumes s#$% will happen. For example, mission critical applications commonly run on server clusters that have redundant IO paths, redundant SAN fabrics and redundant storage. In the server are multipath IO drivers providing two physical “gozinta/gozouta” (aka, HBAs) that are connected to two SAN fabrics, Fabric A and Fabric B, similar to the diagram below.

DualSAN.bmp

Source: Brocade

In theory, the server protects application IO from failure anywhere along the path to the storage. If an IO does not complete (a SCSI status confirms an IO was successful or not successful and if no SCSI status is returned within a specific time that means bad things happened), then the multipath IO driver concludes the path the IO was sent on is bad, and stops sending traffic through the HBA on that path using the other HBA instead. Any failed disk IO is resent (note this is not always the case for tape IO).

If a SAN fabric went off-line for some reason (say due to human error or the capacious whims of Mr. Murphy), then IO will stack up in the server, and once again the multipath IO driver does its thing and redirects the pending IO through the HBA connected to the other SAN fabric.

So, “in theory”, this active/active fully redundant design provides a failure proof storage design that can recover from all manner of bad things (server failure, PCI bus failure, HBA failure, cable failure, switch failure, SAN fabric failure, array failure, array port failure, disk failure, human failure to configure things correctly) and continue to keep the applications happy.

Enter the Zombie Port

Let me tell you a customer story about how they discovered they had a Zombie port. This customer has a sizeable SAN investment and faithfully follows the belt and suspenders best practice for designing an active/active, fully redundant storage system. But one day, latency went way up for several important applications. The analysis showed that frames were being corrupted by a partially failing component in a switch. The effect was retransmissions slowing down storage IO so the application was underperforming and you know what happens then don't you? The business owner calls you, then your boss, then your boss’s boss and up the chain of command until things return to normal. The good news in this case was the problem was isolated, the card replaced, all without having to take the switch down.

This situation is what I call a Zombie port: it’s not quite dead enough to completely fail, but hungry for the life blood of any applications that come across its path. In this case, an active/active completely redundant storage system didn't protect against a Zombie port. In a state similar to that of the “living dead”, the misbehaving port was able to evade detection by the multipath IO driver, so the driver never failed over letting the Zombie port continue its mayhem on unsuspecting applications.

ZombieCartoon.bmp

Source: blog.spoongraphics.co.uk

What Can You do to Protect Yourself?

If your data center is invaded by actual flesh eating, “Night of the Living-Dead” Zombies, you can consult this guide.  But if it’s only a Zombie port in your SAN, it turns out Brocade has a nifty feature for Fibre Channel switches called “port fencing”. Port fencing is part of Brocade’s Fabric Watch suite of monitoring tools, and port fencing is a Zombie port killer.

BestZombieKillerEver.jpg

Source: www.zazzle.com

Although simple in concept, port fencing relies on some pretty sophisticated ASIC engineering. Based on thresholds that you can set for various ASIC event counters, the switch monitors these vital signs to determine the health of its ports. If any port’s vital signs exceed a threshold, then it is shut off. BANG!! No more Zombie port.

Since the port is off line, any traffic using it automatically reroutes to the next available, least cost path in the fabric because that’s what fabrics do: they automatically find the least cost path for traffic so you don’t have to. Should there be no other paths available, the multipath IO driver detects a path failure and moves all the server traffic to the other SAN fabric.

With Fabric Watch and port fencing, you can help make your SAN a Zombie free zone, once again safe for application data traffic … even after dark.

zombiefreezone.jpg

Source: Horrorsigns.com

If you want more information about designing SANs, see our SAN Design and Best Practices guide and take a look at our Strategic Solutions Lab Publications forum for technical “how to” and best practice guidance.

Comments
by (anon) on ‎06-27-2012 12:15 PM

Nice article. From time to time it can be better to be a zombie instead to be dead, or?

I would suggest to use FabricWatch in a more pro active reporting and altering fashion and not like a zombie killer.

If you get in a regular way porterrorstats from your switches with meaning full information like the names of aliases connected to the affected port will help more and you will see these zombies. You can work on these topics in a controlled way during the normal day shift. If you use Fabricwatch for alerting it will you inform before your boss is calling you about poor performance without disabling something. Disabling a port is the last option a SAN admin has and he has to be carefully with this option.

I wonder why peopler do not really configure and use Fabricwatch in a proper way. My own thinking it is related to a complicate configuration task. 

As well I miss some good documents which shows how all the stuff is working and some more details about all the different error counters of the switches. I miss the story behind each error counter. With other words what is telling me the counter. Why does this counter increase.  What is going wrong?

Very often it is guessing...

On my which list is a really good document about FC how it works on the protocol level. A big chapter should be how an Admin can check if things went wrong on the Brocade SAN switches. This book should provide deep knowledge in a readable format.

I would pay very much for this kind of book.

An admin should understand the product to get the best result out of the product.

Regards,

Andreas

by Brook.Reams_1 on ‎07-27-2012 12:41 PM

Hi Andreas,

Thanks for posting your comments. I absolutely agree with you that proactively monitoring events from Fabric Watch can prevent unexpected Zombie Port invasions.

By the way, in case you haven't seen it, we do publish a Fabric Watch Administrator's Guide.

Here is the link:

http://www.brocade.com/forms/getFile?p=documents/product_manuals/B_SAN/FW_AG_70x.html

You can find it, and all our FOS guides, on any Fibre Channel Product Page. Go down to the bottom of the product page to see the "Resources" panel, as I show here from the Brocade 6505 Switch product page. Then click "Expand All" and look under the "Documentation" category.

Documentation.JPG

I hope this is helpful.

Best.

Brook.