This is an article from seb's sanblog, where I write mainly about SAN troubleshooting and related topics. Joe Cannata invited me to post it here, so I felt flattered and here it is! Enjoy reading! :-)
Brocade recently released its 16G platform switches and along with them a new major version of FabricOS: FOS 7.0. Beside the new features customer's admins, architects or end-users might be interested in, I see some nice enhancements and new tools for us support people, too. In the next blog posts I would like to present some of them and show how to use them, why they are important and where they apply.
The first one I want to write about is the D-Port or Diagnostics Port. This is a special mode every port on Brocade's 16G platform can be configured to.
Imagine a two-fabric setup, both spread over two locations, connected via some trunked ISLs through a DWDM. Every once in a while I get a case like this where there was a problem with one of these ISLs. Usually the end-users report major performance problems, there might even be crashes of hosts. The SAN admin looks into his switches, the server admins look for their messages against their HBAs and quickly they notice that the problem seems to be in one fabric only and having a redundant second fabric available the decision is made: "Let's block the ISLs in the affected fabric. The workaround is effective, the situation calms down, the business impact disappears. But of course there is no redundancy anymore and the next step is to find out what happened and subsequently it has to be resolved.
So a problem case is opened at the technical support. The first request from the support people will be to gather a supportsave. Often they even request to clear the counters and wait some time before gathering the data.
But it's useless now!
Of course it's most important to stop any business impact by implementing a workaround as quickly as possible, but if I get a data collection like this, it's like being asked to heal a disease on the basis of a photo of an already dead person. Usually no customer will allow to re-enable the ISLs before the cause of the problem is found and solved. Welcome to a recursive nightmare! )
Having Diagnostics ports on both sides of the link will allow you to test a connection between two switches without having a working ISL. This means there will be no user traffic and also no fabric management over this link and so there will be no impact at all. From a fabric perspective, the ISL is still blocked. It comes with several automatic tests:
So this can even help you to determine the right setup for your long distance connection!
Although it's very easy to set this up in Network Advisor (only supported with 16G SFP+), as a support member I prefer stuff to be done via CLI, because then I can see it in the CLI history. (By the way, a real accounting or audit log covering both CLI and GUI actions would be very useful. I look at you, Brocade!) At first you should know which are the corresponding ports in the two switches. (The Network Advisor would do that for you.) Then you disable them on both sides using
Once disabled you can configure the D-Port:
portcfgdport --enable port
And finally enable it again using
Of course you would do that on both sides. There's a seperate command to view the results then:
B6510_1:admin> portdporttest --show 7 D-Port Information: =================== Port: 7 Remote WWNN: 10:00:00:05:33:69:ba:97 Remote port: 25 Mode: Automatic Start time: Thu Sep 15 02:57:07 2011 End time: Thu Sep 15 02:58:23 2011 Status: PASSED ================================================================================ Test Start time Result EST(secs) Comments ================================================================================ Electrical loopback 02:58:05 PASSED -- ---------- Optical loopback 02:58:11 PASSED -- ---------- Link traffic test 02:58:18 PASSED -- ---------- ================================================================================ Roundtrip link latency: 924 nano-seconds Estimated cable distance: 1 meters
If you see the test failing, you have your culprit and based on which one is failing actions can be defined to resolve the problem. Your IBM support will of course help you with that! )
So if you face similar problems and you are already using 16G switches with 16G SFP+ installed, feel free to implement a workaround like blocking the ISLs to lower the impact. The D-Port will help to find out the reasons afterwards.
Better: Clear the counters, wait 10 minutes and then gather a supportsave before you disable the ports. And even better than that: Clear counters periodically as described here.