11-13-2013 09:50 PM
This post will be woefully short of important information, but wanted to get it out there tonight in case something obvious jumps out at someone. Will look to fill in details more tomorrow.
We have a Dell Blade Center with four blades in it, each with a QLogic HBA. This HBA actually presents two HBA's to the host operating system (in our case, ESXi 5.1). These split to each of two Brocade branded switches which sit in the Dell Blade Center (one switch is Fabric A and the other Fabric B). Each of these switches has a dual E_Port ISL that connects to an IBM SAN24B switch. There is one SAN24B switch per fabric. An IBM Gen2 XIV hangs off the SAN24B's.
This morning we started getting multipath warnings from the XIV for one of the blades. ESX on that blade told us that the FabA links were flapping. During this time we saw "Severe latency" warnings on our FabA SAN24B for *one* of the two ISL links. The CLI showed incrementing rx c3timeouts and disc c3's for one of the ports which uplinks to our XIV and incrementing tx c3timeouts and disc c3's for the aforementioned ISL port (the other member of the ISL shows no errors, yet is passing traffic).
The other blades in the blade center continued to function normally... eventually we tried downing the FabA port from the bladecenter switch to the "bad blade" and that made the errors stop. Re-enabling the port brought them back. Clearly a bad HBA, right? We swapped out the HBA for a new one, but the errors returned! Next, we replaced the entire blade...but still the issue persisted.
At this point we have the "bad" blade back online with its FabA port downed at the bladecenter FC switch level, but are at a loss as to what is going on. It seems like if there was an issue with the XIV port or the ISL's then all of the blades would be affected (yet they're not). But after replacing the HBA and Blade, I'm not sure what else to check. Some sort of weird midplane issue on the Dell bladecenter chassis would be quite odd -- and again you'd think it would affect more than one blade.
We weren't savvy enough to run the Brocade Bottleneck checker tool (we're running FabOS 7.0.2a), but will likely look to do that next -- and of course work with Dell support.
We've already reached out to IBM but apparently our support contract offers only support for failing hardware, not software issues. D'oh.
Anyone have any thoughts on what the heck could be going on here? Beyond support, our likely next steps will be to trial and error replace components (and drop half of the ISL too) -- SFP's, cables, etc.
11-14-2013 01:52 AM
My experience shows in those cases a slow draining device - e.g. faulty SFP, faulty cable or faulty FC HBA in the fabric. As the frames are waiting to be transferred from switch A to switch B, the ISL is busy with a lot of frames and it cannot transfer frames from and to other devices in the fabric.
Sometimes one faulty SFP is able to slow down the entire fabric as the frames wait on the ISL 1/2 second before they got dropped and other devices cannot communicate enough quickly.
My recommendation to you in case you use FOS 7.x
With FOS 7.x Brocade has added an_debug to the Supportsave.
Collect Supportsaves from all switches in the fabric. Look at an_debug and analyze the columns SID and DID - source and destination.
Look first at what time is the bottleneck - then look at the same time to which destinations there are timeouts.
When a frame is not delivered due to slow drain device - it genaretes timeouts around the time of the bottleneck - 2 or 3 seconds earlier. During the bottleneck you will find additional timeouts from defferent ports and to deifferent destination as the ISL is busy with the frames addressed.
Then go to fabriclog --show and just check whether there are SCN_LR - link resets at this time. Ususally few seconds after the bottleneck the affected port initiates link reset.
How to recognize which is the faulty port - the addresse in the column SID/DID are described as 0x140100 -- this means (I exclude 0x as I don't know what is the meaning) 14 is domain ID (in real life the number is 20), then 01 is the port number and 00 stands for FC as far as I know. You go then to nsshow and analyze from which sources (SID) to which destinations (DID) are the frames travelling.
in most cases at the end you will discover one faulty port - then go and check SFP, cable, attached device.
Hope this helps :)
I am able to look into Supportsave (one or more) if you don't mind of course.
P.S. Do you know where are all old topics? I think there are some changes on the Brocade page and I am not able to find old topics... !?!
11-14-2013 01:47 PM
Thanks for the response! Very informative.
One of our vendors is sending us a replacement Blade Center fiber channel switch (from looking at thes support logs, they felt this may be the root cause).
Will let everyone know how it goes.