10-24-2011 03:20 AM
I have a problem that no one seems to have the solution for.
I have 2 fabrics (A and B), both with 4 Brocade slkwrm 300 in it, (latest firmware).
A have to HP EVA 8100 SANs, in two serverrooms, so 4 of the two switches is interconnected via a longwave SFP (4G), all shortwave are running 8G, the SAN uplinks are 4G, and 4 uplinks each in each fabrict, so 8 paths pr. SAN.
We really saw the error after installing a new HP C-7000 Blade enclosure with flexfabric. It was running for a month, and then we becan seeing some path failures, and our new HP 490c (blade) with ESXi 4.1 was going go a "not responding" mode.
Let me define the setup.
non blade ESX 4.1 and SAN + allot other servers
Blade ESXi 4.1 and SAN + allot other servers
All our vmware servers are running on LUNS on both SANs.
What we are seeing is "severe latency bottleneck detected at slot 0 port 0" on both switches that have FlexFabric connected, these switches are also longware connected to the other serverroom. (port 0 is the longware connection).
But the port is never near the 4G, only running at 100M or so when we see the error. The same for the flexfabric port.
We tried disabling the flexfabric uplink for 20 mins, and we didn't recieve any errors in that time, after enabling it again, the warnings came back after 5 mins or so.
All servers in our serverroom 1 has all it paths to both SANs, but all servers in serverroom 2 are missing paths to SAN in room 1 now.
We can reboot one of the swicthes and all paths are back up, but only for some time, that can be 10 mins, of a half day.
Any idear, we have firmware upgraded all switches to the latest firmware v6.4.2a, changed the port, switch and SFP that flexfabric are using, but no change.
HP are looking at logs, but cant find anything, so in the mean time I would like to hear if any one have a clue?
10-24-2011 07:18 AM
What is the FC module model in your C7000? Distance between both server rooms connected by longwave?
Are you using your C7000 FC module in NPIV mode or fabric mode? Firmware on the module?
Can you run SANHealth in your fabric and post the results?
What is the warning that you get?
10-24-2011 07:46 AM
thanks for your reply.
When you say FC module model? do you then mean SFP model? if so, it's a HP AJ715A.
The length between the to rooms is about 500 meters.
The FC port has NPIV enabled, and I'm pretty sure that we are using it, it's a default setup, not quite sure where i can find it in the web interface.
But we are zoning by wwn, and the flexfabric is showing up as a libery on the health rapport, so all Blades are running via the same Flex port.
What do you mean by the firmware? on flexfabric it is v.3.30 (just updatet).
I have just created a new health rapport, I will upload it tomorrow when i get it from brocade.
Have you seen a matching problem before?
10-24-2011 08:54 AM
I think neverdie wanted to know about the flexfabric module.
Its an module fabricated by Qlogic and does FCoE within the chassis and Enet en FC on the outside.
On the 300 the port must have NPIV enabled and (p)wwn zoning is the correct method to zone HBA's to their targets.
What is important in such a config that all firmware and drivers used are in the supportmatrix from HP.
And by all, i mean HBA driver and firmware, Blade server firmware drivers and bios, Flexfabric module firmware (and other modules if applicable),Onboard Administrator Firmware. Its essential that the complete Blade config is up to par.
If thats done the config has to be on the connectivity stream with regards to the san switches. That can be found het HP Spock website.
Otherwise strange behaviour can occur.
10-24-2011 09:55 AM
hi, thanks for replying dion.
All HBA firmware and drives should be on newest version, I will double check tomorrow when i'm back on work. But could that result in bottleneck errors?
In my syslog i'm getting "severe latency bottleneck detected at slot 0 port 0" (longwave uplink port) on both fabrics.
The blade Flex is updates, and so is the management controller (only 10 days ago).
I have uploaded the latest health rapport, there is some stange port issues on sansw7, pretty stange because they sould be empty.
There are also a "few" inactive zones that will be deleted.
10-24-2011 10:52 AM
Have you opened a ticket to VMware yet? You should have to have them check the required driver and firmware for your mezanine card from the VMware point of view.
This seems like a compatibility issue on the chassis side.
10-24-2011 10:55 AM
From what you're writing, you're seeing bottleneck error reported against domain 5 and 6 port 0 as those domain have the C7000 chassis connected, Yes?
The xls sheet (SAN Ports) tells all ISL run at 8G.
Q How long is the longdistance link?
The Blade enclosure is connected at 8 G aswell on domain 5/p19 but its running on 4G in domain 6p/19.
Q Unfortunatly the sheet doesn't include port19 from domain 5 but is that ports fillword set correctly (usually 3)?
-If you use CA >>> Are you seeing errors in the CVE controllers logs with regards to "excessive date rate changes", "resource on the inter site links reduced" or DR groups flipping suspend en resume modes?
-If you use CA >>> is IOD set, DLS not set and portbased routing configured (For commands refer to the HP CA implementation guide )
10-24-2011 11:05 AM
We had HP vmware on it (HP vmware licenses), they could't see any problems.
The firmware on the ESXi's have been updated, drivers and firmware.
It is not only the vmware servers we are having problems with, I think that the not responding mode is a symptom. It should be fixed in Vsphere 5, something about handling of dead paths, and path with link and no data comming through.
When we see this bottlenecks, all servers (windows, blades and standard rack servers) are losing paths, and they only come back up, when we reboot some of the SAN switches.
Thats also why I think I will try disabling some of the SAN uplinks on each blade in the enc. to see if it is a specific blade that is causing the problem.
We have primarily looked at the ESX's but we have 7 blades on SAN via flexfabric, so maybe that could get me somewhere.
The bottleneck errors comes every 5-8 mins.
10-24-2011 11:18 AM
The link is about 500 meters max. when I think about it, I think that it is much less, maybe 200-300.
You are correct about the flexfabric uplinks, 5/19 and 6/19.. and you spottet corretly that it is a 4 and 8G.. I replaced one of the uplinks SFP's (5/19 and flex module) with a 8G to see if it did anyting (maybe a falty SFP) but no change.
About the 5-6/19, I didn't see it. this health rapport is bandnew, but you are rigth, it is missing the name, why i'm not sure. i'm pretty sure that they was there on the last rapport, it may also be a symptom from this error?
I will look into the CVE tomorrow and get back, but HP have looked at all SAN logs without finding anything.
Also remember that servers in serverroom 1 dont looses any paths :-|
again thanks for the help.
10-24-2011 11:42 AM
Do you know what fillword is set against the 8g connection to a FlexFabric module?
Do you have an increasing txcrd_zero couter aswell, as can be seen in the other thread http://community.brocade.com/message/20120#20120