03-09-2012 11:21 AM
IHAC that is suffering with fabric instabillity and performance issues using this old FOS 5.3.0d.
They have a fabric composed by 3 directors : 2 x ED48000B and 1 x ED24000.
They are ISL'ed as follow :
48000_1 > 48000_2 > 24000
48000_1 > ( 4 x 10Gbit ) to 48000_2 > ( 4 x 4Gbit with trunk ) to 24000
Between the 48Ks , ISL is composed by 4 x 10Gbit connections using a blade that have
Between 48000_2 and 24000 the ISL is composed by 4 x 4Gbit connections grouped by TRUNK.
Hosts attached to switches 48000_1 and hosts attached to 48000_2 got extreme performance issues trying to access storage using the ISL traffic even with the ISL showing just 25% of utilization looking at SAN HEALTH output.
Customer tried to reboot their servers keeping zones that are using ISL traffic but disks are presented to servers, but they are not able even to load their file systems suffering with slowness.
Hosts that are being reconfigured , removing the ISL traffic and accessing just the Storage ports belonged to the same switch are getting good results , with normal performance .
Most of servers affected are attached to NPIV blades that are attached to Brocades 48000.
Do you know guys , somehting related to this behaviour at this FOS level ?
03-10-2012 03:01 AM
Hi, have you looked at port error counters? Maybe there is a device somewhere misbehaving / using all the buffers? Are you seeing any errors? Maybe this could pinpoint it to a port/cable/sfp.
"Between the 48Ks , ISL is composed by 4 x 10Gbit connections using a blade that have" - is some text missing?
"Hosts attached to switches 48000_1 and hosts attached to 48000_2 got extreme performance issues trying to access storage using the ISL traffic even with the ISL showing just 25% of utilization looking at SAN HEALTH output." - is that trying to access storage in the 24000? Or also between the two 48000?
I tried to go through the release notes for 5.3.2c (that has fixes in previous firmwares in 5.3.x) but couldn't see anything definite that could cause this.
03-10-2012 04:32 AM
Did the links work in the past with better performance?
The customer can try to do a link reset on the 10G ISLs. It could happens that the ports lost buffers which will end up in slow performance on the links.
An indication can be a high and rapid increasing credit time zero value on the the portstatsshow command. I hope that this counter is in that old code present. Currently I have not the manual to check this.
Keep in mind that the reset is disruptive for the traffic and visible to the servers.
Bandwidth usage and performance are two differernt thinks.
You can fill a link to 100% without getting the full bandwidth. In other words if you send very small IOs (small payload) you will not get 360-380MB/s on that link any way.
How long is the 10G ISL?
Correct number of buffers set for the distance on the affected ISL?
What has been changed before the performnace get poor?