11-15-2010 03:07 PM
I'm having this amazingly annoying problem, where if i tell my multipathing software to go out only path A, the performance is fine, and then if i have it go out path B, its horrible. I am using a sql script that inserts 10000 rows into a table and recording the times of each run. On the good path, i get 1000-1500 inserts/sec, on the bad path i get 100 if im lucky. Sometimes if i let it sit, the first run will be semi-ok, like 800 inserts/sec, and then if i run it again, its below 100 again.
I've tried this on 2 different servers, same results. Good steady performance on fabric A, horrible on fabric B.
I've cleared all the ports involved on fabric B, and ran the tests multiple times again, and i do not see any errors like bad transmitted word or anything. I've also tried pathing to the storage over the other storage controller on the netapp, and still got bad performance, which to me eliminates the possibility of a bad SFP or cable to the SAN itself.
The fabric consists of SAN->Core Switch->Bladecenter Integrated Switch w/ 3 ISL Trunks to core->Server
I'm just out of options and hoping someone can help.
11-16-2010 02:00 AM
This could be a case of a slow drain device in one fabric affecting it entirely, and a bad SFP cable can cause this. Its strange but true and I have had a similar experience before.
Detection of the slow drain device is not easy but with newer FOS (6.3+) you have a command called bottleneckmon which can help you find bottlenecks.
New Bottleneck Detection Capability
With Fabric OS 6.3, you can now utilize the new Bottleneck Detection capability available with Brocade Advanced Performance Monitoring. Bottleneck Detection identifies and alerts you about "slow-drain" storage devices that can cause latency and I/O timeouts. This capability is particularly valuable for optimizing performance in highly virtualized server environments.
I'm using FOS 6.4.1 and in here I can enable bottleneck detection for the whole switch bottleneckmon --enable, but in older FOS you need to enable on specific ports.
Check for tape libs/drives or backups that might exist in the slow fabric or for flapping ports with fabriclog -s. A slow drain device often does logout login in the fabric.
11-16-2010 12:13 PM
This is very good advice, i am running 6.4.0b on my core switch (the one i dont think theres any issue with), and it said no ports are bottlenecked.
The switch that i do think i'm having a problem with was running an old version of the FW, and i upgraded it today. Problem is, bottleneckmon doesnt want to work on it because it says all my ports are not F_Ports, which they say they are in a switchshow. However since this is an IBM integrated bladecenter module, i think bottleneckmon wont work because Locked_G_Port and Disabled_E_Port is set to ON in the portcfgshow. No clue what to do there.
Additionally, i also disabled each ISL trunk one at a time and ran the test, still bad performance.
I am going to replace all the cables in between as well, anything you can do to help would be great.
11-16-2010 10:06 PM
In the older 6.3 FOS code is a defect which doesn't notice the F_Port is you have changed one of the default port settings like disable E_Port.
This is fixed with 6.4.0b.
11-16-2010 10:09 PM
What kind of Storage array are you using?
Some midrange array have a preferred path and if IO are coming down the secondary link the arrays have to reroute internally the IOS which takes longer than on the primary path.
If so please use the storage system multipathing software.
11-16-2010 10:20 PM
I have an IBM Branded NetAPP (N6070). I did discover the sfp on the netapp itself was causing a large number of tim_txcrd_z counts, i had it replaced, as well as every SFP involved from the core switch to the edge switch integrated in the bladecenter.
The tim_txcrd_z count stopped on the link to the SAN itself, but it still persists somewhat on the E_Ports of the edge switch.
The performance issue is still there, and it cant be preferred path related, i have used the storage systems multipath software, as well as only zoning it to the port on the SAN, and is still happening (with out without the vendor multipathing software enabled or disabled)
11-16-2010 10:25 PM
Additionally, the performance flip flops.. some run's i will get 600-800 inserts a second, and consistantly if i run a test right after it never breaks 100.