Fibre Channel (SAN)

Reply
Occasional Contributor
Posts: 6
Registered: ‎09-19-2007

Descrepency deskew (cable length) causes I/O Latencies & Timeouts in our SAN 5300s v6.4.0b ?

We experience I/O Latencies & Timeouts in our SAN Fabric two Brocade 5300s - v6.4.0b - connected via Trunked ISL (ports#0 and port#1 @ 8 Gbps)

If we match the cable length to get matching deskew numbers, do we expect to get less I/O latency errors?

Alternately: is downgrading ISL speeds to 4 Gbps is a better way to acheive balanced thruput between ISL-ports in this trunk (to reduce these I/O latencies errors)

Summary:

Our core-SAN switches ISL-ports it appears there is a descrepency in cable length for the fiber used to form ISL-Trunk-Group between core switches (connecting 3rd and 4th floor core switches). Our goal is to make sure that deskew numbers do match-up on both ISL-ports in this trunk-group.

Analysis:

FL4R1-SWF-FA: >  trunkshow –perf

  1:  0->  0 10:00:00:05:1e:ee:cf:7e   2 deskew 28 MASTER -> 13 unit difference compared to next ISL in Trunk

       1->  1 10:00:00:05:1e:ee:cf:7e   2 deskew 15

    Tx: Bandwidth 16.00Gbps, Throughput 231.11Mbps (1.68%)

    Rx: Bandwidth 16.00Gbps, Throughput 210.08Mbps (1.53%) 

    Tx+Rx: Bandwidth 32.00Gbps, Throughput 441.19Mbps (1.61%)

Run portperfshow to show current ISL/trunk thruput on switches

FL4R11-SWF-FA: > portperfshow 0-1 -t 10

  0      1       Total

========================

   1.6m  55.6m  57.3m   -> Notice the significant difference in thruput between port#0 and port#1 of this Trunk

   1.8m  25.0m  26.8m  -> Same as above

   1.7m  39.3m  41.0m  -> Same as above

   4.7m  44.7m  49.4m  -> Same as above

SAN Configuration:

Each of our two floors (1 & 2), served by a Brocade 5300, floor's local hosts and storage, and these two switches are connected by TWO 8 Gbps ISLs (trunked on port#0 and port#1). All hosts (mostly Windows) in our environment access storage on either floors., so ISLs are essential in our design.

We never get close to thruput limitation on ISL links, instead we end up getting latency bottlenecks.


For example we see high number of "tim_txcrd_z" and high values for C3 frames received (261498201) on ISL trunk port.


FL2_BCD5300
tim_rdy_pri                        1042        Time R_RDY high priority
tim_txcrd_z                        3508284216  Time TX Credit Zero (2.5Us ticks)

FL1_BCD5300
tim_rdy_pri                        1041        Time R_RDY high priority
tim_txcrd_z                        4259587257  Time TX Credit Zero (2.5Us ticks)

Obviously, we see I/O delays and/or timeouts accessing SAN storage (Two EVA 8400s - one on each floor).
EVA8400s doesn't appear to show any performance/latency bottleneck, so we imagine it must be caused by SAN and/or ISLs.

Oct 24 2011 09:23:52 GMT   Warning  "Severe latency bottleneck detected at slot 0 port 0".  Switch  259976  1  AN-1010  FL1_BCD5300_SWF-FA

Thanks for allowing a do-over as we gathered more analytical data!

SR

Valued Contributor
Posts: 931
Registered: ‎12-30-2009

Re: Descrepency deskew (cable length) causes I/O Latencies & Timeouts in our SAN 5300s v6.4.0b ?

If we match the cable length to get matching deskew numbers, do we expect to get less I/O latency errors?

Alternately:  is downgrading ISL speeds to 4 Gbps is a better way to acheive balanced  thruput between ISL-ports in this trunk (to reduce these I/O latencies  errors)

According to HP, ports do not neccesarraly have to be balenced in a trunk, but I could have misunderstood this.

What I see in my environment is unbalenced ISL trunk ports, with IOD set, portBased routing on and DLS off.

Those settings are per HP CA guide when using the old replication protocol for EVA's

  1:  0->  0 10:00:00:05:1e:ee:cf:7e   2 deskew 28 MASTER -> 13 unit difference compared to next ISL in Trunk

       1->  1 10:00:00:05:1e:ee:cf:7e   2 deskew 15

Means you got a difference of approx 25 meters between each ISL, with port 0 being the longest.

Try to reduce the deskew value be adding length (switch patchcables) to ISL on port 1  or reduce length on ISL 0

30 meters difference or more could lead to performance issues, 300 meters or more is not supported

For example we see high number of "tim_txcrd_z" and high values for C3 frames received (261498201) on ISL trunk port.

This in itself doesn't mean anything.

Did you reset the counters, waited an hour or so, and then got that value?

Did you use portstatsshow or portstats64show to get those values?

The first one is a 32 bit counter readout and tends to wrap around quickly  whit high volume traffic.

The later one is well 64 bit which doens't wrap around nearly a quick as its 32 bit counterpart.

But combined with

Oct 24 2011 09:23:52 GMT   Warning  "Severe latency bottleneck detected  at slot 0 port 0".  Switch  259976  1  AN-1010  FL1_BCD5300_SWF-FA

this means you've got a problem, perhaps buffer starvation/ slow drain.

I suggest resetting the counters on both ends, wait a few hours and look again using portstats64show {portnumber} -long.

-Look for the frame discards due to timeouts.

-Calculates average framesize ((# word/ #frames) * 4 = average framesize in bytes)

-Calculate if the number of buffer credit (portbuffershow tells how many are configured for your ISL) is sufficient againt the ISL length speed and average frame size

Super Contributor
Posts: 635
Registered: ‎04-12-2010

Re: Descrepency deskew (cable length) causes I/O Latencies & Timeouts in our SAN 5300s v6.4.0b ?

you have aprox 26 meter differences between the two links.

What happens if you disable trunking. Did you receive also any latency errors?

Regards,

Andreas

Valued Contributor
Posts: 539
Registered: ‎03-20-2011

Re: Descrepency deskew (cable length) causes I/O Latencies & Timeouts in our SAN 5300s v6.4.0b ?

since 4G platforms, trunked ports are not always used equally. the first link is used up to a certain limit, then the next link comes in etc... so, all the links in a trunk will be more or less equally used only when the real usage is close to the maximum throughput for the entire trunk. what you currently have is that the shorter (faster) link in your trunk is used more than the longer link.

since the distance between switches is not very big (just the next floors) and the overall link usage is not very high, i would bet your performance problems are caused by the end devices, not by the switches.

+1 for clearing the stats and further monitoring with regular portstatsshow collection

+1 for disabling one of the links to see how it goes without trunking

Join the Community

Get quick and easy access to valuable resource designed to help you manage your Brocade Network.