Fibre Channel (SAN)

Reply
New Contributor
Posts: 3
Registered: ‎05-12-2011

trunk port status reset from HEALTHY to DOWN and back

Hi,

several times in a day we have message "FW-1424,Switch status changed from HEALTHY to DOWN" followed by meessage  "FW-1436, Switch status change contributing factor Marginal ports:  2 marginal ports (ports 22,23)".

In time range 20 second - 3 minutes the next message appears "FW-1425,Switch status changed from DOWN to HEALTHY".

The fabric consists of 4 switches in full-mesh scenario

The ports 22 and 23 are grouped in ISL trunk connecting Brocade5000 (4Gbps) and  Brocade5100 (8Gps). The status is reported only on the site of Brocade5000, no status changes on Brocade5100.

When we disabled the trunk the same errors appeared on next trunk, because diffrent path (switch) was used for a transfer.

We replaced cables, SFPs in the Brocade5000 switch, set 4Gbps on all trunk ports but no success.

I attached portstatsshow of the 2 trunk ports recently after failure.

Any help apreciated. Thanks

Lumir

-------

sw_23_160:admin> portstatsshow 23
stat_wtx                14801768    4-byte words transmitted
stat_wrx                938711732   4-byte words received
stat_ftx                32708       Frames transmitted
stat_frx                2010461     Frames received
stat_cdr_frx            0           Class 2 frames received
stat_c3_frx             2009783     Class 3 frames received
stat_lc_rx              331         Link control frames received
stat_mc_rx              0           Multicast frames received
stat_mc_to              0           Multicast timeouts
stat_mc_tx              0           Multicast frames transmitted
tim_rdy_pri             0           Time R_RDY high priority
tim_txcrd_z             798631      Time BB credit zero (2.5Us ticks)
er_enc_in               0           Encoding errors inside of frames
er_crc                  0           Frames with CRC errors
er_trunc                0           Frames shorter than minimum
er_toolong              0           Frames longer than maximum
er_bad_eof              0           Frames with bad end-of-frame
er_enc_out              0           Encoding error outside of frames
er_bad_os               0           Invalid ordered set
er_rx_c3_timeout        0           Class 3 receive frames discarded due to timeout
er_tx_c3_timeout        16          Class 3 transmit frames discarded due to timeout
er_c3_dest_unreach      0           Class 3 frames discarded due to destination unreachable
er_other_discard        0           Other discards
er_zone_discard         0           Class 3 frames discarded due to zone mismatch
er_crc_good_eof         0           Crc error with good eof
er_inv_arb              0           Invalid ARB
open                    0           loop_open
transfer                0           loop_transfer
opened                  0           FL_Port opened
starve_stop             0           tenancies stopped due to starvation
fl_tenancy              0           number of times FL has the tenancy
nl_tenancy              0           number of times NL has the tenancy


sw_23_160:admin> portstatsshow 22
stat_wtx                66149548    4-byte words transmitted
stat_wrx                91730776    4-byte words received
stat_ftx                556020      Frames transmitted
stat_frx                187496      Frames received
stat_cdr_frx            0           Class 2 frames received
stat_c3_frx             187496      Class 3 frames received
stat_lc_rx              0           Link control frames received
stat_mc_rx              0           Multicast frames received
stat_mc_to              0           Multicast timeouts
stat_mc_tx              0           Multicast frames transmitted
tim_rdy_pri             0           Time R_RDY high priority
tim_txcrd_z             798631      Time BB credit zero (2.5Us ticks)
er_enc_in               0           Encoding errors inside of frames
er_crc                  0           Frames with CRC errors
er_trunc                0           Frames shorter than minimum
er_toolong              0           Frames longer than maximum
er_bad_eof              0           Frames with bad end-of-frame
er_enc_out              0           Encoding error outside of frames
er_bad_os               0           Invalid ordered set
er_rx_c3_timeout        0           Class 3 receive frames discarded due to timeout
er_tx_c3_timeout        16          Class 3 transmit frames discarded due to timeout
er_c3_dest_unreach      0           Class 3 frames discarded due to destination unreachable
er_other_discard        0           Other discards
er_zone_discard         0           Class 3 frames discarded due to zone mismatch
er_crc_good_eof         0           Crc error with good eof
er_inv_arb              0           Invalid ARB
open                    0           loop_open
transfer                0           loop_transfer
opened                  0           FL_Port opened
starve_stop             0           tenancies stopped due to starvation
fl_tenancy              0           number of times FL has the tenancy
nl_tenancy              0           number of times NL has the tenancy
zero_tenancy            0           zero tenancy

Super Contributor
Posts: 635
Registered: ‎04-12-2010

Re: trunk port status reset from HEALTHY to DOWN and back

You have two different problem.

1) you have to adjust your Fabric Watch settings for E and Optical ports

2) You have to find which server-storage connection is using the ISLs and in addition which is sending errors through the fabric, because your error jumps from one ISL to the next. porterrshow command can help or use Webtools Status Reports to find the ports.

3) You can adjust your switch policy not to change to status from healthy to down at 2 marginal ports.

I hope this helps. If not please let me know.

Andreas

New Contributor
Posts: 3
Registered: ‎05-12-2011

Re: trunk port status reset from HEALTHY to DOWN and back

Thanks for good answer.

I reset error statistics yesterday evening and this morning there are only "disc c3" errors on

1\ 2 trunk groups (sw_21_160-port 30,31;sw_23_160-port 22,23),

2\ 1 server F-port(sw_21_160-port 11) and

3\ 1 storage F-port(sw_23_160-port 8)     .

The trunk ports went several times to DOWN state during the night.

In attachments there are 2 files with porterrshow output fetched this morning.

I suppose I can suspect the server F-port and start to investigate whats wrong there.

I have tried to find some information about the "disc c3" error, but many reasons can be e of the errors cause (ISL congestion, invalid or unreachable destinations,...)

Ad 3\ If understand properly I can adjust switch policy not to change to status from healthy to down at 2 marginal ports but it does not solve the problem, port wil be reset and only message does not appear in the log. Am I right?

One more info: all the switches use port-based routing, iod is set, dls is not set.

Thanks for your opinion

Lumir

Super Contributor
Posts: 635
Registered: ‎04-12-2010

Re: trunk port status reset from HEALTHY to DOWN and back

I would suggest that you add one or two additional ISLs between the switches as a short term solution to avoid IO errors on the servers.

Discards on ISL can affect all servers which are sending data through that ISL! If you have no free ports change the ISL configuration with portcfglongdisatance command to LE mode. This will change buffer handling on the ISL which will reduce the amount of DISCARDS.

You have to find the device which if eating up all your buffers on the switch. Watch your error counts regulary and find the port with a very hight buffer credit zero values. After that check where this device is connect too and solve the issue.

In addition read the Admin guide how to use the bottleneckmon command which can help to find your slow device.

Yes if you adjust the switch policy you will not fix the issue only the visibility of that problem!

My intension was to give you a hint why you get the switch down message.

The device on switch sw_21_160 port 11 is not able to handle the incomming data quick enough!

Please provide more information about this device if you have further questions. 

If you are happy with this please rate or mark the thread as answered. If you need further assist please let me know.

Andreas

Regular Contributor
Posts: 178
Registered: ‎04-21-2008

Re: trunk port status reset from HEALTHY to DOWN and back

Hi Lumir,

for your information, under 8G platform fabricwatch only reports c3_tx_timeout, and you have to care only about that type of c3.

If you have F-port with c3_tx, devices connected are culprit for c3_tx on ISL (zoning should indicate devices are using ISLs to access their targets / be accessed by their initiator, depending on devices type). You can also have a look at portshow and Link Reset counter. If not null, it indicates link has been reset because not enough bbcredit available.

If possible (dual fabric configuration and issue in only one fabric for instance), disable F-port devices where you find c3_tx, then you may not see issue on ISLs again and root cause wil be pointed out.

Hope this helps,

Kind regards,

--

david

New Contributor
Posts: 3
Registered: ‎05-12-2011

Re: trunk port status reset from HEALTHY to DOWN and back-SOLVED

Isolating the server device on the port where disk_c3 errors appeared has stopped errors on the trunk. Next step is finding the cause, but it should be simple.

Thanks for all !

Lumir

Super Contributor
Posts: 635
Registered: ‎04-12-2010

Re: trunk port status reset from HEALTHY to DOWN and back-SOLVED

Fine that you were able to isolate the issue.

Check the number of LUNs and the queue depth of the server and compare this what your array can handle. As well check if the array is overloaded.

Andreas

Regular Contributor
Posts: 178
Registered: ‎04-21-2008

Re: trunk port status reset from HEALTHY to DOWN and back-SOLVED

Great Lumir !

another step - more funny - would be to implement port fencing which is able to auto disable ports on FW c3_tx trigger

Kind regards,

--

david

Super Contributor
Posts: 635
Registered: ‎04-12-2010

Re: trunk port status reset from HEALTHY to DOWN and back-SOLVED

Hi David,

portfencing is cool and make it easy to find the error  Only a dead server will not cause any issues on the SAN.

maybe an issue for the users which were using the dead server.

Andreas

Regular Contributor
Posts: 178
Registered: ‎04-21-2008

Re: trunk port status reset from HEALTHY to DOWN and back-SOLVED

Hi Andreas,

did you already implement portfencing on production environment ? If yes I'm interested in getting your experience, which triggers / thresholds you are using or anything interesting you noticed

I'm about to implement it because we had a serious ISL issue with impact on numerous hosts because of a single faulty hba (like Lumir)

Kind regards,

--

david

Join the Community

Get quick and easy access to valuable resource designed to help you manage your Brocade Network.

vADC is now Pulse Secure
Download FREE NVMe eBook