Fibre Channel (SAN)

Reply
Contributor
Posts: 23
Registered: ‎06-04-2004

Storage problems on SAN

Hello,

We have problems with our storage on our SAN since one week. We are running FOS v6.4.0c on 48K. What I see is the following on the servers :

Reservation error: Timeout
vmkernel: 4:09:16:12.443 cpu5:4119)WARNING: FS3: 4858: Failed to initialize VMFS3 distributed locking on volume 49db0c95-1ea1aaa4-f84f-003005fa7843: Timeout
vmkernel: 4:09:16:12.443 cpu5:4119)WARNING: Fil3: 1930: Failed to reserve volume f530 28 1 49db0c95 1ea1aaa4 3000f84f 4378fa05 0 0 0 0 0 0 0

On the switch log :

Switch status changed from HEALTHY to MARGINAL.
WARNING,  Switch status change contributing factor Marginal ports: 1 marginal ports. (Port(s) 152(0x98))
Switch status changed from MARGINAL to HEALTHY.

We've also such errors on ISL.

Storage disksubsystem is connected on port 152. When I check the port :

portstatsshow -i 152

port:  152
=========
stat_wtx                397983308   4-byte words transmitted
stat_wrx                1927300580  4-byte words received
stat_ftx                556334053   Frames transmitted
stat_frx                805640217   Frames received
stat_c2_frx             0           Class 2 frames received
stat_c3_frx             805640217   Class 3 frames received
stat_lc_rx              0           Link control frames received
stat_mc_rx              0           Multicast frames received
stat_mc_to              0           Multicast timeouts
stat_mc_tx              0           Multicast frames transmitted
tim_rdy_pri             0           Time R_RDY high priority
tim_txcrd_z             23514677    Time BB credit zero (2.5Us ticks)
er_enc_in               0           Encoding errors inside of frames
er_crc                  0           Frames with CRC errors
er_trunc                0           Frames shorter than minimum
er_toolong              0           Frames longer than maximum
er_bad_eof              0           Frames with bad end-of-frame
er_enc_out              0           Encoding error outside of frames
er_bad_os               0           Invalid ordered set
er_rx_c3_timeout        0           Class 3 receive frames discarded due to timeout
er_tx_c3_timeout        1401        Class 3 transmit frames discarded due to timeout
er_c3_dest_unreach      0           Class 3 frames discarded due to destination unreachable
er_other_discard        7           Other discards
er_zone_discard         0           Class 3 frames discarded due to zone mismatch
er_crc_good_eof         0           Crc error with good eof
er_inv_arb              0           Invalid ARB
open                    0           loop_open
transfer                0           loop_transfer
opened                  0           FL_Port opened
starve_stop             0           tenancies stopped due to starvation
fl_tenancy              0           number of times FL has the tenancy
nl_tenancy              0           number of times NL has the tenancy
zero_tenancy            0           zero tenancy

The storage vendor don't see any problem on the storage box. I found that er_tx_c3_timeout errors means that "the device connected to the port is not able to handle the traffic. The SAN switch can not send any frames to the device because the device is not given back the buffer credits R RDY)"

Does anynone already had such problem ? Is it a bug with FOS 6.4.0C ? We have only sfp 4G.

Thanks.

Regards,

Pierre.

Super Contributor
Posts: 635
Registered: ‎04-12-2010

Re: Storage problems on SAN

Hello Pierre,

Your problem looks like an overload condition which I have seen several times.

My first question is: Is the affected server and the storage on the same switch or does the server use the ISL which you mentioned below?

Or does any other server which is zoned to the affected storage port use the ISL?

Switch status changed from HEALTHY to MARGINAL.
WARNING,  Switch status change contributing factor Marginal ports: 1 marginal ports. (Port(s) 152(0x98))
Switch status changed from MARGINAL to HEALTHY.

We've also such errors on ISL.

Did you have checked all your ports if they are error free in the meaning of no physical problems?

My primary recommendation is to check all your ports if they are really error free if not please clean the cables or replace SFPs.

Next step is to check the discard issue.

Do you have mapped big LUNs greater than 200GB to ESX and do you have many ESX hosts which share the same LUNs? In addition if you have many guests inside your LUNs / datastores you can may have an SCSI reservation conflict issue.

In your case the storage array port is not able to handle the IOs.

Reduce the load on the affected port. Reduce on the server side the IO queue depth or move some servers away.

I hope this helps,

Andreas

Regular Contributor
Posts: 201
Registered: ‎11-24-2009

Re: Storage problems on SAN

Hi Pierre,

unless you've just upgraded to 6.4.0c and the problem started afterwards, it looks like there is a congestion on the link you mentioned. Heavy congestion may lead to frame drops and subsequent port recovery actions from the host which is reflected as port flap on the switch.

So, first thing to check is whether you have tim_txcrd_z errors on the storage port side. If you do, it means storage device can't handle traffic.

Otherwise, if you only see tim_txcrd_z errors on the ISL means your fabric is the point of congestion. In this case you'd have to add a second link between the switches, so that you have more bandwidth.

Hope this helps,

Linar

Regular Contributor
Posts: 201
Registered: ‎11-24-2009

Re: Storage problems on SAN

@andreas.bergelt doubleshot

Contributor
Posts: 23
Registered: ‎06-04-2004

Re: Storage problems on SAN

Andreas and Linar,

Thanks a lot for your help. I will do what you adviced and try to find a solution.

Kind regards,

Pierre.

Join the Community

Get quick and easy access to valuable resource designed to help you manage your Brocade Network.

vADC is now Pulse Secure
Download FREE NVMe eBook