02-12-2016 06:02 AM - edited 02-12-2016 06:10 AM
In our environment we have an IBM DS8870 storage box connected to DCX 8510-8 on each fabric with 3x 8G short distance links. This DS8870 reports high number of zero send (so rx from switch point of view) buffer credits events available on one port per each fabric (port 3/36 - index 292). According to IBM's Tivoli Productivity Center DS8870 has no send credits available 5% of the time.
Does it make any sense at all to use portbuffershow for troubleshooting here? I've browsed the forum and see that people ask use this command to check issues with ISLs rather than regular F-ports.
> portbuffershow 3/36 User Port Lx Max/Resv Avg Buffer Usage & FrameSize Buffer Needed Link Remaining Port Type Mode Buffers Tx Rx Usage Buffers Distance Buffers ---- ---- ---- ------- ---------------------------- ------ ------- --------- ---------- 34 F - 8 - ( 64) 1(1684) 8 - - 38 F - 45 4( 888) 4( 776) 45 - - 168 F - 8 1(1340) - ( 344) 8 - - 169 F - 8 - ( - ) - ( - ) 8 - - 173 F - 8 - ( 964) 2(2184) 8 - - 290 F - 8 - ( 112) 2(2892) 8 - - 292 F - 8 1(1868) 2(2208) 8 - - 4730 > portloginshow 3/36 Type PID World Wide Name credit df_sz cos ==================================================
=== fe 6db6c0 50:05:07:63:08:10:84:5d 40 2048 c scr=0x3 ff 6db6c0 50:05:07:63:08:10:84:5d 12 2048 c d_id=FFFFFA ff 6db6c0 50:05:07:63:08:10:84:5d 12 2048 c d_id=FFFFFC > portstatsshow 3/36 stat_wtx 2030258415 4-byte words transmitted stat_wrx 924545779 4-byte words received stat_ftx 82743311 Frames transmitted stat_frx 693948823 Frames received stat_c2_frx 0 Class 2 frames received stat_c3_frx 693948823 Class 3 frames received stat_lc_rx 0 Link control frames received stat_mc_rx 0 Multicast frames received stat_mc_to 0 Multicast timeouts stat_mc_tx 0 Multicast frames transmitted tim_rdy_pri 0 Time R_RDY high priority tim_txcrd_z 0 Time TX Credit Zero (2.5Us ticks) tim_txcrd_z_vc 0- 3: 0 0 0 0 tim_txcrd_z_vc 4- 7: 0 0 0 0 tim_txcrd_z_vc 8-11: 0 0 0 0 tim_txcrd_z_vc 12-15: 0 0 0 0 er_enc_in 0 Encoding errors inside of frames er_crc 0 Frames with CRC errors er_trunc 0 Frames shorter than minimum er_toolong 0 Frames longer than maximum er_bad_eof 0 Frames with bad end-of-frame er_enc_out 0 Encoding error outside of frames er_bad_os 0 Invalid ordered set er_pcs_blk 0 PCS block errors er_rx_c3_timeout 0 Class 3 receive frames discarded due to timeout er_tx_c3_timeout 0 Class 3 transmit frames discarded due to timeout er_unroutable 0 Frames that are unroutable er_unreachable 0 Frames with unreachable destination er_other_discard 0 Other discards er_type1_miss 0 frames with FTB type 1 miss er_type2_miss 0 frames with FTB type 2 miss er_type6_miss 0 frames with FTB type 6 miss er_zone_miss 0 frames with hard zoning miss er_lun_zone_miss 0 frames with LUN zoning miss er_crc_good_eof 0 Crc error with good eof er_inv_arb 0 Invalid ARB er_single_credit_loss 0 Single vcrdy/frame loss on link er_multi_credit_loss 0 Multiple vcrdy/frame loss on link phy_stats_clear_ts 02-11-2016 CET Thu 14:01:30 Timestamp of phy_port stats clear lgc_stats_clear_ts 02-11-2016 CET Thu 14:01:30 Timestamp of lgc_port stats clear > portcfgshow 3/36 Area Number: 182 Octet Speed Combo: 1(16G|8G|4G|2G) Speed Level: AUTO(SW) AL_PA Offset 13: OFF Trunk Port ON Long Distance OFF VC Link Init OFF Locked L_Port OFF Locked G_Port OFF Disabled E_Port OFF Locked E_Port OFF ISL R_RDY Mode OFF RSCN Suppressed OFF Persistent Disable OFF LOS TOV enable OFF NPIV capability ON QOS Port AE Port Auto Disable: OFF Rate Limit OFF EX Port OFF Mirror Port OFF SIM Port OFF Credit Recovery ON F_Port Buffers OFF E_Port Credits OFF Fault Delay: 0(R_A_TOV) NPIV PP Limit: 126 CSCTL mode: OFF D-Port mode: OFF D-Port over DWDM: OFF Compression: OFF Encryption: OFF FEC: ON Non-DFE: OFF > portshow 3/36 portIndex: 292 portName: IBM portHealth: HEALTHY Authentication: None portDisableReason: None portCFlags: 0x1 portFlags: 0x1024b03 PRESENT ACTIVE F_PORT G_PORT U_PORT LOGICAL_ONLINE LOGIN NOELP LED ACCEPT FLOGI LocalSwcFlags: 0x0 portType: 24.0 portState: 1 Online Protocol: FC portPhys: 6 In_Sync portScn: 32 F_Port port generation number: 106 state transition count: 2 portId: 6db6c0 portIfId: 4332002e portWwn: 2e:24:00:27:f8:56:82:3f portWwn of device(s) connected: 50:05:07:63:08:10:84:5d Distance: normal portSpeed: N8Gbps FEC: Inactive Credit Recovery: Inactive Aoq: Inactive FAA: Inactive F_Trunk: Inactive LE domain: 0 FC Fastwrite: OFF Interrupts: 0 Link_failure: 0 Frjt: 0 Unknown: 0 Loss_of_sync: 0 Fbsy: 0 Lli: 0 Loss_of_sig: 0 Proc_rqrd: 0 Protocol_err: 0 Timed_out: 0 Invalid_word: 0 Rx_flushed: 0 Invalid_crc: 0 Tx_unavail: 0 Delim_err: 0 Free_buffer: 0 Address_err: 0 Overrun: 0 Lr_in: 0 Suspended: 0 Lr_out: 0 Parity_err: 0 Ols_in: 0 2_parity_err: 0 Ols_out: 0 CMI_bus_err: 0
02-12-2016 07:24 AM
It seems as though what you are asking is how to confirm that Tivoli isn't lying to you.
How about assuming, just for the sake of argument, that it isn't.
Then you might ask:
Is the utilization between the six storage paths balanced?
If not, then is that a zoning issue (host(s) are only zoned to that one path per CPC)?
Or a MPIO issue (host(s) aren't properly configured to round robin, or otherwise employ all available paths?
Or you might want to know if you can assign more buffer credits to those ports...you can.
Or you might ask if the condition is really impacting latency...maybe it isn't. You can see a whole lot of buffer-credit-zeroes and still not have a performance problem that needs to be addressed. That counter is incremented every 2.5u second, so it takes 400 to equal a single millisecond of delay for one I/O.
02-15-2016 11:51 AM
thank you very much for your suggestions.
It is really not about me not trusting Tivoli. Promise :-)
Checking the zoning is actually what I started with, because not so long ago with other DS8870 I run into a case like you describe: there was a grand discrepancy in what hosts the particular DS8870 ports were zoned to, serving therefore very differing workloads.
In this current case however, the workload seems to be quite evenly distributed among DS8870 ports, but only one of the ports on a fabric exibits lack of credits (zero send credits 4% to 8% of the time according to TPC). If only I was able to attach screenshot file here, I would be even able to prove it....
I was thinking that maybe, for some odd reason, one of the ports on a fabric did not receive couple Receive Ready messages from a switch and since that moment it runs not using all the credits it could.
I suppose this is something that cannot be seen easily on the switch and only disabling/enabling the switchport would show if this theory proves right?
02-15-2016 01:00 PM
When you mention that it is one port per fabric that has me thinking that is isn't a physical connection issue; the switch ports stats seem to back that up.
You are suggesting that you might have 'lost' credits, and that is plausible. As you suggest, you could toggle one of the offending ports to force the storage to log back in and re-negotiate the connection. You could try that on one port/fabric to see if it makes a difference.
Is there a symmetry between the problem ports in the two fabrics? are they both first/last whatever?
In the 8510, are the ports from the 8870 distributed among several blades, or are they bunched together? I was wondering if the odd ports might be in a different port group/blade and perhaps it was the hops between target and initiator that were different for that one (those two...). You could manually assign more buffer credits to the port; and I have seen storage units that did recommend doing that; but that would still leave the question of why that pair and not the others.
You do mention three ports per fabric...Is it that there are two ports to one CPC and one port to the other CPC in each fabric and that the single port is just showing a higher traffic density?
02-18-2016 01:55 PM
thanks for further thoughts!
As for the symmetry, there actually is some on the SAN side: each problem port is connected to port 3/36 of a different switch. The three DS8700 ports per fabric connect to a different blade each and the servers that are zoned to them... that's something I've gotta check yet. Are you thinking here about a possibility of a credit loss on those so called back-end ports of the switch?
On the DS8870 the ports are called I0202 and I0602. All I know, these are not on the same Host Adapter card but I don't know enough about inner workings of DS8870 to say if they can share some internal buffer, link, path, ASIC or how they are assigned to Central Processor Complex (that's what you meant by CPC, right?). I will have to force the guy who is taking care of the DS8870 to think about it. The more I think about it, it is quite suspicious that both ports with troubles connect to 3/36 switchports.
Right now i put my efforts into convincing server side to accept experiment with port disable/enable. They have a bad history of some Oracle servers panicking when seeing a path coming back alive (strange but true) so it is not that easy as it should be.
03-23-2016 11:48 AM
I have seen this before. IMy environment is exactly yours: DS8870, Brocade SAN, TPC Alerting Zero Send Buffer.
I have found an Old server with very old HBA firmware version. The server itself had a very low workload on the SAN. In my case, fortunately the server was decomissioned and the TPC messages stopped.
it was an old Unisys wintel server running a very old Emulex HBA.
Emulex LP9002 FV3.93A0 DV22.214.171.124
I would start looking if there is similar situation in your environment.
wish you luck!
04-21-2016 09:00 AM
So your culprit server might have been acting as the dreaded 'slow drain device'.
Admittedly, a 2G HBA can be considered somewhat slow in today's SAN.
That's an interesting clue. I don't think we have any links working below 8G now in the environment but I can easily double-check. Unfortunately it could also be just some misbehaving obsolete firmware/driver on a server side, in which case it will be extremely difficult to find out from storage admin's perspective.
Anyway, thank you very much!
04-21-2016 09:16 AM
->Unfortunately it could also be just some misbehaving obsolete firmware/driver on a server side, in which case it will be extremely difficult to find out from storage admin's perspective.
SAN-Health can help you to identify quickly most common HBA drivers and firmware
04-22-2016 04:43 AM - edited 04-22-2016 04:55 AM
But San Health would not show anything more then a nodefind would, right? "Only" without that much effort...