09-28-2015 01:37 PM
Having experienced some issues lately, I've started digging into the FOS metrics. I came across this one. The topologyshow command was interesting…the Bandwidth IN to Bandwidth OUT ratio seems quite high. Of the 4-domains we have, three are showing Bandwidth Demand: 12300 % and the fourth is showing 4100%.
Watching the graphs during the time CRC errors were thrown by our storage array, the BBCreditZero errors would spike on the RealTime Network Advisor chart. I also observed portstats on the storage port:
tim_txcrd_z 3492915 Time TX Credit Zero (2.5Us ticks)
I read that the stat “tim_txcrd_z” can increase “if this port is an Storage Port, the Storage may not be able to handle the I/O’s in an acceptable time. The port runs out of BB-Credits.”
Tonight I'll be running a 12-hour 5pm to 5am SAN Health Check. Previous Health Checks I ran were only for 1-hour or so.
What are your thoughts on the tim_txcrd_z and Bandwidth Demand so high? Anything else to look out for?
09-28-2015 01:59 PM
The bandwidth demand in topologyshow only tells you the ratio between ISL ports (with their speed) versus F-Ports (also with their speed). So if you have 2x16G ISL and 4 end-devices each connected with 8G, it would be 100%. This metric doesn't say much, because it does not take into consideration how much traffic from these devices will really pass the ISLs. For exchange-based routing, it's just all devices mapped to all the ISLs.
For the tim_txcrd_z CAN be a problem depending how much you get and in which pattern (slowly but steadily increasing? short bursts? ...).
CRC errors on the other hand are a very clear indication of a physical port problem if you see them coming out of a storage array port.
I recommend you to order a SAN healthcheck incl. a performance analysis from your maintenance service provider. Beside of their findings about the current status of the fabrics you also get a feeling about what to expect and what to look for. And of course, also have a look into the manuals and the best practice / fabric resiliency guide.
09-28-2015 02:05 PM - edited 09-28-2015 02:19 PM
I had a feeling on the topologyshow that it might be something along those lines, thanks for clarifying.
With regards to tim_txcrd_z, it is steadily increasing. I reset counters on two ISL ports and looked after 3-minutes, I had 46,000 on each port.
No CRC errors logged if you mean er_crc and er_crc_good_eof . Would disc_c3 (from porterrshow) also be related to this?
09-28-2015 04:23 PM