09-26-2017 04:09 AM
Hi, all 4 ports from one of our AIX VIO servers are suffering tim_txcrd_z count counters rising.
We have swapped the fibre out. The SFP looks ol and the HDS port is serving other servers with no issues.
I have a feeling the HBA's on the server aren't coping with workload at particluar times rather than it being a component
Anyone have any thoughts?
some diag below: running FOS 7.4.1.b / DCX8510-4
One port in particular is suffering - small excerpt below:
Category(Rule Count)|RepeatCount|Rule Name |Execution Time |Object |Triggered Value(Units)|
Port Health(683) |20 |defALL_OTHER_F_PORTSC3TXTO_|09/25/17 02:19:52| F-Port 2/25 |6472 Timeouts |
| |20_C | | | |
| | | | F-Port 2/25 |4122 Timeouts |
| | | | F-Port 2/25 |4814 Timeouts |
| | | | F-Port 2/25 |8941 Timeouts |
| | | | F-Port 2/25 |6752 Timeouts |
C3TXTO(Timeouts) - 2/25(1245171) - - - 2/25(5267) 2/25(56501)
tim_txcrd_z 1110 Time TX Credit Zero (2.5Us ticks)
tim_txcrd_z_vc 0- 3: 0 0 0 1110
tim_txcrd_z_vc 4- 7: 0 0 0 0
tim_txcrd_z_vc 8-11: 0 0 0 0
tim_txcrd_z_vc 12-15: 0 0 0 0
tim_latency_vc 0- 3: 1 1 1 3
tim_latency_vc 4- 7: 1 1 1 1
tim_latency_vc 8-11: 1 1 1 1
tim_latency_vc 12-15: 1 1 1 1
09-26-2017 01:54 PM
Yes, there is an issue. This is one small snapshot of one port, of one switch in a fabric. What I'm getting at is, this issue is representative of a much larger holistic traffic shaping function on your fabric. It would be very difficult if not impossible to make a decent diagnosis of where the issue lies without a great deal more data/info than is presented. All I can say is that yes, you have some data traffic problems, and it's serious.
Quite often in cases like this, it's not the device(s) which are discarding class3 frames which are the cause, but this represents the effect. It would be that other slower traffic transiting the fabric is slowing down the faster traffic coming from your AIX host port(s). Much as a group of cement trucks(2 or 4GB traffic) is holding up the sedans, and coupes that can go faster(8 or 16GB traffic).
Brocade switches play no favorites with respect to speed in default configuration. It is up to the admin to either segregate the slower traffic, provide more ISLs, use traffic shaping tools(QOS), or upgrade legacy devices to faster adapters to manage this. It's also a function of where, and how ISLs are placed on your 8510.
For a simple example: A legacy 2GB HBA device is ingressing on edge dir slot/port 2/0. The only ISLs in use are on slot 10 port 28-31. ISL ingress is on core switch slot 4 port 0-3, and the ultimate destination target is on slot 10 port 5. Each 2GB exchange from host to target 'touches' (transits) many ASICS in each director. From the edge slot 2 ASIC, to the edge coreblade slot(chassis dependent), from there to the edge ISL slot/port, and out the edge switch. Then the ingress of the core on slot 4 and on to the coreblade slot of the core switch, and finally to the egress slot on port 10 of the core switch and finally to the target. That's a lot of ASICs to transit!
All that time, the TX credit zero counter is ticking, waiting for free buffers down the line. If the timer runs out, we have no choice but to discard the class 3 frame(s), and start over. Brocade has remarkable cut-through-routing speed, which is why it's the best in class. However, like a high performance Ferrari, it requires regular maintenance, tuning, and good technique.
From here, I suggest you read up on SAN Health reporting. Download the utility, and run it to capture each switch in the fabric. http://www.brocade.com/en/support/support-tools/support-download-san-health-diagnostics-capture.html
SAN Health is the best and easiest tool to use for traffic and latency issues. You can also gather quite a bit more info from the MAPS Fabric Performance Impact output. http://www.brocade.com/content/html/en/deployment-guide/brocade-san-resiliency-admin-dp/GUID-9EF972C8-F33B-43B8-9CFC-BCC678071B2D.html which is a more detail oriented report, but is not as easy to view and decode.
Designing quality fabrics is sometimes complex as the port count, and device diversity increase. There are many tools, and reports which will help resolve this, and if you have a vendor asset who has taken the Brocade Fabric Design course, they can assist you to change some features for better throughput.
09-27-2017 12:47 AM
I assume that you see no error logs in portstatsshow / portstats64show either for all four ports?
Do you have any errors logged on the hba or it is clean too?
If the physical layers is OK, then it is either HBA/drivers issue or the server is running out of CPU/memory.
The amount of discard is large for a F-ports, for a day, so it is really looking a server/hba issue.
Can you share the portlogshow output so we can see how many credit the HBA have - it maybe time to upgrade the HBA to something with more memory?