03-19-2015 06:02 AM
I've got a bit of a headscratcher here:
I have the following phenomenon on an AIX TSM server:
A TSM admin is complaining about poor backup storage pool performance.
The performance isn't terrible (+/- 150MB/s) but the TSM admin is expecing something more along the lines of 400 MB/s.
I was asked to have a look on the SAN side to identify any possible underlying SANissues that could cause this.
Data is being copied from library A to library B via 2 HBAs on TSM server X.
Tape drives are LTO6.
The data traverses over ISLs and DWDMs.
I have already checked for congestion on the ISLs and DWDMs, there is none, the lines are not saturated at all.
The TSM server has 2 8G HBAs dedicated for this traffic (tape traffic only), there is also no congestion/saturation on the 2 fixed speed 8G SAN ports that connect to these 2 HBAs.
However, when I check the portstats for the 2 SAN ports that connect to to these 2 TSM server HBAs, I see a constant high increase in the "tim_txcrd_z" counter value.
When I look at the portstats of the ports connecting to the tape drives themselves, I see no issues whatsoever, counters are all very clean.
To completely rule out the possibility of buffer credit starvation on the ISL links, during a maintenance window, I have doubled the amount of buffer credits:
- With the portcfgeportcredits command for normal ISLs
- With the portcfglongdistance command for LS-ports (by doubling the distance)
This did not at all change anything, the situation is still exactly the same.
I don't feel that there is much more that I could investigate from a SAN perspective.
I feel like the problem is probably situated at server level (AIX OS level or TSM application level).
However, the only observation that I can make on the SAN side is that the tim_txcrd_z counter constantly goes up with big numbers on the F_Ports that connect to the 2 server HBAs that handle the tape traffic, and I can't really explain why.
Could this be caused by a bottleneck on the TSM (AIX) server itself?
Any ideas are welcome :)
03-19-2015 08:08 AM
I have encountered the same type of a problem within our data center no long distance ISL's involved. The problem turned out to be the one of the tapes drives wonderful slow draining device. The problem I had finally resulted in discarded c3 frames which helped to isloate the problem. I would be something to dig into.
03-23-2015 08:04 AM
you might wanna have a look at additional options you can set on that port
i had to add ostp option (open systems tape pipelining) and also had to set the fastwrite option (to my 360k san over ip connection)
here you can find additional infos
i need to go but will check back tomorrow if my answer was what you have been looking for or if additonal question arise
04-02-2015 12:46 AM
Thanks a lot for the suggestions.
I've checked the information about ostp and fastwrite, but it seems that these are only to be used when using FCIP for extending the fabric?
We're not using FCIP though, we are using DWDM lines, so I don't think it can help in this case.
04-08-2015 08:03 AM
A constant increase of tim_txcrd_z counter in the ports connected to the TSM HBAs indicate that there are periods of times that the switch cannot send frames to the HBA because there are no BB credits available. If this behavior increases, it could lead to Tx discards on the port.
You can ask the TSM Admin to perform an independent backup on each of the tapes to check the throughput achieved during Write operations and the opposite one to check it out during READs. If he gets a better performance, I would point at the TSM server itself as the bottleneck since it has to read from the tape in Library A an write it to the tape in library B.