09-17-2012 09:56 AM
We are in the process of moving from our old 2 GB Brocade FC switches to new 8GB switches. All has gone well untill now.
After moving one of our AIX production database servers we have started to get oracle errors on the DB server and the er_other_discard counter on the switch port is increasing by 1 approx. every 2 to 3 seconds.
Might this be releated to the errors on the DB server and should I be concerned?
All other counters remain Zero ( except tx and rx.).
Any suggestions would help.
09-17-2012 11:59 AM
This probably won't help, but I had a similar experience when upgrading from 6.2.x to 6.4.x on my FC switches. I know the cause of the issue was not new to that upgrade, but I swear there must be something more sensitive in the newer code...
AIX in general is super picky and knows too much about what it is connected to. Typically if we do anything that is supposed to be transparent in our environment, AIX still usually complains.
My problem was that I had my AIX servers complaining about disk errors and link errors. Looking at my errors on my switches, there were a high number of c3 discards due to timeout (so I was told - I sent it to support). Our ESX boxes also started complaining after a few days. The issue was a "slow drain" due to a hardware issue on a windows blade server which was ISLed into my director where the AIX servers connected, which I think turned out to be an outdated driver issue. I don't know you have any ISLs or if they are stand-alone switches, or if it is just the one port your server is plugged into, but something to consider, anyway. I am not good at troubleshooting unfortunately.
09-17-2012 09:46 PM
I think Annette's suggestion definitely could help - at least looking for a problem elsewhere than on only one port is a good idea. It may be that something else in the fabric is causing issues for your AIX server.
Did you move the AIX box from a 2G to an 8G switch? If the AIX-box still has a 2G FC HBA - it may be worth a shot to set fixed speed on the port.
Discards are bad.
09-18-2012 06:22 AM
Thank you both for you replies,
We have four AIX hosts in total, two application and two DB servers. There
are two ISL's, one on each fabric.
The hosts are 4GB and were originally connected to a 2GB infrastructure so
the HBA's ran at 2GB with no discards.
I have now moved them to the new 8GB switches and set the switch port to 4GB
but still get discards.
The hosts can no longer see the disk down the original path but they now
have a number of new paths to each disk. However I see a number of failed disks
from the OS which I assume are the old paths. The number of failed disks correspond
roughly to the number of discards I'm seeing. Could this be related or is it coincidence?
Thanks again for you time.
09-19-2012 08:22 AM
AIX are very special when modifying the SAN infrastructure; probably if the HW path has changed and the HBA driver is not the latest, it may try to send some IO through those paths which no longer exist and as soon as the FC frame gets into the switch it is discarded with er_other_discard (according to the manual, this counter counts the number of other discarded due to route lookup failures or other reasons); but this is just guessing...
Please, try to clear all the dead paths and let's see afterwards it you still get those errors...