05-24-2016 02:12 AM
this is the question about identifying the source of latency bottlenecks.
One day customer started to receive messages about Latency bottlenecks on E-ports on several of his switches. (there are 20 switches in the fabric but 14 of them are gen3 and I think condor does not support bottleneck monitor, so the messages appeared only on condor 2 and condor 3 switches).
Ok, there is a slow drain device or faulty sfp/cable/hba somewhere. Lets find it.
I issued porterrshow on all switches in the fabric and have not discovered any errors rather than disc_c3.
Faulty media should not be the case... Looking further into disc_c3 errors I found that all F ports have only rx timeouts. And tx timeouts are registered only on E ports.
As FOS 6.x does not distinguish between rx and tx timeouts I can not claim that there are no tx timeouts on f ports in the fabric at all, but there were no definitely on those 6 switches with FOS 7.x
While bottlenecks continued to happen customer found the possible source of problems. After terminating one of backup sessions, latency bottlenecks ceased to popup. the data flow in backup was from storage to the backup server and further via LAN to DataDomain. both devices reside on edge switches and traffic was flowing via core switches. Both storage and server are connected to gen5 switches with FOS 7 and there were no c3 tx timeouts on these ports. So could it be that the reason of latency bottlenecks was lack of credits on ISL links? not delays at any device but too many frames traversing 2 ISL links?