08-03-2016 06:35 AM - edited 08-03-2016 06:36 AM
there is number of switches in the fabric. 2 core switches and severla edge switches. every edge connected to both core.
We noticed, that one of the core switches (say core2) does not see devices from a certain edge switch (say edge5).
this is nsallshow from core1
1174 Nx_Ports in the Fabric
nsallshow from core2
1153 Nx_Ports in the Fabric
nsallshow from edge5 switch shows
1174 Nx_Ports in the Fabric
ISLs are healthy.
what could be the reason for such a behaviour?
Solved! Go to Solution.
08-03-2016 08:05 AM
A possibility is this may occur in a setup with Long distance E port due to a rare name server query frame time out condition.
A hafailover should recover this issue (for directors).
What FOS version are you using?
08-03-2016 08:10 AM
If you run the fabricshow or topologyshow commands on core 2 switch, does it show switch 5 as part of the fabric? If all of the switches are truely merged into 1 fabric then the name server table should be shared. Are you using TI zones? I'll see if i can find anything else that might explain this.
08-03-2016 08:27 AM
Run nscamshow command from each core switch and pipe the outputs of each to a flie. Compare, or diff the entries in the two name server cam files for devices which are not present, or unreachable. Report the findings back here.
08-04-2016 02:50 AM - edited 08-04-2016 02:57 AM
thank you, Alexey.
In my case it may be the DEFECT000513776.
DEFECT000469915 is probably not, because we have normal distance E-ports.
Doc, nscamshow output from core2 switch shows no entries for edge5 switch:
Switch entry for 135
state rev owner cap_available
unknown v711 0xfffc67 1
Device list: count 0
No entry is found!
Path Count: 1
Out Port: 3/6
In Ports: 1/1 1/3 1/4 1/6 1/8 1/9 1/10 1/11 1/12 1/15 2/0 2/1 2/3 2/5 2/7 2/8
2/9 2/10 2/11 2/12 2/14 3/0 3/3 3/4 3/7 3/8 3/9 3/10 3/11 3/12 3/15 7/0
7/1 7/2 7/5 7/6 7/7 7/8 7/9 7/10 7/13 8/1 8/12 9/0 9/2 9/3 9/4 9/6
9/7 9/8 9/10 9/11 9/12 9/15
Total Bandwidth: 16.000 Gbps
Bandwidth Demand: 2375 %
So I'm going to issue hafailover on core2 switch first, if it does not help then reboot edge switch setting defZone mode to NoAccess first.
But I think also, that another solution may be to reboot -f core2 director (48000).
It should not disrupt the traffic:
When HA is in sync, and reboot -f is issued on the Active CP of a director, the Standby CP
takes over as the active CP without traffic disruption.
Local CP (Slot 5, CP0): Active, Warm Recovered
Remote CP (Slot 6, CP1): Standby, Healthy
HA enabled, Heartbeat Up, HA State synchronized
FOS version is v6.4.1b.
08-04-2016 03:59 AM
Defect 469915 listed and fixed into release note of 6.4.3f may match with the issue seen.
08-04-2016 09:00 AM - edited 08-04-2016 09:03 AM
OK, you didn't fully answer my request but we'll deal with what we have. I now know ithe core is a 48k, and running 6.4.1, and it's a pretty big fabric. So, couple of things. The 48k and this FW release are notorious for mem leaks causing OOM condition. If you start to get a full memory on the core, and there's a change in the name server causing plenty of RSCNs, you can crash the NSD(name server deamon) and not maintain a name server distribution. Have you been adding, dropping F devices recently? That might account for the content addressable mem in edge being empty, or it may not. Don't know the sequence here - so shooting in the dark kind of.
As for remediation, well - you can try all kinds of things, but without accurately troubleshooting, it's hard to say what effect your actions will have, unless you reboot core2 and the edge. One way or another, if the NSD is crashed, it will need to be re-spawned somehow, but I'm not even remotely sure this is the cause of the lack of cam entries in edge switch. Rebooting will cause a build fabric, and all the devices will log back in. If you can take that kind of outage, it's a fix, but it's like smashing a Zika bug with a Mack truck.
Concur with the other poster that upgrading the fabric to the latest 6.4 release is a wise choice. Also, as a matter of management principle I would strongly encourage keeping your IP management traffic to a minimum to avoid mem leaks(no WebTools means no WebLinker process running - get it?). Further, if you have the Fabric Watch license, setup some thresholds to check for mem usage on all switches and have it report back to a email acct, or SNMP as you see fit.
Best of luck
08-16-2016 12:12 AM
Please can you mark this as resolved?