03-13-2012 08:25 AM
We are having a problems right now on our network and it has me stumped. We have two Full Layer 3 Fastiron SX800 chassis that do all the routing for our entire network. Our wiring closet switches (8 storey building) are all either FES or FCX switches running Base Layer 3 code (there are 34 of them). We have an environment of approximately 10 L2 VLAN's for traffic seperation and we are running STP 802.1w on all of the links from the switches to the SX800's. All uplink fibre optic connections are tagged and running as spanning tree admin point to point mac ports and all VLAN root bridge priorities are set to the Layer 3 switches. Where a wiring closet contains more than one switch, they are connected together for redundancy. Here is the challenge. When a switch link goes down, it takes the SX800 chassis approximately 15sec to re-negotiate the connections with STP. During this 15secs, the CPU of the SX800 pegs at 99% and stays there.....all network traffic slows down and in some cases our Citrix servers register a disconnection from the SQL databases on our SAN. We have tested the same scenario using a copper link and the CPU pegs at 99% for 4-5 secs only. This 15sec failover delay causes our users significant slowdown during the time of high CPU utilization and therefore we need to resolve this issue.
Any ideas? Thanks
03-13-2012 12:51 PM
Yes I have seen similar STP related high CPU issues with FSX.
But let me ask you some questions first:
* what version are you running on your FSX?
* Are you using single 802.1w or per VLAN 802.1w?
* Have you changed any system-max values?
show default values
* Are all L2 VLANs going accross such a fiber link?
* Are all your non lnter-switch-links configured as admin-edge ports?
* Are you using "link-keepalive" on your fiber uplinks?
03-13-2012 01:04 PM
*Version of software = 7.2.02.d
*Per VLAN configuration.
*No default values changed except for the maximum number of static routes allowed.
*Only inter-switch links both fibre and copper are configured as admin-point-to-point ports.
*No using keep alive on any of the ports.
03-15-2012 03:13 AM
Hm.... I have mostly seen this when system-max values have been increased and CPU has to go through too many allocated ressources even they have not been used.
>>*Only inter-switch links both fibre and copper are configured as admin-point-to-point ports.
You should also configure all Non-ISL ports consequent as admin-edge ports.
Are you running SFlow?
Maybe you could try for a test...
-- disable SFlow
-- disable any local logging and also Syslog and SNMP Traps
03-30-2012 11:45 AM
We are not using sflow at all so that didn't enter into the picture. Found out that one VLAN on one of the 34 floor switches had spanning tree and not rapid spanning tree......go figure. Fixed that one VLAN and everything works the way it should. When a spanning tree calculation gets done, the core CPU pegs at 99% for about 4secs while it figures out the new routes and then goes right back down to its normal 20%.
Thanks for all your thoughts - appreciate them.