08-23-2013 07:49 PM
Been researching this for a couple days but haven't come up with anything applicable to my situation. "show cpu" shows 99% cpu. "show process cpu" shows about 10% for ARP and about 50% for "IP" and <1% for anything else. Lots of packet loss. Rebooting the router brings me down to about half this cpu use for maybe 12 hours at best, but even half these cpu loads is dramatically more than it should be. No major config changes in several months and this just dropped up recently, so I'm thinking it must be some kind of traffic that's being routed in cpu instead of in hardware. The regular logging shows a handful of snmp and telnet logins but nothing to cause this kind of CPU use.
Any ideas where I should look next?
08-23-2013 08:58 PM
I am thinking most likely you have a loop. Depending on what you can do, I would suggest the following
Clear port stats and then check the port stats to find the ports that are growing at a fast rate.
Disable the port/s identified above and monitor CPU
Low tech way, pull each cable one by one monitoring CPU usage.
Take packet capture to see source address then find where that is.
08-23-2013 09:12 PM
By loop, like a switching loop or routing loop? It would be a bit odd to have a switching loop given that nothing has been plugged into or unplugged from this device in a while, though I could certainly check to make sure all my trunk groups are working and configured properly.
FWIW all the ports should be growing at a fast rate because all the ports are uplinked to other switches used as top-of-rack in a web hosting environment. I'm graphing bandwidth use and PPS and don't see anything too far off from ordinary there that I can tell, aside from the packet loss and missing graph when the switch cpu gets too far overloaded. Because of that I'm not sure that clearing the port counters will be of much use here.
We have some packet captures that we sort by destination IP descending, nothing out of ordinary for the top destinations.
About how much traffic would be likely to cause something like this? Am I looking for 10pps, 100, 10k?
08-23-2013 09:20 PM
Well the FESX448 is able to forward 101 MPPS in hardware - so I would not think you are hitting that.
Yes I was thinking a layer 2 switching loop (as I have seen a L2 loop do exactly what you have got). Check you trunks maybe somebody removed the LACP config from one of the connecting switches?
08-23-2013 09:26 PM
My bad, just reread your first post " Rebooting the router brings me down to about half this cpu use for maybe 12 hours at best" - if this was a loop it should be almost instant.
Can you post a "show version"? In the 7 thought 7.3 code there where a few CPU issue with Ironware for FastIron's.
08-23-2013 09:42 PM
My vendor sent me an updated "j" version firmware (f is loaded below), but the changelog didn't indicate anything useful and it's still a 7.2 version. They claimed the 7.2.02j was the latest one available.
The switch cpu use comes and goes, sometimes as low as 30% but usually in the 60-99% range, with ARP consistently doing around 10% and "IP" doing 20-50%. It never adds up to 100% in "show process cpu", the "show cpu" command typically shows double the total use that "show process cpu" does.
I would think with a loop it would max out pretty quickly and stay there. This time I rebooted it didn't stay "good" for 12 hours, but it went from critical "going to lose customers" level of cpu use to "borderline bad, can go get dinner and work on this some more" level of cpu use.
Would a failed / intermittent port in a lag cause this?
SW: Version 07.2.02fT3e3 Copyright (c) 1996-2010 Brocade Communications Systems, Inc.
Compiled on Feb 16 2012 at 20:48:44 labeled as SXR07202f
(4070551 bytes) Primary SXR07202f.bin
BootROM: Version 07.2.00T3e5 (FEv2)
HW: Stackable FESX448+1XG-PREM (PROM-TYPE FESX448-L3U)
Serial #: FL47040072
License: SX_V4_HW_ROUTER_SOFT_PACKAGE (LID: hnJMFJFFMH)
P-ASIC 0: type 00D1, rev D1 subrev 00
P-ASIC 1: type 00D1, rev D1 subrev 00
P-ASIC 2: type 00D1, rev D1 subrev 00
P-ASIC 3: type 00D1, rev D1 subrev 00
P-ASIC 4: type 01D1, rev 00 subrev 00
300 MHz Power PC processor 8245 (version 129/1014) 66 MHz bus
512 KB boot flash memory
8192 KB code flash memory
512 MB DRAM
The system uptime is 2 hours 12 minutes 59 seconds
The system : started=warm start reloaded=by "reload"
08-23-2013 10:02 PM
"Would a failed / intermittent port in a lag cause this?" No it should not..... but...
Also check show mem output and see if that is increasing over time (should be static)
Latest code is 7.4.00d (released on July 31 2013)
7.2.x and 7.3 show process cpu never added up to 100% Not sure if this got fixed 7.4 as the only switch I now have access to is a ICX-6430 running 7.4 but the command is not supported on my model.
I had LOTS of issues with CPU on 7.2. though 7.3 on FSX1600 - I suggest getting 7.4.00d (also 4.3 code was rock solid for FESX).
08-23-2013 10:17 PM
FYI - Was just reading the release notes for 7.4.00d - though there is a download for 7.4.00d for the FESX448 on the Brocade downlaods section it states in the RN that FESX4xx are NOT supported.
08-23-2013 10:25 PM
Hmm, if 7.4 doesn't support feed448, is 7.3 an improvement over 7.2 in terms of cpu or probably not? The version you said is rock solid, could you give a full version number for one of those? Any important downsides to using an older version like that?
08-23-2013 10:39 PM
Ok, just checked and 7.3 is also a no go for FESX4xx- so it look like 7.2.02j is indeed the lastest.
I went back and check some of my old emails. The version was 4.200c that most of my customers where using. Downsides could be if any new feature was added that you require. However I do not remember any new major features for FESX.
Ver 5 was about adding licensing to Fastiron
Ver 6 was adding stacking
Ver 7.00 was first consolidated image for most of the Fastiron family
Ver 7.1 Do not remember
Ver 7.2 added hitless for FSX800 and FSX1600