10-02-2009 08:24 AM
We have had an issue where eight 3900 SAN switches all suddenly rebooted at exactly the same time in two seperate fabrics (four in one fabric and four in the other)!!
In the error log on each switch we see the message:
Err 01 0x10339820 (tSwitch): Oct 2 04:12:59 INFO SYS-BOOT, 4, Restart reason: Watchdogport Command Group
Reset reason 10l: Watchdog NMI
Our hardware support team have been unable to fully explain this event saying that there was some issue deep in the firmware, but we were wondering whether anyone else would have any idea what could have caused this issue.
10-02-2009 08:33 AM
this was a Bug in the FabricOS at long Time ago, and the Defect as be Closed by Brocade.
Which FabricOS Rel.is loaded in this 8 Switches ? Tell me the EXACT Release PLEASE.
10-02-2009 08:54 AM
3900 don't support FabricOS 3.x
3900 are supported only by FabOS 4.x.x and 5.x ( latest 5.3.2c )
I assume you mean 3800 Switches ? can you please confirm ?
log from the command line to the switch, whit the command "switchshow" you see here in the Line
10 = 3900
This reboot behavior was caused by a Old 4.x fabos, but i will check for 3.x
10-02-2009 10:02 AM
Indeed , 3800 are Oldtimer
One thing is sure, Watchdog NMI error is Timer caused ...like SNMP
Can you please check if all switch have the correct - current date and time - ? the command is "date"
what say the command "errshow"
I must check if i can find other info about this error.
10-02-2009 02:36 PM
The date/time are all in synch on the switches affected. They all use an NTP server to synchronise their time.
The errshow only shows the error events after the reboot. The first entry on each is:
0x103397c0 (tSwitch): Oct 2 04:12:58
INFO SYS-BOOT, 4, Restart reason: Watchdog
10-02-2009 10:51 PM
the only thing I've found is the bug with FabOS version 4.4.1a but not for the 3800.
This error was caused as mentioned by SNMP crash.
I can think of that can be a same errors, but this version is no longer fully supportet and I do not think if is a Bug, Brocade
corrects this with an update in the future.
I have no other idea, maybe someone else here know this error.
10-08-2009 05:36 AM
Just though I would give you an update on this issue.
We traced the root cause to some network vunerability scans that mistakenly targeted our SAN management subnet. These scans attempted to login to our switches repeatedly and caused the tHttp daemon to hang and the switches to perform a watchdog reboot. We also have the added complication that a lot of the switches are aggregated to a network hub which was the reason for the simultaneous reboots when the network scan hit the hub.
Thanks for your assistance