Fibre Channel (SAN)

Reply
Contributor
Posts: 21
Registered: ‎04-15-2011

Brocade 48000 - Simultaneous CP Error and Failover an 2 Switches

We have 2 Brocade 48000 directors running Fab OS V5.3.2c (yes - I know we're behind).  At the same time yesterday morning, BOTH switches logged an error on CP0 and successfully failed over to CP1.

From CP0 error log

2013/03/24-03:54:55, , 772, FFDC, WARNING, IBM_2109_M48, kSWD: Detected unexpected termination of: ''zoned:0'RfP=925,RgP=925,DfP=0,died=1,rt=125771772,dt=83402,to=50000,aJc=125670272,aJp=125636971,abiJc=-1027802072,abiJp=-1027835372,aSeq=7636,kSeq=0,kJc=0,kJp=0,J=1256

2013/03/24-03:54:55, , 773,, WARNING, IBM_2109_M48, HA State out of sync.

2013/03/24-03:54:55, , 774,, INFO, IBM_2109_M48, First failure data capture (FFDC) event occurred.

2013/03/24-03:54:56, , 775,, WARNING, M48_Switch_2, Switch status changed from HEALTHY to MARGINAL.

2013/03/24-03:54:56, , 776,, WARNING, M48_Switch_2, Switch status change contributing factor CP: CP non-redundant.

2013/03/24-03:58:16, , 777,, WARNING, IBM_2109_M48, Trace dump available (Slot 5)! (reason: PANIC)

2013/03/24-03:58:16, , 778,, WARNING, IBM_2109_M48, Trace dump (Slot 5) was not transferred because trace auto-FTP disabled.

2013/03/24-03:58:17, , 779,, INFO, IBM_2109_M48, Processor rebooted - Software FaultSmiley Frustratedoftware Watchdog

2013/03/24-03:58:36, , 780,, INFO, M48_Switch_2, Config file change from taskSmiley TongueDMIPC

2013/03/24-03:58:36, , 781,, INFO, IBM_2109_M48, HA State is in sync.

<CP1 Error Log>

2013/03/24-03:54:55, , 3168,, WARNING, IBM_2109_M48, HA State out of sync.

2013/03/24-03:55:53, , 3169,, INFO, M48_Switch_2, iSNS Client Service is disabled.

2013/03/24-03:55:53, , 3170,, INFO, M48_Switch_2, Config file change from taskSmiley TongueDMIPC

2013/03/24-03:55:58, , 3176,, INFO, M48_Switch_2, Previous message repeated 6 time(s)

2013/03/24-03:56:06, , 3177,, INFO, M48_Switch_2, Config file change from taskSmiley TongueDMIPC

2013/03/24-03:56:06, , 3178,, ERROR, IBM_2109_M48, CP in Slot 5 set to faulty because CP ERROR asserted.

2013/03/24-03:56:06, , 3179,, INFO, M48_Switch_2, Config file change from taskSmiley TongueDMIPC

2013/03/24-03:56:09, , 3182,, INFO, M48_Switch_2, Previous message repeated 3 time(s)

2013/03/24-03:56:43, , 3183,, WARNING, M48_Switch_2, Switch status changed from HEALTHY to MARGINAL.

2013/03/24-03:56:43, , 3184,, WARNING, M48_Switch_2, Switch status change contributing factor CP: CP non-redundant.

2013/03/24-03:57:26, , 3185,, INFO, IBM_2109_M48, Resetting standby CP (double reset may occur)

2013/03/24-03:57:29, , 3186,, INFO, IBM_2109_M48, CP in slot 5 not faulty, CP ERROR deasserted.

2013/03/24-03:58:10, , 3187,, WARNING, IBM_2109_M48, Trace dump available (Slot 5)! (reason: PANIC)

2013/03/24-03:58:10, , 3188,, WARNING, IBM_2109_M48, Trace dump (Slot 5) was not transferred because trace auto-FTP disabled.

2013/03/24-03:58:10, , 3189,, INFO, M48_Switch_2, Config file change from taskSmiley TongueDMIPC

2013/03/24-03:58:35, , 3193,, INFO, M48_Switch_2, Previous message repeated 4 time(s)

2013/03/24-03:58:37, , 3194,, INFO, IBM_2109_M48, HA State is in sync.

2013/03/24-03:58:39, , 3195,, INFO, M48_Switch_2, Switch status changed from MARGINAL to HEALTHY.

More Info:

M48_Switch_2:admin> firmwareshow

Slot Name     Appl     Primary/Secondary Versions               Status

------------------------------------------------------------------------

  5  CP0      FOS      v5.3.2c                                  STANDBY *

                       v5.3.2c                                 

  6  CP1      FOS      v5.3.2c                                  ACTIVE

                       v5.3.2c                                 

M48_Switch_2:admin> hashow

Local CP (Slot 5, CP0): Standby

Remote CP (Slot 6, CP1): Active

HA enabled, Heartbeat Up, HA State synchronized

What happened ?   How did this happen on BOTH switches at once ?  They're not connected to each other

Valued Contributor
Posts: 761
Registered: ‎06-11-2010

Re: Brocade 48000 - Simultaneous CP Error and Failover an 2 Switches

The failover seems to have been brought about by a daemon that failed. Check the CORE file created in order to see which daemon failed, and if it was the same in the two switchs (maybe an intensive snmp polling, for example...).

rgds,

F

Contributor
Posts: 21
Registered: ‎04-15-2011

Re: Brocade 48000 - Simultaneous CP Error and Failover an 2 Switches

Would the CORE files be on CP0 or CP1.   How do I read them ?

Thank You

Valued Contributor
Posts: 761
Registered: ‎06-11-2010

Re: Brocade 48000 - Simultaneous CP Error and Failover an 2 Switches

First of all you should generate a fresh supportsave.

Within the supportsave, there should be a file named XXXX.CORE_FFDC.tar.gz. In it, you may find info about the daemon that failed.

Rgds

Contributor
Posts: 21
Registered: ‎04-15-2011

Re: Brocade 48000 - Simultaneous CP Error and Failover an 2 Switches

Thank you.  Un-tarring the core file created several directories,  In one called "panic" there was a file called core.pd1364111880.  In that file I found the text below.  Is the "no space' left on device the problem that caused the panic ?

== Dumping debug information ==

2013/03/24-03:54:55, , 772, FFDC, WARNING, IBM_2109_M48, kSWD: Detected unexpected termination of: ''zoned:0'RfP=925,R

gP=925,DfP=0,died=1,rt=125771772,dt=83402,to=50000,aJc=125670272,aJp=125636971,abiJc=-1027802072,abiJp=-1027835372,aSeq=7636,kSeq=0,

kJc=0,kJp=0,J=1256

2013/03/24-03:54:55, , 773,, WARNING, IBM_2109_M48, HA State out of sync.

shmInit: shmget failed: No space left on device

shmInit: shmget failed: No space left on device

F   UID   PID  PPID PRI  NI    VSZ   RSS WCHAN  STAT TTY        TIME COMMAND

4     0     1     0  16   0   1700   580 -      S    ?          0:03 init

1     0     2     1  34  19      0     0 -      SN   ?          0:16 ksoftirqd/0

1     0     3     1  10  -5      0     0 -      S<   ?          0:00 events/0

1     0     4     1  20  -5      0     0 -      S<   ?          0:00 khelper

1     0     5     1  11  -5      0     0 -      S<   ?          0:00 kthread

1     0    27     5  10  -5      0     0 -      S<   ?          0:00  \_ kblockd/0

1     0    56     5  15   0      0     0 -      S    ?          0:00  \_ pdflush

1     0    59     5  14  -5      0     0 -      S<   ?          0:00  \_ aio/0

1     0    60     5  14  -5      0     0 -      S<   ?          0:00  \_ xfslogd/0

1     0    61     5  14  -5      0     0 -      S<   ?          0:00  \_ xfsdatad/0

1     0    62     5  10  -5      0     0 -      S<   ?          0:00  \_ xfsbufd

1     0    69     5  14  -5      0     0 -      S<   ?          0:00  \_ kseriod

1     0   844     5  15   0      0     0 -      S    ?          0:00  \_ pdflush

1     0    58     1  15   0      0     0 -      S    ?          0:00 kswapd0

1     0   265     1  15   0      0     0 -      S    ?          0:00 kjournald

1     0   283     1 -100  -   1680   396 -      Ss   ?          0:00 wdtd

1     0   345     1  15   0      0     0 -      S    ?          0:08 kjournald

5     1   495     1  16   0   1692   428 -      Ss   ?          0:00 portmap

5     0   513     1  16   0   2120   636 -      Ss   ?          0:00 inetd

1     0   537     1  16   0   1812   616 -      Ss   ?          0:06 crond

1     0   538     1  15   0   1948   652 -      Ss   ?          0:00 syslogd

5     0   544     1  16   0   1704   380 -      Ss   ?          0:00 klogd

1     0   550     1  15   0      0     0 -      S    ?          0:01 RASLOGK_TH

1     0   681     1  15   0      0     0 -      S    ?          2:44 module-99-th

1     0   684     1  19   0      0     0 -      S    ?          0:00 module-107-th

1     0   687     1  19   0      0     0 -      S    ?          0:00 module-126-th

1     0   712     1  19   0      0     0 -      S    ?          0:00 kmtracer

0     0   787     1  16   0  61312  5248 -      Dl   ?          0:22 raslogd

1     0  5847   787  16   0  61316  2196 -      Ss   ?          0:00  \_ raslogd

0     0  5848  5847  18   0   8648  2012 -      R    ?          0:00      \_ tracedump

5     0   789     1  16   0  16620  1708 -      Ssl  ?          7:24 ipadmd

1     0   792     1  16   0   7444  1264 -      Ss   ?          0:00 telnetmond

0     0   800     1  16   0  17088  2472 -      Sl   ?          0:00 sysctrld

4     0   898   800   0 -16  12084  1112 -      S<l  ?          0:00  \_ proxy

4     0   899   800  16   0  51548  3292 -      Sl   ?          0:00  \_ pdmd

0     0   900   800  16   0  23640  1884 -      Sl   ?          0:00  \_ hmond

0     0   901   800  16   0   8100  1788 -      S    ?          0:00  \_ diagd

0     0   902   800  18   0   8044  1780 -      S    ?          0:00  \_ porttestd

0     0   903   800  16   0  77004  3876 -      Sl   ?        128:33  \_ emd

0     0   905   800  16   0  50248  2344 -      Sl   ?          0:00  \_ bmd

4     0   906   800  16   0  24484  2176 -      Sl   ?          0:00  \_ hamd

0     0   923   800  16   0  59116  2996 -      Sl   ?          0:00  \_ essd

4     0   924   800   6 -10  53820  3708 -      S<l  ?          6:32  \_ fabricd

0     0   926   800  16   0  51504  2976 -      Sl   ?          9:06  \_ fspfd

0     0   927   800  15   0  72484  6376 -      Sl   ?         10:38  \_ nsd

0     0   929   800  16   0  33800  2476 -      Sl   ?          0:00  \_ arrd

0     0   930   800  16   0  65708  4096 -      Sl   ?         34:39  \_ msd

0     0   931   8000:'nsd:0' PS:3(C2013/03/24-03:54shmInit: shmget failed: No spaceshmInit: shmget failed: No space left on device

sysctrld tries to remove shm with id 0

Assert failure: /vobs/pWarning: bad ps chassis0(0): ACTIVE(0), Required

local = SYN_SUCC, prev = SYN_STime=7:54:56-480Time=7:54:56-480output of memshoshmInit: shmget failed: No space left on device

             total       used       free     shared    buffers     cached

Mem:     520110080  486871040   33239040          0   33091584  273969152

Swap:            0          0          0

shmInit: shmget failed: No space left on device

2013/03/24-03:54:56, , 775,, WARNING, M48_Switch_2, Switch status changed from HEALTHY to MARGINAL.

uSWD: End of Data ==========

Valued Contributor
Posts: 761
Registered: ‎06-11-2010

Re: Brocade 48000 - Simultaneous CP Error and Failover an 2 Switches

Hi,

Yes, it seems that the daemos ran out of memory, or maybe physical space in the flash.

You should monitor the free space (commands df and memshow) in order to prevent this situation from reoccuring.

As you know, an upgrade would be a good option.

rgds

Join the Community

Get quick and easy access to valuable resource designed to help you manage your Brocade Network.