04-25-2013 05:46 AM
We have a very strange situation with a pait of brocade DCXs that are ICLd together.
san008 ---- san004 are the two switches in question.
We keep getting this occurring in the fabriclog for san008:
Switch 0; Thu Apr 25 02:50:03 2013 BST (GMT-1:00)
02:50:03.488166 SCN LR_PORT (0);g=0xffa A1,P0 A1,P0 14 NA
04:26:40.060810 SCN LR_PORT (0);g=0xffa A1,P0 A1,P0 14 NA
05:23:12.578159 SCN LR_PORT (0);g=0xffa A1,P0 A1,P0 14 NA
06:06:33.578087 SCN LR_PORT (0);g=0xffa A1,P0 A1,P0 14 NA
When the above happens on san008, we get the following in the raslog on san004:
2013/04/25-02:51:00, , 23365, SLOT 7 | FID 128, INFO, RELN_PRO_SAN004, Switch status changed from DOWN to HEALTHY.
2013/04/25-02:50:22, , 23364, SLOT 7 | FID 128, WARNING, RELN_PRO_SAN004, Switch status change contributing factor Marginal ports: 8 marginal ports. (Port(s) 384(0x180),385(0x181),386(0x182),387(0x183),388(0x184),389(0x185),390(0x186),391(0x187))
2013/04/25-02:50:22, , 23363, SLOT 7 | FID 128, WARNING, RELN_PRO_SAN004, Switch status changed from HEALTHY to DOWN.
We have observed that almost everytime the node plugged into port 14 has the link resets, it then sets off san004 reporting that ICL ports have gone marginal. The times are a match. You can see at 02:50:03, the port does a reset, then 48 seconds later the ports marginal events occur.
The SCN LR_PORT on san008 comes before the marginal ports event on san004.
Whats more the marginal ports are always ICL ports and in port groups of 8
Sometimes we will have 8 ports, sometimes 16, sometimes 32, occassionally 48 go "marginal"
The node plugged into port index 14 is an IBM power 7 frame.
Has anybody observed this behaviou? Or know what the cause might be? It has been going on for months in our environment.
Sometimes it causes f scsi errors on the IBM hosts with paths dropping momnetarily.
We have also observed in the fabriclog, hundreds of link resets, across windows and AIX hosts.
I can provide any data required.
We have elevated this case with support and are getting knowhere fast.
04-25-2013 08:21 AM
FW-1436 stay indicated:
This occurred because the number of marginalports are greater than or equal to the policy set using the switchStatusPolicySet command. A port is faulty when the port value for Link Loss, Synchronization Loss, Signal Loss, Invalid word, Protocol error, CRC error, Port state change, or Buffer Limited Port is above the high boundary.
have you made recently some policy change ?
04-29-2013 04:19 AM
OK, I understand why the switch is sending out marginal status, I'm more interested in what it causing the link resets and the ports to fault in the first place.