Fibre Channel (SAN)

Reply
Contributor
Posts: 21
Registered: ‎04-23-2008

AIX SC_DISK_ERR4 Errors - Due to "HW ASIC Chip error type" in 48K

Hi ALL,

First, i'd like to thanks everyone that could give us any information.

So the thing is, we have a Core-Edge SAN which have 2 Fabrics and in one of them we had a problem below:

- 2009/12/09-00:18:30, , 8057, SLOT 6 | CHASSIS, CRITICAL, ED_48000B, S9,C1: HW ASIC Chip error type = 0x21.

     PS: This Switch is our Principal in this fabric ...

SW_xpto:admin> fabricshow
Switch ID   Worldwide Name           Enet IP Addr    FC IP Addr      Name
-------------------------------------------------------------------------
20: fffc14 10:00:00:05:1e:36:3a:70 167.2.60.150    0.0.0.0         "SW_02"
21: fffc15 10:00:00:05:1e:36:3a:b6 167.2.61.23     0.0.0.0         "SW_01"
126: fffc7e 10:00:00:05:1e:36:29:34 167.2.60.153    0.0.0.0        >"sw_xpto
"

26: fffce2 10:00:00:05:1e:36:3a:a8 167.2.61.26     0.0.0.0         "SW_03"

The Fabric has 4 switches

SW_xpto:admin>

We requested our OEM to change this blade but after (and before) that we still having those errors below:

Lately we've been receiving a few erros in some AIX servers like this:

---------------------------------------------------------------------------
LABEL:          SC_DISK_ERR4
IDENTIFIER:     DCB47997

Date/Time:       Wed Dec  9 16:36:44 2009
Sequence Number: 74880
Machine Id:      0007D511D600
Node Id:         aix_host
Class:           H
Type:            TEMP
Resource Name:   hdisk97
Resource Class:  disk
Resource Type:   SYMM_RAID5
Location:        U7311.D20.0613C5C-P1-C02-T1-W5006048C49AE3C59-LE5000000000000
VPD:
        Manufacturer................EMC
        Machine Type and Model......SYMMETRIX
        ROS Level and ID............5671
        Serial Number...............856A6580
        Part Number.................000000000000510039000287
        EC Level....................750385
        Device Specific.(Z0)........03
        Device Specific.(Z1)........51
        Device Specific.(Z2)........567100750000000000092508
        Device Specific.(Z3)........12000000
        Device Specific.(Z4)........54130008
        Device Specific.(Z5)........7F80
        Device Specific.(Z6)........4D

Description
DISK OPERATION ERROR

Probable Causes
MEDIA
DASD DEVICE

User Causes
MEDIA DEFECTIVE

        Recommended Actions
        FOR REMOVABLE MEDIA, CHANGE MEDIA AND RETRY
        PERFORM PROBLEM DETERMINATION PROCEDURES

Failure Causes
MEDIA
DISK DRIVE

        Recommended Actions
        FOR REMOVABLE MEDIA, CHANGE MEDIA AND RETRY
        PERFORM PROBLEM DETERMINATION PROCEDURES

Detail Data
PATH ID
           0
SENSE DATA
0A00 2A00 013D 9558 0000 0804 0000 0000 0000 0000 0000 0000 0200 0300 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 772F 0002 E040 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0035 001D
---------------------------------------------------------------------------

The point is, what other logs/commands i can use to get more information and identify what's going on here ...

Tks

Daniel Volochen

External Moderator
Posts: 4,974
Registered: ‎02-23-2004

Re: AIX SC_DISK_ERR4 Errors - Due to "HW ASIC Chip error type" in 48K

Daniel,

this is a hardware error of the Symmetrix, and not being caused by the 48K CP.

TechHelp24
Contributor
Posts: 21
Registered: ‎04-23-2008

Re: AIX SC_DISK_ERR4 Errors - Due to "HW ASIC Chip error type" in 48K

Many thanks for you answer!

But at the sametime, we received the same message in some AIX which have access to CLARiiON (CX-700, CX3-80 and others Symmetrix arrays). It seems to be something else, in my point of view the only common point is the SAN.

And we realize that most of servers get erros when they are trying to access an array in other site (we mirror data through LVM in AIX). Below i'm seding another example:

---------------------------------------------------------------------------
LABEL:          SC_DISK_ERR4
IDENTIFIER:     DCB47997

Date/Time:       Fri Dec 11 02:35:35 2009
Sequence Number: 91321
Machine Id:      00C7B1BE4C00
Node Id:         aix_host_1
Class:           H
Type:            TEMP
Resource Name:   hdisk2886
Resource Class:  disk
Resource Type:   CLAR_FC_raid5
Location:        U5791.001.992034N-P1-C01-T1-W5006016B30600FCA-LA000000000000
VPD:
        Manufacturer................DGC
        Machine Type and Model......RAID 5
        ROS Level and ID............0226
        Serial Number...............CK200044500286
        Subsystem Vendor/Device ID..CX700
        Device Specific.(PQ)........00
        Device Specific.(VS)........740000CCE6CL
        Device Specific.(UI)........60060160F5A71200B6EE27F7ADBCD911
        FRU Label...................0074
        Device Specific.(Z0)........10
        Device Specific.(Z1)........10

Description
DISK OPERATION ERROR

Probable Causes
MEDIA
DASD DEVICE

User Causes
MEDIA DEFECTIVE

        Recommended Actions
        FOR REMOVABLE MEDIA, CHANGE MEDIA AND RETRY
        PERFORM PROBLEM DETERMINATION PROCEDURES

Failure Causes
MEDIA
DISK DRIVE

        Recommended Actions
        FOR REMOVABLE MEDIA, CHANGE MEDIA AND RETRY
        PERFORM PROBLEM DETERMINATION PROCEDURES

Detail Data
PATH ID
           0
SENSE DATA
0A00 2A00 023B 0800 0001 0004 0000 0000 0000 0000 0000 0000 0200 0300 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 7D4B 0005 2D80 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0035 001D
---------------------------------------------------------------------------

External Moderator
Posts: 4,974
Registered: ‎02-23-2004

Re: AIX SC_DISK_ERR4 Errors - Due to "HW ASIC Chip error type" in 48K

Daniel,

at this point you are right, but I'm the opinion the CP cannot caused this error. I agree with you, some error can be caused by a defective CP but this a be replaced, so a new CP cannot longer caused similar error.

I have the presumption:

1) By the CP Replacement a error occured with the config, zone, www zone, alias.

2) Port Blade is defective..... but

3) .....Question will be, are all array connected to the same blade? And probable is the answer NO ! ?

what say "errshow" output ?

TechHelp24
Contributor
Posts: 21
Registered: ‎04-23-2008

Re: AIX SC_DISK_ERR4 Errors - Due to "HW ASIC Chip error type" in 48K

1) By the CP Replacement a error occured with the config, zone, www zone, alias.

2) Port Blade is defective..... but

3) .....Question will be, are all array connected to the same blade? And probable is the answer NO ! ?

     A: No, all arrays are spread across all blades ... but in our core switches (this is an edged sw)

SW:admin> slotshow

Slot   Blade Type     ID    Status
-----------------------------------
  1     SW BLADE     18     ENABLED
  2     SW BLADE     18     ENABLED
  3     UNKNOWN             VACANT
  4     UNKNOWN             VACANT
  5     CP BLADE     16     ENABLED
  6     CP BLADE     16     ENABLED
  7     UNKNOWN             VACANT
  8     UNKNOWN             VACANT
  9     SW BLADE     18     ENABLED
10     SW BLADE     18     ENABLED

SW:admin>

In our case, they suggested us to replace slot 9 in this director, which we had done a few days ago ...

The "errshow" didnt say anything interesting (in my opinion), i just had a bunch of this msgs:

"2009/12/11-12:06:01, , 8218, SLOT 6 | FID 128, INFO, SW,  The last device change happened at : Fri Dec 11 12:05:50 2009"

PS: The errshow log is attached.

Thanks in Advance

Daniel

External Moderator
Posts: 4,974
Registered: ‎02-23-2004

Re: AIX SC_DISK_ERR4 Errors - Due to "HW ASIC Chip error type" in 48K

Daniel,

--->>>....they suggested us to replace slot 9....

why slot 9 ? coming the error from Blade in slot 9 ?

--->>>...all arrays are spread across all blades

when as example SP from CX-700 are connect to i.e. SPA to Blade 2 and SPB to Blade 5 ( or other blade, i cannot see from here in which blade teh SP are connected ) can you test to unplug A or B to see the error is continued ?

 

I see in the errshow which can be caused by wrong setting in the switchstatuspolicy, as below.

FW-1436
        Message <timestamp>, , <sequence-number>,, WARNING, <system-name>, Switch status
        change contributing factor Marginal ports: <Num of marginal ports and the port
        numbers> marginal ports. (Ports <Unknown>)

Probable Cause Indicates that the switch status is not in a healthy state. This occurred because
         the number ofmarginal ports is greater than or equal to the policy set using the switchStatusPolicySet command.

         A port is faulty when   the port value for Link Loss,
         Synchronization Loss, Signal Loss, Invalid word, Protocol error, cyclic
         redundancy check (CRC) error, Port state change or Buffer Limited Port is above the high boundary.

Recommended Replace any faulty or deteriorating small form-factor pluggables (SFPs).
            Action
       

Severity WARNING

FW-1424
Message <timestamp>, , <sequence-number>,, WARNING, <system-name>, Switch status
        changed from <Previous state> to <Current state>.
        Probable Cause Indicates that the switch status is not in a healthy state.
        This occurred because of a policy violation.

Recommended Run the switchStatusShow command to determine the policy violation.
            Action

Severity WARNING

TechHelp24
Super Contributor
Posts: 260
Registered: ‎04-09-2008

Re: AIX SC_DISK_ERR4 Errors - Due to "HW ASIC Chip error type" in 48K

TEMP errors can be caused by problems in the SAN, and from some of the events you have described seems like it is.

Pick one server and trace the connectivity, draw a diagram if you'd wish to. Run fabriclog -s on all the switches in its path. If there are any flapping ports, fix the issue. I would recommend doing this for any ports, relevant or irrelevant since this could help troubleshoot in future. There could be ISL ports as well in the list, which could probably be the cause of the problem.

It would also help to check on the AIX machines to look for fcs errors or fscsi errors. If you find them, run a errpt -j on the identifier and post the o/p on the forum.

Be patient, coz isolating this type of an error in a large fabric needs time.

Contributor
Posts: 21
Registered: ‎04-23-2008

Re: AIX SC_DISK_ERR4 Errors - Due to "HW ASIC Chip error type" in 48K

Thanks bijukrishnan,

I'm working on it right now, because we received those msgs againd a few minutes ago ...

Guys, can everyone share with me all useful commands to get log/info in my environment?

Tks

Daniel

Super Contributor
Posts: 260
Registered: ‎04-09-2008

Re: AIX SC_DISK_ERR4 Errors - Due to "HW ASIC Chip error type" in 48K

Did you get any leads to solve the problem??

I had provided my friend with some useful commands a few weeks ago. Maybe useful to you.

Command line for Brocade for daily use


switchshow - For general switch info.


porterrshow - To check for port errors. Look for counters with high numbers.


portshow <port number> - To check for more detailed stats.


errshow -r - For general FOS errors - not related to hw.


errdump -r - You will get hardware errors as well.


portstatsshow <port number> - shows port errors upto 32bit.


portstats64show <port number> - shows port errors upto 64bit.


fabriclog -s - Shows fabric transitions - vimp command to see if some port is flapping.


nodefind <WWPN>


To troubleshoot zoning issues.


This command will help you easily locate zone info.


cfgshow | grep -A4 -B4 <server name>


For example cfgshow | grep -A4 -B4 pat2140


fcping <initiator WWPN> <target WWPN>


example fcping 10:00:00:00:3e:c9:10 50:00:00:00:6e:d9:01

New Contributor
Posts: 2
Registered: ‎04-06-2009

Re: AIX SC_DISK_ERR4 Errors - Due to "HW ASIC Chip error type" in 48K

I receive this same error message on only one of my 48K anytime I run a supportshow or supportsave. I receive it from CMDCE and in the Switch Event on webTools.

Join the Community

Get quick and easy access to valuable resource designed to help you manage your Brocade Network.