Fibre Channel (SAN)

Reply
New Contributor
M48MAN
Posts: 2
Registered: ‎08-30-2004

Has anyone experienced the following error on their DCX switch

As you can see below we first had an error saying we had a hardware fault, we then had an error saying we did not. this continued for over 12hrs then stopped again. We are running Fabric OS v6.3.0b, has anyone else experienced this error and how was it resolved ?

2011/04/05-15:08:34, , 189040, SLOT 6 | CHASSIS, WARNING, ED_DCX_B, S1,C1: Internal link errors have been reported, no hardware faults identified, continuing to monitor for errors: fault1:10, fau

2011/04/05-15:12:33, , 189041, SLOT 6 | CHASSIS, CRITICAL, ED_DCX_B, S1,C1: Internal monitoring of faults has identified suspect hardware, blade may need to be reset or replaced: fault1:100, fault2

Contributor
sebastian.thaele
Posts: 60
Registered: ‎05-26-2009

Re: Has anyone experienced the following error on their DCX switch

Hi M48Man,

at least one backlink port on ASIC 1 of the port card in slot 1 seems to receive "freshly corrupted" frames from one of the core cards. I suggest to open a case with your service / maintenance provider to let them check which backlinks are affected and to create a step by step action plan to solve it.

In any way: Don't reseat or replace anything at the moment. For example Slot 6 is certainly not the culprit as it is just the CP card reporting the internal errors. Let the support analyse the internal logs / counters first.

By the way, v6.3.0b has some very ugly bugs. I would update it (independently from the problem with the internal frame corruption).

Cheers seb

Contributor
luke.soares
Posts: 44
Registered: ‎05-22-2009

Re: Has anyone experienced the following error on their DCX switch

We had a very similar error on our DCX running 6.3.0b.

If you are running 48-port or 64-port cards, then I would suggest upgrading asap.

6.3.0b and other 6.3 levels had some significant routing errors. 

In my SAN it actually downed hundreds of our DCX ports on two separate occasions before we found the issue was related to the TSB for 48-port cards.

If you aren't using 48 or 64 port cards then I suppose our issues are not similar, but our messages sure were.

Contributor
luke.soares
Posts: 44
Registered: ‎05-22-2009

Re: Has anyone experienced the following error on their DCX switch

I just had another customer that experienced this identical issue on 6.4.1a. 

My recommendation is to slotpoweroff slot 1 and then power it back on.   If that doesn't resolve the situation, then replace the blade. 

This is what Brocade's manual indicates should be done (FOS MEssage Reference Guide)

Contributor
sebastian.thaele
Posts: 60
Registered: ‎05-26-2009

Re: Has anyone experienced the following error on their DCX switch

A reseat of the blade in slot 1 _could_ help. But it's an internal CRC that was detected on one of the internal ingress ports of an ASIC on the blade in Slot 1. So this is just one possible port responsible for the problem. The frame could have been corrupted at one of these 3 places: The sending port (a core card in this case), the backplane or the receiving port (the one in slot 1). While I actually never had to replace a backplane because of this problem, there is a fifty-fifty chance, that the culprit is not the port card but the core card. And a reseat of a core card has far less impact. But the error message itself will not tell you which core card it is. For this reason, it may be better to open a service request with the service provider for this DCX.

Occasional Contributor
sriram.iyer
Posts: 5
Registered: ‎10-30-2012

Re: Has anyone experienced the following error on their DCX switch

Hi all

I faced a similar issue with one of my DCX

2012/12/17 - 12:08:04, , 107227, SLOT 6 | CHASSIS, CRITICAL, IBM_2499_384, S11,C0: Internal monitoring has identified suspect hardware, blade may need to be reset or replaced: fau1:516, fau2:-268435355 th2:0x90000064

2012/12/17 - 12:13:04, , 107228, SLOT 6 | CHASSIS, WARNING, IBM_2499_384, S11,C0: Internal link errors reported, no hardware faults identified, continuing monitoring: fault1:516, fault2:-268435456 thresh1:0x8000000a

Unfortunately, many Unix hosts reported "path down" and "path recovered" error around the same timestamp.

These hosts were either connected directly to slot 11 blade or zoned with storage ports connected to slot 11 blade.

So, is it possible that above errors in DCX can affect the path between storage and hosts ?

FOS version is  v6.4.2a

Thanks

Valued Contributor
felipon
Posts: 546
Registered: ‎06-11-2010

Re: Has anyone experienced the following error on their DCX switch

Hi,

I agree with , and I suggest you to open a case with the switch vendor because the solution to these issues usually implies the manipulation of delicate components and that sort of operatives needs to be fully diagnosed. Usually, after reviewing the switch ssaves, the first step would be to reset the Core Blade involved in the errors, but that depends on what info the ssaves provide.

, yes, these errors could make some hosts complain. If a frame is corrupted somewhere in the fabric, in the end it will be discarded and the whole exchange will have to be re-transmitted after some scsi timeout in the host.

Rgds

Occasional Contributor
sriram.iyer
Posts: 5
Registered: ‎10-30-2012

Re: Has anyone experienced the following error on their DCX switch

Hi all

I have opened a L2 ticket with IBM support and received following response :

1. Upgrade FOS to 6.4.3c

2. Enable bottleneck detection on the switch with alerts using default values for threshold and time.                                          

    bottleneckmon --enable -alert      

3. Enable Link Reset recovery.                                           

    bottleneckmon --cfgcredittools -intport -recover onLrOnly          

4. Run statsclear and slotstatsclear  in switch                       

    collect a supportsave  from switch 48 hrs later for further check.

So lets see how it goes. Thanks for help guys.

Contributor
sebastian.thaele
Posts: 60
Registered: ‎05-26-2009

Re: Has anyone experienced the following error on their DCX switch

Although the first point about upgrading FOS might sound like a "standard actionplan to keep someone busy" (once heard that from a customer) it actually makes sense here, because Brocade solved several of these internal "hardware-related" errors (CRCs over backlinks) by adjusting and tuning the internal parameters. So before reseating or even replacing any hardware (which always involves the danger of really benting a pin or breaking something then), the FOS update should be done first. Point 3 makes sense as well, because if there are bit-errors on a backlink (which also causes the internal CRC errors) it could turn the primitives for buffer credit replenishment unreadable and cause a stuck VC (Stuck VCs or why my switch began to nag? (seb's sanblog))

Point 2 and 4 are more to allow further monitoring and troubleshooting.

Join the Community

Get quick and easy access to valuable resource designed to help you manage your Brocade Network.