04-12-2011 04:21 AM
As you can see below we first had an error saying we had a hardware fault, we then had an error saying we did not. this continued for over 12hrs then stopped again. We are running Fabric OS v6.3.0b, has anyone else experienced this error and how was it resolved ?
2011/04/05-15:08:34, , 189040, SLOT 6 | CHASSIS, WARNING, ED_DCX_B, S1,C1: Internal link errors have been reported, no hardware faults identified, continuing to monitor for errors: fault1:10, fau
2011/04/05-15:12:33, , 189041, SLOT 6 | CHASSIS, CRITICAL, ED_DCX_B, S1,C1: Internal monitoring of faults has identified suspect hardware, blade may need to be reset or replaced: fault1:100, fault2
04-12-2011 04:41 AM
at least one backlink port on ASIC 1 of the port card in slot 1 seems to receive "freshly corrupted" frames from one of the core cards. I suggest to open a case with your service / maintenance provider to let them check which backlinks are affected and to create a step by step action plan to solve it.
In any way: Don't reseat or replace anything at the moment. For example Slot 6 is certainly not the culprit as it is just the CP card reporting the internal errors. Let the support analyse the internal logs / counters first.
By the way, v6.3.0b has some very ugly bugs. I would update it (independently from the problem with the internal frame corruption).
04-26-2011 10:44 PM
We had a very similar error on our DCX running 6.3.0b.
If you are running 48-port or 64-port cards, then I would suggest upgrading asap.
6.3.0b and other 6.3 levels had some significant routing errors.
In my SAN it actually downed hundreds of our DCX ports on two separate occasions before we found the issue was related to the TSB for 48-port cards.
If you aren't using 48 or 64 port cards then I suppose our issues are not similar, but our messages sure were.
04-27-2011 04:22 PM
I just had another customer that experienced this identical issue on 6.4.1a.
My recommendation is to slotpoweroff slot 1 and then power it back on. If that doesn't resolve the situation, then replace the blade.
This is what Brocade's manual indicates should be done (FOS MEssage Reference Guide)
04-28-2011 01:32 AM
A reseat of the blade in slot 1 _could_ help. But it's an internal CRC that was detected on one of the internal ingress ports of an ASIC on the blade in Slot 1. So this is just one possible port responsible for the problem. The frame could have been corrupted at one of these 3 places: The sending port (a core card in this case), the backplane or the receiving port (the one in slot 1). While I actually never had to replace a backplane because of this problem, there is a fifty-fifty chance, that the culprit is not the port card but the core card. And a reseat of a core card has far less impact. But the error message itself will not tell you which core card it is. For this reason, it may be better to open a service request with the service provider for this DCX.
12-20-2012 06:54 AM
I faced a similar issue with one of my DCX
2012/12/17 - 12:08:04, , 107227, SLOT 6 | CHASSIS, CRITICAL, IBM_2499_384, S11,C0: Internal monitoring has identified suspect hardware, blade may need to be reset or replaced: fau1:516, fau2:-268435355 th2:0x90000064
2012/12/17 - 12:13:04, , 107228, SLOT 6 | CHASSIS, WARNING, IBM_2499_384, S11,C0: Internal link errors reported, no hardware faults identified, continuing monitoring: fault1:516, fault2:-268435456 thresh1:0x8000000a
Unfortunately, many Unix hosts reported "path down" and "path recovered" error around the same timestamp.
These hosts were either connected directly to slot 11 blade or zoned with storage ports connected to slot 11 blade.
So, is it possible that above errors in DCX can affect the path between storage and hosts ?
FOS version is v6.4.2a
12-21-2012 02:06 AM
I agree with , and I suggest you to open a case with the switch vendor because the solution to these issues usually implies the manipulation of delicate components and that sort of operatives needs to be fully diagnosed. Usually, after reviewing the switch ssaves, the first step would be to reset the Core Blade involved in the errors, but that depends on what info the ssaves provide.
, yes, these errors could make some hosts complain. If a frame is corrupted somewhere in the fabric, in the end it will be discarded and the whole exchange will have to be re-transmitted after some scsi timeout in the host.
12-21-2012 04:07 AM
I have opened a L2 ticket with IBM support and received following response :
1. Upgrade FOS to 6.4.3c
2. Enable bottleneck detection on the switch with alerts using default values for threshold and time.
bottleneckmon --enable -alert
3. Enable Link Reset recovery.
bottleneckmon --cfgcredittools -intport -recover onLrOnly
4. Run statsclear and slotstatsclear in switch
collect a supportsave from switch 48 hrs later for further check.
So lets see how it goes. Thanks for help guys.
12-21-2012 04:33 AM
Although the first point about upgrading FOS might sound like a "standard actionplan to keep someone busy" (once heard that from a customer) it actually makes sense here, because Brocade solved several of these internal "hardware-related" errors (CRCs over backlinks) by adjusting and tuning the internal parameters. So before reseating or even replacing any hardware (which always involves the danger of really benting a pin or breaking something then), the FOS update should be done first. Point 3 makes sense as well, because if there are bit-errors on a backlink (which also causes the internal CRC errors) it could turn the primitives for buffer credit replenishment unreadable and cause a stuck VC (Stuck VCs or why my switch began to nag? (seb's sanblog))
Point 2 and 4 are more to allow further monitoring and troubleshooting.