05-01-2014 06:15 AM
I've seen this topic discussed in the past, but I think there's something new to add here...
A bit about the setup: a pair of Brocade 7800s are being used between two data centers to do a data migration with Hitachi Universal Replicator between a VSP on the source side, and an HUS-VM on the target side. Both 7800s are running FOS v7.1.1b. Four FC ports, 4-7 are being using for MCU/RCU communication. There is a single FCIP tunnel (using port VE 16) using 6 circuits, i.e. all 6 GE ports are configured.
Almost like clockwork at around 30 minutes past midnight each night, this happens on one of the 7800s:
Wed Apr 30 2014 00:33:21 CDT Warning S0,P-1(3): Link Timeout on internal port ftx=2000497714 tov=2000 (>1000) vc_no=2 crd(s)lost=4 complete_loss:1. Chassis 364 1 C2-101
A case was opened with Hitachi and they found nothing wrong with the switch. Their suggestion was to turn on bottleneck monitoring, which was done a couple of weeks ago, but it hasn't helped at all. These still occur around the same time each night.
Suggestions in previous posts on this topic were to reseat blades / replace blades, but this is obviously not possible with a 7800. It seems highly coincidental that these errors are occuring with regularity at around the same time each night. Can anyone offer any insight as to why this might be the case and what can be done to fix this? When the errors occur, the FCIP tunnel becomes unusable and HUR on the VSP array starts throwing a hissy fit. The only way to recover is to reset the FCIP tunnel. I'm getting tired of climbing out of bed every morning to reset the link.
05-08-2014 04:02 AM
If you have an active maintenance contract and you have a problem with the switch, I suppose that you should be given a solution, if not I recommend you to invest that budget in something else...
Once said this, what bottleneck configuration did you apply? can you post it here?
I would configure it in the following way:
bottleneckmon --enable –alert
bottleneckmon --cfgcredittools -intport -recover onLrOnly
Do you always get the error in the same Backend Port? If so, you can reset that port with command:
bottleneckmon --linkreset 3
Once done this, if errors reoccur I would reboot the switch. And if then, still appears, I would have it replaced.
05-19-2014 07:59 PM
Can you send me that case number?
be aware this has all to do with Frameflow and especially when FCIP is in the mix you might see these errors occuring. What it basically shows is that a back-end FC port is having issues off-loading the frames to the blaster chip for further prosessing onto the FCIP stack.
If this is ongoing for a prolonged period you might see credit loss messages like the ones you mentioned.
I'm pretty sure the same clockwork will also show an increased workload on the HUR pairs or there is another workload eating up the bandwidth on the IP WAN link and therefore starving the FCIP thoughput. Especially when QoS services are not enabled it's a wild west on that side and there is nothing you can do from a SAN and Array perspective.