03-20-2016 09:28 AM
For various reasons we only got around to upgrading our SANs (2 SANs with 2 Core DCX4s and 32 edge switches in each) from v7.1.0c to v7.4.0a in February. As part of this we went from FabricWatch to MAPS.
We are now seeing incidents of masive throughput on SAN ports that we never used to see. We store the errdump outputs, and prior to the upgrades Fabricwatch FW-1523 reported RX Performance higher than 80% 16 times in 13 months, and two where it went above 90%. Since the upgrades we have seen 239 TX and 63 RC hits where the performance was above 90%.
These recent events have caused issues (saturating disk ports and causing latency within the disk array), and we have been able to identify underlying application errors causing these.
It just seems odd that we suddenly start seeing this sort of thing and it seems like too much of a coincidence to say that applications have suddenly got worse since February.
Is there a fundamental difference between what FW under v7.1.0c and MAPS under v7.4.0a report?
Is there some fundamental difference between these versions that allow, (or encourage) higher throughput? I could not find anything in the release notes that suggested this.
03-22-2016 07:29 AM
One thing that I noticed around that upgrade was the the configuration policy manager default SAN policy was performing some relatively aggressive scans during its daily runs. It was scanning for host issues aggressively enough to actually cause some host issues...[sigh...]
I'd suggest that you check the run time for the policy against your bandwidth peaks and see it there is any correlation.
If you find that, then test by disabling the run, and if that makes your fabric(s) happier go back and modify the policy to eliminate some of the host-based checks.
03-23-2016 01:29 AM
can you collect some realistic details ?
where you get the increments ?
what is connect on the port in question ? - storage type/fw ; Server/HBA Type FW;
i noted in the past similarly isse, finally the error was related to defective cable and/or SFP. and in another case due wrong settings on Storage port.
03-23-2016 01:43 AM
Additional would be helpfull to know the output of "portcfgshow"
collect it in a .txt file and upload here please, with a short description of connectivity.
03-23-2016 09:46 AM
I'm replying on behalf of Tony as he's quite busy at the moment.
Let me give some more information. This isn't a 'problem' exactly, and doesn't seem to be related to a particular switch or port. I've attached a graph, which plots all the alert messages in 2016 related to RX throughput going above 60%, across all our switches. We upgraded two of our four core switches (DCX-4S) on the 26th Jan, and then started upgrading the edge switches (5100s, 5300s, 6510s and 5480 AGs) in earnest on the 4th Feb, before going on to upgrade our second SAN fabric. As part of the upgrade we started using MAPS.
As you can see all the alerts before the 4th February (where I've put the black line) are easily less than 80% RX throughput, but then after this the range of alert values is much higher - often over 90%.
It seems that either:
Any thoughts on which of these (or a third option) might be the case?
For ease here is a before and after alert:
2016/02/04-10:48:58, [FW-1523], 2481, FID 128, INFO, c40fsw101, port9 FOP Port#9,RX Performance, has crossed lower threshold boundary to in between(High=80, Low=70). Current value is 71 Percentage(%)/minute.
2016/02/04-15:32:37, [MAPS-1003], 2787, FID 128, WARNING, c40fsw101, F-Port 9, Condition=ALL_PORTS(RX/min>60.00), Current Value:[RX,90.24 %], RuleName=LHR_High_Receive_Bandwidth_Min, Dashboard Category=Fabric Performance Impact.
The reason we've particularly noticed this now, is that we've had a series of events where Oracle RAC clusters connecting to our XIV storage have started doing an unusually huge amount of bandwidth, which has impacted other applications. It may be completely unrelated, but we're wondering whether the upgrade might be enabling these servers to do more than they used to be able?
I will also attached a portcfgshow from one of our core switches - it's similar across the rest of the SANs (I had to zip it up to be allowed to upload with this post).
Thanks in advance!
03-23-2016 10:09 AM - edited 03-23-2016 10:28 AM
on slot 3 and slot 6, i see only port 0-15 is this 16 Port Blade ?
Fillword is set here as Mode 0
where is the Storage connected ? Slot / port number
Keep in mind, the XIV storage system supports any fillword except "IDLE/IDLE, and this is exactely the Mode is set on all port in slot 3 and 6
please let me know.
03-23-2016 11:14 AM
I would start loonking at the FOS 7.4 release notes and look to the credit recovery and resiliency improvements.
Maybe the improvement is not in the bandwidth itself, but in the way that FOS is working to sustain this bandwidth by acting on credit recovery and things alike.
03-24-2016 06:13 AM
Thanks Antonio for your time looking at this. I should have said before, our DCX-4S only have 2 port blades in slots 1 and 8. Slots 3 and 6 have inter chassis link blades in that we don't use.
The XIVs are connected to 1/6, 1/11, 1/43, 8/22, 8/27 and 8/35.
03-24-2016 07:53 AM
Johnny, thanks for Update.
Core Blade ( ICL ) used or not are another story, my mistake....., i don't read accurately the first post about the 4S.
in you case Fillword settings show ok.
how is MAPS policy value configured ?
can you run SAN-Health and share to me ?