06-13-2017 12:58 PM
We have two Partial Mesh 7.2.x SAN Fabrics with DCX-8510 SAN Directors. One fabric has Port Based Routing and the other Fabric has Exchange Based Routing. Whenever we perform routine maintenance on ISL's, add/remove we get I/O path failures and server crashes in the PBR fabric. We understand the problem, that anytime a change is made in the fabric that requires routes to be recalculated I/O paths can fail. We understand that the solution is to change the routing policy to EBR. Changing the routing policy to EBR needs a switch disable/enable which will cause I/O path failure. An intermidiate step is to enable Lossless Dynamic Load Sharing. Lossless DLS will eliminate dropped frames and I/O failures by rebalancing the paths going over the ISLs whenever there is a fabric event. Eliminate frame delay caused by establishing a new path when a topology change occurs.
Before I make the changes I need a little help in understanding why it is that paths fail when using PBR. Also what specifically does EBR and Lossless DLS do that prevents I/O path failure.
Also does anyone have experience making the parameter change for Lossless DLS? Can you do it Hot? What happens to exisiting traffic? What commands can I use to see what is happening during the change?
06-13-2017 01:23 PM
in Admin Guide,
is a Table of Support Port and Exchange Based Routing with L-DLS
is tis a Open or Mainframe environments ?
anyway, I think you have to do such change within a maintenance window, to be honest I've never made such change into a productive fabric, but I believe this is Disruptive ????? I'm really not sure.
--->>>What commands can I use to see what is happening during the change?
with command fabricshow and nsshow, you should be able to see what happen during the change.
06-13-2017 11:33 PM
you do not say or verify, but it sound like IOD (in order delivery) is enabled - you can check with iodshow, please.
If you are using PBR (Port Based Routing) and IOD enabled (but without lossless DLS), when you have a topology change (adding or removing an ISL), a 500ms "freeze" of traffic on the switches concerned occurs - you will see discard frames.
Since frames are discarded for extended period of time - 500ms - server will timeout; loss path (name server queries or SCSI TUR loss) etc. Most interrupted SCSI exchange will be aborted (servers to disk).
You might be able to tune / adapter the multipath software to increase the robustness.
So, you can run PBR with lossless (dls --lossless --enable), if you targets and/or initiators requires PBR.
if you enabe Lossless DLS, then when a topology change occurs, the switches will pause the sender of
frames (server / initiators) while the route change occurs and then after route change is finished, then enabling the sender.
And time for the change is much lower - a magnitude compared to 500ms.
EBR in itself will not help your situation if you have IODset, but advantage if if your are using PBR is that
traffic will be rebalanced over all ISLs after a change, which not occurs under PBR. What the possible reason for EBR might be is that with EBR will distributed traffic from one initiator over all ISL based on a function of DID,SID, OXID so each exchange will be going over the same ISL.
And that mean instead of ALL traffic from SID to an SID going over the same ISL as with PBR, and if you bring that ISL down, you remove all traffic with PBR. With EBR, if you bring down one ISL of four, you will only impact 25% of traffic from that SID to that DID - instead of 100%.
Lossless DLS can be enabled online (hot) - but fabric should be stable. Best practices is to enable lossless DLS. Further, once you have lossless DLS enabled, you can use portdecom to drain an ISL before taking offline without impacting traffic (you need to at least another ISL between the two switches). Further, verify if you have IOD enabled / set and if you really need it. Lossless will help here. EBR will balance traffic froma SID to a DID over all ISL using DID,SID,OXID.
06-14-2017 12:15 PM
Thanks for all of the great info. From your description things just got a bit more complicated. We have two fabrics with 4 SAN Directors each. Two Directors in each fabric at one data center and Two Directors in each fabric at a 2nd remote data center. One fabric has all of its Directors set to EBR but two of the Directors have IOD set. The other fabric has Two Directors set for EBR and IOD and Two Directors set for PBR and No IOD.
Our Specific problem is with Oracle RAC servers evicting themsevles from their clusters and rebooting when SAN changes occur. The Oracle clusterware ‘miscount’ timeout value is set to 30 seconds. If a node can’t reach any of the other RAC nodes within misscount seconds timeframe it will start split-brain resolution and probably evict itself from the cluster by doing a reboot.
We see other hosts SQL server, Non-RAC Oracle, Non-DB servers lose I/O paths but multipath handles them well.
I belive that if you have PBR set then the ISLs in the Fabric won't load balance even though other switches are set to EBR. Are you saying that if we have IOD set and EBR enabled that we don't get the benefits of DLS?
06-15-2017 02:00 AM
for the Oracle RAC, ensure that lossless is enable. This will reduce the issues seen. Once you have enabled lossless DLS (both under PBR and EBR), IOD has less of an impact since the freeze time is much shorter.
For loadbalancing, I would enable EBR on all switches/directors. Further, if possible, unify the settings, e.g. either IOD off (I prefer that) or all ON, so not having a different settings.
With PBR, the traffic load will not load balance, though over time connections (SID to DID) will balance. Under PBR, do you have DLS enabled or not - if DLS is enabled, then changes occurs (ISL or F-port down/up), connections over ISLs will load balance.
In short - enable lossless DLS on all switches, if possible switch off IOD, and enable EBR.