Fibre Channel (SAN)

Reply
Contributor
Posts: 55
Registered: ‎05-12-2013

SCN saturation

We’ve been troubleshooting an issue lately where we have around 170 initiators zoned to the same 8 target ports on an HPE 3PAR array.

The initiators are spread across 3 edge switches per fabric and are AIX and Solaris servers. 90% of initiators are NPIV devices spread over 8 N ports per fabric.

All Servers are seeing random path drops. Also when any LPAR or VIOS server reboots it’s causes a “storm” where several servers connected to these same eight target ports lose all their paths and crash.

After several iterations of support saves and array log collections the theory is there are too many SCNs happening on target switch ports and the ports can’t keep up.

We are trying to reduce the number of initiators by spreading them across more array ports but after that the vendor will install an FC analyzer to try to capture frames for diagnosing the problem. I am also seeing the CPU spike to 100% on the 8510s where the targets are.

I thought I’d ask the community if they had any suggestions.

Thanks.
Contributor
Posts: 37
Registered: ‎01-19-2018

Re: SCN saturation

My first question: did you setup any kind of zoning?
Contributor
Posts: 55
Registered: ‎05-12-2013

Re: SCN saturation

Every zone is using wwn zoning. Also fabrics look clean as far as errors go.
Contributor
Posts: 37
Registered: ‎01-19-2018

Re: SCN saturation

How many members do you have in every zone? What kind of zones? Manually created legacy zones? Manually created peer zones? Auto created target driven zones?
Contributor
Posts: 55
Registered: ‎05-12-2013

Re: SCN saturation

Legacy zoning manually created 8 targets per zone .. AIX zones have two initiators per zone (One of those initiators is a standby)
Contributor
Posts: 37
Registered: ‎01-19-2018

Re: SCN saturation

Did I get it right that each of your zones contains two AIX HBAs and eight 3PAR ports?

Best practice is to have one-to-one zoning, it was always believed to reduce RSCN/SCN storms.

Now we seem to have have better options with peer zoning.

I think you'd better either employ TDZ built-in into the 3PAR microcode, or convert your own legacy zones into the peer zones.
Regular Contributor
Posts: 166
Registered: ‎02-05-2014

Re: SCN saturation

The problem is that it most often works two ways. If a initiators observes a problem it'll propagate to a target at some stage. When that port is going in an error recovery scenario it will have an onflow effect on all initiators mapped to it. That is irrespective of how zoning is configured.

 

A fan-in ratio of 170 to one (NPIV or not) isa very, very bad idea.

Kind regards,
Erwin van Londen
Brocade Distinguished Architect
http://www.erwinvanlonden.net The Fibre Channel blog



Q&A -> https://hackhands.com/elo/


-------
Broadcom Moderator
Posts: 87
Registered: ‎03-29-2010

Re: SCN saturation

Any port may be configured to supress RSCN and notification. 

 

portcfg rscnsupr --enable  (*two dashes in front of 'enable'*)

 

Use with caution. This does not address the original issue of devices leaving the fabric without explanation. However, it will reduce the RSCN issue. I suggest you resolve the device dropping problem, and the RSCN issue will not cause trouble, unless it is the RSCN traffic which is causing the device to drop?

doc

Any and all information provided by me is for entertainment value and should not be relied upon as a guaranteed solution or warranty of mechantability. All systems and all networks are different and unique. If you have a concern about data loss, or network disconnection, please open a TAC service request for service through Brocade, or through your OEM equipment provider. If this provided you with a solution to this issue, Please mark it with the button at the bottom "Accept as solution".

Regular Contributor
Posts: 166
Registered: ‎02-05-2014

Re: SCN saturation

Which I think is a very, very bad idea. The RSCN's are there for a reason.If you disable this on the array-port thanthat port has no idea if one or more initiators or NPIV addresses have left the fabric or if a zone-change has occured.

 

This subsequently incurs the problem that the array is required towait for each outstanding exchange to timeout for at leastan R_A_TOV timeout and it will not be able to use those resources during that time.

 

On a normal operating workload thou shall not exceed a 40:1 fan-in ration which personally think is still a potential killer when workload is uncontrolled.

Kind regards,
Erwin van Londen
Brocade Distinguished Architect
http://www.erwinvanlonden.net The Fibre Channel blog



Q&A -> https://hackhands.com/elo/


-------
Contributor
Posts: 37
Registered: ‎01-19-2018

Re: SCN saturation

Erwin, 170:8 here is less than 40:1, or how do you calculate the ratio in this case?

I've seen environments where more than 1000 HBAs were zoned against one set of 8 or 12 target ports from one VNX array, with all the initiators being NPIV ESX hosts from a huge amount of blade chassis, and it worked fine.

A 3par is a buggy device by itself, so maybe the root cause is not in the SAN?

Join the Community

Get quick and easy access to valuable resource designed to help you manage your Brocade Network.