When I told my recent Ohio State graduate son that I was writing a blog on RoCE for this month, he asked me when I started doing movie reviews. It took me a while to figure out what he meant-apparently he was thinking of the recent Rocky movie “Creed”. I got a chuckle and asked myself should I have said I am writing an article on RDMA over Converged Ethernet? Well than might not have been enough since RDMA is another acronym itself. According to my college age English Writing major daughter, what we really have here with RoCE is a recursive acronym that also happens to be a homophone. Great, and here I thought we had a great technology that made for an interesting subject of this month’s blog. Now that we have all of that out of the way and I have a headache, let’s talk about some technology.
What is RDMA, RoCE and SMC-R?
Remote Data Memory Access (RDMA) is a remote memory management capability that allows server-to-server data movement directly between application memory without any CPU involvement. RDMA allows a host to write or read memory from a remote host without involving the remote host’s CPU and operating system. It requires a specialized network adapter to transfer data as it bypasses operating system and communications protocol layers that are otherwise required for communication between applications. RDMA is a critical technology that is being used at the heart of many of the world’s fastest super computers as well as in some of the world’s largest data centers. It was originally adopted with Infiniband but has since been widely adopted over Ethernet networks with RDMA over Converged Ethernet, aka RoCE (pronounced “rocky”). RoCE features have been available for z Systems going back to zEC12.
RoCE uses the highly efficient mechanisms in RDMA to provide lower CPU overhead and increase mainstream data center application performance at 10GigE link speeds and beyond. RoCE provides low latency, high bandwidth, high throughput, low processor utilization data transfer between hosts while taking advantage of existing Ethernet networks. Ideally, IBM z Systems RoCE implementations will take advantage of the newer Converged Enhanced Ethernet (CEE) and Ethernet Fabrics capability to optimize performance and availability.
Shared Memory Communication over RDMA (SMC-R) is a sockets over RDMA communication protocol that allows existing sockets applications that exploit TCP to transparently benefit from RDMA for exchanging data over a RoCE network. The exploitation of SMC-R is transparent to sockets applications (meaning no application changes are required for implementation of SMC-R). SMC-R provides host-to-host direct memory access without the traditional TCP/IP processing overhead. SMC-R offers high availability and load balancing when redundant network paths are available. It provides dynamic setup of RDMA connections over RoCE fabrics. Support for SMC-R is included as part of z/OS V2R1 and requires the 10GbE RoCE Express feature.
For those of you keeping track of the acronyms, when SMC-R is used with RoCE, we have a doubly recursive acronym (SMC over RDMA over Converged Ethernet).
What is the Value of SMC-R and RoCE?
The value of using RDMA technology in your z Systems environment is in providing significantly improved network performance for CPC to CPC communications. The specific attributes that realize this improvement are latency, throughput, and scalability. Improvements in network performance can potentially improve application workload transaction rates while reducing your CPU costs.
With IBM z13, the combination of the 10GbE RoCE Express feature and the SMC-R protocol provide for significant improvements in network performance. When combined with an Ethernet Fabrics based 10GbE switching network that does not rely on spanning tree protocol (STP), the performance benefits are even greater. The network latency characteristics of this combined solution are highly compelling and result in reduced network round trip times, which translates into an improved overall application transaction rate for z/OS to z/OS workloads. One such example of a workload that benefits from this technology is a WebSphere application server communicating with a remote database server (such as DB2). Another example is CICS to CICS communications with IPIC (IP interconnectivity). When using SMC-R versus standard TCP/IP, IBM benchmarking showed a 48% reduction in response time and up to 10% CPU savings in this CICS example.
Sharing the RoCE Express Features between multiple LPARs
Based on a preliminary discussion, it does appear that RoCE and SMC-R provide substantial benefit for IBM z Systems end users. One frequently asked question is if the 10GbE RoCE Express features can be shared between multiple LPARs. The answer depends on whether or not you have upgraded from zEC12 (zBC12) to z13. On the zEC12 and zBC12, each 10GbE RoCE Express feature can only be used by a single LPAR. However, each z/OS does support sharing among multiple TCP/IP stacks within the same z/OS instance.
On the other hand, z13 provides RoCE Express virtualization using Single Root I/O Virtualization (SR-IOV). This allows up to 31
OS instances (LPARs or 2nd level guests) to share the same RoCE Express feature. So, yet another reason to move from your older mainframe to z13!
Brocade Ethernet Fabrics and RoCE
An Ethernet fabric network is a network that is aware of all its paths, nodes, requirements and resources. Ethernet fabrics are able to automatically manage themselves to scale up or down, depending on demand. They also eradicate the need for the challenging and comparatively less-efficient Spanning Tree Protocol (STP), and the redundancies it creates. Compared to classic hierarchical Ethernet architectures, Ethernet fabrics provide the higher levels of performance, utilization, availability and simplicity required to meet the business needs of data centers today and into the future. Ethernet fabric systems can be incorporated with pre-existing networks
In an Ethernet fabric, the control path replaces STP with link state routing, while the data path provides equal-cost multipath forwarding at Layer 2 so data always takes the shortest path using multiple interswitch link (ISL) connections without loops. Combined with the fabric’s control plane, scaling bandwidth is made simple. For example, it becomes possible to automate the formation of a new trunk when a new switch connects to any other switch in the fabric. If a trunk link fails or is removed, traffic is rebalanced on the existing links non-disruptively. Finally, if an ISL is added or removed anywhere in the fabric, traffic on other ISLs continues to flow instead of halting as with STP.
With this architecture in place, a group of switches can be defined as part of a “logical chassis,” similar to port cards in a chassis switch. This simplifies management, monitoring and operations since policy and security configuration parameters can be easily shared across all switches in the logical chassis. In addition, because information about connections to physical and virtual servers and storage is now known to all switches in the fabric, the fabric can ensure all network policies and security settings continue to be applied to any given virtual machine no matter whether it moves or where it resides.
A Brocade Ethernet Fabrics based network delivers high wire-speed performance and high-network resiliency. In addition, Ethernet fabrics are efficient. All paths between switches are fully active, and traffic is continuously routed to use the most efficient path. The network topology is scalable, flexible and dynamic—changing quickly as the needs of the business change. And, if appropriate to the application, IP and storage traffic can be converged over a common network infrastructure, further reducing cost. Network administration is simplified since all switches in the fabric can be managed as a single entity or individually as needed. Lastly, Ethernet fabrics are self-forming, and self-aggregating. By simply having an administrator add a switch to the fabric, ISLs are automatically configured and aggregated.
I have written a few magazine articles in the recent past that discussed Ethernet Fabrics and their advantages in a z Systems environment for connectivity when high performance and high availability were of a concern. One such use case is for 10GigE connectivity between a z13 OSA Express 3/4/5 channel and the IBM DB2 Analytics Accelerator for z/OS (IDAA). RoCE network connectivity is an even more compelling use case for implementing Ethernet fabrics in your z Systems environment. Brocade is a leader in Ethernet Fabrics technology with our VDX family of switches.
Once you figure out what all the acronyms are about, it is readily apparent RoCE and SMC-R technology is a great fit for the modern IBM z Systems mainframe. When combined with a Brocade Ethernet Fabric, it provides network performance benefits along with improved CPU utilization metrics that can lead to reduced operational costs. IBM really hit a home run by bringing this technology over from the world of Supercomputing to the data center and z Systems. If you are a z13 mainframe shop, or considering moving to z13, you should take a look at RoCE and the Brocade VDX switches.