A colleague recently pointed me to “11 Myths of RoCE.” Previously, RoCE’s version proliferation had put me in mind of a certainboxing moviehexology. But this article’s remarkable assertions brought to mind a more cultish classic, theRocky Horror franchise, which is many things: a parody, a tribute, a stage musical and a movie, often with live performance art. Its characters are never quite what they seem.
Not all the article’s 11 myths seem like myths (as in, “does anyone really think that?) but, in fairness, once upon a time debunking 5 or 7 myths was plenty. Now, thanks to another cult parody tribute, you must goup to 11, spawning extra myths to debunk. For the sake of enjoyment, awilling suspension of disbeliefcomes in handy.
Also, to appreciate a good parody, familiarity with the original will help. I can only recap the species of RoCE I know of:
RoCE, the original. “RDMA over Converged Ethernet” premiered in 2010, produced and directed by theIBTA. A link layer protocol that runs directly over “lossless” Ethernet using PFC (Priority Flow Control), RoCEv1 is definitely not routable. Since iWARP does RDMA over TCP, RoCE is faster. Okay, check.
The first sequel, RoCEv2, a.k.a. “routable RoCE”, debuted in 2014, with UDP/IP added to the cast. V2 adds the optional “Congestion Notification Packet,” or CNP, to exploit the IETF’s Explicit Congestion Notification (ECN) end-to-end TCP flow control scheme. Using CNPs assumes RFC 3168 routers plus sender and receiver algorithms, but these apparently didn’t make the final editing cut.
A 2015 SIGCOMM DCQCN paper (by Microsoft, Mellanox, and UC Santa Barbara) defines some CNP algorithms. It’s not really a sequel (more likea director’s cut reissuewith “new unseen footage”?) but arguably a distinct version.
Next up, another Microsoft (Azure) production, also presented in a SIGCOMM paper (that says RDMA needs lossless and PFC). The paper (which weirdly says the C in RoCE is “commodity”) seems pro-RoCE but also lists several RoCE-related problems, including “livelock” and “deadlock”. The unlikely solution involved plenty of special vendor code, plus setting Ethernet’s PFC field from IP’s DSCP field. It’s nota “layer violation”, it’s a “feature!” As an oft-referenced “large network,” it’s arguably a “de facto” standard. One competitor snarks that it’s “RoCEv4.” The IBTA disavows the term, but it’s a bit sticky, or at leasttacky.
Well, that’s where I thought things stood as I studied the article… 2 to 4 versions of RoCE, all needing lossless Ethernet for good performance. So I was boggled that the first “myth” of RoCE was that it needs a lossless network. Wait… what? Without lossless, RoCE needs to retransmit, just like iWARP. Then the article soon says that RoCE beats other Ethernet-based RDMA like iWARP. With my “parody bit” still not turned on, I shook my head, and slogged through a dismissal of “deployment difficulties”. As if. Dell EMC’s Erik Smith posted “the level of complexity required to properly configure itto avoid issues with congestion spreading.” (Erik’s blog isn’t official, butthis related videois.) “Interoperability between vendors is unreliable” is another supposed myth, though there’s no standard for CNP algorithms and vendors are free to choose their own yet they must interoperate?
As I grumbled about these pseudo-myths, I was startled to hear from another colleague about a quiet (art film?) new RoCE production, disavowing PFC. Whoa! Time to extend that earlier recap!
This new “implementation” uses new CNP sender and receiver algorithms (and format?) to enable UDP-based RoCE to do “selective retransmit” (likeTCP in RFC 2018). This oxymoronic RoCE runs on vanilla Ethernet, and outperforms RoCE over PFC-based lossless Ethernet.
Darn, they were right! RoCE has morphed again, and its need for lossless Ethernet is, bizarrely, now a myth. And I’d bought the myth! Hilarious! What a knee slapper!
In self-defense, the acronym itself says “converged Ethernet,” an old synonym forDCB, which uses PFC. This latest non-PFC RoCE is clearly a new version. I shall call him RoCEv5. (Side note: I asked a Broadcom contact, who told me that their RNICs cannot, ahem, interoperate with this new mode.) A briefInigo Montoyamoment is understandable among observers. What, exactly, does “RoCE” mean?
The “myth buster” article says:
“RoCE” started in 2010: v1, directly on converged PFC Ethernet and can’t scale
“RoCE” has been deployed at scale: v3 (or v4) emerged in 2015 (?), needs PFC Ethernet
“RoCE” doesn’t need lossless Ethernet: v5, described in 2017, not (yet) deployed at scale
But each bullet is only true for one version, and they are all different! The ambiguous language glosses over RoCE’s lack of a stable, well-specified version and adds to confusion about the protocol.
In all honesty, this new PFC-free version is actually a good thing. I hope that the RoCE (re-)inventors can get the word out and make v5 sit still. Maybe they can even write a fully specified, interoperable standard and give it a less oxymoronic name!
It is worth recognizing, though, that when the RoCE crowd criesU.N.C.L.E.on lossless Ethernet, they are staking a claim for pretty good performance on non-deterministic, best-effort infrastructure. It’ll work great much of the time, but now and then it won’t work as well. That’s good stuff for a number of non-mission-critical applications. Mission critical Enterprise Storage is just not one of those applications.