on 08-20-201303:00 AM - last edited on 10-28-201309:21 PM by
There has been a great deal of discussion of late of the relative merits of overlay approaches to SDN vs methods in which network hardware has a more active role to play. The arguments go back a year or more, and are fundamentally rooted in varying assumptions about the primacy of the hypervisor in modern data center architectures.
First, a bit of history. In August, 2012, several Nicira founders published a paper exploring the role of a physical network fabric in an SDN architecture. The paper observed that OpenFlow, and more broadly SDN, in its then-current instantiation didn’t actually solve some fundamental networking problems: most notably, it didn’t do anything to make network hardware any simpler, nor remove the dependency of the host on behavior in the network core. The proposed solution was effectively ‘smart edge/fast, dumb core’, though there were also two key observations that blunt the problems with that oversimplification.
Fabric and edge serve different purposes; the fabric should provide fast, reliable transport with as little complexity as possible, while the edge “is responsible for providing more semantically rich services”.
Therefore fabric and edge need to be able to evolve independently of one another.
However, as the term “network virtualization” entered the discussion, the temptation quickly became to discuss overlays as though they are completely analogous to server virtualization. Joe Onisick did an excellent job of unwinding that analogy, and I’d encourage you to read his post in its entirety, as well as the comments. His key point was this:
The act of layering virtual networks over existing infrastructure puts an opaque barrier between the virtual workloads and the operation of the underlying infrastructure. This brings on issues with performance, quality of service (QoS) and network troubleshooting…this limitation is not seen with compute hypervisors which are tightly coupled with the hardware, maintaining visibility at both levels.
In other words, network virtualization *still* does not address the fundamental problem of overly complex, rigid and manual physical infrastructure, nor prevent interdependent (physical-virtual) failure. In fact, it adds complexity in the form of additional layers of networks to manage, even while it simplifies the configuration and deployment of specific network services to specific clients. (In the absence of hardware termination, it is also unavailable as a solution to a broad swath of non-virtualized workloads, a problem which VMware is moving swiftly to address.)
Where does this leave us, then? The earlier articulation of the distinct purposes of the core and edge is important. True fabrics such as VCS explicitly address the need for simpler physical network operations through automation of common routines, as well as a more resilient, low latency and highly available architecture—a fast, simple, highly efficient forwarding mechanism. However, precisely because the physical network needs to be able to operate and evolve independently of what goes on at the software-defined level, fabrics can not be “dumb”. The individual nodes must in fact have sufficient local intelligence as well as environmental awareness to make forwarding decisions both efficiently and automatically. Not all fabrics are architected thus, but VCS fabrics have a shared control plane as well as unusual multipathing capabilities that allow them to function largely independently after initial set-up. There can also be utility in horizontal, fabric-native services that may be different from those deployed at the edge, or which may in some use cases be simpler to deploy natively.
VCS fabrics also maintain visibility to VMs of any flavor, wherever they may reside or move to, as well as mechanisms for maintaining awareness of overlay traffic, restoring the loss of visibility highlighted by Onisick. In addition, the VCS Logical Chassis management construct provides a much simpler means of scaling the host-network interface. Although VCS fabrics are actually masterless, the logical centralization of management allows the Logical Chassis view to serve as a physical network peer to the SDN controller, while providing the SDN controller a means of scaling across many fabric domains (each domain appears as a single switch), vs a plethora of interactions with each individual node. (NB I'm highlighting some of the specifics of VCS fabrics for the sake of concrete illustration, but broadly speaking, similar principles apply to other fabrics.)
Where many disagree with the Nicira stance is in the claim that an ideal network design would involve hardware that is cheap and simple to operate, and “vendor-neutral”, eg easily replaced. I would argue that what matters in terms of network portability is not that hardware needs to be indistinguishable from one vendor to the next—rather, it needs to be able to present vendor neutrality at a policy level. Hardware performance and manageability absolutely continue to matter and remain primary purchasing criteria assuming equivalent support for higher-level policy abstraction.