I recently returned from IETF 87 and would like to provide a brief synopsis of the event. First, who would have expected the weather to be that hot in Berlin? It hit 96F on Sunday, the first dayRead more...
I recently returned from IETF 87 and would like to provide a brief synopsis of the event. First, who would have expected the weather to be that hot in Berlin? It hit 96F on Sunday, the first day of the event!
Overall, it was a very well attended event with a very relevant agenda. My first point to make is that the SDN trend continues to gain interest and acceptance. If you’ve been following IETF activity over the last 12 – 18 months or so, you know that this trend started kind of slow but is gaining more traction at each of the IETF events. The future meeting schedule can be found here, and if you’d like a quick review of the IETF 86 event, you can go here.
The NVO3 WG and SDN RG meetings continue to draw the most relevance to the SDN problem space. However, the L3VPN WG and to a lesser extent, the I2RS & PCE WGs, also have correlated activities worth following if SDN is of interest to you.
So, here are the brief updates, organized by WG/RG.
The SDN Research Group kicked off the Monday morning sessions. This RG is chaired by our Brocade SP CTO, Dave Meyer. Dave started the session with a thought provoking presentation that set the bar early. The session included some other quite interesting presentations; such as an SDN-enabled IXP, a presentation on how NFV fits with other IETF SDN activities, and how I2RS and SDN are related (or not).
This was a *very* well attended session, as it was the last time. With the advent of cloud computing and private/public/hybrid data-center clouds, this is where most of that activity is taking place in terms of defining the data-center virtualization problem space, its requirements and eventually the solution space.
The NVO3 architecture team provided an update on their activities. Brocade Principal Engineer, Jon Hudson, is part of this architecture team. This was quite an interactive session. They recommended to promote the current “as-is” VXLAN and NVGRE ID’s to Informational RFC status. This is primarily to document these technologies since they are already implemented and deployed. In addition, other IETF activities relate back to these drafts, so now they are Informational RFCs that can be referenced.
The NVO3 reference architecture that is being developed will identify the key system components and describe how they fit together. This architecture phase will then drive the requirements definition work; and that will then feed into a gap analysis. Some key definitions are being defined; such as, NVE (Network Virtualization Edge) and NVA (Network Virtualization Authority). Some critical ‘on-the-wire’ protocols are being flushed out, such as the NVE to NVA protocol and the Inter-NVA protocol. Finally, consensus on the need for both a push and a pull model of control plane distribution was agreed upon.
The L3VPN sessions are always worth attending, and more recently so since this WG has become closely correlated to the NVO3 activities. Besides the usual discussions of MPLS-related internet drafts and technologies, this session included a very active discussion on how the activities in this WG remains aligned (or not) with the NVO3 activities. This discussion was very much needed, since the last few L3VPN WG sessions included some internet drafts that overlap in some sense with the NVO3 charter. Most of this discussion was around which problem areas should remain in the L3VPN WG and which problem areas should perhaps move to the NVO3 WG. As an example, there was a good amount of consensus that technologies “inside” the DC should belong in the NVO3 WG. However, technologies around “inter-DC” solutions should perhaps remain in this WG. Well, this makes a large assumption that the inter-DC solutions and technologies are based on L3VPN solutions. L3VPN solutions could clearly be one answer to the inter-DC problem, but it’s not the only answer. That’s where it gets fuzzy.
I think one way to clarify this is that if the applicability statement in an ID includes intra-DC problems, then it should be part of the NVO3 WG. If protocol extensions to L3VPN solutions are needed for a particular solution, then clearly that work must be done in the L3VPN WG. But if an ID touches on inter-DC problems, then perhaps it needs to be presented in both WGs; at least until it gets further flushed out. Clear as mud yet?
Another WG that continues to generate a fair amount of activity and interest is PCE. This is also an area of IETF work that is somewhat related to the SDN solution space. This WG is focused on how to enhance traffic-engineering decisions in MPLS networks.
I had two key take-away’s from this session. One is that this WG has quickly moved from working on solutions that only “recommend” traffic-engineered LSPs to the network to also now including solutions that actually “instantiate” those LSPs into the network. In other words, the solutions being discussed here include both the centralized TE control-plane and the actual distributed data-plane. The other important take-away is there is more activity around providing PCE redundancy; that is, how to provide multiple PCE databases for the network and how to keep them synchronized. This is a hard problem to solve and in some sense can help flush out the entire “logically centralized” notion in SDN.
The session ended with an interactive discussion that I thought was quite interesting; well actually, almost amusing. The topic was about whether Auto-BW mechanisms should be pushed back into the network nodes so they can each dynamically adjust their LSP BW. Recall that the primary goal of PCE is to logically centralize the traffic-engineering decisions. It’s all about having a central holistic view of network traffic loads in order to fully optimize the entire network. So, this discussion was about whether it makes sense to then allow each network node to adjust its LSP bandwidth, using auto-BW mechanisms, in a distributed way. Doesn’t this sound counter to the goal of PCE? So, is it a centralized or is it distributed? I hope you see why I thought this entire discussion was somewhat amusing.
I2RS was also very well represented and generated lots of interesting dialogue. I believe this was only the second time this WG met, so the activity here is very early in its definition. The architecture team started off discussing the high-level architecture and a policy framework. All this discussion is centered on the Routing Information Bases (RIBs) of a network node; for example, are multiple RIBs needed? What mechanisms are needed to inject state into the RIB(s)? What mechanisms are needed to extract state from the RIB(s)?
The I2RS activity focuses on the southbound problem space; in other words, the interfaces and protocols needed from the controller or client down to the network node. It does not focus on the northbound interfaces or applications that live on top of the controller.
There was also a question about whether I2RS protocol extensions are being developed in private; or outside the knowledge of this WG. There was a request to encourage people to share those discussions and potential experiments with the general WG to spur discussion.
An interesting I2RS service chaining use case was discussed that is being co-authored by Brocade Principal Architect, Ramki Krishnan.
I dropped into the FORCES session to see how this WG is progressing. This WG has been around for many years but it never really gained much attention in terms of implementation or deployment. Now that SDN is here to stay, my sense is that this WG is trying to re-emerge to become relevant to that conversation.
To that end they have a new charter and some of the terminology being used in this WG is more aligned with the SDN problem space. For example, there are drafts that discuss Virtual Control Element (VCE) nodes and Virtual Forwarding Element (VFE) nodes.
The OPSA WG continued the discussion around a draft presented by Brocade Principal Architect, Ramki Krishnan, on mechanisms for optimal LAG/ECMP link utilization. This capability is important in not only service provider networks but also in research and education networks. Researchers are often the sources for very large IP flow dissemination and the ability to properly load-balance these large flows on LAG bundles is becoming increasingly important. I also presented this draft, on behalf of Ramki, at the recent ESnet Site Coordinators Committee (ESCC) Meeting at the Lawrence Berkeley National Labs and it was very well received by this group.
Regarding the various Routing WGs, here is a short recap.
An interesting presentation in the GROW WG was on a use case of using route reflectors for traffic steering. This idea is not new but it does provide an additional data point on how service providers desire enhanced capabilities to influence traffic patterns in their networks.
Similar to the last IETF, there was more discussion in the IDR WG about the ability to distribute link-state and TE information northbound, using extensions to BGP. This type of capability would allow higher layer applications to make more intelligent traffic-engineering decisions. Kind of sounds SDN-like, doesn’t it?
So, that wraps up this short update on the IETF 87 SP related activities. Please let me know if you have any comments or questions.
As every service provider already knows, performance and reliability are always high priorities for enterprise IT. Add cloud computing to the mix, its importance multiplies. In this week’s blog, I continue discussing the WaveLength Market Analytics study on enterprise infrastructures service needs. Specifically, I address the market opportunity for server load balancing services in both terms of size and demand, as well as service packaging, and pricing.
Like the market for IPv6 translation services, the market for server load balancing (SLB)-as-a-service is very large. When asked about their top three IT management priorities, the graph above shows that more than half of enterprise IT survey respondents chose improving user experience, service quality, and reliability as the top IT management priority. High performance and reliability requirements drive demand for server load balancing solutions, so if half of America’s 18,000 large organization’s outsource just one application’s server load-balancing, it’s a good-sized market.
Of course, server load balancing is not a new concept. Enterprises have been using the technology for more than a decade and as the table below shows, they use a mix of server load balancing solutions. Most organizations use an internally managed solution and about half already use a service provider for some type of server load balancing. Forty five percent of both the medium and large enterprises outsource using dedicated load balancers from a service provider. Although less mature, even 38% of medium and 20% of large enterprises say they outsource to a service provider using shared load balancers. As public and hybrid cloud acceptance grows, this shared load balancing service is expected to grow along with them.
Where are enterprises on their willingness to buy a new variant, server load balancing-as-a-service? What value-added features would they be willing to buy and what would they be willing to spend on them? Among the three new infrastructure services, IPv6 translation, storage area network extension, and server load balancing, server load balancing has the highest percentage of respondents who are willing to pay for services. With 86% of large enterprises, survey respondents are more likely to be willing to pay for server load-balancing services than the other two infrastructure services discussed in my previous blogs, IPv6 translation and SAN extension.
As to the above table shows, server load balancing services offer significant opportunity for differentiated services or to sell additional value-added services. Nearly three-quarters of large enterprises are willing to pay for all SLB capabilities included in the study. A service provider can create a service for global load balancing between two data centers as a stand-alone service and about 77% of large enterprises said they’d be willing to pay for it. A service provider can also earn extra revenue by packaging device redundancy for highly available SLB-as-a-service. This value-add can entice 45% of medium and 78% of large enterprises to part with some dollars.
So how much extra will enterprises likely pay for SLB value-added features? We asked how much extra in the form of a premium over the base service fees for SLB-as-a-service for 3 specific value-adds. On average, large enterprises are willing to pay an additional 27% for IPv6 migration support and for SSL offload, and an additional 31% for device redundancy. The more budget-constrained medium-sized enterprise segment reports they are willing to pay an approximately an additional 25% for each of the three value-adds.
The enterprise IT imperative to improve user experience, service quality and reliability, along with increasing cloud apps, create a large and growing service provider opportunity. With its many value-added features, SLB-as-a-service is potentially lucrative; it is both easily differentiated and average revenue per user (ARPU) increases with each a la carte value-added feature. Service providers should invest in their SLB-as-a-service portfolio today to take those new SLB revenue streams to the bank.
Multi-Chassis Trunking (MCT) is a key Brocade technology that helps network operators build scalable and resilient networks, and we are continuing to add more enhancements to MCT that provide advanced redundancy. MCT-aware routers appear asRead more...
Multi-Chassis Trunking (MCT) is a key Brocade technology that helps network operators build scalable and resilient networks, and we are continuing to add more enhancements to MCT that provide advanced redundancy. MCT-aware routers appear as a single logical router that is part of a single link aggregation trunk interface to connected devices. While standard LAG provides link- and module-level protection, MCT adds node-level protection, and provides sub-200 ms link failover times. It works with existing devices that connect to MCT-aware routers and does not require any changes to the existing infrastructure. Pete Moyer wrote about the multicast over MCT features we added in NetIron software release 05.4.00 in his blog earlier this year, and we recently released version 05.5.00 with support for Layer 3 dynamic routing over MCT which is what I want to write about today. Together these two enhancements give network operators the ability to deploy MCT Layer 3 active-active or active-passive redundancy at the network edge or border for IP unicast and multicast.
Layer 3 Routing Over MCT Highlights
Here’s how it works in a nutshell for a quick example with OSPF. The CCEP and CEP in the diagram are on different Layer 3 networks and the routers run OSPF as the IGP. The MCT CCEPs can be configured as active in the OSPF so that they establish an OSPF adjacency with the connected device. Or they can be configured as passive in the IGP so that the interface is advertised via OSPF, and a static route is configured on the connected device.
Layer 3 traffic that is sent to the network connected to the CEP is load-balanced over the two MCT routers by the connected device, assuming an active-active configuration. If one of the CCEPs or MCT routers fails, Layer 3 traffic will still be forwarded over the MCT Inter-Chassis Link (ICL) to the remaining MCT router. I left out some details for simplicity to illustrate the functionality; obviously you’d want redundant connections via CEPs on each MCT router in order to provide full redundancy.
For more information on Layer 3 routing with MCT, refer to the “MCT L3 Protocols” section in the MCT chapter in the Multi-Service Switching Configuration Guide. If you need more information on MCT then we have two technical white papers that you can read. “Multi-Chassis Trunking for Resilient and High-Performance Network Architectures” provides an overview of the technology, and “Implementing Multi-Chassis Trunking on Brocade NetIron Platforms” has design and deployment details.
sFlow is a very interesting technology that often gets overlooked in terms of network management, operations and performance. That’s a shame; as it can be a very powerful tool in the network oRead more...
sFlow is a very interesting technology that often gets overlooked in terms of network management, operations and performance. That’s a shame; as it can be a very powerful tool in the network operator’s tool-kit. In this brief blog, I hope to shed some light on the appealing advantages of sFlow. To start with - if you are a network operator and you are not gathering network statistics from sFlow, I hope you will carefully read this blog!
While the title of this blog says enhancements to sFlow, I’d like to focus a good portion of this piece on sFlow itself and explain why it’s not just useful, but why it should be considered a necessary component of any overall network architecture. I’ll also point out some differences between sFlow, NetFlow and IPFIX (since I frequently get asked about these when I talk about sFlow with customers).
sFlow was originally developed by InMon and has been published in Informational RFC 3176. In a nutshell, sFlow is the leading, multi-vendor, standard for monitoring high-speed switched and routed networks. Additional information can be found at sFlow.org.
sFlow relies on sampling; which enables it to scale to the highest speed interfaces, such as 100GbE. It provides very powerful statistics and this data can be aggregated into very edifying graphs. Here is a pretty cool animation describing sFlow in operation. sFlow provides enhanced network visibility & traffic analysis; can contribute relevant data to an overall network security solution; and can be used for SLA verification, accounting and billing purposes. sFlow has been implemented in network switches & routers for many years and is now often implemented in end hosts.
Here are some publicly available sFlow generated graphs from AMS-IX (the sFlow samples are taken from Brocade NetIron routers).
Here is another simple output from sFlow, showing the top talkers in a specific IP subnet.
Short Comparison of sFlow, Netflow and IPFIX
While sFlow was explicitly invented as an open standards based protocol for network monitoring, Netflow was originally developed to accelerate IP routing functionality in Cisco routers (it remains proprietary to Cisco). The technology was subsequently modified to support network monitoring functions instead of providing accelerated IP routing; however, it can exhibit performance problems on high-speed interfaces. Furthermore, sFlow can provide visibility and network statistics from L2 – L7 of the network stack, while Netflow is predominantly used for L3 – L4 (there is now limited L2 support in Netflow but there is still no MPLS support).
Another key difference between the two protocols is that sFlow is a packet sampling technology; while Netflow attempts to capture entire flows. Attempting to capture an entire flow often leads to performance problems on high-speed interfaces, which are interfaces of 10GbE and beyond.
IPFIX is an IETF standards based protocol for extracting IP flow information from routers. It was derived from Netflow (specifically, Version 9 of Netflow). IPFIX is standardized in RFC 5101, 5102, and 5103. As its name correctly implies, IPFIX remains specific to L3 of the network stack. It is not as widely implemented in networking gear as sFlow is.
sFlow and OpenFlow?
There is some recent activity around integrating sFlow with OpenFlow to provide some unique “performance aware” SDN applications. For example, take a look at this diagram:
[Diagram referenced from here.]
In this example, sFlow is used to provide the real-time network performance characteristics to the SDN application running on top of an OpenFlow controller, and OpenFlow is used to re-program the forwarding paths to more efficiently utilize the available infrastructure. Pretty slick, huh? This example uses sFlow-RT, a real-time analytics engine, in place of a normal sFlow collector.
NetIron sFlow Implementation Enhancements
Brocade devices have been implementing sFlow in hardware for many years. This hardware based implementation provides key advantages in terms of performance. The sampling rate is configurable and sFlow provides packet header information for ingress and egress interfaces. sFlow can provide visibility in the default VRF and non-default VRFs. NetIron devices support sFlow v5, which replaces the version outlined in RFC 3176.
In addition to the standard rate-based sampling capability, NetIron devices are capable of using an IPv4 or IPv6 ACL to select which traffic is to be sampled and sent to the sFlow collector. This capability provides more of a flow-based sampling option, rather than just sampling packets based on a specified rate. In addition to sampling L2 and L3 information, sFlow can be configured to sample VPN endpoint interfaces to provide MPLS visibility. Neither Netflow nor IPFIX can provide this type of visibility.
One of the new enhancements to the NetIron sFlow implementation is the ability to provide Null0 interface sampling. Service providers often use the Null0 interface to drop packets during Denial of Service (DoS) attacks. sFlow can now be configured to sample those dropped packets to provide visibility into the DoS attack. This feature is in NetIron Software Release 5.5.
The other new enhancement that I’d like to mention is the ability to now capture the MPLS tunnel name/ID when sampling on ingress interfaces. This feature is coming very soon and will provide additional visibility into MPLS-based networks.
In summary, I hope you gained some additional insight into the advantages of leveraging the network visibility that sFlow provides. One last thing I’d like to correlate to sFlow is Network Analytics. These are complementary technologies which can co-exist together in the same network, while performing different functions. Brocade continues to innovate in both of these areas and I welcome any questions or comments you may have on sFlow or Network Analytics.
As data center networks scale to support thousands of servers running a variety of different services, a new network architecture using the Border Gateway Protocol (BGP) as a data center routing protocol is gainRead more...
As data center networks scale to support thousands of servers running a variety of different services, a new network architecture using the Border Gateway Protocol (BGP) as a data center routing protocol is gaining popularity among cloud service providers. BGP has traditionally been thought of as only usable as the protocol for large-scale Internet routing, but it can also be used as an IGP between data center network layers. The concept is pretty simple and has a number of advantages over using an IGP such as OSPF or IS-IS.
Large-Scale BGP is Simpler than Large-Scale IGP
While BGP in itself may take some heavy learning to fully grok, BGP as a data center IGP uses basic BGP functionality without the complexity of full-scale Internet routing and traffic engineering. BGP is especially suited for building really big hierarchical autonomous networks, such as the Internet. So, introducing hierarchy with EBGP and private ASNs into data center aggregation and access layers down to the top of rack behaves just like you would expect. We’re not talking about carrying full Internet routes down to the top of rack here, just IGP-scale routes, so even lightweight BGP implementations that run on 1RU top of rack routers will just work fine in this application.
The hierarchy and aggregation abilities of an IGP are certainly quite extensive, but each different OSPF area type, for example, introduces different behaviors between routers, areas and how different LSA types are propagated. There’s a lot of complexity to consider when designing large-scale IGP hierarchy, and a lot of information that is flooded and computed when the topology changes. The other advantages of BGP are the traffic engineering and troubleshooting abilities. With BGP you know exactly what prefix is sent and received to each peer, what path attributes are sent and received, and you even have the ability to modify path attributes. Using AS paths you can tell precisely where the prefix originated and how it propagated, which can be invaluable in troubleshooting routing problems.
How it Works
What you basically do is divide the network into modular building blocks made up of top of rack access routers, aggregation routers, and data center core routers. Each component uses its own private ASN, with EBGP peering between blocks to distribute routing information. The top of rack component doesn’t necessarily need to be a single rack; it could certainly be a set of racks and a BGP router.
Petr Lapukhov of Microsoft gave a great overview of the concept at a NANOG conference recently in a presentation called “Building Scalable Data Centers: BGP is the Better IGP”, which goes into a lot more background on their design goals and implementation details. If you’d like to experiment with the network design as Petr describes, the commands for the BGP features on slide 23 for the Brocade NetIron software are:
AS_PATH multipath relax: multipath multi-as (router bgp)
Allow AS in: no enforce-first-as (router bgp or neighbor)
Fast EBGP fallover: fast-external-fallover (router bgp)
Remove private AS: remove-private-as (router bgp or neighbor)
Taking it a Step Further
An alternative that takes the design even further from top of rack down into the virtual server layer for high-density multitenant applications is to also use the Brocade Vyatta vRouter. In this design, EBGP would be run from the data center core at each layer to a virtual server that routes for a set of servers in the rack. This addition gives customers a lot of flexibility in controlling their own routing, for example, if they wanted to announce their own IP address blocks to their hosting provider as part of their public cloud. Customers could also use some of the other vRouter VPN and firewall features to control access into their private cloud.
In addition to using BGP to manage routing information, you can also build an OpenFlow overlay to add application-level PBR to the network. Using the Brocade hybrid port features that enables routers to forward using both OpenFlow rules and Layer 3 routing on the same port, introducing SDN into this network as an overlay is easy. In fact, this is exactly what Internet2 is doing in production on their AL2S (Advanced Layer 2 Services) network to enable dynamically provisioned Layer 2 circuits.
So is BGP better as a data center IGP? I think the design lends itself especially well to building modular data center networks with independent and autonomous modular components that can be built all the way down to the virtual server level. Perhaps you even have different organizations running their own pieces of the network, or servers that you’d rather not invite into your OSPF or IS-IS IGP.
For more information on Brocade’s high density 10 GbE, 40 GbE and 100 GbE routing solutions, please visit the Brocade MLX Series product page.
Today, Brocade announced it strategy to bridge the physical and virtual worlds of networking to enable customers to build an “On-Demand Data Center”. For service providers, an On-Demand Data Center means getting closer to becoming the greatly sought after cloud provider by increasing business agility, reducing complexity and scaling virtualization. In this blog I will focus on the announcement of the new 40 GbE interface module we have added to the Brocade MLX Series to enhance the physical aspects of the data center core that are required as the foundation for the On-Demand Data Center.
In the core of the service provider data center, network operators need to be able to respond in real time to dynamic business needs by delivering applications and services on demand. At the same time, they must contain costs through more efficient resource utilization and simpler infrastructure design. Traditional network topologies and solutions are not designed to support increasingly virtualized environments. With the Brocade MLX 4-port 40 GbE module, in conjunction with Brocade VCS Fabric technology, you can scale the data center fabric and extend across the Layer 3 boundary between data centers. High 40 GbE density with advanced Layer 3 capabilities helps consolidate devices and links needed in the data center core. Large Link Aggregation Group (LAG) capabilities provide capacity on-demand and reduce management overhead. By consolidating devices and simplifying the network, customers can reduce capital expenditures and operational expenditures in terms of power, space, and management savings, minimizing TCO. In addition to massive scalability from the 40 GbE density, the rich feature set of the Brocade MLX 4-port 40 GbE module eliminates the need for additional edge routers by enabling Layer 3 data center interconnect with full featured support for Access Control Lists (ACLs), routing, and forwarding in the data center core
Prior to 2012, optical equipment dominated the 40 GbE market. 40 GbE is now taking off on Ethernet routers and switches, principally in data centers because it helps to bridge the bandwidth and economics gap between 10 GbE and 100 GbE for customers. The market for 40 GbE in high-end routing applications is expected to ramp up quickly, with CAGR from 2013 to 2016 expected to be 125% with a total market size in 2016 of $239M (Source: Dell’Oro, 2012). Similar to 10 GbE, business drivers will be the growth of bandwidth-intensive applications:
The image shows a primary deployment model for the 40 GbE module in the Core of the data center. The high density, wire-speed performance enables 40 GbE connection with the aggregation layer – in this case the Brocade VDX 8770 supporting the VCS Fabric. Also supporting advanced MCT, this new module enables data center cores to scale in in highly resilient and efficient manner.
The MLXe also serves as an ideal border router to interconnect the data center to the WAN – or other data centers. Here 40 GbE or 100 GbE is typically used. The new 40 GbE module is often used, especially where underlying WAN optical infrastructure does not yet support 100G.
There has been lots of recent discussion about Google and AT&T targeting to provide the city of Austin, TX with a 1-gigbit-per-second Internet serv.... While the competitive and innovative spirit should make Austin feel like one of the luckiest towns in the world, I would like to tell you about a metro service provider in Clarksville, Tennessee that already provides its residential and commercial customers with 1 Gigabit Ethernet services to the premise.
CDE Lightband is the leading municipal utility provider of electricity, digital television, Internet and voice to all of the 100 square miles located within the boundaries of Clarksville, TN. They offer their services to approximately over 64,000 customers while 892 miles of power lines and 960 miles of fiber optic cable are maintained. Most distinguishably, CDE Lightband provides a true Active Ethernet network to their customers. This means that each and every one of their residential and commercial customers has their own active Ethernet, Fiber-to-the-Premise port. The value of an Active Ethernet network is that the bandwidth on the connection is not shared, and is thus an effective way of ensuring a 1-Gbps connection to each subscriber. It is certainly a feather in the hat for a service provider of any size.
Brocade is proud to support CDE Lightband’s Active Ethernet project. By using the Brocade NetIron CES series switches, CDE Lightband can sell Gigabit Internet service and provide bandwidth throughput. In the future, CDE Lightband plans to use the 10G ports on the Brocade CES so they can grow the switches into the network as they expand their internal infrastructure.
Like all service providers, CDE Lightband’s top priority is to provide world class performance and reliability. Because of the Brocade CES series switches, CDE Lightband is able to offer their customers a unique and powerful Ethernet services (as exemplified in their Active Ethernet project) and deliver them on pace with their customers’ business and personal requirements. Brocade is very honored to be the backbone of CDE Lightband’s network!
To learn more about the Brocade and CDE Lightband partnership, please watch this video.
I recently returned from IETF 86 and would like to update the folks in this community with a brief synopsis of the event. Overall, it was a very well attended, interactive and relevant event! BuRead more...
I recently returned from IETF 86 and would like to update the folks in this community with a brief synopsis of the event. Overall, it was a very well attended, interactive and relevant event! But I think that’s pretty much the norm these days, particularly with all the interest in SDN related technologies and use cases. I will post a separate blog in our SDN community on the SDN related IETF activities, so please go there for that update. In this blog, I will focus on IETF activities related to service providers.
I’ll start off with the discussion around IPv6 in MPLS networks. While we all know that there has been some interest and IETF standards work in the area of MPLS/IPv6, it has yet to garner much real deployment interest. Techniques for providing IPv6 over IPv4-based MPLS networks, such as 6PE and 6VPE, have solved some of the issues with IPv6 and MPLS. However, it appears the IETF community is now getting behind full IPv6 support in MPLS. This would include native IPv6 LDP and RSVP-TE support. Some folks believe that although full IPv6 MPLS networks may not be needed for another 3-5 years, the IETF community should get on board now and start officially driving this. The MPLS WG will start formally tracking progress in this area, as it’s deemed important work.
Entropy labels to improve load balancing in MPLS networks was briefly discussed and this appears to be a done deal in terms of standards (RFC 6790) and having broad community support and consensus.
TRILL over Pseudo-Wires was discussed in the PWE3 WG. This is cool stuff and appears to have some degree of consensus. This basically would allow a TRILL domain in one data center to have layer-2 connectivity to another TRILL domain in another data center.
A similar topic of VXLAN over L2 VPNs was discussed in the L2VPN WG. This would provide a layer-2 MPLS connection between VXLAN or NVGRE logical overlay networks. This is also a pretty cool use case and this appears to be a needed solution if VXLAN/NVGRE solutions become more widely deployed in data centers. A somewhat related topic was discussed on how Ethernet VPNs (E-VPNs) could be leveraged to provide a data center overlay solution. In this context, E-VPNs are based on MPLS technologies. While this solution revolves around Network Virtualization Overlays, it was discussed in the L2VPN WG due to it leveraging MPLS technologies. This Internet Draft was also discussed in the NVO3 WG.
Interesting work on MPLS forwarding compliance and performance requirements was discussed in the different MPLS WGs. This work intends to document the MPLS forwarding paradigm for the perspective of the MPLS implementer, MPLS developer and MPLS network operator. Very useful work!
In the L3VPN WG, there were quite a few IDs that overlap with the NVO3 WG and data center overlay technologies. The general support for MPLS-based solutions for data center overlay architectures appears to be gathering momentum. From a high level, this does make sense as MPLS VPN technologies provide a logical network overlay in the wide area of service provider networks. As data center overlay architectures evolve, why not leverage this work and experience? I will discuss more on this topic in my SDN community blog.
To wrap up the MPLS activities; there were a number of other MPLS-related developments and enhancements that I won’t go into detail about here. Areas such as P2MP LSPs, special purpose MPLS label allocations, OAM, and additional functionality for advertising MPLS labels into the IGP (like an enhanced “forwarding adjacency”) were all discussed and are progressing at various stages though the IETF standards process.
Another WG that generated a fair amount of activity and interest is PCE. This is also an area of IETF work that is somewhat related to the SDN solution space. This WG is focused on how to enhance traffic-engineering decisions in MPLS networks. PCE functionality would “recommend” traffic-engineered LSPs for the network but would not be responsible for the actual instantiation of those LSPs into the network. That would be done by another function; and is deemed outside the scope of the PCE WG.
The WG agreed to make the PCE MIB “read-only”. This makes sense since the MIB is not a good place to implement PCE functionality. They also discussed P2MP LSPs, Service aware LSPs and even the support of wavelength switched optical networks. They also agreed that “Stateful” PCE was indeed in scope and in the charter.
Overall, nothing really ground breaking to report on in the area of routing activity at this IETF. One topic worth a mention is the North-bound distribution of link-state and TE routing information using BGP. This area is somewhat related to the SDN solution space, as it could provide upper layer applications (such as ALTO or PCE) the knowledge of link-state topology state from the network. This would allow those applications to make more intelligent traffic-engineering decisions.
Another area of routing that is interesting to mention is having the ability to make routing decisions based upon additional link-state metric information; such as latency, jitter and loss. This seems like a very logical evolution of IP routing.
And to wrap up the routing activity; as expected, the security of inter-domain routing continues to generate lots of interest. It was interesting that immediately after the IETF, there was a paper published by Boston University on the security implications of the Resource Public Key Infrastructure (RPKI) being discussed on the SIDR mailing list. This paper seems to re- ignite some of the controversy around secure routing.
I2RS was also very well represented and generated lots of interesting dialogue and debate.
This WG is fairly new. The primary goal of this WG is to provide a real-time interface into the IP routing system. This interface will not only provide a configuration and management capability into the routing system, but will also allow the retrieval of useful information from the routing system. Quite a bit of the discussion was centered around what type of state information needs to be injected into the routing system, what type of state information should be extracted from the routing system, and interesting enough, what specifically is the “routing system”. The routing system is generally understood to be the Routing Information Base (RIB) in IP routers but there was a good amount of debate on exactly what constitutes a RIB, what information does it hold and what might the interface to this RIB look and behave like. It appears this WG may have taken a step back to re-group and get more focused before moving on to solutions too rapidly.
There were five use case drafts that were presented and discussed. So, while this WG may have taken a step back to more clearly understand and define the problem space, they are also continuing to move forward with relevant use case definitions and then onward to solutions.
So, that wraps up this short update on the IETF 86 SP related activities. I should mention before closing that the I2RS WG intends to hold an interim meeting after the ONS event in April, so if you are attending the ONS event you may want to attend the I2RS interim meeting as well.
I’d like to follow Greg’s great blog from last week with a related topic. Like his blog, this blog will be focused on router hardware (unlike my previous blogs which were NetIron software related). The topic at hand is a brief discussion of the differences and the pros/cons of FPGA and ASIC technology. I’ll also briefly touch on the advantages of each of these technologies as they apply to high-end IP routers.
FPGAs (Field Programmable Gate Arrays) are specialized chips that are programmed to perform very specific functions in hardware. An FPGA is basically a piece of programmable logic.The first FPGA was invented in 1985, so this technology has been around for quite some time. Rather than executing a function in software, the same function can be implemented in an FPGA and executed in hardware. One can think of an FPGA as “soft-hardware”, since it can be reprogrammed after manufacturing. How many of you remember the bygone days of software-based IP routers? If you do, then you should also remember how poorly the Internet performed at that time! Performance was poor in software-based routers due to the fact that a centralized CPU executed all functions, both the control/management plane functions and the data plane functions of the router. Today, all modern routers execute the data plane functions in hardware; and more frequently, some vendors are moving certain control plane functions into the router hardware as well. The Bi-Directional Forwarding (BFD) protocol is one example of this; where portions of the BFD keep-alive mechanisms are implemented in the line card of the router.
While FPGAs contain vast amounts of programming logic and millions of gates, one thing to note is that there is some programming logic in an FPGA that is not used for the “customer facing” or "mission specific" application or function. In other words, not all the logic in an FPGA is designed to be directly used by the application the FPGA is providing for the customer. There are additional gates needed to connect all the internal logic that is needed to make it programmable; so an FPGA is not fully optimized in terms of “customer facing” logic.
Now, what I find interesting is that some people will still claim that FPGAs cannot scale to the speeds that are required in the today’s Internet. However, Brocade has proven this claim to be quite false and has been shipping line-rate, high-end performance routers using FPGAs for over 10 years. As shown in the line card diagram in Greg’s blog, an FPGA in this context is really a programmable network processor.
One great advantage of an FPGA is its flexibility. By flexibility, I’m referring to the ability to rapidly implement or reprogram the logic in an FPGA for a specific feature or capability that a SP customer requires. When a networking vendor has a new feature that it wants to implement, the vendor may have the choice of deciding whether to put the feature in software or hardware. This is not always the case; for example, OSPF needs to be run in the control plane of the router and cannot be implemented in hardware. The question of whether to implement something in software or hardware basically comes down to a decision of flexibility versus scalability (and cost is always part of that decision process, as one would expect). Implementing something in software usually results in a rapid implementation timeframe, but often at the detriment to performance. As usual, there is always a trade-off to be made. However, if the vendor supports programmable network processors, they can implement the feature in hardware with no detriment to performance. While it takes more time to get the feature into an FPGA rather than implementing it in software, the time-to-market timeframe is still considerably less than doing a similar feature in an ASIC. The real advantage of this becomes evident with deployed systems in a production network. When a customer requires a feature that needs to be implemented in the forwarding plane of a router, once this feature is developed by the vendor the deployed systems in the field can be upgraded to use the new feature. This requires only a software upgrade of the system; no new hardware or line cards would be required. The routers’ software image contains code for the FPGAs, as well as the code for the control and management plane of the router.
Back to the performance question: Industry has shown that high-end FPGAs are growing in density while handling higher-speed applications and more complex designs. Furthermore, if you look at the evolution of FPGAs over the years, they follow Moore's Law just like CPUs have been doing in terms of the amount of logic that you can implement into them. Recent history has shown that FPGA development in terms of density is on an exponential growth curve.
FPGAs can also be used for developing a “snapshot” version of a final ASIC design. In this way, FPGAs can be re-programmed as needed until the final specification is done. The ASIC can then be manufactured based on the FPGA design.
While ASICs have very high density in terms of logic gates on the chip, the result of higher scalability in terms of the same power metric can give ASICs a competitive edge over an FPGA. One thing to note is that an ASIC is designed to be fully optimized in terms of gates and logic. All the internal structures are used for customer facing or mission specific applications or functions. So, while an ASIC may consume more power per unit die size than an FPGA, this power is amortized over a higher density solution; and hence, provides better power efficiency.
Compare/Contrast of FPGA-ASIC
So, FPGAs and ASICs are both specialized chips that perform complex calculations and functions at high levels of performance. FPGAs, however, can be re-programmed after fabrication, allowing the line card's feature set to be upgraded in the field after deployment. Being able to upgrade the data plane of a deployed router extends the useful lifespan of the system; which correlates to extended investment protection. Since an ASIC is not re-programmable, an ASIC-based line card cannot be upgraded in the field. This is a huge differentiator between the two technologies.
One excellent real-world example of this is when Brocade introduced support for 64 ports in a single LAG. This is industry leading scale (64 10GbE ports in a single LAG!) and since this functionality is implemented in the forwarding plane of the line card, it required reprogramming the Brocade network processor. While this type of capability is in the hardware of the router, it was implemented with a system software upgrade and no hardware needed to be replaced.
There are network scenarios or use cases where it makes more sense to have an FPGA-based product and there are use cases when it makes more sense to have an ASIC-based product. For example, a SP may determine that a high density solution is more important than a solution that provides quicker feature velocity and, thus, may choose an ASIC-based product. ASIC-based line cards are often denser in terms of numbers of ports and the cores of SP networks typically do not require high feature velocity. Most of the feature velocity in today’s SP networks is at the edge of the network (ie: at the PE router) or in the data center, where innovation is currently happening at a rapid pace. The general flexibility of an FPGA results in time-to-market advantages for feature implementation and soft-hardware bug fixes.
For smaller applications and/or lower production volumes, FPGAs may be more cost effective than an ASIC. The non-recurring engineering (NRE) cost of an ASIC can run into the millions of dollars. Conversely, in high volume applications the front-end R&D costs of an ASIC are offset by a lower cost to manufacture and produce. For example, in high-end IP core routers, ASIC-based line cards are more economical due to the lower manufacturing cost, combined with the higher port density of the line card that ASICs can provide.
As costs related to ASIC development are increasing, some recent trends may suggest that FPGAs could be a better alternative even for high volume applications that traditionally used ASICs. It is unclear whether this trend is indeed sustaining or a somewhat temporary aberration.
To summarize the primary differences between FPGA and ASIC based line cards; at the highest level it basically comes down to a scalability versus a flexibility question (again, with cost a large contributing factor). ASICs are advantageous when it comes to high port density applications. FPGAs are advantageous when it comes to feature velocity with a shortened time-to-market requirement. In high end core routers, high density ASIC-based line cards can provide higher density at a lower cost than FPGA-based line cards. So, it’s based upon the use case and network application to determine which type of technology would be favored over the other.
As usual, any questions are comments are welcome!
It’s hard to believe that Ethernet is turning 40 this year, isn’t it? Since its conception by Bob Metcalfe and the team of engineers at XEROX PARC in the 1970s, Ethernet technology has continued to evolve to meet the increasing bandwidth, media diversity, cost, and reliability demands of today’s networks. The next Ethernet evolution has officially started, and I'm excited to follow the latest developments on this new technology that will enable networks to support even higher capacities.
“Here is more rough stuff on the ALTO ALOHA network.” Memo sent by Bob Metcalfe on May 22, 1973.
I wrote about 400 GbE in my blog recently as the next likely Ethernet speed, and now it’s official. Last week at the March 2013 IEEE 802 Plenary Session, 400 GbE became an official IEEE 802.3 Study Group that will start work on developing the new standard. Though 100 GbE is only a few years old, it’s important that we start working on the next speed now, so that we have the technology shipping when there is demand from network operators to deploy higher speed Ethernet.
The 400 Gb/s Ethernet Study Group is starting with strong industry consensus this time, which will enable the standard to be developed faster than before. The 400 GbE Call-For-Interest presentation was given last week to measure the interest in starting a 400 GbE Study Group in the IEEE. Based on the hard work of the IEEE 802.3 Ethernet Bandwidth Assessment (BWA) Ad Hoc and the IEEE 802.3 Higher Speed Ethernet (HSE) Consensus Ad Hoc, there was clear consensus on the direction the industry should take on the next Ethernet speed. The straw polls and official vote on the motion to authorize the Study Group formation were all in favor with a few abstains, which showed a high degree of consensus from the individuals and companies represented. This was not so with the last Ethernet speed evolution, which was simply called the Higher Speed Study Group (HSSG) when it was formed. First, the HSSG had to analyze the market and come up with feasible higher speed solutions before even deciding on the speed. This made the standardization process much longer as the HSSG debated 40 GbE and 100 GbE, and eventually standardized both speeds for different applications. Since we are already starting the 400 Gb/s Ethernet Study Group with a clear speed objective in mind, the standardization process should be much faster. This means the Study Group could have the 400 GbE standard finished in 2016 with the first interfaces available on the market soon after.
Stay tuned for more updates as we follow the road to 400 GbE! If you happen to be in the Bay Area next week, check out the Ethernet 40th Anniversary Celebration at the Ethernet Technology Summit on Wednesday evening at 6 pm, April 3, 2013.
While considering what to write about for this blog, after my previous blog about a really cool NetIron 5.3 feature, I thought I’d stick with that trend for now and talk about another highly anticipated 5.3 feature. This one also happens to be MPLS-based and it’s often a required SP capability within an MPLS-based solution. It’s called Automatic Bandwidth Label-Switched Paths, or Auto-BW LSPs for short.
The Good News
As we know, RSVP-TE based networks are capable of considerable optimizations in terms of bandwidth reservations and traffic engineering. Operators can “plumb” their networks more intelligently, by reserving LSP bandwidth onto specific paths within their network for certain traffic types or overlay services. This makes their networks run more efficiently and with better performance. Operators and network managers like this, as they are getting the most out of the network. In other words, they are “getting their monies worth”.
The Not So Good News
While bandwidth reservations and traffic engineering provide great capabilities in MPLS networks, oftentimes the configured bandwidth reservations turn out to be less than optimal. In other words, it’s great for an operator to be able to say “for this LSP between these two endpoints I want to reserve 2.5 Gbps of bandwidth” and then make that happen in the network. The operator knows that the topology can support the 2.5 Gbps of capacity due to capacity planning exercises or from offline MPLS-based TE tools. Cool. (btw: It may be desirable to integrate sFlow data into the capacity planning capability or maybe even into an offline MPLS-based TE tool, but that’s a topic for a future blog.)
But what if there is a sustained increase in traffic, well above the reserved 2.5 Gbps, for that service? How are those surges handled by the LSP? Or what if the actual sustained traffic load is only 1.5 Gbps? In that case, no other LSP may be able to reserve the “extra” 1 Gbps of capacity since it is already reserved for that specific LSP. Now the operators’ network is plumbed in a less than optimal fashion. They are no longer “getting their monies worth” out of the network.
The (now) Gooder News
Here is where Auto-BW LSPs come onto the scene to save the day and make the operator a hero (again).
Auto-BW LSPs can solve both problems mentioned; handling a sustained surge in traffic, above what was previously planned for & being able to use “extra” capacity that is actually available but because it’s allocated to an LSP, it may not be available to be reserved by other LSPs.
Overview of Auto-BW
In its simplest definition: auto-bandwidth is an RSVP feature which allows an LSP to automatically and dynamically adjust its reserved bandwidth over time (ie: without operator intervention). The bandwidth adjustment uses the ‘make-before-break’ adaptive signaling method so that there is no interruption to traffic flow.
The new bandwidth reservation is determined by sampling the actual traffic flowing through the LSP. If the traffic flowing through the LSP is lower than the configured or current bandwidth of the LSP, the “extra” bandwidth is being reserved needlessly. Conversely, if the actual traffic flowing through the LSP is higher than the configured or current bandwidth of the LSP, it can potentially cause congestion or packet loss. With Auto-BW, the LSP bandwidth can be set to some arbitrary value (even zero) during initial setup time, and it will be periodically adjusted over time based on the actual bandwidth requirement. Sounds neat, huh? Here’s how it works…
First, determine what the desired sample-interval and adjustment-interval should be set at. The traffic rate is repeatedly sampled at each sample-interval. The default sampling interval is 5 minutes. The sampled traffic rates are accumulated over the adjustment-interval period, which has a default of 24 hours. The bandwidth of the LSP is then adjusted to the highest sampled traffic rate amongst the set of samples taken over the adjustment-interval. Note that the highest sampled traffic rate could be higher or lower than the current LSP bandwidth.
That’s basically it in a nutshell, but there are other knobs available to tweak for further control (as expected, operators want more knobs to tweak).
In order to reduce the number of readjustment events (ie: too many LSPs constantly re-sizing), we allow the operator to configure an adjustment-threshold. For example, if the adjustment-threshold is set to 25%, the bandwidth adjustment will only be triggered if the difference between the current bandwidth and the highest sampled bandwidth is more than 25% of the current bandwidth.
As mentioned, the adjustment-interval is typically set pretty high, at around 24 hours. But a high value can lead to a situation where the bandwidth requirement becomes suddenly high but the LSP waits for the remaining adjustment-interval period before increasing the bandwidth. In order to avoid this, we allow the operator to configure an overflow-limit. For example,if this value is set to 3, the LSP bandwidth readjustment will be triggered as soon as the adjustment-threshold is crossed in 3 consecutive samples.
The feature will also allow the operator to set a max-bandwidthandamin-bandwidthvalue to constrain the re-sizing of an LSP to within some reasonably determined bounds.
It is also possible to simply gather statistics based on the configured parameters, without actually adjusting the bandwidth of an LSP. This option involves setting the desired mode of operation to either monitor-only or monitor-and-signal.
The Auto-BW feature also provides a template-based configuration capability, where the operator can create a template of auto-bandwidth parameter values and apply the templates on whichever path of an LSP that needs the same configuration or across multiple LSPs.
This example below shows three adjustment-intervals on the horizontal axis and traffic load of the LSP on the vertical axis. After each adjustment-interval, the LSP bandwidth is automatically adjusted based upon the sampled traffic rate. The diagram also shows where the adjustment-threshold is set and exceeded by the actual traffic rate, which then results in the bandwidth adjustment. The red line is the bandwidth of the LSP, after being adjusted at each adjustment-interval.
In the example above, each adjustment-interval has three sample-intervals. The following graphic shows the relationship between the sample-interval and the adjustment-interval.
Auto-BW Solves Real Problems
Here is one simple but real-life scenario where Auto-BW can prevent packet loss. Consider the topology below.
In this topology there are two LSPs between PE1 and PE3; each with 400 Mbps of reserved bandwidth and each with actual traffic loads approaching the 400 Mbps reservations. The entire topology consists of 1 GbE links so both of these LSPs can share any of the links since their combined bandwidth reservation is 800 Mbps. Constrained Shortest-Path First (CSPF) calculations put both LSPs onto the PE1-PE2-PE3 path.
However, over time LSP2's actual traffic load grows in size to 650 Mbps. Now the combined traffic load of both LSPs exceeds the capacity of a 1 GbE link and packet loss is now happening on the PE1-PE2-PE3 path. RSVP, specifically the CSPF algorithm, cannot take the additional “actual” traffic load into account so both LSPs remain on the same path. This is not good. The reason for this is the Traffic-Engineering Database (TED) that CSPF uses to determine paths in the network is not updated by actual traffic loads on links or in LSPs. This is just how RSVP-TE works.
When Auto-BW is enabled, both LSPs are sampled to determine their actual traffic loads. After the adjustment-interval, LSP2 is re-sized to 650 Mbps. Now both LSPs can no longer share the same path as the CSPF algorithm will compute that the combined bandwidth of the LSPs now exceeds a single 1 GbE link. So, the result is that CSPF will look into the TED for a new path in the network from PE1 to PE3 that meets the bandwidth requirement of LSP2 and will traffic engineer that LSP onto the PE1-PE4-PE5-PE3 path.
The operator is now a hero (again) because the network is back to working at its maximum efficiency and performance levels.
As usual, any questions are comments are welcome! Also, if there are future topics related to MPLS that you would like to see a blog about, please post them in the comments and we will see what we can do.
I’d like to continue a previous discussion about Brocades Multi-Chassis Trunking (MCT) technology. Please see the earlier blog: MCT with VPLS. The MCT w/VPLS capability was part of NetIron Software Release 5.3. In NetIron Software Release 5.4, we added a powerful enhancement to provide Multicast over MCT.
A diagram of this capability is shown below. In the diagram, there are two MLXe routers who are MCT peers. They have multicast receivers downstream and multicast sources upstream.
The diagram shows that the MCT Cluster Client Edge Ports (CCEPs) now have the ability to support the Internet Group Management Protocol (IGMP) and the Protocol Independent Multicast (PIM) protocol. As you recall, the CCEPs are the MCT customer facing edge ports. The diagram shows multiple host receivers behind two layer-2 switches, who are the MCT clients, and the host receivers are sending IGMP join requests toward the network. IGMP is used by hosts to establish and manage their multicast group membership. The MCT client layer-2 switches are directly connected to the CCEPs of the MCT cluster. Each layer-2 switch is doing standard link aggregation (LAG) to connect to both of the MLXe routers. As with all MCT configurations, the client layer-2 switches are unaware that they are connected to two MLXe routers; this is the active/active redundancy that MCT provides.
Both of the MCT peers will receive IGMP join requests and will subsequently send PIM joins toward the multicast Rendezvous Point (RP) or multicast source, depending on whether PIM-SM (*, G) or PIM-SSM (S, G) is being used. So, PIM runs on the network facing interfaces of the MCT peers, including the Inter-Chassis Link (ICL). The MCT ICL is also used to synchronize the IGMP membership state between the MCT peers. The result is that both of the MCT peers will install the correct multicast membership state. The diagram shows a few of the scenarios that are possible; where sources can be somewhere inside the IP network or directly attached to either MLXe router. However, the sources and receivers can actually be reversed such that sources are behind the MCT client layer-2 switches and the host receivers are either somewhere in the IP network or directly attached to either MLXe router. All variations are supported.
A hashing algorithm determines if the active multicast outgoing interface (OIF) is a local CCEP interface or the ICL. As shown in the diagram below, multicast traffic can arrive on both MLXe routers from the source but the MLXe with the local forwarding state is the only one that forwards traffic to the host receiver.
Since both MCT peers are properly synchronized, forwarding is performed as expected on the multicast shortest-path tree.
Some of the benefits of this compelling enhancement are:
As an example of a possible failure scenario; If CCEP2 fails, the MCT peers will remain synchronized such that the redundant MCT peer immediately takes over as the active forwarding device.
So, you can see that this is a very powerful feature as it provides for an active/active redundancy capability while maintaining the optimal multicast forwarding tree under failure scenarios. This is no easy feat!
Continue to stay tuned to this page for additional NetIron enhancements, as Brocade continues to lead the industry in innovation!
This is a great opportunity for me to introduce a really cool and highly anticipated feature that is part of the Brocade NetIron 5.3 Software Release. The official release date for this software is sometime next week, but because you are part of this awesome SP community, you get a sneak peak!
While 5.3 contains many new innovative features that our SP customers have been clamoring for, I thought I’d pick one in particular and write a bit about it here. The feature is Multi-Chassis Trunking integration with Virtual Private LAN Service, or MCT w/VPLS for short.
First, a short background refresher on what problem MCT solves. (BTW: Brocade has been supporting MCT for well over a year now.)
Brocade developed MCT to provide a layer-2 “active/active” topology in the data center without the need to run a spanning-tree protocol (STP). STP has traditionally been used to prevent layer-2 forwarding loops when there are alternate paths in a layer-2 switched domain. However, STP has its issues in terms of convergence, robustness, scalability, etc. Orthogonal to STP, link aggregation (IEEE 802.3ad) is also often deployed to group or bundle multiple layer-2 links together. The advantages of link aggregation are:
So, MCT leverages standards-based link aggregation but is capable of providing this “across” two switch chassis instead of just one chassis. This is shown below.
As you can see, there are two chassis that act like a single logical switch. This is called an MCT pair or cluster. The devices on either side of the MCT logical switch believe they are connected to a single switch. Standard LAG is used between these devices and the MCT logical switch. The advantage of doing this is that now both switches in the MCT cluster are functioning at layer-2 in an “active/active” manner. Both can forward traffic and if one chassis has a failure, standard failover mechanisms for a LAG bundle take effect. In addition, there are no layer-2 loops formed by an MCT pair so no STP is needed!
Now, for a short background refresher on what VPLS provides. (BTW: Brocade has been supporting VPLS for many years now.)
VPLS provides a layer-2 service over an MPLS infrastructure. The VPLS domain emulates a layer-2 switched network by providing point-to-multipoint connectivity across the MPLS domain, allowing traffic to flow between remotely connected sites as if the sites were connected by one or more layer-2 switches. The Provider Edge (PE) devices connecting the customer sites provide functions similar to a switch, such as learning the MAC addresses of locally connected customer devices, and flooding broadcast and unknown unicast frames to other PE devices in the VPLS VPN.
MCT with VPLS
Very frequently, a customer network needs to provide layer-2 connectivity between multiple data centers-- to enable VM mobility, for instance. The MCT w/VPLS feature I’m describing provides this type of connectivity in a redundant and high-available fashion. MCT provides the “active/active” layer-2 connectivity from the server farm or access layer to the core layer of the data center. The customer then leverages VPLS on the core layer data center routers to transport the layer-2 Ethernet frames between data centers. This is shown below.
In the diagram above, the CE switch uses a standard LAG to connect to the redundant MCT cluster. The same NetIron routers that form the MCT cluster are also configured to support VPLS to connect to the backbone network. So, the connection from the CE switch in one data center to a remote CE switch in another data center is a layer-2 service. VM mobility between the data centers is now provided in a redundant end-to-end fashion.
Fast failover in the VPLS network is provided by using redundant Pseudo-Wires (PWs), based on IETF Internet Draft <draft-ietf-pwe3-redundancy-03>. As shown below, each PE router signals its own PW to the remote PE. These local PE routers determine, based on configuration, which PE signals an active PW and which PE signals a standby PW. There is also a spoke PW signaled between the PE routers. In the case of the active PW failing, the primary PE router signals to the secondary PE router to bring up its standby PW. This failover is provided in a rapid manner.
The benefits of this solution are:
So, as you can see this is a really awesome capability for SPs who need to integrate their data center infrastructure with their MPLS/VPLS backbone network. We expect this solution to become a very common data center network architecture going forward for providing inter-data center layer-2 connectivity. I should also note that this solution works with Virtual Leased Line (VLL), in addition to VPLS. And, on top of that, it integrates with Ethernet Fabrics in the data center extremely well!
Stay tuned to this forum for more blogs like this.
Last year was another exciting year for 100 GbE as we saw several new technical developments and large deployments by service provider, data center, research and HPC network operators. Here's a quick recap of the highlights for 2012.Read more...
Last year was another exciting year for 100 GbE as we saw several new technical developments and large deployments by service provider, data center, research and HPC network operators. Here's a quick recap of the highlights for 2012. AMS-IX, our biggest 100 GbE customer and one of the biggest 100 GbE networks in the world, upgraded their 10 GbE core to a 100 GbE core with over 90 x 100 GbE ports in their backbone alone for a capacity of over 7.8 Tbps. The IEEE 802.3ba standard for 40 GbE and 100 GbE, now over 2½ years old, was added to the latest IEEE 802.3-2012 "Standard for Ethernet". 2nd generation 100 GbE projects in the IEEE P802.3bj and P802.3bm Task Forces are in progress that will lower cost and increase density. We’re now well underway to the next evolution of 100 GbE technology and even to the next speed of Ethernet, 400 GbE.
One trend that I’ve noticed among service providers is that 100 GbE peering at IXPs (Internet Exchange Points) is on the rise. We saw a lot of 100 GbE deployments primarily in core networks over the past couple of years, and now 100 GbE peering is taking off too. Several IXPs around the world, most of whom are Brocade customers, have announced the availability of 100 GbE peering ports or the intent offer them this year: AMS-IX (Amsterdam), DE-CIX (Frankfurt), JPIX (Tokyo), JPNAP (Tokyo), LINX (London), Netnod (Stockholm) and NIX.CZ (Prague). AMS-IX for example has deployed three 100 GbE customer ports already, and has six more on order that are expected to go live in the next several weeks. They will also have the first customer 2 x 100 GbE LAG, which will upgrade a 12 x 10 GbE LAG.
The motivation for 100 GbE peering is obvious: to reduce the number of 10 GbE LAGs that connect to an IXP for cheaper and simpler peering. 10 GbE LAG is a great solution but when you consider the port costs, cross connect costs, management and troubleshooting costs, etc. it does start to add up. Costs are different for every network operator as all networks are different, but in general 100 GbE starts to make sense when 10 GbE LAGs exceed six to seven links. Incidentally it also made sense to upgrade to a DS3 when a link exceeded six to seven inverse-multiplexed DS1s when I was a network engineer at MindSpring in the late 1990s, so there is some strange commonality in that number of links. AMS-IX’s 100 GbE port price for example is €9000/month, which is six times the 10 GbE port price of €1500/month.
There is another motivation for 100 GbE peering that is not so obvious too, and this demand comes from IXP resellers. IXP resellers are a relatively new development in the peering industry that enables service providers to peer remotely from anywhere in the world through a reseller port. Until recently, service providers were required to have a physical presence at an IXP in order to peer, because IXPs do not offer long haul transport services. Now IXP resellers, in partnership with an IXP, can resell peering ports remotely over their network to their customers. Remote peering capacity demand is what’s driving these 100 GbE ports. In order for a reseller to offer a high capacity service to their customer, say for example 20 Gbps or 40 Gbps, their own peering port to the IXP has to have the capacity available. Deploying a 100 GbE port to the IXP gives a reseller both the capacity and the flexibility to offer more capacity on demand, without having to constantly manage 10 GbE LAGs.
So, expect more announcements from IXPs about 100 GbE ports this year as 100 GbE peering goes mainstream in 2013.
Acknowledgements: I’d like to thank Henk Steenman, AMS-IX CTO, for his valuable insight and interesting data on 100 GbE peering.
There are some interesting developments in the Research & Education Networks (RENs) space. While there is continued interest and innovation in the REN space around OpenFlow and SDN, there are some related developments in terms of networkRead more...
There are some interesting developments in the Research & Education Networks (RENs) space. While there is continued interest and innovation in the REN space around OpenFlow and SDN, there are some related developments in terms of network architecture. After a brief overview of one such interesting development, I will then relate this development back to Brocade.
To start with, I’d like to describe an emerging network architecture called a “Science-DMZ”; which basically moves the high-performance computing (HPC) environment of a research & education campus network into its own DMZ. The reference architecture looks something like this:
As the diagram shows, the traditional “general purpose” campus network sits behind one or more security appliances, which are typically stateful firewalls. This DMZ, or perimeter network, protects the internal network, systems and hosts on the campus network from external security threats. The research and science HPC environment also traditionally connected into the same campus network; so it was also behind the DMZ firewall. This presented some challenges to the HPC environment in terms of data throughput (eg. TCP performance), dynamic “ad-hoc” network connectivity, and general network complexity.
The concept of a Science-DMZ emerged where the connectivity to the HPC environment is moved to its own DMZ; in other words, this environment is no longer connected behind the campus DMZ and firewalls. It now sits on a network that is purposely engineered to support the high performance HPC requirements. As the diagram shows, the science and research enviornment is now connected to a Science-DMZ switch, which in turn connects to a Science-DMZ border router. Access control lists (ACLs) in the border router are leveraged to maintain security of this HPC environment. In addition to simpler access control mechanisms, when a scientist or researcher needs to set up a logical connection to another scientist or researcher to share data, the HPC network can be directly provided that connectivity with provisioning in the border router. For network performance testing and measurement, the perfSONAR tool is included in the reference architecture.
The Science-DMZ concept emerged out of work from the engineers at Energy Sciences network (ESnet). Please take a look at their website for additional details on this architecture. As I have explained, the idea here is pretty simple: to allow the local HPC environment to have better connectivity to other research & education networks by putting it on its own DMZ. The external connectivity is often provided via the national Internet 2 backbone, or it could be provided via a regional REN backbone. To deliver this type of high performance connectivity, there are some hard requirements in terms of scale, performance and feature set of the Science-DMZ Border Router. This is where Brocade enters the conversation.
The hard requirements for this border router are:
• Must be capable of linerate 100GbE, including support for very large, long-lived flows
• Must support pervasive OpenFlow & SDN, for ease of provisioning and innovative applications
• Must support deep packet buffers to handle short data bursts
• Must support linerate ACLs to provide the security mechanisms needed, without impact to data throughput or performance
The Brocade MLXe high-performance router uniquely fits the bill of these requirements! As of software version 5.4, which started shipping in September of this year, the MLXe supports OpenFlow v1.0 in its GA release. The OpenFlow rules are pushed into hardware so the MLXe maintains its high forwarding performance; as it does with IPv4, IPv6 and MPLS forwarding. The largest MLXe chassis can scale to 32 ports of 100GbE or 768 ports of 10GbE, possesses deep packet buffers to handle bursty traffic and performs ACL functions in hardware.
In summary, the Science-DMZ architecture has emerged to solve some of the performance challenges for HPC environments and this reference architecture includes innovative features such as OpenFlow & SDN. The Brocade MLXe platform possesses the unique performance, functionality, and feature set that is required to perform the role of the Science-DMZ border router.
Stay tuned to this space for additional emerging developments in the research & education network arena.
I want to give you a quick overview and update on the industry’s progress toward 400 GbE as the next Ethernet speed. Though 100 GbE is only two years old, it’s important that we start working on the next speed now, so that we have the technRead more...
I want to give you a quick overview and update on the industry’s progress toward 400 GbE as the next Ethernet speed. Though 100 GbE is only two years old, it’s important that we start working on the next speed now, so that we have the technology shipping when there is demand from network operators to deploy higher speed Ethernet. The Call for Interest (CFI) to start the 400 GbE Study Group that will work on defining a new Ethernet standard was just announced yesterday, and is scheduled to be held on March 18, 2013 at the next IEEE Plenary meeting.
Here’s a little history on how we chose 400 GbE as the next Ethernet speed. First, the IEEE 802.3 Ethernet Bandwidth Assessment (BWA) Ad Hoc was formed in 2011 to evaluate future Ethernet wireline bandwidth needs. The BWA gathered input from the industry so that we would have accurate bandwidth growth data and requirements for the next Ethernet speed. The full report was released in July, and found continuing growth of bandwidth demands in core and transport layers beyond 100 GbE. If you are just interested in a summary and overview of the findings, then have a look at Scott Kipp’s NANOG56 presentation from last month. Next, the IEEE 802.3 Higher Speed Ethernet Consensus (HSE) Ad Hoc first met in July to develop consensus on the next speed of Ethernet based on the BWA data. The November IEEE Plenary meeting was just held a couple of weeks ago, where the HSE Ad Hoc made progress on the draft 400 GbE CFI presentation.
Why Not TbE?
I’d really love for us to build TbE! But, in order to make TbE economically feasible the cost per bit needs to be at or below the cost of 100 GbE. This means it would make sense for us to reuse current 100 Gbs technology, which implies a TbE architecture using 40 x 25 Gbps signaling lanes. Unfortunately, reusing 25 Gbps signaling means the resulting size of the pluggable media module, and the large amount of interface signals would simply be impractical to develop. Several good presentations were given at the IEEE HSE Consensus Ad Hoc meeting in September about why we should work on 400 GbE now and defer TbE for a few years. There are a couple of alternatives to 25 Gbps signaling, such as using advanced multilevel or phase modulation signaling, but these are still immature technologies that need more development before we can get the performance and low-cost manufacturing needed for volume production. Higher signaling rates will make TbE more feasible, but this technology isn’t expected to be available for the next several years.
As the 400 GbE CFI is now scheduled for March 2013, it means we will have the 400 GbE standard in mid-2015 at the earliest. It’s likely that the first generation of 400 GbE will use 16 x 25 Gbps signaling and that the first interfaces will be available in the 2016 timeframe. The questions that still need to be answered are the physical layer specifications for reaches and media, and this is what the Study Group will start working to define first. As we had for 100 GbE, the interfaces for 400 GbE will use a pluggable media module which gives network operators the most flexibility and choice. It’s likely that the 400 GbE media module will be called CDFP, which is short for “CD (400) Form-factor Pluggable”. As 400 GbE evolves with faster signaling technology, the second generation is expected to use 8 x 50 Gbps signaling which the Optical Internetworking Forum is already beginning to define. The third generation of 400 GbE is expected to use 4 x 100 Gbps signaling which has more advanced electrical and optical signaling technology that is being worked on in labs today. These key 100 Gbps signaling technologies will also be the building blocks for TbE and aren’t expected until after 2020. Stay tuned for more updates as we follow the road to 400 GbE!