Design & Build

Data Center Solution-Design Guide: VMware vSphere 5.0 SRM with Dual Active/Active Data Centers

by ‎06-19-2012 09:52 AM - edited ‎08-07-2014 07:36 AM (12,811 Views)

Synopsis: A design guide with best practices for a two-site data center disaster recovery solution with VMware vSphere 5.0, SRM and array-to array storage replication with Fiber Channel over IP (FCIP).

 

 

Preface

 

Overview

Server virtualization has broken the tight coupling between application workloads and the server hardware and operating system supporting them. Application and operating system state are now portable between servers in a cluster. This innovation defines a new way to achieve disaster recovery that leverages built-in capabilities of the virtualization platform.

 

With an increased number of business critical applications deployed on server virtualization hypervisor platforms, such as VMware® vSphere™ 5.0, many data centers are updating their Business Continuity / Disaster Recovery (BC/DR) designs to support virtualization platforms.

 

For example, using VMware vSphere 5.0 with vCenter Site Recovery Manager (SRM), application workloads and data can automatically be synchronized between servers in a cluster. If the server hosting a workload goes offline, the application restarts on another server in the SRM cluster. If the SRM cluster includes servers in different data centers, then SRM provides a reliable disaster recovery tool. A key component of this solution is highly reliable data replication which relies on array-to-array replication using either Fibre Channel, Fibre Channel over IP (FCIP) or IP for storage replication traffic.

 

The following Brocade platforms are used in this solution design.

  • Brocade Fabric OS® (FOS)  for SAN Switches and DCX® Backbones
  • Brocade Network Operating System (NOS) for VDX™ series switches
  • Brocade NetIron® Operating System for MLX™ series routers

 

Purpose of This Document

This guide describes how to design a disaster recovery (DR) solution for VMware vSphere 5.0 with SRM using Brocade IP networking and SAN products.

 

Design best practices are included for the SAN, SAN extension service, data center network and inter-data center network. This design has been designed, constructed and validated in Brocade’s Strategic Solution Validation Lab.

 

This design incorporates array-based replication due to its excellent scalability, high performance and low latency.

 

A companion Deployment Guide document describes configuration of all products in this solution.

 

Audience

This document is intended for disaster recovery planners, and solution, network and SAN architects who are evaluating and deploying DR solutions for VMware vSphere 5.0.

 

Objectives

This design guide is intended to provide a guidance and recommendations based on best practices for a two-site data center disaster recovery solution with VMware vSphere 5.0, SRM and array-to array storage replication using Fiber Channel over IP (FCIP).

 

Restrictions and Limitations

The design is limited to dual data centers and does not address environments with more than two data centers.

 

When selecting a storage array for a SRM solution, the storage vendor must have a VMware Storage Replication Adaptor (SRA) for vSphere 5.0.

 

Related Documents

The following documents are valuable resources for the designer. In addition, any Brocade release notes that have been published for NOS, FOS and the MLX NetIron should be reviewed.

 

References

 

About Brocade

Brocade® (NASDAQ: BRCD) networking solutions help the world’s leading organizations transition smoothly to a world where applications and information reside anywhere. This vision is designed to deliver key business benefits such as unmatched simplicity, non-stop networking, application optimization, and investment protection.

 

Innovative Ethernet and storage networking solutions for data center, campus, and service provider networks help reduce complexity and cost while enabling virtualization and cloud computing to increase business agility.

 

To help ensure a complete solution, Brocade partners with world-class IT companies and provides comprehensive education, support, and professional services offerings. (www.brocade.com)

 

Key Contributors

The content in this guide was provided by the following key contributors.

  • Lead Architect: Marcus Thordal, Strategic Solutions Lab

 

Document History

Date                  Version        Description

2012-06-20       1.0                  Initial Version

2012-07-12       1.1                  Edited graphics for better PDF display/print

 

Reference Architecture

 

Business Requirements

To comply with the business requirements and service level agreements (SLA), the applications must perform the same when running at either the primary or recovery data center site.

 

Note:

SRM supports bidirectional configuration. Therefore, in a dual data center configuration, the primary site for an application can be either data center. This requires applications to be on different storage LUNs at each site and the array must support replication from the primary LUNs to separate LUNS at the secondary site. The array replication software must support independent replication directions for individual LUN pairs.

 

For bidirectional configurations, equal compute, network and storage resources must be available at both sites. This means each site is a mirror of the other one. While it is possible to design configurations where the recovery site has fewer resources than the primary site, such a deployment is restricted to an active/passive configuration resulting in degraded application availability and performance. Best practice for designing a two-site DR configuration is to equip both sites equally as this greatly reduces complexity when provisioning and maintaining applications as well as the DR infrastructure.

 

Increasingly, it is common practice to design DR solutions with active/active configurations and distribute production applications more or less equally at both sites. The benefit from this is in the event of a site failure only one half of the production servers and applications are impacted and subject to the recovery process. Of course, network latency has to be taken into consideration so that end-to-end response time from client to application does not exceed best practice.

 

When designing a DR configuration, the highest level of automation for failover and recovery is desirable.  Therefore this design provides a SAN able to support a fully automated DR solution using VMware SRM. However, the design principles and Brocade products used in this design are not specific to a single virtualization platform and also apply to physical server configurations.

 

Architecture

This design guide is based on Brocade’s Data Center Network Infrastructure Reference Architecture (DCNI-RA). The reference architecture (RA) provides a flexible, modular way to architect and design data center networks. As shown below, VMware vSphere 5.0 SRM is overlaid on top of the DCNI-RA to create this solution design.

 

The DCNI-RA is constructed from building blocks which are combined into templates. Building blocks are repeatable component configurations commonly found in data center networks. Templates provide repeatability, incorporate scale-out and simplify the architecture.

 

DesignGuide_VMware5.0-SRM_RATemplates.JPG

   Reference Architecture Templates with Building Blocks (click to enlarge)

 

References

 

Special Considerations

The inter-site connectivity (bandwidth/cost) and distance (RTT latency) between data centers have fundamental impact on the design options for WAN network, storage replication technology and directly impact RPO and RTO for the Disaster Recovery solution. Commonly the cost of inter-site connectivity is the determining factor for the data replication design which must then be accommodated with network and replication technology selection.

 

Design

 

Topology

The diagram below shows the topology used to validate this design. The tested design can scale to larger configurations as required.

 

DataCenter-Infrastructure_DualAADC_TypicalConfig.JPG

   Typical Dual Active-Active Data Center Topology for VMware vSphere SRM (click to enlarge)

 

The design includes dual Fiber Channel Backbones with embedded array-to-array SAN replication blades at one site and rack mount replication switches at the other, Brocade VCS Fabrics with VDX switches connecting the server edge to the core, dual Brocade ADX Application Delivery Controllers for global load balancing of client connections, dual Brocade MLX routers at the data center core with Brocade CER routers attached to dual WAN links providing VPLS service between the data center core routers at each site.

 

The topology provides a network that is highly available, resilient with excellent scalability which avoids the use of Spanning Tree protocol. It integrates Fibre Channel SAN fabrics with VCS Fabrics for server and storage connectivity and storage replication over distance for replication of both virtual machine state and application storage between the two data centers.

 

Base Design

This section describes the base design for the solution. Any design options are optimizations of this base design and documented in later sections.

 

The network design has three templates, data center, SAN and core, which are constructed from several building blocks documented in the Data Center Infrastructure Reference Architecture.

 

The network is designed in an open way to accommodate a variety of server and storage vendor products. However, the choices are restricted by the VMware hardware compatibility requirements for vSphere 5.0 and SRM.

 

VMware vSphere 5.0 SRM

Synopsis

The vSphere 5 platform must be configured in accordance with the requirements for VMware SRM. It’s highly recommended that servers in the same protection group have the same specifications providing equal compute and memory resources at both sites and thereby equal performance at the recovery and protected site.

 

With most SRM implementations the vCenter server at each site must have in-band Fibre Channel connectivity to the Storage subsystem as well as access to the management IP network in order to coordinate LUN replication states during failover, recovery and reprotect.

 

When configuring the vNetworks on the vSphere hosts use either vNetwork Standard Switches (vSS) or vNetwork Distributed Switch (vDS). Using the same names and switch types at both sites makes configuration and troubleshooting less complicated and underscores the goal that the configuration should be the same at both sites.

 

When designing the SRM configuration it is recommended to classify your applications based on SLAs, and identify the corresponding VMs and VM interdependencies. Based on a VM’s classification, place it into the designated protection group(s) and prioritize the VMs within each protection group as well as any pre or post recovery tasks necessary for recovery. In this design, the network infrastructure provides layer-2 adjacency as well as a site-local default gateway for application servers using VRRP-E with Short-path-forwarding which eliminates the need to reassign IP address of the VMs being protected by SRM. This simplifies deployment and ongoing operations and management.

 

Block Diagram

DesignGuide_VMware5.0-SRM_VMwareBlockDiagram.JPG

   VMware vSphere 5.0 with Site Recovery Manager Template (click to enlarge)

 

Key Features

Third-party storage-based replication

Support a broad range of storage-based replication products for large, business-critical environments.

Non-disruptive testing

Enable frequent non-disruptive testing of recovery plans to ensure that they meet business requirements

Centralized recovery plans

Replace traditional, error-prone manual run books with simple, automated recovery plans.

Automated DR failover

Monitor site availability and alert users of possible site failures. Initiate recovery plan execution from VMware vCenter Server with a single button. Execute user-defined scripts and pauses during recovery. Reconfigure virtual machines’ IP addresses to match network configuration at failover site. Manage and monitor execution of recovery plans within VMware vCenter Server.

 

References

 

Data Center Template

The following diagram shows the Data Center template and the building blocks used.

 

DesignGuide_VMware5.0-SRM_DataCenterTemplate.JPG

   Data Center Template (click to enlarge)

 

The IP network is deployed the same way in both data centers with redundancy so there is no single point of failure.

 

The following blocks are used to construct the data center template.

 

VCS Fabric, ToR Block

Synopsis

This block moves the L2/L3 boundary out of the access block. The ToR switches support Brocade’s VCS Fabric technology eliminating spanning tree while providing multi-path traffic flow, automatic load balancing, automatic trunking with very low fabric convergence times. When switches are connected together, they use 10GE ports and automatically form Brocade ISL Trunks. A LAG to servers and uplink switches provides HA with the added benefit of Brocade’s Virtual LAG (vLAG) supporting links within the LAG connecting to multiple VCS Fabric switches.

 

While using vLAGs for redundant connectivity of the vSphere servers it is recommended to attach both NICs to the same vNetwork and configure the port groups load balance option to “Route based on IP hash”.

 

The data center template uses Brocade’s VCS Fabric (NOS 2.1.1 or later) with two ToR VDXs in each rack at the server edge. A vLAG is used with both VDX switches with NIC teaming in the server to provide link resiliency and load balancing.

 

The VDX switches are connected to the MLX using Multi Chassis Trunking (MCT) at the routing core with a VCS Fabric vLAG for ease of configuration and optimal link utilization.

 

For ease of management it is recommended to use VCS vCenter integration with VCS Fabric AMPP, which enables automatic VLAN placement based on the vNetwork port group configuration on the vSphere servers. Therefore, the vLAG ports connecting to the vSphere servers are configured as port-profile ports.

 

Block Diagram

DataCenter_BlockAccess_ToRVCSFabric.JPG

   Access Block, ToR VCS Fabric (click to enlarge)

 
Key Features

Automatic network formation

VCS Ethernet fabric automatically form when connecting switches, enabling ease of deployment and (non-disruptive) scaling on demand

All links are forwarding

VCS Ethernet fabric automatically provide multipath traffic flow and eliminates the need for spanning tree

Adjacent links automatically trunk

All VLANs are automatically carried on fabric Inter Switch Links (ISLs) and in addition traffic is loadbalanced at the frame-level providing completely even traffic distribution

Topology agnostic

The VCS Ethernet is topology agnostic, enabling topology design to support traffic flows

AMPP with VMware vCenter plug-in

Brocade VM-aware network automation provides secure connectivity and full visibility to virtualized server resources with dynamic learning and activation of port profiles. By communicating directly with VMware vCenter, it eliminates manual configuration of port profiles and supports VM mobility across VCS fabrics within a data center. In addition to providing protection against VM MAC spoofing, AMPP and VM-aware network automation enable organizations to fully align virtual server and network infrastructure resources, and realize the full benefits of server virtualization.

 

References

 

SAN Template

The following diagram shows the SAN template and the building blocks used.

 

DesignGuide_VMware5.0-SRM_SANTemplate.JPG

   SAN Template (click to enlarge)

 

The storage platform must have array based replication enabled (required licensing) –either synchronous or asynchronous –depending on distance (Metro or Geo) between the two sites and the requirements for RPO. The selected storage platform must have a VMware Storage Replication Adaptor (SRA) for vSphere 5.0. Refer to the VMware SRM compatibility matrix for a list of storage vendors providing an SRA interface.

 

Whereas array based replication in general is compatible between adjacent generations it is highly recommended to keep the microcode levels in sync across array pairs.

 

The storage network is build out using DCX8510 backbone directors, with the use of optical ICLs full mesh topology can be utilized enabling very low latency across the fabric and thereby optimal flexibility for connectivity of storage and servers with scalability of each SAN fabric beyond 3000 ports of 16GBit with no oversubscription.

 

Data replication over the WAN can be on dark fiber if available utilizing embedded FC compression and Encryption for optimal performance and data security between the data centers. For Ethernet circuit WANs the fabric is extended using FCIP between the two data centers using the fiber channel extension blade (FX8-24) in DCX or the Brocade 7800. Best practice for SAN extension across WAN is to configure FCR with the WAN connection as the BB fabric to avoid any WAN connectivity disruptions propagate and create fabric disruptions at both site. With fabric extension over FCIP best practice is to enable Fastwrite compression and encryption for optimal performance and data security.

 

Note:

Prior to deployment of Fastwrite verify your selected storage replication is supports Fastwrite.

 

The following blocks are used to construct the SAN template.

 

Core Backbone Block

Synopsis

The Brocade DCX is a backbone-class chassis switch. Chassis with different slot capacities are available for inserting port cards and SAN Services cards. It’s common practice to deploy dual physical fabrics called Fabric “A” and Fabric “B” for high availability. Multiple core switches can be connected using either ISL connections, or with the DCX series, using inter-chassis links (ICL) which consume no ports on port cards as they are built-in to the chassis.

 

Block Diagram

 

DataCenter_BlockSAN_CoreBackbone.JPG

   SAN Block, Core Backbone (click to enlarge)

 

Key Features

ISL trunking across redundant circuits

Enabling optimized WAN bandwidth utilization across redundant circuits

Optical Inter-chassis links (ICL) enable low latency highly scalable topologies at 16Gbps with no oversubscription

With Optical ICLs full-mesh topologies can provide up to 6,000 port, low latency SAN fabrics

 

References

 

Integrated SAN Services Block

Synopsis

Service blocks connect SAN services with specific traffic flows between servers and storage in a shared storage environment. Important SAN services include data at rest encryption and SAN distance extension with device emulation and compression for storage replication. Device emulation reduces latency for FICON and SCSI IO over distance and compresses data. This reduces the WAN bandwidth required and extends the distance for array-based storage replication.

 

Integrated SAN services are deployed either as blades inserted into chassis slots in backbone switches (shown in Fabric-A below), or as separate switches connected to the core ( shown in Fabric-B below).

 

Block Diagram

 

DataCenter_BlockSAN_CoreIntegratedSvcs.JPG

   SAN Block, Core Integrated SAN Services (click to enlarge)

 

Key Features

SAN Extension

Facilitating high performance data replication between sites

FCIP Compression and Encryption

Compression and encryption of data replication with FCIP provides higher WAN bandwidth utilization and data security

Native FC ISL Compression and Encryption

For deployments where dark fiber is available compression and encryption of data replication on native FC provides higher WAN bandwidth utilization and data security

 

References

 

Core Template

The following diagram shows the core template and the building blocks used.

 

DesignGuide_VMware5.0-SRM_CoreTemplate.JPG

   Core Template (click to enlarge)

 

The WAN connectivity between the data centers supports layer-2 Ethernet, layer-2 IP or both traffic types. Note that SRM does not require layer-2 adjacency between the data centers. For Metropolitan distance deployments, it can be advantageous to extend the layer-2 network between the two sites enabling increased application mobility with live VM migration, dynamic application load balancing and flexible provisioning as servers in the same cluster are running in both data centers.

 

In the case of extending layer-2 between data centers, the recommended technologies include Metro Ring Protocol (MRP), Virtual Leased Line (VLL), and Virtual Private LAN Service (VPLS) over Multi-protocol Label Switching (MPLS). Each is applicable depending on the specific characteristics of the available WAN connectivity.

 

The following block is used to construct the core template.

 

Data Center Interconnect, Core Block

Synopsis

Data center interconnects support data center-to-data center traffic supporting disaster recovery and business continuance. Traffic can include server clusters, data replication using IP or Fibre Channel over IP (FCIP) and high availability or disaster recovery clusters for virtual server environments.

 

For virtualization DR configurations, live migration can be a cost-effective method to meet very low recovery time and recover point objectives (RTO, RPO respectively). Live migration of virtual machines requires the same subnet be maintained after recovery to avoid client disruption. One way to provide layer-2 adjacency is to define layer-2 tunnels over MPLS. This uses customer premises routers (CPE) between the core routers and the service provider’s MPLS routers.

 

Application recovery also requires data replication between data centers. A common method to accomplish this with Fibre Channel storage is to use Fibre Channel extension with Fibre Channel over IP (FCIP) over the WAN. Note that virtual machines are files, so data replication protects both the machine state as well as the application’s data. In this block, separate CPE routers are used for handling this replication traffic.

 

For other applications, application outage is acceptable provided the time to restart the application and connect to a copy of its data does not exceed RPO/RTO requirements. Common methods to achieve this rely on data replication between data centers, again, using array replication and FCIP to transport data between sites.

 

In smaller environments, the core block connects to access blocks and IP service blocks. In larger environments, the core block would connect to one or more aggregation blocks.

 

Block Diagram

 

DataCenter_BlockCore_DCInterconnect.JPG

   Core Block, Data Center Interconnect (click to enlarge)

 

Key Features

Metro Ring Protocol (MRP)

Metro Ring Protocol enables very simple layer-2 adjacency between sites

Virtual Private LAN Service/Multi-protocol Label Switching (VPLS/MPLS)

Virtual Private LAN Service/Multi-protocol Label Switching enables layer-2 adjacency for multi-point to multi-point communication over MPLS

Virtual Leased Line/Multi-protocol Label Switching (VLL/MPLS)

Virtual Leased Line/Multi-protocol Label Switching enables L2 adjacency between 2 sites over MPLS

Multi Chassis Trunking (MCT)

Multi Chassis Trunking allows two switches to appear as one enabling design of a resilient and redundant router implementation

 

References

 

Layer-2 Lollipop IP Services Block

Synopsis

The data center template includes an IP Services block with global load balancing.

 

An active/active data center design should provide minimal disruption to client connections during a fail-over to the secondary data center. In each data center, Brocade ADX Application Delivery Controllers are used to provide server load balancing of client connections. A feature of the ADX the Application Resource Broker, along with ADX hardware provides optimal distribution of client access and integration vCenter to automatically provision and de-provision VMs based on the client connection load. Between data centers, the ADX provides global server load balancing of client connections between the data centers with the ability to direct clients to closest data center reducing connection latency.

 

Although not part of the validation of this design, third party security and IDS/IPS products can be used with the ADX Application Delivery Controllers.

 

Block Diagram

 

DataCenter_BlockIPSvcs_Layer3Lollipop.JPG

   IP Services Block, Layer-3 Lollipop (click to enlarge)

 

Key Features

Global Load Balancing

Directs client connections to the closest data center and distributes client loads between data centers

Application Resource Broker (ARB)

Provides dynamic server resource allocation / deallocation based on application workload via a plug-in to vCenter.

Brocade OpenScript Engine

A scalable application scripting engine that can help application developers and delivery professionals create and deploy network services faster, and is based on the widely used Perl programming language. Organizations can use OpenScript to augment Brocade ADX Series services with user-provided custom logic.

Brocade Application Resource Broker vSphere client plug-in

Monitoring of application workload and automatic allocation / deallocation of VM as required to maintain client SLA.

 

References

 

Management Template

Synopsis

Monitoring and management of the underlying network infrastructure in a unified way minimizes risk, reduces configuration error and provides early detection of traffic flows that are experiencing high latency and bottlenecks. Further, integration of monitoring and reporting of the network (SAN and IP) with VMware vCenter provides virtualization administrators with needed status and insights about the operational health of the storage SAN, client connections and application resource requirements. Brocade Network Advisor provides this network management platform.

 

Other vCenter plug-ins for management include Application Resource Broker support for the ADX series of Application Delivery Controller switches and the VCS Fabric Automated Migration of Port Profiles plug-in to automatically create and synchronize VCS Fabric port profiles with virtual machine port groups.

 

Traffic monitoring is a valuable service for active-active dual data centers and Brocade includes the open standard sFlow monitoring in its NetIron, ServerIron and VDX family of products. Via third party sFlow monitoring tools, network and virtualization administrators can see traffic performance at the individual VM and workload in both data centers.

 

Block Diagram

DataCenter_TemplateMngmnt-BNAsFlowvCenter-NoTitle.JPG

   Management Template (click to enlarge)

 

Key Features

sFlow

Traffic monitoring down to the individual virtual machine.

Brocade Network Advisor vCenter management plug-in for Brocade SAN

Monitoring and alerts for storage traffic flows.

 

References

 

 

Components

The following lists typical components that can be used in the design templates for this solution.

 

VMware vSphere 5.0 SRM Components

VMware vSphere 5

VMware License

VMware vCenter 5

VMware License

VMware SRM 5

VMware License

 

Data Center Template Components

Brocade VDX® 6720

Select based on number of ports required

Brocade VDX 6720

VCS software license for Brocade VDX 6720.

   

 

SAN Template Components

Brocade DCX 8510

Fibre Channel SAN Backbone switch

Brocade FX8-24 Extension Blade

DCX blade for encryption/extension

Brocade 7800 Extension Switch

Option for encryption/compression of array-to-array replication.

 

Core Template Components

Brocade MLX Router

Select based on number of slots to meet scalability requirements and AC/DC to meet power requirement.

Brocade NetIron CER 2000 Series Router

Select based on number of ports to meet scalability and physical interface requirements.

ADV_SVCS_PREM for CER 2000

License for NetIron CER 2001 Advanced Services with MPLS support.

Brocade ServerIron ADX Application Delivery Controller

Select based on number of CPU cores and number of ports to meet scalability requirements.

-PREM

License for ADX premium features—Layer 3 routing, IPv6, and Global Server Load Balancing (GSLB)

Brocade OpenScript Engine

The Brocade OpenScript engine is a scalable application scripting engine that can help create and deploy network services faster. It is based on the widely used Perl programming language. OpenScript is used to augment Brocade ADX Series services with user-provided custom logic. The OpenScript engine includes the Brocade OpenScript performance estimator tool, which mitigates deployment risk by predicting the performance of scripts before they go live.

Brocade Application Resource Broker

Brocade Application Resource Broker (ARB) is a software component that simplifies the management of application resources within IT data centers by automating on-demand resource provisioning. ARB helps ensure optimal application performance by dynamically adding and removing application resources such as Virtual Machines (VMs). ARB–working in tandem with the Brocade ADX Series–provides these capabilities through real-time monitoring of application resource responsiveness, traffic load information, and capacity information from the application infrastructure.

 

Management Template Components

BNA 12.1

Single pane of glass management platform for SAN and IP network

VMware vCenter Management plug-in for Brocade SAN

vCenter integration plug-in

VMware vSphere client plug-in for Brocade Application Resource Broker

vSphere integration plug-in

Comments
by wherr
on ‎03-20-2013 10:45 AM

Descriptive and Useful Block Diagrams.

Contributors