A resilient IT architecture gives an organization the ability to respond, and adapt to a wide variety of external and internal demands, disruptions, disturbances and threats while continuing business operations without any significant impact. One of the key components in a resilient IT architecture for IBM z Systems customers is the IBM TS7700 with its multi-cluster grid configuration.
Your goal in designing your TS7700 grid architecture should be to maximize IT resiliency at the lowest cost. In other words, maximize efficient use of the network with connectivity devices that offer the highest performance, highest availability, and lowest operating costs, that will be easy to manage. This can now be achieved thanks to new technology developments for distance extension switching. This technology is known simply as IP Extension, or IPEX. The remainder of this week's post will review the basics of the IBM TS7700 Grid solution, and then discuss how the Brocade IPEX technology can enhance your TS7700 Grid reliability, availability, and performance.
IBM TS7700 Grid basics
The prior generation technology IBM virtual tape server (VTS) had a feature called peer-to-peer (PtP) VTS capabilities. PtP VTS was a multisite capable business continuity and disaster recovery solution. PtP VTS to VTS data transmission was originally done by Enterprise Systems Connection (IBM ESCON®), then FICON, and finally Transmission Control Protocol/Internet Protocol (TCP/IP).
With the TS7700, the virtual tape controllers and remote channel extension hardware for the PtP VTS were eliminated. This change provided the potential for significant simplification in the infrastructure that is needed for a business continuity solution and simplified management. Hosts attach directly to the TS7700s. Instead of FICON or ESCON, the connections between the TS7700 clusters use standard TCP/IP. Similar to the PtP VTS of the previous generation, with the new TS7700 Grid configuration, data can be replicated between the clusters based on customer established policies. Any data can be accessed through any of the TS7700 clusters, regardless of which system the data is on, if the grid contains at least one available copy.
As a business continuity solution for high availability and disaster recovery, multiple TS7700 clusters can be interconnected using standard Ethernet connections. Local and geographically separated connections are supported to provide a great amount of flexibility to address customer needs. This Internet Protocol network for data replication between TS7700 clusters is more commonly known as a TS7700 Grid. A TS7700 Grid refers to 2- 6 physically separate TS7700 clusters that are connected to each other with a customer-supplied Internet Protocol network. The TCP/IP infrastructure that connects a TS7700 Grid is known as the grid network. The grid configuration is used to form a high availability, disaster recovery solution and to provide metro and remote logical volume replication. The clusters in a TS7700 Grid can, but do not need to, be geographically dispersed. In a multiple-cluster grid configuration, two TS7700 clusters are often located within 100 kilometers (km) of each other. The remaining clusters can be more than 1,000 km away. This solution provides a highly available and redundant regional solution. It also provides a remote disaster recovery solution outside of the region.
The TS7700 Grid is a robust business continuity and IT resilience solution. By using the TS7700 Grid, organizations can move beyond the inadequacies of on-site backup (disk-to-disk or disk-to-tape) that cannot protect against regional (non-local) natural or man-induced disasters. By using the TS7700 Grid, data can be created and accessed remotely through the grid network. Many TS7700 Grid configurations rely on this remote access to further increase the importance of the TCP/IP fabric. With increased storage flexibility, your organization can adapt quickly and dynamically to changing business environments. Switching production to a peer TS7700 can be accomplished in a few seconds with minimal operator skills. With a TS7700 Grid solution, z Systems customers can eliminate planned and unplanned downtime. This approach can potentially save thousands of dollars in lost time and business and can address today’s stringent government and institutional data protection regulations.
The network infrastructure that supports a TS7700 Grid solution faces some challenges and requirements of its own. First, the network components need to individually provide reliability, high availability, and resiliency. A TS7700 Grid network requires non-stop predictable performance with components that have “five-9s” availability, that is, 99.999% uptime. Second, a TS7700 Grid network must be designed with highly efficient components that minimize operating costs. Third, today’s rapidly changing and growing amounts of data require a TS7700 Grid network that meets the following specifications:
Highly scalable components that support business and data growth and application needs.
Helps accelerate the deployment of new technologies as they become available
IP Extension (IPEX) Basics
IP Extension (IPEX) is a new technology introduced by Brocade in 2015. Over the past several years, storage applications or replication solutions have been developed that perform their replication via IP instead of traditional mechanisms such as ESCON, FICON, or Fibre Channel over IP (FCIP). Also, existing replication solutions such as EMC’s SRDF or IBM’s TS7700 Grid have been changed or modified to perform replication over IP as an alternative, or even as the only option. Unfortunately for the end user, often times the performance of this IP based replication has not been on par with what was achievable with FCIP. There simply have not been IP switching platforms that had features similar to what FCIP device offered-there is benefit to using a technology designed for data storage traffic. With the advent of IPEX earlier this year, this has now changed. IPEX and extension switching platforms (such as the Brocade 7840) that support IPEX, offer the same robust, performance and reliability enhancing features for data replication that have been standard for many years with FCIP.
Security of IP Extension Flows: IPsec
Unsecured data leaving your data center potentially could cause data breaches and even unwanted publicity for an enterprise. Increasingly, end users are facing requirements to encrypt all data leaving the data center (encryption in flight). These requirements are typically driven by government regulation, internal audit requirements, or a combination of the two. Any data leaving the safe confines of the data center should be protected using encryption.
The Brocade 7840 offers hardware-based IPsec for secure data in-flight across Extension Inter-Switch Links (ISLs). Ideally, you should look for an IPsec implementation that operates at line-rate and introduces only a couple of microseconds of added latency, making it useful for synchronous applications. IPsec with the Brocade 7840 is part of circuit formation and protects data from virtually every type of attack, including sniffers, data modification, identity spoofing, man-in-the-middle, and denial of service. It requires no additional licenses or costs, is very simple to configure, and prevents the need for costly and complex firewalling. Because firewalls are software-based, they tend to provide poor performance. The Brocade 7840 based IPsec implementation offers better performance than the TS7700 native based encryption solution while providing prudent security at no additional cost.
Acceleration of IP Extension Flows: WAN-Optimized TCP
Transmission Control Protocol (TCP) is centric to the high-speed transport of large data sets that are common in storage extension. Acceleration of flows across the WAN improves IP storage performance dramatically. Long distance increases latency and is prone to packet loss. Tested applications have demonstrated improvements of up to 50 times, due to the ability to handle latency and packet loss without performance degradation. This performance has nothing to do with compression; any compression achievable is in addition to flow acceleration. Flow acceleration is purely a function of enhanced protocol efficiency across the network.
The Brocade 7840 terminates IP storage TCP flows locally and transports the data across the WAN using a specialized TCP transport (called WAN-Optimized TCP aka WO-TCP). The primary benefit here is the local acknowledgement (ACK). By limiting ACKs to the local data center, TCP that originates from an end IP storage device has to be capable of merely high-speed transport within the data center. Most native IP storage TCP stacks are capable only of high speeds over short distances. Beyond the limits of the data center, “droop” becomes a significant factor. Droop refers to the inability of TCP to maintain line rate across distance. Droop worsens progressively as distance increases. WO-TCP is an aggressive TCP stack designed for Big Data movement, operating on the purpose-built hardware of the Brocade 7840. In performance tests, WO-TCP has no droop across two 10 Gbps connections, up to 160 ms RTT per data processor. This is equivalent to two fully utilized 10 Gbps WAN connections (OC-192) between Los Angeles and Hong Kong.
Extension Trunking is a technology originally developed by Brocade for mainframes and FCIP extension. Extension Trunking has evolved to include IP Extension flows. Extension Trunking bundles multiple circuits together into a single logical trunk. Those circuits can span multiple service providers and different data center LAN switches for redundancy. Bandwidth is managed in such a way that if a data center LAN switch goes offline or encounters any disruption along the path, the bandwidth of the remaining paths adjusts to compensate for the offline path. With the proper design, bandwidth can be maintained during outages of various devices in the pathway. Extension Trunking shields end devices from IP network disruptions, making network path failures transparent to replication traffic. Multiple circuits (two or more) from the IBM SAN 42B-R are applied to various paths across the IP network. With each added circuit, even more bandwidth is added to the pool. Extension Trunking performs a Deficit Weighted Round Robin (DWRR) schedule when placing batches into the egress. A feature of Extension Trunking called Lossless Link Loss (LLL) ensures lossless data transmission across the trunk in the event that data is lost in-flight due to an offline circuit, and a WO-TCP is no longer operational across that circuit. WO-TCP itself recovers lost or corrupted data across a link, if that circuit is still operational. All data is delivered to the ULP in-order.
Extension Trunking performs failover and failback, and no data is lost or delivered out-of-order during such events. Circuits can be designated as backup circuits, which are passive until all the active circuits within the failover group have gone offline. This protects users against a WAN link failure and avoids a restart or resync event. Extension Trunking supports aggregation of multiple WAN connections with different latency or throughput characteristics (up to a 4:1 ratio), allowing you to procure WAN circuits from multiple service providers with different physical routes, to ensure maximum availability.
Extension Trunking offers more than the ability to load balance and failover/failback data across circuits. Extension Trunking is always a lossless function, providing in-order delivery within an extension trunk (defined by a Virtual Expansion_Port, or VE_Port). Even when data in-flight is lost due to a path failure, data is retransmitted over remaining circuits via TCP and placed back in-order before it is sent to Upper Layer Protocol (ULP). IP storage applications are never subjected to lost data or out-of-order data across the WAN.
ARL (Adaptive Rate Limiting)
ARL automatically adjusts the rate limiting on all associated circuits replicating across the IP network, regardless of the ingress FC device and the WAN path or paths. ARL automatically adjusts rate limiting when other extension circuits go online/offline or the available IP bandwidth that is being experienced changes.
ARL is used with Extension Trunking to maintain available bandwidth to storage applications. Rate limiting is used to prevent oversubscribing the WAN and any associated contention or congestion. Congestion events force TCP to perform flow control, which is extremely inefficient, slow to react, and results in poor performance. ARL adjusts from a normal condition that is not over-subscribed to an outage condition that maintains the same bandwidth. Clearly, this is essential to continuous availability. ARL in conjunction with the extension trunking feature is an ideally unique performance optimizer for TS7700 Grid implementations.
Prioritization of IP Extension Flows (QoS)
Frequently, the IP network does not have QoS configured, at least for the storage applications. Therefore, at a minimum it is important to deliver data to the IP network sequenced according to the Storage Administrator’s priorities. Prioritization of flows across the WAN using QoS can be achieved in various ways. The first and simplest method is to configure priorities on the IPEX switch and feed the prioritized flows into the IP network. There are three priorities for Fibre Channel over IP (FCIP): high, medium, and low, and three for IP Extension: high, medium, and low, for a total of six priorities. In addition, the percentage of bandwidth during contention that is apportioned to IP Extension and FCIP is configurable. When there is no contention for bandwidth, all available bandwidth can be utilized by a flow.
The Brocade 7840 is capable of doing both traditional FICON FCIP extension and IP extension. Therefore, it is an ideal platform for use in both TS7700 Grid disaster recovery and high availability (HA) configurations. In the TS7700 high availability architecture, it is necessary for the local TS7700 cluster to communicate with the remote cluster(s), and possibly the remote cluster with the local cluster. In the event that either the local or remote cluster is offline, tape processing can continue for mission-critical applications, such as SAP. Both hosts have connectivity to the remote cluster. The FICON connection across the WAN uses FCIP extension over the same tunnel. Bandwidth and prioritization are managed by the extension switch to ensure reliable operation.
As for the IP connection between local and remote TS7700 clusters, instead of sharing the WAN connection with FCIP, which often is contentious and has to be continuously monitored and managed, the TS7700 grid IP connectivity is managed by a single WAN scheduler and is joined into the extension tunnel. The extension switch optimally manages both the FCIP and IP Extension flows for optimal performance without contention or oversubscription on the WAN. Flows can further be managed with QoS, compression, IPsec, and ARL without involving long complex projects on the network side that might cause ongoing operational issues. In the case of the IBM SAN42B-R, it accelerates data transfers across the WAN by using WO-TCP.
Finally, the Brocade 7840 has the ability to manage applications that operationally use a combination of FC+IP or FICON+IP. For example, performing FICON based z/OS Global Mirror extension with the Advanced FICON Accelerator emulation technology, in conjunction with managing the TS7700 grid IP replication via its IPEX functionality and features.
The IBM TS7700 Grid solution is a key component to z Systems disaster recovery and business continuity implementations. The TS7700 Grid solution offers the end user a highly available, resilient mechanism for improving RPO/RTO. The Brocade 7840 Extension Switch, in combination with Brocade’s Fabric Vision suite of tools offers an innovative, unique IP Extension solution for TS7700 Grid implementations. This Brocade solution improves the TS7700 grid replication performance, data availability, and security while providing a suite of management features that make management of the TS7700 grid network a more proactive, simplified process. My next blog post will cover these Fabric Vision management features and how they benefit a TS7700 grid architecture.