As a mainframe end user, the chances are quite likely that you have a disaster recovery/business continuity strategy that requires you to replicate data to a remote site. That data may be stored on a variety of media: spinning disk, flash/SSD, virtual tape, or physical tape cartridges. While some of you may still be using the PTAM (pickup truck access method) and have tape cartridges physically transported to an offsite location, the overwhelming majority of you are doing some form of electronic data replication. For longer distances the replication methodology is usually asynchronous, while for shorter distances (less than 50 miles) the replication methodology is often synchronous. The platform/protocols used for the data replication over distance can be a wide variety including DWDM, FCIP, or IP.
For long distances with asynchronous replication, many of you likely use specialized FCIP and/or IP extension switches (such as the Brocade 7840), or blades in FICON directors (such as the Brocade FX8-24 or SX6 extension blades) and take advantage of the protocol emulation and acceleration technology that is available to improve the performance of the data replication over long distances. Such protocol emulation technology is not required for synchronous replication over short distances, so often the extension switches/blades are not used for synchronous replication. However, since the synchronously replicated data is often your most critical data for your business-critical applications, there is technology used in these devices that can be very valuable for improving both the availability and performance of your synchronous replication. That technology is known as extension trunking and the remainder of this blog post will explain some of its benefits.
Extension trunking, provides a single logical tunnel comprised of multiple circuits. Circuits are individual connections within the trunk, each with its own unique source and destination IP address. A single circuit is referred to as a tunnel. A tunnel with multiple circuits is referred to as a trunk, simply because multiple circuits are being trunked together. A tunnel or trunk is a single Inter-Switch Link (ISL), aggregating the bandwidth of all member circuits into one logical “big pipe” and should be treated as such in architectural designs. These extension ISLs can carry Fibre Channel (FC), FICON, and Internet Protocol (IP) storage traffic using IP extension.
Link failover is fully automated with extension trunking. By using metrics and failover groups, active and passive circuits can coexist within the same trunk. Each circuit uses keepalives to reset the keepalive timer. The time interval between keepalives is a configurable setting on the extension switch. If the keepalive timer expires, the circuit is deemed to be down and is removed from the trunk. In addition, if the Ethernet interface assigned to one or more circuits loses light, those circuits immediately are considered down. When a circuit goes offline, the egress queue for that circuit is removed from the load-balancing algorithm, and traffic continues across the remaining circuits, although at the reduced bandwidth due to the removal of the offline link.
When a connection is lost, data inflight is almost always lost as well. Any frames in the process of being transmitted at the time of the link outage are lost, due to partial transmission. This causes an Out-Of- Order-Frame (OOOF) problem, because some frames have already arrived, then one or more are lost, and frames continue to arrive over the remaining link(s). Now the frames are not in sequence because of the missing frame(s). This is problematic for some devices, (particularly mainframes), resulting in an Interface Control Check (IFCC) in mainframe environments or Small Computer Systems Interface (SCSI) error in open systems. Brocade Extension Trunking resolves this problem with Lossless Link Loss technology (LLL). When a link is lost, inevitably frames inflight are lost as well, and the lost frames are retransmitted by LLL. Normally, when a TCP segment is lost due to a dirty link, bit error, or congestion, TCP retransmits that segment. In the case of a broken connection (circuit down), there is no way for TCP to retransmit the lost segment, because TCP is no longer operational across the link.
The simple solution is to encapsulate all the TCP sessions, each session associated with an individual circuit within the trunk. This outer TCP session is referred to as the “Supervisor” TCP session and it feeds each circuit’s TCP session through a load balancer and a sophisticated egress queuing/scheduling mechanism that compensates for different link bandwidths, latencies, and congestion events. The Supervisor TCP session operates at the presentation level (Level 6) of the Open Systems Interconnection (OSI) model and does not affect any LAN or WAN network device that might monitor or provide security at the TCP (Level 4) or lower levels, such as firewalls, ACL, or sFlow (RFC 3176). It works at a level above FC and IP such that it can support both Fibre Channel over IP (FCIP) and IP extension across the same extension trunk. If a connection goes offline and data continues to be sent over remaining connections, missing frames indicated by noncontiguous sequence numbers in the header trigger an acknowledgement by the Supervisor TCP session back to the Supervisor source to retransmit the missing segment, even though that segment was originally sent by a different link TCP session that is no longer operational. This means that segments that are held in memory for transmission by TCP must be managed by the Supervisor, if the segment must be retransmitted over a surviving link TCP session.
The space constraints of a blog post preclude a deeper dive into the technology. For those of you interested in learning more, please read the Brocade white paper. However, hopefully you can see that extension trunking technology provides a simple, yet highly effective solution for improving the performance and availability of mainframe storage data replication networks. Its lossless technology, seamless failover, and bandwidth aggregation make it especially ideal for implementing with synchronous replication used for mission critical data and applications.