This blog post is based on an article on the same subject I recently wrote for Enterprise Tech Journal Magazine. The article appeared in slightly different format in the March/April 2013 issue.
The introduction of the FICON I/O protocol to the mainframe I/O subsystem provided the ability to process data rapidly and efficiently. As a result of two main changes that FICON made to the mainframe channel I/O infrastructure, the requirements for a new Resource Measurement Facility (RMF) record came into being. The first change was that unlike ESCON, FICON uses buffer credits to account for packet delivery and provide a data flow control mechanism. The second change was the introduction of FICON cascading, and the long distance capabilities it introduced, which was not as practical with ESCON.
Similar to the ESCON directors that preceded them, FICON directors and switches have a feature called Control Unit Port (CUP). Among the many functions of the CUP feature is an ability to provide host control functions such as blocking and unblocking ports, safe switching, and in-band host communication functions such as port monitoring and error reporting. Enabling CUP on FICON directors while also enabling RMF 74 subtype 7 (RMF 74-7) records for your z/OS system, yields a new RMF report called the “FICON Director Activity Report”. Data is collected for each RMF interval if FCD is specified in yourERBRMFnn parmlib member AND… in SYS1.Parmlib the IECIOSnn says FICON STATS=YES. RMF will format one of these reports per interval per each FICON director that has CUP enabled and the parmlib specified. If you are using FICON Virtual Fabrics and have your FICON directors partitioned into multiple, smaller logical switches you will have one RMF 74-7 report per switch address (per logical switch). The FICON Director Activity Report captures information based on an interval which is set for RMF and tells it when to create this report along with others. In essence, the report captures a snapshot of data and the counters based on a time interval, such as 20 minutes. Often, you need to run these reports more than once and change the interval periods for troubleshooting to determine if there is a trend.
This RMF report is often overlooked but contains very meaningful data concerning FICON I/O performance—in particular, frame pacing delay.Frame pacing delay has been around since fibre channel SAN was first implemented in the late 1990s by our open systems friends. But until the increased use of cascaded FICON, its relevance in the mainframe space has been completely overlooked
The article “Performance Troubleshooting Using the RMF Device Activity Report” (Oct/Nov 2012 Enterprise Tech Journal)continued a series of articles I have been writing on System z I/O performance. As a quick review, that article assumed there was an application Service Level Agreement (SLA)/Service Level Objective (SLO) for transaction response time that wasn't being met. It then went through one of the key RMF reports used in mainframe I/O performance management/troubleshooting: SMF 74-1, the RMF Direct Access Device Activity report. This report contains response time information and information on the various components of response time. It can be used to further narrow down what may be the root cause of the problem and provide a good idea of what other RMF reports we should check. This article continues where that discussion left off, and will examine the RMF 74-7 record, the RMF FICON Director Activity Report.
Figure 1 illustrates our environment and where the various I/O related RMF reports fit. Let us assume that our review and analysis of the RMF Device Activity Report showed that the specific components of response time that are outside of our normal parameters are PEND and CONN. We wish to determine if our FICON SAN could be the component of our infrastructure causing PEND and/or CONN to be abnormally high.
Figure 2 shows an example of a FICON Director Activity Report. It should be noted that this report is vendor agnostic, meaning that it will contain the same fields and information regardless of which FICON director vendor manufactured the director. It also is the same for any model director/switch from a given vendor.
The fields of critical interest contained in the report are:
AVG FRAME PACING: the number of times during the RMF reporting interval that there was a 2.5 microsecond period in which data (frame) transmission was stopped by this port.
AVG FRAME SIZE READ: the average frame size (in bytes) of data received by a given port during the RMF reporting interval.
AVG FRAME SIZE WRITE: the average frame size (in bytes) of data transmitted by a given port during the RMF reporting interval.
PORT BANDWIDTH READ: the average rate (in MB/sec) of data received by a given port during the RMF reporting interval.
PORT BANDWIDTH WRITE: the average rate (in MB/sec) of data transmitted by a given port during the RMF reporting interval.
ERROR COUNT: the number of errors which were encountered by a given port during the RMF reporting interval. This should be cross referenced with hardware errors reported on the operations console(s).
Other data in the report include:
SWITCH DEVICE: the hexadecimal number of the switch device of the FICON Director for which measurements are being reported.
SWITCH ID: the hexadecimal switch ID of the FICON Director which is associated with this switching device. In case of cascaded switches, a double asterisk may be shown.
TYPE MODEL MAN PLANT SERIAL: a hardware description of the switch device.
PORT ADDR: the hexadecimal address of the port.
CONNECTION: provides information about what is attached to a given port (CU, CHPID, or SWITCH). CHP-H indicates that this port is attached to a CHPID being used to collect the FICON CUP data from the FICON director.
What is frame pacing and what is the difference between frame pacing and frame latency?
Frame Pacing is an FC-4 application data exchange level measurement and/or throttling mechanism. It uses buffer credits to provide a flow control mechanism for FICON to assure delivery of data across the FICON fabric. When all buffer credits for a port are exhausted a frame pacing delay can occur. Frame Latency, on the other hand, is a frame delivery level measurement. It is somewhat akin to measuring frame friction. Each element that handles the frame contributes to this latency measurement (CHPID port, switch/Director, storage port adapter, link distance, etc.). Frame latency is the average amount of time it takes to deliver a frame from the source port to the destination port.
If frame pacing delay is occurring then the buffer credits have reached zero on a port for 1 or more intervals of 2.5 microseconds. Data transmission by the port reporting frame pacing delay ceases until a credit has been added back to the buffer credit counter kept by that port. Frame pacing delay causes unpredictable performance delays. These delays generally result in elongated FICON CONNect time and/or elongated PEND times that show up on the volumes attached to these links. Therefore, when you see abnormally high PEND and CONN metrics, particularly in a multi-site cascaded FICON architecture, one of the first places to look at should be the RMF 74-7 records.
Figure 2 shows an example of a “clean” FICON Director Activity Report, meaning that none of the ports are reporting frame pacing issues. All ports show an AVG FRAME PACING of “0”. This is the ideal. Note that this is a report from a non-cascaded FICON director (there are no ports connected to another SWITCH). Next, let’s look at a potential problem with frame pacing.
Figure 3 is an example of another RMF 74-7 report from a different FICON director. As you can see, Ports (PORT ADDR) 27, 29, 2E, 5E, and 5F are all reporting some degree of frame pacing issues of varying magnitudes. In other words, these ports all stopped transmitting for “x” intervals of 2.5 microsecond duration. This happened because each of these ports, for some reason, had an indication that the port it was connected to had run out of buffer credits. Recall that for two ports connected together, the transmit half of each port is in constant communication with the receive half of its partner port. The buffer credits based fibre channel flow control mechanism used by FICON storage networks has the transmit half keep track of how many buffer credits its partner port’s receive half has remaining. When this counter reaches “0”, frame transmission temporarily stops. This in a nutshell is what is reflected in the AVG FRAME PACING field.
The question then becomes “is everything that is a non-zero value of AVG FRAME PACING” bad? The answer is the usual “it depends”. Remember, you are likely closely looking at the RMF 74-7 reports because you are doing some performance troubleshooting to determine what is causing an abnormally high response time. You noticed higher than normal PEND or CONN time. Most of us do not look at the RMF 74-7 report as a first step. If the ports associated with the devices exhibiting abnormally high PEND and/or CONN times are showing non-zero values for AVG FRAME PACING, you likely have narrowed down the problem. As a next step, you should do some trend analysis by looking at a series of the RMF 74-7 reports for this specific FICON director to determine if the frame pacing issue occurs consistently, and are the values getting worse?
As a personal rule of thumb, the author looks for AVG FRAME PACING values that are >100. Rarely have I seen AVG FRAME PACING values <100 as cause for concern. It deserves scrutiny and looking for a trend to see if it becomes worse. Values greater than 100 such as shown by port 29, 2E, and especially 5F in this example deserve further analysis. Port 5F is attached to a control unit (CU) so this is likely a slow drain device issue on the storage host/fibre adapter. This is when we would turn to the ESS Link Statistics report for further analysis (which will be the subject of the next article in this series).
The one exception to the aforementioned rule of thumb is if the port exhibiting AVG FRAME PACING >0 is for interswitch links (ISLs) connecting cascaded FICON directors. An example is illustrated in Figure 4. Port (address) 25 shows an AVG FRAME PACING value of 570. Since we know that this port is attached to a port on another FICON director, there are more troubleshooting options available outside of RMF such as using the FICON director management software and/or Command Line Interface (CLI). It may be something as simple as not having enough buffer credits configured on the port attached to port 25 in this example. In which case, the port buffer credit configuration can be altered. Since ISLs are typically used for remote data replication such as PPRC or XRC, any frame pacing delay is cause for concern.
This article has explained how you can use the FICON Director Activity Report (RMF 74-7) to drill down further into a I/O performance problem. It provides a valuable way to narrow down the potential root cause(s), but is not a one stop place to completely solve the problem you are troubleshooting. Problems such as slow drain devices need further examination which can be done using the ESS Link Statistics Report. Fabric contention issues such as frame pacing delay being exhibited on ISLs should also be examined further with the IBM z/OS I/O health check mechanism. That and the ESS Link Statistics report will be discussed in greater detail in future articles.
I look forward to hearing your questions, comments and concerns. Thanks for reading!