Abstract
Computational science applications are driving a demand for increasingly powerful storage systems. While many techniques are available for capturing the I/O behavior of individual application trial runs and specific components of the storage system, continuous characterization of a production system remains a daunting challenge for systems with hundreds of thousands of compute cores and multiple petabytes of storage.As a result, these storage systems are often designed without a clear understanding of the diverse computational science workloads they will support
计算科学应用正在推动对日益强大的存储系统的需求。虽然有许多技术可用于捕获单个应用程序试运行和存储系统特定组件的I/O行为,但对于具有数十万个计算核心和多个pb存储的系统来说,持续表征生产系统仍然是一项艰巨的挑战。因此,在设计这些存储系统时,通常没有清楚地了解它们将支持的各种计算科学工作负载
In this study, we outline a methodology for scalable, continuous, systemwide I/O characterization that combines storage device instrumentation, static file system analysis, and a new mechanism for capturing detailed application-level behavior. This methodology allows us to quantify systemwide trends such as the way application behavior changes with job size, the “burstiness” of the storage system, and the evolution of file system contents over time. The data also can be examined by application domain to determine the most prolific storage users and also investigate how their I/O strategies correlate with I/O performance. At the most detailed level, our characterization methodology can also be used to focus on individual applications and guide tuning efforts for those applications.
在本研究中,我们概述了一种可扩展的、连续的、系统范围的I/O表征方法,该方法结合了存储设备检测、静态文件系统分析和捕获详细应用程序级行为的新机制。这种方法允许我们量化系统范围的趋势,例如应用程序行为随作业大小的变化方式、存储系统的“爆炸性”以及文件系统内容随时间的演变。还可以按应用程序域检查数据,以确定最多产的存储用户,并调查他们的I/O策略如何与I/O性能相关联。在最详细的层面上,我们的表征方法还可以用于关注单个应用程序,并指导这些应用程序的调优工作。
We demonstrate the effectiveness of our methodology by performing a multilevel, two-month study of Intrepid, a 557teraflop IBM Blue Gene/P system. During that time, we captured application-level I/O characterizations from 6,481 unique jobs spanning 38 science and engineering projects with up to 163,840 processes per job. We also captured patterns of I/O activity in over 8 petabytes of block device traffic and summarized the contents of file systems containing over 191 million files. We then used the results of our study to tune example applications, highlight trends that impact the design of future storage systems, and identify opportunities for improvement in I/O characterization methodology.
我们通过对Intrepid (557teraflop IBM Blue Gene/P系统)进行为期两个月的多层次研究,证明了我们方法的有效性。在此期间,我们从跨越38个科学和工程项目的6,481个独特作业中捕获了应用程序级I/O特征,每个作业多达163,840个进程。我们还捕获了超过8 pb的块设备流量中的I/O活动模式,并总结了包含超过1.91亿个文件的文件系统的内容。然后,我们使用研究结果来调优示例应用程序,强调影响未来存储系统设计的趋势,并确定I/O表征方法的改进机会。
I. INTRODUCTION
Computational science applications are driving a demand for increasingly powerful storage systems. This situation is true especially on leadership-class systems, such as the 557 TFlop IBM Blue Gene/P at Argonne National Laboratory, where the storage system must meet the concurrent I/O requirements of hundreds of thousands of compute elements [1].Hardware architecture, file systems, and middleware all contribute toproviding high-performance I/O capabilities in this environment. These components cannot be considered in isolation, however. The efficiency of the storage system is ultimately determined by the nature of the data stored on it and the way applications choose to access that data. Understanding storage access characteristics of computational science applications is therefore a critical—and challenging—aspect of storage optimization.
计算科学应用正在推动对日益强大的存储系统的需求。这种情况在领导级系统上尤其如此,例如阿贡国家实验室的557 TFlop IBM Blue Gene/P,存储系统必须满足数十万计算元素的并发I/O需求[1]。硬件体系结构、文件系统和中间件都有助于在此环境中提供高性能I/O功能。但是,不能孤立地考虑这些组成部分。存储系统的效率最终取决于存储在其上的数据的性质以及应用程序选择访问该数据的方式。因此,理解计算科学应用程序的存储访问特征是存储优化的一个关键和具有挑战性的方面。
Various methods exist for analyzing application access characteristics and their effect on storage. Synthetic I/O benchmarks are easily instrumented and parameterized, but in many cases they fail to accurately reflect the behavior of scientific applications [2], [3], [4]. Application-based benchmarks are more likely to reflect actual production behavior, but the available benchmarks do not represent the variety of scientific domains and applications seen on leadership-class machines, each with unique access characteristics and data requirements.I/O tracing at the application, network, or storage level is another technique that has been successful in analyzing generalpurpose network file systems [5], [6]. However, these tracing techniques are impractical for capturing the immense volume of I/O activity on leadership-class computer systems. These systems leverage high-performance networks and thousands of hard disks to satisfy I/O workloads generated by hundreds of thousands of concurrent processes. Instrumentation at the component level would generate an unmanageable quantity of data. Such systems are also highly sensitive to perturbations in performance that may be introduced by comprehensive tracing.
分析应用程序访问特性及其对存储的影响有多种方法。合成I/O基准很容易被仪器化和参数化,但在许多情况下,它们不能准确地反映科学应用的行为[2],[3],[4]。基于应用程序的基准测试更有可能反映实际的生产行为,但是可用的基准测试并不能代表在领导级机器上看到的各种科学领域和应用程序,每个领域和应用程序都有独特的访问特征和数据需求。应用程序、网络或存储级别的I/O跟踪是另一种在分析通用网络文件系统方面取得成功的技术[5]、[6]。然而,这些跟踪技术对于在领导级计算机系统上捕获大量的I/O活动是不切实际的。这些系统利用高性能网络和数千块硬盘来满足数十万并发进程生成的I/O工作负载。组件级别的检测将产生难以管理的数据量。这样的系统对可能通过全面跟踪引入的性能扰动也高度敏感。
As a result, key gaps exist in our understanding of the storage access characteristics of computational science applications on leadership-class systems. To address this deficiency, we have developed an application-level I/O characterization tool, known as Darshan [7], that captures relevant I/O behavior at production scale with negligible overhead. In this work we deploy Darshan in conjunction with tools for block device monitoring and static file system analysis in order to answer the following questions for a large-scale production system:
- What applications are running, what interfaces are they using, and who are the biggest I/O producers and consumers?
- How busy is the I/O system, how many files are being created of what size, and how “bursty” is I/O?
- What I/O interfaces and strategies are employed by the top I/O producers and consumers? How successful are they in attaining high I/O efficiency? Why?
因此,我们对领导级系统上计算科学应用的存储访问特征的理解存在关键差距。为了解决这一缺陷,我们开发了一种应用级I/O表征工具,称为Darshan[7],它可以在生产规模上以可忽略不计的开销捕获相关的I/O行为。在这项工作中,我们将Darshan与块设备监控和静态文件系统分析工具结合起来,以回答大规模生产系统的以下问题:
- 正在运行什么应用程序,它们使用什么接口,谁是最大的I/O生产者和消费者?
- I/O系统有多忙,创建了多少大小的文件,I/O有多“突发”?
- 顶级I/O生产商和消费者采用哪些I/O接口和策略?它们在实现高I/O效率方面有多成功?为什么?
To answer these questions, we performed a long-running, multilevel I/O study of the Intrepid Blue Gene/P (BG/P) system at Argonne National Laboratory. The study spanned two months of production activity from January to March 2010. During that time we recorded three aspects of I/O behavior: storage device activity, file system contents, and application I/O characteristics. We combine these three sources of information to present a comprehensive view of storage access characteristics and their relationships. We also show how this data can be used to identify production applications with I/O performance problems as well as guide efforts to improve them.
为了回答这些问题,我们在Argonne国家实验室对Intrepid Blue Gene/P (BG/P)系统进行了长期的多级I/O研究。这项研究涵盖了2010年1月至3月两个月的生产活动。在此期间,我们记录了I/O行为的三个方面:存储设备活动、文件系统内容和应用程序I/O特征。我们将这三个信息来源结合起来,以提供存储访问特征及其关系的全面视图。我们还将展示如何使用这些数据来识别存在I/O性能问题的生产应用程序,并指导改进这些问题的工作。
In Section II we outline a methodology for comprehensive I/O characterization of HPC systems. In Sections III, IV, and V we demonstrate how this methodology enables continuous analysis of production I/O activity at an unprecedented scale. We investigate the I/O characteristics of a 557-teraflop computing facility over a two-month period and uncover a variety of unexpected application I/O trends. In Section VI we demonstrate how this same methodology can be used to tune and improve the I/O efficiency of specific applications.We conclude in Section VIII by identifying the impact of our analysis on future I/O subsystem designs and research directions.
在第二节中,我们概述了HPC系统的综合I/O表征方法。在第三节、第四节和第五节中,我们展示了这种方法如何能够以前所未有的规模连续分析生产I/O活动。我们在两个月的时间里研究了557 teraflop计算设施的I/O特性,并揭示了各种意想不到的应用程序I/O趋势。在第6节中,我们将演示如何使用相同的方法来调优和提高特定应用程序的I/O效率。我们在第八节中总结了我们的分析对未来I/O子系统设计和研究方向的影响。
II. TARGET SYSTEM AND METHODOLOGY
This study was conducted on Intrepid, the IBM BG/P system at the Argonne Leadership Computing Facility (ALCF). The ALCF makes large allocations available to the computational science community through the Innovative and Novel Computational Impact on Theory and Experiment (INCITE) program [8]. Systems such as Intrepid therefore host a diverse set of applications from scientific domains including climate, physics, combustion, and Earth sciences.
这项研究是在阿贡领导计算设施(ALCF)的IBM BG/P系统Intrepid上进行的。ALCF通过INCITE (Innovative and Novel computational Impact on Theory and Experiment)项目为计算科学界提供了大量拨款[8]。因此,像Intrepid这样的系统承载了各种科学领域的应用,包括气候、物理、燃烧和地球科学。
Intrepid is a 163,840-core production system with 80 TiB of RAM and a peak performance of 557 TFlops. The primary high-performance storage system employs 128 file servers running both PVFS [1] and GPFS [9], with a separate, smaller home directory volume. Data is stored on 16 DataDirect Networks S2A9900 SANs. The storage system has a total capacity of 5.2 PiB and a peak I/O rate of approximately 78 GiB/s. The architecture and scalability of this storage system have been analyzed in detail in a previous study [1].
Intrepid是一个163,840核的生产系统,具有80 TiB的RAM和557 TFlops的峰值性能。主高性能存储系统使用128个同时运行PVFS[1]和GPFS[9]的文件服务器,并使用一个单独的、较小的主目录卷。数据存储在16个DataDirect Networks S2A9900 san上。存储系统的总容量为5.2 PiB,峰值I/O速率约为78 GiB/s。在之前的研究中[1]对该存储系统的架构和可扩展性进行了详细的分析。
Intrepid groups compute nodes (CNs) into partitions of sizes 512, 1,024, 2,048, 8,192, 16,384, 32,768, and 40,960 nodes.All jobs must select one of these partition sizes, regardless of the number of nodes that will be used. Each set of 64 compute nodes utilizes a single, dedicated I/O forwarding node (ION).This ION provides a single 10 gigabit Ethernet link to storage that is shared by the CNs.
Intrepid将计算节点划分为512、1024、2048、8,192、16,384、32,768和40,960个节点的分区。所有作业都必须选择这些分区大小中的一个,而不考虑将要使用的节点数量。每组64个计算节点使用单个专用的I/O转发节点(ION)。这个ION提供一个单一的10gb以太网链路到存储,由cna共享。
The goal of our characterization methodology on this system was to gain a complete view of I/O behavior by combining data from multiple levels of the I/O infrastructure without disrupting production activity. To accomplish this goal, we captured data in three ways. We monitored storage device activity, periodically analyzed file system contents, and instrumented application-level I/O activity using a tool that we developed for that purpose.
我们对该系统的表征方法的目标是,在不中断生产活动的情况下,通过组合来自多个I/O基础设施级别的数据,获得I/O行为的完整视图。为了实现这一目标,我们以三种方式捕获数据。我们监视存储设备活动,定期分析文件系统内容,并使用我们为此目的开发的工具检测应用程序级I/O活动。
A. Characterizing storage device activity
On the storage side of the system, the DataDirect Networks SANs used by PVFS and GPFS are divided into sets of LUNs that are presented to each of the 128 file servers. Activity for both file systems was captured by observing traffic at the block device level. We recorded high-level characteristics such as bandwidth, amount of data read and written, percentage of utilization, and average response times.
在系统的存储端,PVFS和GPFS使用的DataDirect Networks san被划分为若干组lun,分别呈现给128个文件服务器。两个文件系统的活动都是通过观察块设备级别的流量来捕获的。我们记录了一些高级特征,如带宽、读写数据量、利用率百分比和平均响应时间。
Behavior was observed by using the iostat command-line tool included with the Sysstat collection of utilities [10].Iostat can report statistics for each block device on regular intervals. We developed a small set of wrappers (known as iostat-mon) to monitor iostat data on each file server. Data was collected every 60 seconds, logged in a compact format, and then postprocessed to produce aggregate summaries. All local disk activity was filtered out to eliminate noise from operating system activity. The data was collected continuously from January 23 to March 26, but four days of data were lost in February because of an administrative error.
通过使用Sysstat实用程序集合中包含的iostat命令行工具来观察行为[10]。Iostat可以定期报告每个块设备的统计信息。我们开发了一小组包装器(称为iostat-mon)来监视每个文件服务器上的iostat数据。数据每60秒收集一次,以紧凑的格式记录,然后进行后处理以生成汇总摘要。所有本地磁盘活动被过滤掉,以消除操作系统活动的噪声。数据从1月23日持续收集到3月26日,但由于管理错误,2月份有4天的数据丢失。
B. Characterizing file system contents
The block device instrumentation described above captures all data movement, but it does not describe the nature of the persistent data that is retained on the system. To capture this information, we used the fsstats [11] tool. Fsstats analyzes entire directory hierarchies to collect a snapshot of static characteristics, such as file sizes, file ages, capacity, and a variety of namespace attributes. We ran the fsstats tool at the beginning and end of the study on both primary file systems.GPFS was measured on January 23 and March 25, while PVFS was measured on January 27 and March 31. This approach allowed us to observe the change in file system contents over the course of the study.
上面描述的块设备检测捕获所有数据移动,但它没有描述保留在系统上的持久数据的性质。为了获取这些信息,我们使用了fsstats[11]工具。Fsstats分析整个目录层次结构,以收集静态特征的快照,例如文件大小、文件年龄、容量和各种名称空间属性。我们在研究的开始和结束时分别在两个主文件系统上运行了fsstats工具。GPFS在1月23日和3月25日测量,PVFS在1月27日和3月31日测量。这种方法允许我们在整个研究过程中观察文件系统内容的变化。
C. Characterizing application behavior
The most detailed level of characterization was performed by analyzing application-level access characteristics.Application-level characterization is critical because it captures I/O access patterns before they are altered by high-level libraries or file systems. It also ensures that system behavior can be correlated with the specific job that triggered it.Application-level I/O characterization has traditionally been a challenge for arbitrary production workloads at scale, however.Tracing and logging each I/O operation become expensive (in terms of both overhead and storage space) at scale, while approaches that rely on statistical sampling may fail to capture critical behavior. In previous work we developed a tool called Darshan [7] in order to bridge this gap. Darshan captures information about each file opened by the application.Rather than trace all operation parameters, however, Darshan captures key characteristics that can be processed and stored in a compact format. Darshan instruments POSIX, MPI-IO, Parallel netCDF, and HDF5 functions in order to collect a variety of information. Examples include access patterns, access sizes, time spent performing I/O operations, operation counters, alignment, and datatype usage. Note that Darshan performs explicit capture of all I/O functions rather than periodic sampling in order to ensure that all data is accounted for.
最详细的表征是通过分析应用程序级访问特征来执行的。应用程序级别的特征描述至关重要,因为它可以在高级库或文件系统更改I/O访问模式之前捕获它们。它还确保系统行为可以与触发它的特定作业相关联。然而,应用程序级I/O表征对于大规模的任意生产工作负载来说一直是一个挑战。跟踪和记录每个I/O操作在规模上变得非常昂贵(在开销和存储空间方面),而依赖于统计抽样的方法可能无法捕获关键行为。在之前的工作中,我们开发了一种名为Darshan的工具[7],以弥合这一差距。Darshan捕获应用程序打开的每个文件的信息。然而,Darshan并没有跟踪所有的操作参数,而是捕获了可以处理并以紧凑格式存储的关键特征。达善仪器具有POSIX、MPI-IO、并行netCDF和HDF5等功能,以便收集各种信息。示例包括访问模式、访问大小、执行I/O操作所花费的时间、操作计数器、对齐和数据类型使用。注意,Darshan对所有I/O函数执行显式捕获,而不是定期采样,以确保所有数据都得到考虑。
The data that Darshan collects is recorded in a bounded (approximately 2 MiB maximum) amount of memory on each MPI process. If this memory is exhausted, then Darshan falls back to recording coarser-grained information, but we have yet to observe this corner case in practice. Darshan performs no communication or I/O while the job is executing. This is an important design decision because it ensures that Darshan introduces no additional communication synchronization or I/O delays that would perturb application performance or limit scalability. Darshan delays all communication and I/O activity until the job is shutting down. At that time Darshan performs three steps. First it identifies files that were shared across processes and reduces the data for those files into an aggregate record using scalable MPI collective operations. Each process then compresses the remaining data in parallel using Zlib.The compressed data is written in parallel to a single binary data file. Figure 1 shows the Darshan output time for various job sizes on Intrepid as measured in previous work [7]. This figure shows four cases for each job size: a single shared file, 1,024 shared files, one file per process, and 1,024 files per process. The largest case demonstrates that the Darshan shutdown process can be performed in less than 7 seconds even for jobs with 65,336 processes that opened 67 million files [7].This time is not likely to be noticeable because jobs at this scale take several minutes to boot and shut down. In addition, we have measured the overhead per file system operation to be less than 0.05%, even for operations that read only a single byte of data.
Darshan收集的数据记录在每个MPI进程的有限(最多大约2 MiB)内存量中。如果内存耗尽,Darshan就会返回到记录粗粒度的信息,但是我们还没有在实践中观察到这种情况。在作业执行期间,Darshan不执行通信或I/O。这是一个重要的设计决策,因为它确保Darshan不会引入额外的通信同步或I/O延迟,从而影响应用程序性能或限制可伸缩性。Darshan会延迟所有通信和I/O活动,直到作业关闭。那时Darshan要走三个步骤。首先,它识别跨进程共享的文件,并使用可扩展的MPI集合操作将这些文件的数据减少到聚合记录中。然后,每个进程使用Zlib并行压缩剩余的数据。压缩后的数据被并行写入单个二进制数据文件。图1显示了前人[7]测量的Intrepid上不同作业规模的Darshan输出时间。该图显示了每种作业大小的四种情况:单个共享文件、1024个共享文件、每个进程一个文件和每个进程1024个文件。最大的案例表明,即使对于有65336个进程打开了6700万个文件的作业,Darshan关闭过程也可以在不到7秒的时间内完成[7]。这个时间不太可能被注意到,因为这种规模的作业需要几分钟才能启动和关闭。此外,我们测量了每个文件系统操作的开销小于0.05%,即使是只读取单个字节数据的操作。
Darshan was installed on Intrepid through modifications to the default MPI compilers. Users who built MPI applications using these default compilers were therefore automatically included in the study. Darshan did not achieve complete coverage of all applications, however. Some applications were compiled prior to the Darshan deployment on January 14, 2010. Other applications either did not use MPI at all or used custom build scripts that had not been modified to link in Darshan. Users do have the option of explicitly disabling Darshan at compile time or run time, though this option is rarely chosen. In total, Darshan characterized 27% of all jobs executed during the interval studied in this work.
Darshan是通过修改默认的MPI编译器安装在Intrepid上的。因此,使用这些默认编译器构建MPI应用程序的用户自动包含在研究中。然而,Darshan并没有完全覆盖所有的应用程序。一些应用程序是在2010年1月14日Darshan部署之前编译的。其他应用程序要么根本不使用MPI,要么使用自定义构建脚本,这些脚本没有被修改为在Darshan中链接。用户确实可以选择在编译时或运行时显式禁用Darshan,尽管很少选择这个选项。总的来说,Darshan在研究期间执行的所有作业中占27%。
To analyze the resulting data from Darshan, we postprocessed all log files and loaded the resulting data into a unified SQL database. We also utilized a graphical tool included with Darshan to generate summary reports for particular jobs of interest. This tool is available to users and system administrators; it enables immediate feedback on the I/O behavior of any production job, in many cases eliminating the need to explicitly instrument or rerun jobs in order to troubleshoot I/O performance problems.
为了分析来自Darshan的结果数据,我们对所有日志文件进行了后处理,并将结果数据加载到统一的SQL数据库中。我们还利用Darshan附带的图形工具为感兴趣的特定作业生成摘要报告。此工具可供用户和系统管理员使用;它支持对任何生产作业的I/O行为进行即时反馈,在许多情况下,为了排除I/O性能问题,无需显式地检测或重新运行作业。
D. Performance metrics
One important metric for applications is aggregate I/O bandwidth. For parallel I/O benchmarks we typically calculate this by dividing the amount of data moved by the time of the slowest MPI process, with some coordination ensuring that I/O overlapped. In this work we are observing real applications running in production, and these applications may not have the same coordination seen in I/O benchmarks. Additionally, because the applications are running across a range of job sizes, it is useful for comparison purposes to examine performance relative to job size, rather than as an absolute value.We introduce a generic metric for read and write performance that can be derived from the Darshan statistics of unmodified applications and scaled across a variety of job sizes.
应用程序的一个重要指标是聚合I/O带宽。对于并行I/O基准测试,我们通常通过将移动的数据量除以最慢MPI进程的时间来计算,并通过一些协调来确保I/O重叠。在这项工作中,我们观察了在生产环境中运行的实际应用程序,这些应用程序可能没有在I/O基准测试中看到的相同的协调。此外,由于应用程序是在不同作业大小的范围内运行的,因此比较性能与作业大小的关系(而不是作为绝对值)是很有用的。我们引入了一个通用的读写性能指标,它可以从未经修改的应用程序的Darshan统计数据中得出,并可以扩展到各种作业规模。
Darshan records independent statistics for each file accessed by the application, including the number of bytes moved, cumulative time spent in I/O operations such as read() and write(), and cumulative time spent in metadata operations such as open() and stat(). The aggregate I/O bandwidth can be estimated by dividing the total amount of data transferred by the amount of I/O time consumed in the slowest MPI process. To make comparisons across jobs of different sizes, we divide the aggregate performance by the number of compute nodes allocated to the job. The result is a MiB per second per compute node (MiB/s/CN) metric, calculated as follows:
Darshan为应用程序访问的每个文件记录独立的统计数据,包括移动的字节数、I/O操作(如read()和write())所花费的累积时间,以及元数据操作(如open()和stat())所花费的累积时间。通过将传输的总数据量除以最慢的MPI进程所消耗的I/O时间,可以估算出总的I/O带宽。为了对不同大小的作业进行比较,我们将总性能除以分配给作业的计算节点数。其结果是每个计算节点每秒MiB/s/CN的度量值,计算方法如下:
In this equation, n represents the number of MPI processes,while Ncn represents the number of compute nodes. Intrepidhas four cores per compute node, so those two numbers seldom match. Here bytesr and bytesw represent the number of bytes read and written by the MPI process, respectively, while tmd, tr, and tw represent time spent in metadata, read, and write operations, respectively. A slight variation is used to account for shared files because, in that scenario, Darshan combines statistics from all MPI processes into a single record and, in doing so, loses track of which process was the slowest.This situation has since been addressed in the Darshan 2.0.0 release. For shared files we therefore estimate the I/O time as the elapsed time between the beginning of the first open() call and the end of the last I/O operation on the file.
式中,n表示MPI进程数;Ncn表示计算节点数。intrepid每个计算节点有四个内核,所以这两个数字很少匹配。这里的bytesr和bytesw分别表示MPI进程读和写的字节数,而tmd、tr和tw分别表示花在元数据、读和写操作上的时间。为了解释共享文件,使用了一个细微的变化,因为在这种情况下,Darshan将来自所有MPI进程的统计信息合并到单个记录中,这样一来,就无法跟踪哪个进程是最慢的。这种情况已经在Darshan 2.0.0版本中得到了解决。因此,对于共享文件,我们将I/O时间估计为从第一次open()调用开始到文件上最后一次I/O操作结束之间所经过的时间。
To verify the accuracy of this approach, we used the IOR benchmark. The results are shown in Table I. All IOR jobs used 4,096 processes and transferred a total of 1 TiB of data to or from GPFS. Our examples used both shared files and unique files per process (notated as N-1 and N-N, respectively). The aggregate performance derived from Darshan deviated by less than 3% from the value reported by IOR in each case. Note that reads obtained higher performance than did writes, likely because current SAN configuration settings disable write-back caching but still cache read operations. The variance between N-1 and N-N is probably due to metadata operations because the IOR measurement includes both open and close time as well as the read and write time.
为了验证该方法的准确性,我们使用IOR基准。结果如表1所示。所有IOR作业使用4,096个进程,并向GPFS传输了总共1 TiB的数据。我们的示例在每个进程中使用共享文件和唯一文件(分别表示为N-1和N-N)。在每种情况下,从Darshan获得的总性能与IOR报告的值相差不到3%。请注意,读操作获得的性能比写操作更高,这可能是因为当前的SAN配置设置禁用回写缓存,但仍然缓存读操作。N-1和N-N之间的差异可能是由于元数据操作,因为IOR测量包括打开和关闭时间以及读和写时间。
IOR, as configured in these examples, issues perfectly aligned 4 MiB operations concurrently on all processes. Table I therefore also indicates the approximate maximum MiB/s/CN that can be observed on Intrepid. The general maximum performance is bound by the network throughput the ION can obtain. The BG/P tree network, which connects CN and ION, supports approximately 700 MiB/s, which gives a theoretical maximum of 10.94 MiB/s/CN. The ION to storage network supports approximately 350 MiB/s, which results in a theoretical maximum of 5.47 MiB/s. With GPFS, a read workload can take advantage of read-ahead and caching to get above the storage network maximum. One final caveat is that the maximum possible MiB/s/CN rate will diminish as the total performance approaches the limit of the file system [1]. The theoretical maximum performance for a 40,960-node, 163,840process job is 1.59 MiB/s/CN. These metrics are not perfect representations of I/O performance; however, with the exception of three jobs that achieved unreasonably high measures by our estimate, the metric provides meaningful insight into relative performance for production jobs that cannot otherwise be explicitly instrumented. Further analysis indicates that the three outliers exhibited very sparse, uncoordinated I/O that did not fit the model.
正如在这些示例中配置的那样,IOR在所有进程上并发地发出完全对齐的4 MiB操作。因此,表1也显示了在Intrepid上可以观察到的最大MiB/s/CN。一般来说,最大性能受ION可以获得的网络吞吐量的限制。连接CN和ION的BG/P树型网络支持约700 MiB/s,理论最大值为10.94 MiB/s/CN。ION到storage网络支持的最大传输速率为350mib /s,理论最大传输速率为5.47 MiB/s。使用GPFS,读取工作负载可以利用预读和缓存来超过存储网络的最大值。最后一个警告是,最大可能的MiB/s/CN速率将随着总性能接近文件系统的极限而减小[1]。对于40960个节点、163840个进程的作业,理论最大性能为1.59 MiB/s/CN。这些指标并不是I/O性能的完美表示;然而,根据我们的估计,除了三个作业达到了不合理的高度量之外,该度量为生产作业的相对性能提供了有意义的见解,否则无法显式地测量。进一步的分析表明,这三个异常值表现出非常稀疏、不协调的I/O,不符合模型。
III. APPLICATION TRENDS AND I/O-INTENSIVE PROJECTS
From January 23 to March 26, Intrepid executed 23,653 jobs that consumed a total of 175 million core-hours. These jobs were divided into 66 science and engineering projects (not counting maintenance and administration). Of these, 37 were INCITE projects, as described in Section II. The remainder were discretionary projects that are preparing INCITE applications or porting codes to the Blue Gene/P architecture. Of the total workload, Darshan instrumented 6,480 (27%) of all jobs and 42 million (24%) of all core-hours. At least one example from 39 of the 66 projects was captured.
从1月23日到3月26日,Intrepid执行了23,653个任务,总共消耗了1.75亿核心小时。这些工作分为66个科学和工程项目(不包括维护和管理)。其中,如第二节所述,37个是研究所项目。其余的是准备INCITE应用程序或将代码移植到Blue Gene/P体系结构的任意项目。在总工作量中,Darshan测量了6480个(27%)工作岗位和4200万个(24%)核心工时。66个项目中的39个项目中至少有一个被捕获。
A. Overall application trends
As part of our analysis we investigated the overall trends of the applications instrumented through Darshan on Intrepid.We were interested primarily in two characteristics. First, we wanted to discover which I/O interfaces were used by applications at various job sizes. High-level interfaces such as PnetCDF and HDF5 ease the data management burden and provide data portability, but it has remained unclear how many applications utilize these interfaces and how much data is moved through them. MPI-IO provides useful optimizations for accessing the parallel file systems deployed at leadershipclass supercomputing centers, but applications may continue to use POSIX interfaces for a variety of reasons.
作为分析的一部分,我们调查了通过Darshan在Intrepid上检测的应用程序的总体趋势。我们主要对两个特征感兴趣。首先,我们希望发现不同作业规模的应用程序使用哪些I/O接口。像PnetCDF和HDF5这样的高级接口减轻了数据管理的负担,并提供了数据可移植性,但是目前还不清楚有多少应用程序利用这些接口,以及有多少数据通过它们移动。MPI-IO为访问部署在领导级超级计算中心的并行文件系统提供了有用的优化,但是由于各种原因,应用程序可能会继续使用POSIX接口。
Second, we wanted a clearer understanding of the patterns of access in these applications. We focused on the type of file access at various job sizes, looking at the frequency and amount of I/O done to unique files (also known as N:N), shared files (N:1), and partially shared files (only a subset of processes perform I/O to the file, N:M, where M < N).Intuitively, applications performing I/O across a large number of processes will be more likely to use shared files in order to ease the file management burden that results when a large job writes 10,000–100,000 files, but it has remained unclear when applications choose shared files. Also, various studies have shown that at the largest job sizes, some file systems perform better under unique file workloads because lock contention is avoided [12], whereas others perform better under shared file workloads because they ease the metadata management burden [1]. We wanted to see how application performancevaried according to the strategy chosen.
其次,我们希望更清楚地了解这些应用程序中的访问模式。我们关注不同作业大小下的文件访问类型,查看对唯一文件(也称为N:N)、共享文件(N:1)和部分共享文件(只有一部分进程对文件(N: M,其中M < N)执行I/O的频率和数量。直观地说,跨大量进程执行I/O的应用程序更有可能使用共享文件,以减轻当大型作业写入10,000-100,000个文件时产生的文件管理负担,但是应用程序何时选择共享文件仍然不清楚。此外,各种研究表明,在最大的作业规模下,一些文件系统在唯一文件工作负载下表现更好,因为避免了锁争用[12],而另一些文件系统在共享文件工作负载下表现更好,因为它们减轻了元数据管理负担[1]。我们想看看应用程序的性能如何根据所选择的策略而变化。
Figures 2 and 3 give an overview of the I/O interfaces and access patterns, respectively, used by applications at various job sizes. We see in Figure 2 that the POSIX interfaces were used by the majority of jobs and performed the bulk of the I/O, especially for reading and at smaller process counts. Some applications used MPI-IO, particularly at the highest process counts and for applications that primarily wrote data. So few of the applications in our study used high-level libraries that they would not have been visible in the graph.
图2和图3分别概述了不同作业规模的应用程序所使用的I/O接口和访问模式。我们在图2中看到,大多数作业都使用POSIX接口,并且执行了大量的I/O,特别是在读取和较小的进程计数时。一些应用程序使用MPI-IO,特别是在进程数最高的应用程序和主要写数据的应用程序中。在我们的研究中,很少有应用程序使用高级库,因此它们在图中不可见。
Figure 3 shows that the I/O strategy used by jobs varied considerably depending on the size of the job. Unique files were the most common access method for small jobs, whereas partially shared files were the most common access method for large jobs. In terms of quantity of data, most I/O was performed to files that were shared or partially shared.The widespread use of partially shared files indicates that applications are not relying on MPI-IO collective buffering optimization but, rather, are performing their own aggregation by writing and reading shared files from subsets of processes.
图3显示,作业使用的I/O策略根据作业的大小有很大的不同。对于小型作业来说,唯一文件是最常见的访问方法,而对于大型作业来说,部分共享文件是最常见的访问方法。就数据量而言,大多数I/O是对共享或部分共享的文件执行的。部分共享文件的广泛使用表明,应用程序不依赖MPI-IO集体缓冲优化,而是通过从进程子集写入和读取共享文件来执行它们自己的聚合。
B. I/O-intensive projects
Not all the 39 projects captured by Darshan were significant producers or consumers of data. Figure 4 illustrates how much data was read and written by the ten projects that moved the most data via Darshan-enabled jobs. The projects are labeled according to their general application domain. The first observation from this figure is that that a few projects moved orders of magnitude more data than most others. The project with the highest I/O usage, EarthScience, accessed a total of 3.5 PiB of data. Another notable trend in Figure 4 is that eight of the top ten projects read more data than was written. This is contrary to findings of previous scientific I/O workload studies [13]. By categorizing the data by project we see that the read/write mix varies considerably by application domain.
Darshan调查的39个项目并非都是重要的数据生产者或消费者。图4说明了通过支持darshan的作业移动最多数据的10个项目读取和写入的数据量。项目根据其一般应用领域进行标记。从这个图中我们观察到的第一个现象是,一些项目比其他大多数项目移动的数据多出几个数量级。I/O使用率最高的项目是EarthScience,总共访问了3.5 PiB的数据。图4中另一个值得注意的趋势是,前10个项目中有8个项目读取的数据多于写入的数据。这与之前科学的I/O工作负载研究结果相反[13]。通过按项目对数据进行分类,我们看到读/写组合因应用程序领域的不同而有很大差异。
Table II lists coverage statistics and application programmer interfaces (APIs) used by each of the projects shown in Figure 4. Darshan instrumented over half of the corehours consumed by seven of the ten projects. NuclearPhysics, Chemistry, and Turbulence3 were the exceptions and may have generated significantly more I/O activity than is indicated by Figure 4. The fourth column of Table II shows which APIs were used directly by applications within each project.P represents the POSIX open() interface, S represents the POSIX stream fopen() interface, M represents MPI-IO, and H represents HDF5. Every project used at least one of the two POSIX interfaces, while four projects also used MPI-IO.Energy1 notably utilized all four of HDF5, MPI-IO, POSIX, and POSIX stream interfaces in its job workload.
表2列出了图4中所示的每个项目所使用的覆盖率统计数据和应用程序编程接口(api)。Darshan测量了十个项目中七个项目所消耗的一半以上的核心时间。核物理学、化学和湍流学是例外。生成的I/O活动比图4所示的要多得多。表2的第四列显示了每个项目中的应用程序直接使用的api。P表示POSIX open()接口,S表示POSIX流fopen()接口,M表示MPI-IO, H表示HDF5。每个项目至少使用两个POSIX接口中的一个,而四个项目也使用MPI-IO。Energy1在其工作负载中特别利用了HDF5、MPI-IO、POSIX和POSIX流接口的所有四个接口。
This subset of projects also varies in how many files are used. Figure 5 plots the number of files accessed by application run according to its processor count for our ten most I/O-intensive projects. If several application instances were launched within a single job (as is common on Intrepid), each instance is shown independently. Reinforcing Figure 3, we see four rough categories: applications that show an N:N trend, ones that show an N:1 trend, a group in the middle exemplified by Turbulence3 that are subsetting (N:M), and a fourth category of applications operating on no files. The large number of application runs that operated on zero files is surprising. Darshan does not track standard output or standard error. One possible explanation is that projects appear to run a few debug jobs to run diagnostic or preliminary tests that write results only to standard out or standard error and then proceed to run “real” jobs.
这个项目子集所使用的文件数量也各不相同。图5根据处理器数量绘制了10个I/ o密集项目中应用程序访问的文件数量。如果在单个作业中启动了多个应用程序实例(在Intrepid上很常见),则每个实例将独立显示。强化图3,我们看到了四个大致类别:显示N:N趋势的应用程序,显示N:1趋势的应用程序,中间的一组以Turbulence3为例,属于子集(N:M),第四类应用程序在无文件上运行。在零文件上运行的大量应用程序令人惊讶。Darshan不跟踪标准输出或标准误差。一种可能的解释是,项目似乎运行了一些调试作业来运行诊断或初步测试,这些测试只将结果写入标准输出或标准错误,然后继续运行“真正的”作业。
Of the N:N applications, some access as many as 100 files per process. Programs accessing multiple files per process might need special attention when scaling to full-machine runs because of challenges in metadata overhead and file management.
在N:N个应用程序中,有些应用程序每个进程访问多达100个文件。由于元数据开销和文件管理方面的挑战,在扩展到全机运行时,每个进程访问多个文件的程序可能需要特别注意。
Figure 5 also demonstrates that some projects have both N:1 and N:N jobs. Perhaps the clearest example is NuclearPhysics, the purple rectangle, about which more will be said in Section V-B.
图5还演示了一些项目同时具有N:1和N:N个作业。也许最明显的例子是核物理,紫色矩形,关于它将在V-B部分进行更多的讨论。
IV. STORAGE UTILIZATION
The previous section provided an overview of how applications and jobs of varying sizes interacted with the storage system. In this section we investigate how this interaction translates into utilization at the storage device and file system level.
上一节概述了不同大小的应用程序和作业如何与存储系统交互。在本节中,我们将研究这种交互如何转化为存储设备和文件系统级别的利用率。
Figure 6 shows the combined aggregate throughput at the block device level of Intrepid’s main storage devices from January 23 to March 26. This includes both GPFS and PVFS activity. It also includes interactive access from login nodes as well as analysis access from the Eureka visualization cluster.Notable lapses in storage activity have been correlated with various maintenance windows and labeled accordingly. There were four notable scheduled maintenance days, as well as three unplanned maintenance windows due to network, storage, or control system issues. Note that the data from 9:00 am February 1 to 10:00 am February 5 was lost because of administrative error, but the system was operating normally during that time.
图6显示了1月23日至3月26日期间Intrepid主存储设备块设备级别的综合总吞吐量。这包括GPFS和PVFS活动。它还包括来自登录节点的交互式访问以及来自Eureka可视化集群的分析访问。存储活动的明显失误与各种维护窗口相关,并相应地标记。有四个值得注意的计划维护日,以及三个由于网络、存储或控制系统问题而导致的计划外维护窗口。注意,由于管理错误,2月1日上午9:00至2月5日上午10:00的数据丢失,但在此期间系统运行正常。
The peak read throughput achieved over any one minute interval was 31.7 GiB/s, while the peak write throughput was 35.0 GiB/s. In previous work, we found that end-to-end throughput on this system varied depending on the access pattern [1]. In that study we measured maximum read performance from 33 to 47 GiB/s and maximum write performance from 30 to 40 GiB/s, both using PVFS. The system did not quite reach these numbers in practice during the interval shown in Figure 6. This may be caused by SAN hardware configuration changes since the previous study. Our previous study also took advantage of full system reservations during Intrepid’s acceptance period, with no resource contention.
在任意一分钟间隔内实现的峰值读吞吐量为31.7 GiB/s,而峰值写吞吐量为35.0 GiB/s。在之前的工作中,我们发现该系统的端到端吞吐量根据访问模式而变化[1]。在该研究中,我们测量了最大读性能从33到47 GiB/s,最大写性能从30到40 GiB/s,两者都使用PVFS。在图6所示的时间间隔内,系统在实践中并没有完全达到这些数字。这可能是由于上次研究之后SAN硬件配置发生了变化。我们之前的研究还利用了Intrepid接受期间的全系统预留,没有资源争用。
From the iostat logs we can also calculate the amount of data moved over various time intervals. An average of 117.1 TiB were read per day, and 31.5 TiB were written per day. A total of 6.8 PiB and 1.8 PiB were read and written over the study interval, not counting the four missing days of data. Although reads made up 78.8% of all activity over the course of the study, this was largely due to the behavior of a single project.The EarthScience project was noted in Section III for having the most read-intensive workload of all projects captured by Darshan. It read over 3.4 PiB of data during the study, or approximately half the total read activity on the system. We investigated that project’s usage activity in scheduler logs and found that it significantly tapered off around February 25. This corresponds to a visible change in the read/write mixture at the same time in Figure 6. For the following two weeks, reads accounted for only 50.4% of all I/O activity.
从iostat日志中,我们还可以计算在不同时间间隔内移动的数据量。平均每天读117.1 xz,每天写31.5 xz。在研究期间共读取和写入6.8 PiB和1.8 PiB,不包括丢失的4天数据。尽管在整个研究过程中,阅读占所有活动的78.8%,但这在很大程度上是由于单个项目的行为。第三节指出,地球科学项目是Darshan捕获的所有项目中读取工作量最大的项目。在研究期间,它读取了超过3.4 PiB的数据,大约占系统总读取活动的一半。我们在调度器日志中调查了该项目的使用活动,发现它在2月25日左右明显逐渐减少。这对应于图6中同时读/写混合的可见变化。在接下来的两周,读取仅占所有I/O活动的50.4%。
One explanation for the unexpectedly high level of read activity is that users of Intrepid do not checkpoint as frequently as one might expect. The maximum job time allowed by scheduling policy is 12 hours, which is significantly less than the mean time to interrupt of the machine (especially if boot time errors are excluded). Many applications therefore checkpoint only once per run unless they require more frequent checkpoints for analysis purposes.We also note that some fraction of read activity was triggered by unaligned write accesses at the application level. Both GPFS and PVFS must ultimately perform read/modify/write operations at the block level in order to modify byte ranges that do not fall precisely on block boundaries. As we will see in Section V, unaligned access is common for many applications.
对于读取活动异常高的一个解释是,Intrepid的用户并不像人们预期的那样频繁地检查。调度策略允许的最大作业时间是12小时,这比机器中断的平均时间要短得多(特别是在排除引导时间错误的情况下)。因此,许多应用程序每次运行只检查点一次,除非它们需要更频繁的检查点来进行分析。我们还注意到,一部分读活动是由应用程序级别的未对齐写访问触发的。GPFS和PVFS最终都必须在块级别执行读/修改/写操作,以便修改不精确地落在块边界上的字节范围。正如我们将在第五节中看到的,对许多应用程序来说,不对齐访问是很常见的。
Figure 6 suggests that the I/O activity is also bursty. To quantify this burstiness, we generated a cumulative distribution function of the combined read and write throughput on the system for all 63,211 one-minute intervals recorded from iostat. The result is shown in Figure 7. The average total throughput was 1,984 MiB/s. The peak total throughput was 35,890 MiB/s. For 98% of the time, the I/O system was utilized at less than 33% of peak I/O bandwidth. This performance matches with the common understanding of burstiness of I/O at these scales. Because leadership-class I/O systems are provisioned to provide a very high peak bandwidth for checkpointing, during computation phases the I/O system will be mostly idle.
图6表明I/O活动也是突发的。为了量化这种突发情况,我们为从iostat记录的所有63,211个一分钟间隔生成了系统上的综合读写吞吐量的累积分布函数。结果如图7所示。平均总吞吐量为1984 MiB/s。峰值总吞吐量为35890 MiB/s。在98%的时间里,I/O系统的利用率低于峰值I/O带宽的33%。这种性能符合对这些规模上的I/O突发的普遍理解。由于领导级I/O系统被配置为为检查点提供非常高的峰值带宽,因此在计算阶段,I/O系统将大部分处于空闲状态。
The duration of the idle periods varied considerably. Table III summarizes the duration of idle periods if we define an idle period as any time in which the aggregate throughput was less than 5% of peak for consecutive 60-second intervals. Idle periods lasting less than 10 minutes by this definition are quite common, but the majority of the idle time is coalesced into much longer periods. Over 37% of the total system time was spent in idle periods that lasted at least one hour.The longest example lasted 19 hours on March 1, which was a scheduled maintenance day for Intrepid. If we limit the definition of idle periods to only those in which no disk activity was observed at all, then minute-long idle periods are much less common. We observed 935 such intervals, which accounted for just over 1% of the total system time. This stricter definition of idle time would be better investigated by using tools with smaller granularity [14] in order to capture shorter intervals.
空闲时间的持续时间差别很大。表III总结了空闲时间的持续时间,如果我们将空闲时间定义为连续60秒间隔内总吞吐量低于峰值5%的任何时间。根据这个定义,持续时间少于10分钟的空闲时间很常见,但大多数空闲时间合并成更长的时间段。超过37%的系统总时间花在了至少一个小时的空闲时间上。最长的一次持续了19个小时,发生在3月1日,这一天是“无畏号”的例行维护日。如果我们将空闲周期的定义限制为完全没有观察到磁盘活动的时间段,那么分钟长的空闲周期就不那么常见了。我们观察到935个这样的间隔,它们只占系统总时间的1%多一点。通过使用更小粒度的工具[14]来更好地研究这种更严格的空闲时间定义,以便捕获更短的间隔。
We believe that both definitions of idle time are relevant for different aspects of storage system design. Time in which the storage system is mostly idle presents an opportunity for file systems to leverage unused capacity for autonomous storage activity, while time in which parts of the storage system are completely idle may be more useful for component-level optimizations.
我们认为,空闲时间的两种定义都与存储系统设计的不同方面相关。存储系统大部分空闲的时间为文件系统提供了利用未使用容量进行自主存储活动的机会,而存储系统部分完全空闲的时间可能对组件级优化更有用。
A. File system contents
Despite the quantity of data that was transferred through the storage system on Intrepid, a surprising amount of it was stored in relatively small files at the file system level. Figure 8 illustrates the cumulative distribution function of file sizes in March 2010. The most popular size range was 64 KiB to 128 KiB, with over 71 million files. Of all files, 86% were under 1 MiB, 95% were under 2 MB, and 99.8% were under 128 MB. The largest file was 16 TiB.
尽管通过Intrepid上的存储系统传输了大量数据,但在文件系统级别上,有相当多的数据存储在相对较小的文件中。图8显示了2010年3月文件大小的累积分布函数。最流行的大小范围是64kib到128kib,超过7100万个文件。86%的文件小于1mb, 95%的文件小于2mb, 99.8%的文件小于128mb,最大的文件为16tib。
Closer investigation, however, revealed that a single project, EarthScience, was significantly altering the file size characteristics. Figure 8 shows that without EarthScience’s files, the most popular file size would have been 512 KiB to 1 MiB and only 77% of the files would have been under 1 MiB in size.
然而,更深入的调查显示,一个名为“地球科学”的项目显著地改变了文件的大小特征。图8显示,如果没有EarthScience的文件,最流行的文件大小将是512 KiB到1 MiB,只有77%的文件大小在1 MiB以下。
We can also observe how the file systems changed over the study interval by comparing fsstats results from the beginning and end of the two-month study. Table IV shows growth of the primary file systems on Intrepid, in terms of both total capacity and number of files. Over the study period, the number of files doubled. Just as in the static analysis, however, this growth was largely the result of the EarthScience project data.EarthScience was responsible for 88% of the additional files created during the study period but generated only 15% of the new data by capacity. If the number of files continues to increase at the same rate, the file system will reach 1 billion files in September 2011.
通过比较为期两个月的研究开始和结束时的fsstats结果,我们还可以观察文件系统在研究期间的变化情况。表4显示了Intrepid上主文件系统在总容量和文件数量方面的增长情况。在研究期间,文件的数量翻了一番。然而,就像静态分析一样,这种增长主要是地球科学项目数据的结果。在研究期间创建的额外文件中,有88%是由EarthScience负责的,但按容量计算,它只生成了15%的新数据。如果文件数量继续以同样的速度增长,到2011年9月,文件系统将达到10亿个。
B. File age and overwriting data
Although a large number of files are stored on Intrepid, not all of them are accessed frequently. Figure 9(a) shows a cumulative distribution plot of the last access time for each file on Intrepid in March 2010. Shown are the cumulative percentage based on the number of files and on the amount of data in each file. By either metric, over 90% of the data on Intrepid has not been accessed in at least a month. Moreover, although 55% of all files have been accessed in the past 64 days, those files accounted for only 15% of the data stored on the file system by volume. Figure 9(b) shows similar data for the modification time for the same files: 8% of all files were written in the past 64 days, but those files accounted for 15% of the data on the system by volume. The most pronounced difference between Figure 9(a) and Figure 9(b) is that the lines for file count and file size percentages are reversed. This suggests that small files tend to be read more frequently than large files on Intrepid.
尽管大量文件存储在Intrepid上,但并非所有文件都经常被访问。图9(a)显示了2010年3月Intrepid上每个文件最后访问时间的累积分布图。显示的是基于文件数量和每个文件中的数据量的累积百分比。无论采用哪一种方法,Intrepid上超过90%的数据至少在一个月内未被访问。此外,尽管在过去64天内访问了55%的文件,但这些文件仅占文件系统上存储的数据量的15%。图9(b)显示了相同文件修改时间的类似数据:所有文件中有8%是在过去64天内写入的,但这些文件占系统上数据量的15%。图9(a)和图9(b)之间最明显的区别是表示文件数和文件大小百分比的行是相反的。这表明在Intrepid上,小文件的读取频率往往高于大文件。
The modification data and access time data suggest that files are rarely overwritten once they are stored on the file system.Darshan characterization supports this observation. We found that of the 209.5 million files written by jobs instrumented with Darshan, 99.3% either were created by the job or were empty before the job started.
修改数据和访问时间数据表明,文件一旦存储在文件系统中,就很少被覆盖。Darshan的描述支持这一观察。我们发现,在使用Darshan工具的作业写入的2.095亿个文件中,99.3%要么是由作业创建的,要么是在作业开始之前为空的。
We note that the jobs characterized by Darshan wrote more files in the two-month study than were actually present in the file system at the end of the study. This fact, in conjunction with the earlier observations, indicates that files either are deleted within a relatively short time frame or are stored unchanged for extended periods of time. These characteristics would be beneficial to algorithms such as replication, compression, and hierarchical data management that can take advantage of infrequently modified files to improve efficiency.
我们注意到,在两个月的研究中,以Darshan为特征的作业所写的文件比研究结束时文件系统中实际存在的文件还要多。这一事实与先前的观察结果相结合,表明文件要么在相对较短的时间内被删除,要么在较长时间内不变地存储。这些特征将有利于复制、压缩和分层数据管理等算法,这些算法可以利用不经常修改的文件来提高效率。
V. I/O CHARACTERISTICS BY PROJECT
We have established that the I/O workload on Intrepid consists of a variety of access patterns and file usage strategies and that the underlying storage system experiences bursts of I/O demand. In this section we explore in greater detail how storage access characteristics vary by application domain and how those characteristics correlate with I/O performance.
我们已经确定,Intrepid上的I/O工作负载由各种访问模式和文件使用策略组成,底层存储系统会经历I/O需求爆发。在本节中,我们将更详细地探讨存储访问特征如何因应用程序域而异,以及这些特征如何与I/O性能相关联。
Figure 11 is a box plot of performance measured by using the MiB/s/CN metric outlined in Section II-D. We have filtered the jobs captured by Darshan to include only those that used at least 1,024 processes and moved at least 500 MiB of data.This approach eliminates noise from jobs that moved trivial amounts of data. All statistics shown in the remainder of this study are filtered by the same criteria. For each project in Figure 11, we have shown the minimum, median, and maximum, as well as the Q1 and Q3 quartiles. Some projects exhibited very consistent performance, whereas others varied over a relatively wide range. Very few jobs from any project approached the maximum values established in Section II-D.
图11是使用第II-D节中概述的MiB/s/CN度量来测量性能的框图。我们已经过滤了Darshan捕获的作业,只包括那些使用了至少1,024个进程并且移动了至少500 MiB数据的作业。这种方法消除了来自移动少量数据的作业的噪音。本研究其余部分中显示的所有统计数据均按相同标准进行过滤。对于图11中的每个项目,我们都显示了最小值、中值和最大值,以及Q1和Q3四分位数。一些项目表现出非常一致的表现,而另一些项目则在相对较大的范围内变化。任何项目的工作很少接近第II-D节规定的最大值。
Table V summarizes a set of key storage access characteristics as averaged across jobs within that project. The MiB/s/CN and metadata overhead are computed as described in Section II-D. The second column shows the percentage of overall job run time that was spent performing I/O operations. The third column shows the percentage of that I/O time that was spent performing metadata operations rather than read() or write() operations. This value is much higher than expected in some cases because of the GPFS file system flushing writes to small files at close() time, since Darshan counts all close() operations as metadata. The other columns show the number of files accessed and created per MPI process, the percentage of sequential and aligned accesses, and the amount of data moved per process. Often accesses by a given process are highly sequential, as has been seen in previous studies [13].Figure 10 illustrates in greater detail the access sizes used by each project; histograms represent the percentages of accesses that fell within each range.
表V总结了该项目中跨作业的一组键存储访问特征的平均值。MiB/s/CN和元数据开销的计算方法见章节II-D。第二列显示了用于执行I/O操作的总作业运行时间的百分比。第三列显示了用于执行元数据操作而不是read()或write()操作的I/O时间的百分比。在某些情况下,这个值比预期的要高得多,因为GPFS文件系统刷新在close()时写入小文件,因为Darshan将所有close()操作都视为元数据。其他列显示每个MPI进程访问和创建的文件数量、顺序和对齐访问的百分比以及每个进程移动的数据量。通常,给定进程的访问是高度顺序的,正如之前的研究[13]所看到的那样。图10更详细地说明了每个项目使用的访问大小;直方图表示每个范围内访问的百分比。
In the remainder of this section we will refer to Table V and Figure 10 as we explore the characteristics of each project in greater depth.
在本节的其余部分中,我们将在更深入地探讨每个项目的特征时参考表V和图10。
A. EarthScience
The EarthScience project has already featured prominently in previous sections because it dominated both the read activity and number of files stored on Intrepid during the study interval.Despite the high level of I/O usage, however, EarthScience ranked near the low range of median I/O performance. Other than the three outliers discussed earlier, the performance is consistent, with an interquartile range (IQR) of only 0.17 MiB/s/CN. Further inspection indicated that the EarthScience workload is dominated by 450 nearly identical jobs, each of which utilized 4,096 processes. These jobs were often further subdivided into a sequence of up to 22 repeated instances of the same application within a job allocation. Each instance accessed approximately 57,000 files, leading some jobs to access a total of more than 1 million distinct files over the lifetime of the job.
在前面的章节中,EarthScience项目已经有了突出的特点,因为它在研究期间控制了Intrepid上存储的读取活动和文件数量。尽管I/O使用率很高,但EarthScience的排名接近I/O性能中位数的较低范围。除了前面讨论的三个异常值外,性能是一致的,四分位数范围(IQR)仅为0.17 MiB/s/CN。进一步的检查表明,EarthScience的工作量主要由450个几乎相同的作业组成,每个作业使用4,096个过程。这些作业通常被进一步细分为作业分配中相同应用程序的多达22个重复实例的序列。每个实例访问大约57,000个文件,导致一些作业在作业的生命周期内总共访问超过100万个不同的文件。
The EarthScience project read over 86 times more data than it wrote. The data that it did write, however, was broken into a large number of newly created files. Of the 141 files accessed per process on average, 99 were created and written by the job itself. As noted in Section IV, this project alone contributed over 96 million files to the 191.4 million stored on Intrepid at the end of the study. The direct result of splitting data into so many files is that each job spent more of its I/O time performing metadata operations than actually reading or writing application data. Over 20 TiB of data were written into files averaging 109 KiB each in size, leaving the file system little opportunity to amortize metadata overhead. The apparent metadata cost is exaggerated somewhat by I/O time that is attributed to close() rather than write(), but that doesn’t change the fact this metadata overhead is a limiting factor in overall I/O efficiency for the project.
地球科学项目读取的数据比写入的数据多86倍。然而,它写入的数据被分解成大量新创建的文件。在每个进程平均访问的141个文件中,有99个是由作业自己创建和写入的。如第四节所述,在研究结束时,仅该项目就为Intrepid存储的1.914亿文件贡献了9600多万个文件。将数据拆分为如此多的文件的直接结果是,每个作业花费更多的I/O时间来执行元数据操作,而不是实际读取或写入应用程序数据。超过20 TiB的数据被写入平均每个大小为109 KiB的文件中,使得文件系统几乎没有机会分摊元数据开销。明显的元数据成本在某种程度上被归因于close()而不是write()的I/O时间夸大了,但这并不能改变这样一个事实,即元数据开销是项目总体I/O效率的限制因素。
B. NuclearPhysics
NuclearPhysics exhibited the widest IQR of job performance of any of the ten most I/O-intensive projects. This variability was not caused by fluctuations in performance of a single application. Two applications with different I/O characteristics were run by users as part of this project. In one set, 809 nearly identical jobs accounted for the upper quartile and were among the most efficient of any frequently executed application during the study. In the other set, 811 jobs accounted for the lower quartile. This example illustrates that access characteristics may vary significantly even across applications from the same domain on the same system.
核物理学在10个I/ o最密集的项目中表现出最广泛的工作绩效IQR。这种可变性不是由单个应用程序的性能波动引起的。作为该项目的一部分,用户运行了两个具有不同I/O特征的应用程序。在其中一组中,809个几乎相同的作业占了最高的四分之一,并且在研究期间经常执行的应用程序中效率最高。在另一组中,811个工作岗位占较低的四分之一。这个示例说明了访问特征甚至可能在来自同一系统上同一域的不同应用程序之间存在显著差异。
The faster of the two applications utilized a partially shared file access pattern (N:M) and was atypical among jobs observed in this study because many of its files were both read and written to during the same job. The metadata overhead of creating and writing multiple files was amortized by the amount of I/O performed to each file. An example job read 1.38 TiB of data and wrote 449.38 GiB of data.This job is also a clear example of a behavior that was first speculated in Section III-A, namely, that some applications are implementing their own form of I/O aggregation rather than using the collective functionality provided by MPI-IO. This particular application used POSIX exclusively and was run with 4096 processes, but the first 512 MPI ranks performed all of the I/O for each job.
两个应用程序中速度较快的应用程序使用了部分共享的文件访问模式(N:M),在本研究中观察到的作业中并不常见,因为它的许多文件在同一作业期间都被读写。创建和写入多个文件的元数据开销由对每个文件执行的I/O量平摊。一个示例作业读取1.38 TiB的数据并写入449.38 GiB的数据。这个任务也是第III-A节中首先推测的行为的一个明显例子,即一些应用程序正在实现它们自己形式的I/O聚合,而不是使用MPI-IO提供的集合功能。这个特殊的应用程序专门使用POSIX,并在4096个进程中运行,但是前512个MPI队列执行每个作业的所有I/O。
The slower of the two applications that dominated this project presents an example of an application that performs “rank 0” I/O, in which a single process is responsible for all of the I/O for the job. In this case the jobs were either 2,048 or 4,096 processes in size. The fact that all I/O was performed by a single rank resulted in a MiB/s/CN score as low as 0.2 in most cases. At first glance this appears to be very poor I/O behavior, but in practice these jobs read only 4 GiB of data, and the time to read that data with one process often constituted only 1% of the run time for this application. So while the storage access characteristics were poor, it will likely not be a significant problem unless the application is scaled to a larger problem size. This application accounted for the earlier observation in Figure 5 that NuclearPhysics exhibited both N:M and N:1 styles of access patterns.
在这个项目中占主导地位的两个应用程序中,较慢的应用程序提供了一个执行“rank0”I/O的应用程序示例,其中单个进程负责该作业的所有I/O。在本例中,作业的大小为2,048或4,096个进程。事实上,所有的I/O都是由一个rank执行的,在大多数情况下,MiB/s/CN得分低至0.2。乍一看,这似乎是非常糟糕的I/O行为,但在实践中,这些作业只读取4gb的数据,并且用一个进程读取这些数据的时间通常只占该应用程序运行时间的1%。因此,虽然存储访问特性很差,但除非应用程序扩展到更大的问题规模,否则它可能不会成为一个重大问题。这个应用程序解释了图5中较早的观察,即核物理学显示了N:M和N:1两种访问模式。
C. Energy1
The performance fluctuation in the Energy1 project results from variations within a single application that used different file systems, job sizes, APIs, data sizes, and file sharing strategies. Discussions with the scientists involved in the project revealed that this behavior was the result of experimental I/O benchmarking and does not represent production application behavior. However, it was interesting to capture an applicationoriented I/O tuning experiment in progress.
Energy1项目中的性能波动源于使用不同文件系统、作业大小、api、数据大小和文件共享策略的单个应用程序中的变化。与参与该项目的科学家的讨论表明,这种行为是实验性I/O基准测试的结果,并不代表生产应用程序的行为。但是,捕捉正在进行的面向应用程序的I/O调优实验是很有趣的。
D. Climate
The Climate project executed 30 jobs. The jobs tended to use co-processor mode, which means 2 MPI processes per node with 2 threads per MPI process. The application performance was likely dominated by three factors. First, each process created two files, translating to a higher metadata overhead. Second, the application performed a seek for every read/write operation. All seeks need to be forwarded to the ION to be processed, making the calls unusually expensive relative to a cluster system; 82% of I/O time was spent in metadata. Third, write operations were only 64 KiB when the file system block size was 4 MiB. Writes this small are not efficient on this system.
气候项目执行了30个工作。作业倾向于使用协处理器模式,这意味着每个节点有2个MPI进程,每个MPI进程有2个线程。应用程序的性能可能由三个因素决定。首先,每个进程创建两个文件,这意味着更高的元数据开销。其次,应用程序对每个读/写操作执行寻道。所有的请求都需要转发到ION进行处理,这使得调用相对于集群系统来说异常昂贵;82%的I/O时间花在元数据上。第三,当文件系统块大小为4 MiB时,写操作仅为64kib。在这个系统上,这么小的写入效率不高。
E. Energy2
The Energy2 project executed 58 jobs at one of two sizes, 2,048 or 4,096 processes. The overall time spent in I/O as a whole was very small (less than 1%). The I/O performance of this project was low compared with the others’ performance, even though it had low overhead for metadata. The performance loss was due to small independent writes (less than 10 KiB) that occurred only on rank 0. This project does utilize a single, shared file for reading, in which all processes read significantly more bytes than are written, and at a larger access size, producing very good performance.Given the small amount of time spent in I/O as compared to the overall application run-time and the minimal number of bytes, maximizing the write performance doesn’t seem to be a priority.
Energy2项目以2,048或4,096个进程两种规模之一执行了58个作业。总的来说,花在I/O上的时间非常少(不到1%)。与其他项目相比,这个项目的I/O性能较低,尽管它的元数据开销较低。性能损失是由于仅发生在rank 0上的小型独立写(小于10 KiB)造成的。这个项目确实利用一个共享文件进行读取,其中所有进程读取的字节比写入的字节多得多,并且访问大小更大,从而产生非常好的性能。与整个应用程序运行时相比,花在I/O上的时间很少,而且字节数也很少,因此最大化写性能似乎不是优先考虑的问题。
F. Turbulence1
The Turbulence project sample contained 118 diverse jobs.The jobs were run by three users, with process counts between 1,024 and 32,768. Each user ran a few different applications, which led to a wide performance range for all applications.Table VI details example jobs at the different performance scales.
湍流项目样本包含118个不同的工作。作业由三个用户运行,进程数在1,024到32,768之间。每个用户运行几个不同的应用程序,这导致所有应用程序的性能范围很大。表六详细介绍了不同性能等级下的示例作业。
All of these jobs had a common I/O profile. Each application used shared files as well as unique files on a fileper-process basis. For applications that needed to read/write more substantial amounts of a data, a single shared file was used with a 4 MiB access size. The bulk of the accesses, however, involved very small read or write operations. As a result, performance was determined by the effectiveness of either collective I/O or POSIX stream operations to combine these small I/O operations into larger requests. The fastest job performed the bulk of its I/O to shared files using POSIX read operations using the fast scratch file system. The access size was not particularly large but probably benefited from the GPFS read-ahead caching. The slowest application used a large number of independent POSIX reads of a very small access size, on the order of four bytes to the slower home file system.
所有这些作业都有一个共同的I/O配置文件。每个应用程序在文件-进程的基础上使用共享文件和唯一文件。对于需要读/写大量数据的应用程序,使用单个共享文件,访问大小为4 MiB。然而,大部分访问涉及非常小的读或写操作。因此,性能取决于集合I/O或POSIX流操作将这些小I/O操作组合成更大请求的有效性。最快的作业对共享文件执行大量I/O,使用POSIX读取操作,使用快速刮擦文件系统。访问大小不是特别大,但可能受益于GPFS预读缓存。最慢的应用程序使用了大量独立的POSIX读取,访问大小非常小,对较慢的主文件系统的访问大小大约为4个字节。
G. CombustionPhysics
The CombustionPhysics project comprised only 11 jobs in the sample based on the selection criteria. Within those 11 were a wide variety of different-sized jobs. The size of the job had a significant impact on the I/O rate.
基于选择标准,燃烧物理项目在样本中只包含11个工作。在这11个职位中,有各种各样不同规模的职位。作业的大小对I/O速率有很大的影响。
This project appeared to be studying strong scaling since the total number of bytes transferred for each job was similar regardless of size. Hence, the bytes per process transferred were smaller at each larger job size. At the two smaller node counts (1,024 and 2,048) the total I/O time for the job was small (<1%) compared with the total compute time. At the 4,096 node count, however, the total I/O time became 40% of the run time, and the percentage of time spent in metadata exploded to 35%. At each larger node count the percentage of I/O time spent in metadata increased, eventually topping out at 99%.
这个项目似乎在研究强伸缩性,因为无论大小如何,每个作业传输的总字节数都是相似的。因此,作业大小越大,每个进程传输的字节就越少。在两个较小的节点计数(1,024和2,048)下,作业的总I/O时间与总计算时间相比很小(<1%)。然而,当节点数达到4,096时,总I/O时间占运行时间的40%,花在元数据上的时间比例激增至35%。在每个较大的节点上,花在元数据上的I/O时间百分比增加,最终达到99%。
Table V indicates that this project created about three files per process on average. This I/O strategy did not scale well on Intrepid for higher processor counts and smaller amounts of data per process (see Table VII). In Section VI we will revisit this project and evaluate the impact of various I/O tuning strategies that were guided by the findings of our integrated I/O characterization methods.
表V表明这个项目平均每个进程创建大约三个文件。这种I/O策略在Intrepid上不能很好地扩展到处理器数量较多和每个进程数据量较少的情况下(见表7)。在第六节中,我们将重新审视这个项目,并评估各种I/O调优策略的影响,这些策略是由我们的集成I/O表征方法的发现指导的。
H. Chemistry
All the data captured for the Chemistry project corresponds to a single application. The bulk data accesses from the application were all perfectly aligned to the file system at 4 MiB. The metadata overhead was also low, because the majority of jobs accessed fewer than 10 total files regardless of job size. The I/O efficiency is poor despite these characteristics, however. The reason is that (with one exception) all of the Chemistry jobs captured by Darshan performed I/O exclusively from a single process, regardless of the size of the job. These jobs achieved performance similar to the lower quartile jobs of the NuclearPhysics project that utilized the same strategy.
为Chemistry项目捕获的所有数据都对应于单个应用程序。来自应用程序的批量数据访问都在4 MiB处与文件系统完美对齐。元数据开销也很低,因为无论作业大小如何,大多数作业访问的文件总数都少于10个。然而,尽管有这些特点,I/O效率还是很低。原因是(除了一个例外)Darshan捕获的所有化学作业都只从单个进程执行I/O,而不考虑作业的大小。这些作业的性能与使用相同策略的核物理项目的低四分位数作业相似。
One instance of the same application was executed with notably different characteristics. The job size in that case was 2,048, and half of the processes were involved in performing I/O. As in the upper quartile NuclearPhysics cases, the application appears to be manually performing aggregation on behalf of the other processes, as no MPI-IO is involved.The 1,024 I/O tasks combined to read 10 TiB of data and write 1.35 TiB of data. A unique file was used by each process, and all data was perfectly aligned. The application was therefore able to sustain I/O for an extended period with no significant metadata or misalignment overhead. This job achieved the highest observed efficiency of jobs analyzed in the case studies.
同一应用程序的一个实例以明显不同的特征执行。在这种情况下,作业大小为2,048,其中一半的进程涉及执行I/O。与上四分位数的核物理案例一样,应用程序似乎是代表其他进程手动执行聚合,因为不涉及MPI-IO。1024个I/O任务加起来读10tib的数据,写1.35 TiB的数据。每个进程使用一个唯一的文件,并且所有数据都完全对齐。因此,应用程序能够在没有显著元数据或不对齐开销的情况下长时间维持I/O。该作业达到了案例研究中所分析的作业中最高的效率。
I. Turbulence2
The Turbulence2 project illustrates another example where the job variability arose from differences in the performance of a single application at different scales. The jobs took very little run time, with several examples executing for less than one minute. There is an unusual mix of access sizes, as illustrated in Figure 10. Writes were dominated by very large access sizes, but many reads were less than 100 bytes each. This strategy performed best at relatively small job sizes of 2,048 or 2,304 processes. The same application did not fare was well when scaled up to 65,536 or 131,072 processes, though the run time was still only a few minutes. This application used MPI-IO but did not leverage collective operations or derived datatypes. The I/O-intensive jobs in this project may have been the result of a benchmarking effort, as only 40 jobs met the filtering criteria used in this section. All 40 were the same application executed with different parameters and at different scales.
Turbulence2项目展示了另一个例子,其中工作可变性是由单个应用程序在不同尺度上的性能差异引起的。这些作业的运行时间非常短,有几个示例的执行时间不到一分钟。有不同寻常的访问大小组合,如图10所示。写操作主要由非常大的访问大小主导,但许多读操作每次都小于100字节。此策略在相对较小的作业规模(2,048或2,304个进程)中表现最好。当扩展到65,536或131,072个进程时,同样的应用程序表现不佳,尽管运行时间仍然只有几分钟。这个应用程序使用MPI-IO,但没有利用集合操作或派生数据类型。这个项目中的I/ o密集型作业可能是基准测试工作的结果,因为只有40个作业符合本节中使用的过滤标准。所有40个应用程序都是在不同参数和不同规模下执行的相同应用程序。
J. Turbulence3
The Turbulence3 project consisted of 49 jobs. There were a few common sizes of 1,024, 1,600, and 2,048 processes as well as two jobs of 8,192 processes. The jobs have a similar pattern of I/O. There is a mix of MPI independent reads and writes at a specific request size. The I/O also occurs only from a subset of the MPI ranks, either 4 or 8 ranks. The lowestperforming job used 16 KiB MPI-IO independent reads and writes. The performance increased as jobs used larger request sizes, going up to 320 KiB and 512 KiB request sizes. The highest performing job used a 320 KiB request size but had more than double the number of reads as writes. The reads would be able to take advantage of GPFS read-ahead and caching.
Turbulence3项目包含49个工作。有几种常见的大小(1,024、1,600和2,048个进程)以及两个8,192个进程的作业。作业具有类似的I/O模式。在特定的请求大小下,可以混合使用MPI独立的读和写操作。I/O也只发生在MPI等级的一个子集上,即4级或8级。性能最低的作业使用16 KiB的MPI-IO独立读写。性能随着作业使用更大的请求大小而提高,请求大小达到320 KiB和512 KiB。性能最高的作业使用320 KiB请求大小,但读和写的数量是它的两倍多。读取将能够利用GPFS预读和缓存。
VI. APPLICATION TUNING CASE STUDY
At the conclusion of the I/O study, we selected the CombustionPhysics project from Section V-G as a case study of how continuous characterization can be applied to tuning a specific application. As noted earlier, production jobs from this project ranged from 1,024 to 32,768 nodes but achieved progressively worse I/O performance as the scale increased.This application used OpenMP with four threads per node to process an AMR data set ranging from 2 10 to 2 13 data points.To simplify the tuning process, we decided to investigate the I/O of a similar but smaller application example from the CombustionPhysics project. This target application utilized the same OpenMP configuration and the same I/O strategy and likewise achieved poor I/O performance at scale. However, its data set is a uniform 2 9 mesh that produces fixed-size checkpoints of approximately 20 GiB each. We focused on the 8,192 node (32,768 cores) example of this application as a test case and configured it to generate two checkpoints.
在I/O研究的最后,我们选择了V-G部分的燃烧物理项目作为案例研究,研究如何将连续表征应用于调优特定应用程序。如前所述,这个项目的生产作业范围从1,024到32,768个节点,但随着规模的增加,I/O性能逐渐变差。该应用程序使用每个节点有四个线程的OpenMP来处理范围从2 10到2 13个数据点的AMR数据集。为了简化调优过程,我们决定研究一个类似但更小的应用程序示例的I/O,该示例来自于CombustionPhysics项目。这个目标应用程序使用相同的OpenMP配置和相同的I/O策略,同样在规模上实现了较差的I/O性能。然而,它的数据集是一个统一的29网格,产生固定大小的检查点,每个检查点大约为20 GiB。我们将该应用程序的8,192个节点(32,768个内核)示例作为测试用例,并将其配置为生成两个检查点。
To reduce the metadata overhead in a more practical manner, we decided to modify the application to dump its checkpoint data to a single, shared file. Rather than dumping data using independent POSIX operations, however, we updated the application to use MPI-IO collective operations. As a result, the application not only reduced metadata overhead but also enabled a range of transparent MPI-IO optimizations. Of particular importance for strong scaling algorithms, MPI-IO can aggregate access to shared files in order to mitigate the effect of smaller writes at scale. Collective MPI-IO routines also introduce precise block alignment, which has been shown to be an important factor in GPFS performance [15], [12].
为了以更实际的方式减少元数据开销,我们决定修改应用程序,将其检查点数据转储到单个共享文件中。但是,我们没有使用独立的POSIX操作转储数据,而是更新了应用程序,使用MPI-IO集合操作。因此,应用程序不仅减少了元数据开销,而且还启用了一系列透明的MPI-IO优化。对于强缩放算法来说,MPI-IO可以聚合对共享文件的访问,以减轻小规模写操作的影响。集体MPI-IO例程还引入了精确的块对齐,这已被证明是GPFS性能的一个重要因素[15],[12]。
Figure 12 shows the per node I/O performance achieved by all three versions of the test application with two checkpoints.The MPI-IO version achieved a factor of 41 improvement, dumping two time steps in approximately 17 seconds. Darshan analysis confirmed that all application-level writes were between 1 and 4 MiB, as in the original example. However, MPIIO aggregated these into 16 MiB writes at the file system level.The total number of write operations processed by the file system was reduced from 16,388 to 4,096. This, in conjunction with improved block alignment and a dramatic reduction in the number of files created, resulted in more efficient use of available resources and faster turnaround for science runs.
图12显示了使用两个检查点的测试应用程序的所有三个版本实现的每个节点I/O性能。MPI-IO版本实现了41倍的改进,在大约17秒内转储两个时间步。Darshan分析证实,与原始示例一样,所有应用程序级写入都在1到4 MiB之间。然而,MPIIO在文件系统级别将这些聚合为16个MiB写。文件系统处理的写操作总数从16,388次减少到4,096次。这与改进的块对齐和创建的文件数量的显著减少相结合,导致了对可用资源的更有效利用和科学运行的更快周转。
VII. RELATED WORK
A number of past studies have investigated the I/O access patterns of scientific applications. Nieuwejaar et al initiated the influential Charisma project in 1993 to study multiprocessor I/O workloads [13]. This culminated in an analysis of three weeks of data from two high-performance computing systems with up to 512 processes. Their study identified access pattern characteristics and established terminology to describe them. Smirni and Reed analyzed five representative scientific applications with up to 64 processes [16]. The Pablo environment [17] was used for trace capture in that work. While both Charisma and Pablo measured application characteristics in a manner similar to our work, neither was performed at a comparable scale or correlated to system-level I/O activity.Wang et al investigated synthetic benchmarks and two physics applications with up to 1,620 processes and found similar results [18]. Uselton et al developed a statistical approach to I/O characterization in a more recent study [19]. They leveraged IPM [20] for the raw trace capture of two scientific applications with up to 10,240 processes on two different platforms. Statistical techniques were then used to identify and resolve I/O bottlenecks in each application. These studies utilized complete traces that focused on specific applications rather than on a general production workload.
过去的一些研究调查了科学应用程序的I/O访问模式。Nieuwejaar等人于1993年发起了颇具影响力的Charisma项目,研究多处理器I/O工作负载[13]。最后,我们分析了来自两个高性能计算系统、多达512个进程的三周数据。他们的研究确定了访问模式的特征,并建立了描述它们的术语。Smirni和Reed分析了五个具有代表性的科学应用,其中多达64个过程[16]。Pablo环境[17]在该工作中用于痕量捕获。虽然Charisma和Pablo都以类似于我们工作的方式测量了应用程序特征,但两者都没有在可比较的规模上执行,也没有与系统级I/O活动相关。Wang等人研究了多达1,620个过程的合成基准和两个物理应用程序,并发现了类似的结果[18]。Uselton等人在最近的一项研究中开发了一种I/O表征的统计方法[19]。他们利用IPM[20]对两个科学应用程序在两个不同平台上多达10,240个进程的原始跟踪捕获。然后使用统计技术来识别和解决每个应用程序中的I/O瓶颈。这些研究利用了专注于特定应用程序而不是一般生产工作负载的完整跟踪。
Kim et al performed a workload characterization of a 10 PiB storage system at Oak Ridge National Laboratory that provides storage for over 250 thousand compute cores [14]. They observed and modeled production disk controller characteristics such as bandwidth distribution, the correlation of request size to performance, and idle time. These characteristics were not correlated back to application level behavior, however.
Kim等人在橡树岭国家实验室对一个10 PiB存储系统进行了工作负载表征,该系统为超过25万个计算核心提供存储[14]。他们观察并模拟了生产磁盘控制器的特征,如带宽分布、请求大小与性能的相关性以及空闲时间。然而,这些特征与应用程序级别的行为并不相关。
Other recent system-level studies have focused on large network file systems. In an investigation of two CIFS file systems that hosted data for 1,500 industry employees, Leung et al discovered a number of recent trends in I/O behavior [6].Anderson presented a study of NFS workloads with up to 1,634 clients [5]. These studies were similar in scope to our work but were not performed in a high-performance computing environment.
最近的其他系统级研究集中在大型网络文件系统上。Leung等人对两个为1500名行业员工托管数据的CIFS文件系统进行了调查,发现了I/O行为的一些最新趋势[6]。Anderson提出了一项对NFS工作负载的研究,其中包含多达1,634个客户机[5]。这些研究的范围与我们的工作相似,但不是在高性能计算环境中进行的。
A wide variety of tools are available for capturing and analyzing I/O access from individual parallel applications, including IPM, HPCT-IO, LANL-Trace, IOT, and mpiP [20], [21], [22], [23], [24]. Multiple I/O tracing mechanisms were surveyed by Konwinski et al [25]. Klundt, Weston, and Ward have also investigated tracing of user-level I/O libraries on lightweight kernels [26].
各种各样的工具可用于捕获和分析来自单个并行应用程序的I/O访问,包括IPM、HPCT-IO、LANL-Trace、IOT和mpiP[20]、[21]、[22]、[23]、[24]。Konwinski等[25]研究了多种I/O跟踪机制。Klundt、Weston和Ward还研究了轻量级内核上用户级I/O库的跟踪[26]。
VIII. CONCLUSIONS
In this work we have introduced a methodology for continuous, scalable, production characterization of I/O workloads, and we have demonstrated its value in both understanding system behavior and accelerating debugging of individual applications. We used data collected over a two-month period to investigate critical questions about the nature of storage access characteristics on leadership-class machines. We performed our investigation on Intrepid, a 557-teraflop IBM Blue Gene/P deployed at Argonne National Laboratory. Intrepid’s storage system contained over 191 million files and moved an average of nearly 150 TiB of data per day. We captured detailed application-level I/O characteristics of 27% of all jobs executed on Intrepid, ranging in size from 1 to 163,840 processes. In doing so, we demonstrated that it is possible to instrument several aspects of storage systems at full scale without interfering with production users. We have also developed a performance metric that enables relative comparison of a wide variety of production applications.
在这项工作中,我们介绍了一种对I/O工作负载进行连续的、可扩展的、生产特性描述的方法,并展示了它在理解系统行为和加速单个应用程序调试方面的价值。我们使用在两个月期间收集的数据来调查有关领导级机器上存储访问特征性质的关键问题。我们在Intrepid上进行了调查,Intrepid是部署在阿贡国家实验室的557万亿次的IBM Blue Gene/P。Intrepid的存储系统包含超过1.91亿个文件,平均每天移动近150 TiB的数据。我们捕获了在Intrepid上执行的所有作业中27%的应用程序级I/O特征,其大小从1到163,840个进程不等。在这样做的过程中,我们证明了在不干扰生产用户的情况下对存储系统的几个方面进行全面检测是可能的。我们还开发了一个性能指标,可以对各种生产应用程序进行相对比较。
The findings of this study will influence future research directions as well as the design of future I/O subsystems to be used at the ALCF. We found that POSIX is still heavily used by many applications, although on the BG/P it offered no discernible performance advantage. MPI-IO, HDF5, and Parallel NetCDF are also used by the top 10 I/O producers and consumers. We also found several examples of I/O performance being constrained by the metadata overhead that results from accessing large numbers of small files. These examples of suboptimal behavior highlight the need for tools such as Darshan to simplify the task of understanding and tuning I/O behavior.
本研究的结果将影响未来的研究方向,以及未来的I/O子系统的设计,将用于ALCF。我们发现POSIX仍然被许多应用程序大量使用,尽管在BG/P上它没有提供明显的性能优势。MPI-IO、HDF5和并行NetCDF也被前10大I/O生产商和消费者所使用。我们还发现了几个I/O性能受到元数据开销限制的例子,元数据开销是由于访问大量小文件造成的。这些次优行为的例子突出表明,需要Darshan等工具来简化理解和调优I/O行为的任务。
Shared or partially shared file usage becomes the predominant method of file access at the 16,384-processor mark on Intrepid. This job size, and larger, will be increasingly common on the next generation of computing platforms. This implies that the ALCF should invest in helping applications transition to shared or partially shared file models. We demonstrated the benefit of this approach through a case study of I/O tuning in the CombustionPhysics project, one of the most active INCITE projects on Intrepid in terms of data usage.
共享或部分共享文件使用成为Intrepid上16,384个处理器标记处的主要文件访问方法。这种作业规模,甚至更大,将在下一代计算平台上越来越普遍。这意味着ALCF应该投资于帮助应用程序转换到共享或部分共享的文件模型。我们通过对燃烧物理项目(Intrepid上最活跃的INCITE项目之一,就数据使用而言)的I/O调优案例研究,展示了这种方法的好处。
From the aspect of I/O system design, we found two major items of interest. We were able to verify the “burstiness” of I/O on Intrepid, indicating that we have a significant opportunity to utilize idle storage resources for tasks such as performance diagnosis. In addition, we found that files are rarely overwritten once they are closed. This suggests that there is an opportunity to leverage hierarchical storage for more cost-effective storage of infrequently accessed data. We also found that while I/O characterization data was easy to use on a per job basis, analyzing and summarizing many jobs in aggregate were more difficult than anticipated. As a result, we have enhanced Darshan in the 2.0.0 release to streamline the analysis process [27]. For example, more shared file statistics (such as minimum, maximum, and variance among participating processes) are now computed at run time.
从I/O系统设计的角度来看,我们发现了两个主要感兴趣的项目。我们能够在Intrepid上验证I/O的“突发”,这表明我们有很大的机会将空闲存储资源用于诸如性能诊断之类的任务。此外,我们发现文件一旦关闭就很少被覆盖。这表明有机会利用分层存储来更经济有效地存储不经常访问的数据。我们还发现,虽然在每个作业的基础上使用I/O特征数据很容易,但分析和汇总许多作业比预期的要困难得多。因此,我们在2.0.0版本中增强了Darshan,以简化分析过程[27]。例如,现在可以在运行时计算更多的共享文件统计信息(例如参与进程之间的最小、最大和方差)。
We compared storage access characteristics of different scientific application domains and found an extraordinary variety of both data usage and application-level I/O performance.Several distinct I/O strategies were identified, including shared file usage, unique file usage, rank 0 I/O, and examples of both weak and strong scaling of data. We also discovered examples of applications that appeared to be utilizing custom aggregation algorithms without the assistance of MPI-IO.In general we found that metadata overhead, small access sizes, and I/O imbalance were the most significant barriers to I/O performance. However, no single technique employed by applications emerged overwhelmingly as the most successful.In future work we would like to perform similar studies at other sites that may offer a different collection of scientific computing applications to compare and contrast.
我们比较了不同科学应用领域的存储访问特征,发现数据使用和应用程序级I/O性能的差异非常大。确定了几种不同的I/O策略,包括共享文件使用、唯一文件使用、0级I/O以及弱和强数据伸缩的示例。我们还发现了一些应用程序的例子,这些应用程序似乎在没有MPI-IO帮助的情况下使用了自定义聚合算法。通常,我们发现元数据开销、较小的访问大小和I/O不平衡是I/O性能的最大障碍。然而,应用程序所采用的任何一种技术都不是最成功的。在未来的工作中,我们希望在其他可能提供不同科学计算应用程序集合的站点进行类似的研究,以进行比较和对比。
文章评论