The International Conference for High Performance Computing, Networking, Storage and Analysis
Bandwidth-Aware Resource Management for Extreme Scale Systems.
Authors: Zhou Zhou (Illinois Institute of Technology), Xu Yang (Illinois Institute of Technology), Zhiling Lan (Illinois Institute of Technology), Paul Rich (Argonne National Laboratory), Wei Tang (Argonne National Laboratory), Vitali Morozov (Argonne National Laboratory), Narayan Desai (Ericsson)
Abstract: As systems scale towards exascale, many resources including the traditional CPU cycles as well as non-traditional resources (e.g., communication bandwidth) will become increasingly constrained. This change will pose critical challenges on resource management and job scheduling. As systems continue to evolve, we expect non-traditional resources like communication bandwidth to increasingly be explicitly allocated, where they have previously been managed in an implicit fashion. In this paper we investigate smart allocation of communication bandwidth on Blue Gene systems. The partition-based design in Blue Gene systems provides us a unique opportunity to explicitly allocate bandwidth to jobs, in a way that isn't possible on other systems. While this capability is currently rare, we expect it to become more common in the future. This paper makes two major contributions. The first is substantial benchmarking of leadership applications, focusing on assessing application sensitivity to communication bandwidth at large scale.