SC14 New Orleans, LA

The International Conference for High Performance Computing, Networking, Storage and Analysis

Scalable and Highly Available Fault Resilient Programming Middleware for Exascale Computing.


Authors: Atsuko Takefusa (National Institute of Advanced Industrial Science and Technology), Tsutomu Ikegami (National Institute of Advanced Industrial Science and Technology), Hidemoto Nakada (National Institute of Advanced Industrial Science and Technology), Ryousei Takano (National Institute of Advanced Industrial Science and Technology), Takayuki Tozawa (National Institute of Advanced Industrial Science and Technology), Yoshio Tanaka (National Institute of Advanced Industrial Science and Technology)

Abstract: Falanx is a programming middleware for the development of applications for exascale computing. Because of the fragility of the computing environment, applications are required to be not only scalable, but also fault resilient. Falanx employs an MPI-based hierarchical parallel programming model for the scalability. Falanx consists of Resource Management System (RMS) and Data Store (DS): The RMS allocates processes of each task to computing nodes avoiding failed nodes. The DS redundantly preserves data required for each application, and prevents data loss. It is necessary that these components must be scalable and that they themselves have to be implemented in a fault resilient manner. We design a scalable and highly available middleware, which consists of RMS and DS, and implement them by using Apache ZooKeeper and Kyoto Cabinet. Then, we investigate the basic performance from the preliminary experiments and confirm the feasibility from experiments using an actual application, OpenFMO.

Poster: pdf
Two-page extended abstract: pdf


Poster Index