The International Conference for High Performance Computing, Networking, Storage and Analysis
Designing Scalable and Efficient I/O Middleware for Fault-Resilient High-Performance Computing Clusters.
Student: Raghunath Raja Chandrasekar (Ohio State University)
Advisor: Dhabaleswar Kumar Panda (Ohio State University)
Abstract: This dissertation proposes a cross-layer framework that leverages the hierarchy in storage media (memory, ramdisk, flash/NVM, disk, parallel FS, and so on), to design scalable and low-overhead fault-tolerance mechanisms that are inherently I/O bound. The key components of the framework include - CRUISE, a highly-scalable in-memory checkpointing system that leverages both volatile and Non-Volatile Memory technologies; Stage-FS, a light-weight data-staging system that leverages burst-buffers and SSDs to asynchronously move application snapshots to a remote file system; Stage-QoS, a file system agnostic Quality-of-Service mechanism for data-staging systems that minimizes network contention; MIC-Check, a distributed checkpoint-restart system for coprocessor-based supercomputing systems; and FTB-IPMI, an out-of-band fault-prediction mechanism that pro-actively monitors for failures.