The International Conference for High Performance Computing, Networking, Storage and Analysis
Monetary Cost Optimizations for HPC Applications on Amazon Clouds: Checkpoints and Replicated Execution.
Authors: Yifan Gong (Nanyang Technological University), Bingsheng He (Nanyang Technological University), Amelie Chi Zhou (Nanyang Technological University)
Abstract: In this paper, we propose monetary cost optimizations for MPI-based applications with deadline constraints on Amazon EC2 clouds. Particularly, we develop an MPI runtime system called SOMPI that minimizes the monetary cost by considering two kinds of Amazon EC2 instances (on-demand and spot instances). As a spot instance can fail at any time due to out-of-bid events, fault tolerant executions are necessary. Through detailed studies, we have found that two common fault-tolerant mechanisms, i.e., checkpoint and replicated executions, are complementary with each other for cost-effective MPI executions on spot instances. Therefore, we propose a novel cost model to minimize the expected monetary cost. The experimental results with NPB benchmarks on Amazon EC2 demonstrate (1) the signiﬁcant monetary cost reduction and performance improvement; (2) the necessity of adaptively choosing checkpoint and replication techniques due to spot instance price dynamics.