The International Conference for High Performance Computing, Networking, Storage and Analysis
Lossy Compression for Checkpointing: Fallible or Feasible?.
Authors: Xiang Ni (University of Illinois at Urbana-Champaign), Tanzima Islam (Lawrence Livermore National Laboratory), Kathryn Mohror (Lawrence Livermore National Laboratory), Adam Moody (Lawrence Livermore National Laboratory), Laxmikant Kale (University of Illinois at Urbana-Champaign)
Abstract: As HPC applications scale to hundreds of thousands of processors, large checkpoints consume a lot of space making it costly to fit them in memory or burst buffers. It also takes a significant amount of time to transfer them to stable storage. Lossless compression fail to reduce the size of such checkpoints due to randomness in the lower bits of typical floating point scientific data. To address this challenge, we propose use of lossy compression for reducing checkpoint size significantly, and study its impact on correctness of application execution. We study the trade-off between the loss of precision, compression ratio, and correctness due to lossy compression. For ChaNGa, we show that use of moderate lossy compression reduces checkpoint size by 3-5x while maintaining correctness. Finally, we inject failures following different distributions to study whether an application is more sensitive to precision loss at earlier or later stages of the simulation.