Verifiable Application-Level Checkpoint and Restart Framework for Parallel Computing

Ivan Gankevich, Ivan Petriakov, Anton Gavrikov, Dmitrii Tereshchenko, Gleb Mozhaiskii

Fault tolerance of parallel and distributed applications is one of the concerns that becomes topical for large computer clusters and large distributed systems. For a long time the common solution to this problem was checkpoint and restart mechanisms implemented on operating system level, however, they are inefficient for large systems and now application-level checkpoint and restart is considered as a more efficient alternative. In this paper we implement application-level checkpoint and restart manually for the well-known parallel computing benchmarks to evaluate this alternative approach. We measure the overheads introduced by creating and restarting from a checkpoint, and the amount of effort that is needed to implement and verify the correctness of the resulting programme. Based on the results we propose generic framework for application-level checkpointing that simplifies the process and allows to verify that the application gives correct output when restarted from any checkpoint.

  title={Verifiable Application-Level Checkpoint and Restart Framework for Parallel Computing},
  author={Ivan Gankevich and Ivan Petriakov and Anton Gavrikov and Dmitrii Tereshchenko and Gleb Mozhaiskii},
  publisher={RWTH Aahen University},
  booktitle={Proceedings of GRID'21},
  series={CEUR Workshop Proceedings},
  editor={Vladimir Korenkov and Andrey Nechaevskiy and Tatiana Zaikina},

Publication: Proceedings of GRID'21
Publisher: RWTH Aahen University