Factory: Non-stop batch jobs without checkpointing

I. Gankevich, Y. Tipikin, V. Korkhov, V. Gaiduchok

Nowadays many job schedulers rely on checkpoint mechanisms to make long-running batch jobs resilient to node failures. At large scale stopping a job and creating its image consumes considerable amount of time. The aim of this study is to propose a method that eliminates this overhead. For this purpose we decompose a problem being solved into computational microkernels which have strict hierarchical dependence on each other. When a kernel abruptly stops its execution due to a node failure, it is responsibility of its principal to restart computation on a healthy node. In the course of experiments we successfully applied this method to make hydrodynamics HPC application run on constantly changing number of nodes. We believe, that this technique can be generalised to other types of scientific applications as well.

  title={Factory: Non-stop batch jobs without checkpointing},
  author={I. Gankevich and Y. Tipikin and V. Korkhov and V. Gaiduchok},
  booktitle={Proceedings of HPCS'16},

Publication: Proceedings of HPCS'16