Research activities
CAPE — Checkpointing Aided Parallel Execution
The checkpoints were introduced many years ago to ensure the completion of a program by regularly backing up its state, and thus be able to resume execution in case of problems since the last backup rather than only from the beginning. Various techniques have emerged offering, for example, to make a complete backup of the state of the program or to keep only the changes made since the last backup. OpenMP is a set of compilation directives that make it possible to specify to the compiler the regions of the program that it would be interesting to parallelize. The relative simplicity of using OpenMP compared to MPI makes it an interface of choice for the expression of parallelism. If OpenMP was originally defined for shared memory machines, several more or less successful attempts have been made to upgrade it to MPI or GlobalArray to make it accessible to distributed memory machines.
CAPE is another alternative to this portage. It consists of saving an image of the program - using a checkpoint - at the beginning of a parallel section and distributing it to all the nodes composing the parallel machine so that each one executes a part of the program. After the nodes finish executing their piece of code, all return to the initial node all the modifications that have been made locally so that they are injected into the original program. Once all the modifications are taken into account, the initial program can resume its execution as if the parallel part had been executed locally on the initial node. Many developments have been necessary for the realization of successive prototypes, but the results obtained are very encouraging. In particular, the experiments we conducted on matrix products (operations very common in high-performance computing) show that the solution is scalable with a very good speedup, even for a large number of nodes.