LSF provides a method to checkpoint jobs on systems that do not support kernel-level checkpointing called user-level checkpointing. To implement user-level checkpointing, you must have access to your applications object files (.o files), and they must be re-linked with a set of libraries provided by LSF. This approach is transparent to your application, its code does not have to be changed and the application does not know that a checkpoint and restart has occurred.
By default, the checkpoint libraries are installed in LSF_LIBDIR and echkpnt and erestart are installed in the LSF_SERVERDIR.
Optionally, third party checkpoint and restart implementations can be used with LSF. You must use the echkpnt and erestart supplied with the implementations. To avoid overwriting the echkpnt and erestart supplied by LSF, install any third party implementations in a separate directory by defining LSB_ECHKPNT_METHOD and LSB_ECHKPNT_METHOD_DIR as environment variables or in lsf.conf.
There are restrictions to the use of the current implementation of the checkpoint library for user-level checkpointing. These are:
The checkpointed process can only be restarted on hosts of the same architecture and with the same operating system as the host on which the checkpoint was created.
Processes with open pipes and sockets can be checkpointed but may not properly restart as the pipes and sockets are not re-opened on restart.
If a process has stdin, stdout, or stderr as open pipes, all data in the pipes is lost on restart.
The checkpointed process cannot be operating on a private stack when the checkpoint happens.
The checkpointed program must be statically linked.
SIGHUP is used internally to implement checkpointing. Do not use this signal in programs to be checkpointed.