Job checkpoint and restart

The job checkpoint and restart feature enables you to stop jobs and then restart them from the point at which they stopped, which optimizes resource usage. LSF can periodically capture the state of a running job and the data required to restart it. This feature provides fault tolerance and allows LSF administrators and users to migrate jobs from one host to another to achieve load balancing.

Contents

  • About job checkpoint and restart

  • Scope

  • Configuration to enable job checkpoint and restart

  • Job checkpoint and restart behavior

  • Configuration to modify job checkpoint and restart

  • Job checkpoint and restart commands

About job checkpoint and restart

Checkpointing enables LSF users to restart a job on the same execution host or to migrate a job to a different execution host. LSF controls checkpointing and restart by means of interfaces named echkpnt and erestart. By default, when a user specifies a checkpoint directory using bsub -k or bmod -k or submits a job to a queue that has a checkpoint directory specified, echkpnt sends checkpoint instructions to an executable named echkpnt.default.

When LSF checkpoints a job, the echkpnt interface creates a checkpoint file in the directory checkpoint_dir/job_ID, and then checkpoints and resumes the job. The job continues to run, even if checkpointing fails.

When LSF restarts a stopped job, the erestart interface recovers job state information from the checkpoint file, including information about the execution environment, and restarts the job from the point at which the job stopped. At job restart, LSF
  1. Resubmits the job to its original queue and assigns a new job ID

  2. Dispatches the job when a suitable host becomes available (not necessarily the original execution host)

  3. Re-creates the execution environment based on information from the checkpoint file

  4. Restarts the job from its most recent checkpoint

Default behavior (job checkpoint and restart not enabled)

With job checkpoint and restart enabled

Kernel-level checkpoint and restart

The operating system provides checkpoint and restart functionality that is transparent to your applications and enabled by default. To implement job checkpoint and restart at the kernel level, the LSF echkpnt and erestart executables invoke operating system-specific calls.

LSF uses the default executables echkpnt.default and erestart.default for kernel-level checkpoint and restart.

User-level checkpoint and restart

For systems that do not support kernel-level checkpoint and restart, LSF provides a job checkpoint and restart implementation that is transparent to your applications and does not require you to rewrite code. User-level job checkpoint and restart is enabled by linking your application files to the LSF checkpoint libraries in LSF_LIBDIR. LSF uses the default executables echkpnt.default and erestart.default for user-level checkpoint and restart.

Application-level checkpoint and restart

Different applications have different checkpointing implementations that require the use of customized external executables (echkpnt.application and erestart.application). Application-level checkpoint and restart enables you to configure LSF to use specific echkpnt.application and erestart.application executables for a job, queue, or cluster. You can write customized checkpoint and restart executables for each application that you use.

LSF uses a combination of corresponding checkpoint and restart executables. For example, if you use echkpnt.fluent to checkpoint a particular job, LSF will use erestart.fluent to restart the checkpointed job. You cannot override this behavior or configure LSF to use a specific restart executable.

Scope


Applicability

Details

Operating system

  • Kernel-level checkpoint and restart using the LSF checkpoint libraries works only with supported operating system versions and architecture for:
    • SGI IRIX 6.4 and later

    • SGI Altix ProPack 3 and later

Job types

  • Non-interactive batch jobs submitted with bsub or bmod

  • Non-interactive batch jobs, including chunk jobs, checkpointed with bchkpnt

  • Non-interactive batch jobs migrated with bmig

  • Non-interactive batch jobs restarted with brestart

Dependencies

  • UNIX and Windows user accounts must be valid on all hosts in the cluster, or the correct type of account mapping must be enabled.
    • For a mixed UNIX/Windows cluster, UNIX/Windows user account mapping must be enabled.

    • For a cluster with a non-uniform user name space, between-host account mapping must be enabled.

    • For a MultiCluster environment with a non-uniform user name space, cross-cluster user account mapping must be enabled.

  • The checkpoint and restart executables run under the user account of the user who submits the job. User accounts must have the correct permissions to
    • Successfully run executables located in LSF_SERVERDIR or LSB_ECHKPNT_METHOD_DIR

    • Write to the checkpoint directory

  • The erestart.application executable must have access to the original command line used to submit the job.

  • For user-level checkpoint and restart, you must have access to your application object (.o) files.

  • To allow restart of a checkpointed job on a different host than the host on which the job originally ran, both the original and the new hosts must:
    • Be binary compatible

    • Run the same dot version of the operating system for predictable results

    • Have network connectivity and read/execute permissions to the checkpoint and restart executables (in LSF_SERVERDIR by default)

    • Have network connectivity and read/write permissions to the checkpoint directory and the checkpoint file

    • Have access to all files open during job execution so that LSF can locate them using an absolute path name

Limitations

  • bmod cannot change the echkpnt and erestart executables associated with a job.

  • Linux 32, AIX, and HP platforms with NFS (network file systems), checkpoint directories (including path and file name) must be shorter than 1000 characters.

  • Linux 64 with NFS (network file systems), checkpoint directories (including path and file name) must be shorter than 2000 characters.


Configuration to enable job checkpoint and restart

The job checkpoint and restart feature requires that a job be made checkpointable at the job or queue level. LSF users can make jobs checkpointable by submitting jobs using bsub -k and specifying a checkpoint directory. Queue administrators can make all jobs in a queue checkpointable by specifying a checkpoint directory for the queue.

Configuration file

Parameter and syntax

Behavior

lsb.queues

CHKPNT=chkpnt_dir [chkpnt_period]

  • All jobs submitted to the queue are checkpointable. LSF writes the checkpoint files, which contain job state information, to the checkpoint directory. The checkpoint directory can contain checkpoint files for multiple jobs.
    • The specified checkpoint directory must already exist. LSF will not create the checkpoint directory.

    • The user account that submits the job must have read and write permissions for the checkpoint directory.

    • For the job to restart on another execution host, both the original and new hosts must have network connectivity to the checkpoint directory.

  • If the queue administrator specifies a checkpoint period, in minutes, LSF creates a checkpoint file every chkpnt_period during job execution.
    Note:

    There is no default value for checkpoint period. You must specify a checkpoint period if you want to enable periodic checkpointing.

  • If a user specifies a checkpoint directory and checkpoint period at the job level with bsub -k, the job-level values override the queue-level values.

  • The file path of the checkpoint directory can contain up to 4000 characters for UNIX and Linux, or up to 255 characters for Windows, including the directory and file name.

lsb.applications


Configuration to enable kernel-level checkpoint and restart

Kernel-level checkpoint and restart is enabled by default. LSF users make a job checkpointable by either submitting a job using bsub -k and specifying a checkpoint directory or by submitting a job to a queue that defines a checkpoint directory for the CHKPNT parameter.

Configuration to enable user-level checkpoint and restart

To enable user-level checkpoint and restart, you must link your application object files to the LSF checkpoint libraries provided in LSF_LIBDIR. You do not have to change any code within your application. For instructions on how to link application files, see the Platform LSF Programmer’s Guide.

Configuration to enable application-level checkpoint and restart

Application-level checkpointing requires the presence of at least one echkpnt.application executable in the directory specified by the parameter LSF_SERVERDIR in lsf.conf. Each echkpnt.application must have a corresponding erestart.application.
Important:
The erestart.application executable must:
  • Have access to the command line used to submit or modify the job

  • Exit with a return value without running an application; the erestart interface runs the application to restart the job


Executable file

UNIX naming convention

Windows naming convention

echkpnt

LSF_SERVERDIR/echkpnt.application

LSF_SERVERDIR\echkpnt.application.exe

LSF_SERVERDIR\echkpnt.application.bat

erestart

LSF_SERVERDIR/erestart.application

LSF_SERVERDIR\erestart.application.exe

LSF_SERVERDIR\erestart.application.bat


Restriction:

The names echkpnt.default and erestart.default are reserved. Do not use these names for application-level checkpoint and restart executables.

Valid file names contain only alphanumeric characters, underscores (_), and hyphens (-).

For application-level checkpoint and restart, once the LSF_SERVERDIR contains one or more checkpoint and restart executables, users can specify the external checkpoint executable associated with each checkpointable job they submit. At restart, LSF invokes the corresponding external restart executable.

Requirements for application-level checkpoint and restart executables

  • The executables must be written in C or Fortran.

  • The directory/name combinations must be unique within the cluster. For example, you can write two different checkpoint executables with the name echkpnt.fluent and save them as LSF_SERVERDIR/echkpnt.fluent and my_execs/echkpnt.fluent. To run checkpoint and restart executables from a directory other than LSF_SERVERDIR, you must configure the parameter LSB_ECHKPNT_METHOD_DIR in lsf.conf.

  • Your executables must return the following values.
    • An echkpnt.application must return a value of 0 when checkpointing succeeds and a non-zero value when checkpointing fails.

    • The erestart interface provided with LSF restarts the job using a restart command that erestart.application writes to a file. The return value indicates whether erestart.application successfully writes the parameter definition LSB_RESTART_CMD=restart_command to the file checkpoint_dir/job_ID/.restart_cmd.
      • A non-zero value indicates that erestart.application failed to write to the .restart_cmd file.

      • A return value of 0 indicates that erestart.application successfully wrote to the .restart_cmd file, or that the executable intentionally did not write to the file.

  • Your executables must recognize the syntax used by the echkpnt and erestart interfaces, which communicate with your executables by means of a common syntax.
    • echkpnt.application syntax:
      echkpnt [-c] [-f] [-k | -s] [-d checkpoint_dir] [-x] process_group_ID
      Restriction:

      The -k and -s options are mutually exclusive.

    • erestart.application syntax:
      erestart [-c] [-f] checkpoint_dir

    Option or variable

    Description

    Operating systems

    -c

    Copies all files in use by the checkpointed process to the checkpoint directory.

    Some, such as SGI systems running IRIX and Altix

    -f

    Forces a job to be checkpointed even under non-checkpointable conditions, which are specific to the checkpoint implementation used. This option could create checkpoint files that do not provide for successful restart.

    Some, such as SGI systems running IRIX and Altix

    -k

    Kills a job after successful checkpointing. If checkpoint fails, the job continues to run.

    All operating systems that LSF supports

    -s

    Stops a job after successful checkpointing. If checkpoint fails, the job continues to run.

    Some, such as SGI systems running IRIX and Altix

    -d checkpoint_dir

    Specifies the checkpoint directory as a relative or absolute path.

    All operating systems that LSF supports

    -x

    Identifies the cpr (checkpoint and restart) process as type HID. This identifies the set of processes to checkpoint as a process hierarchy (tree) rooted at the current PID.

    Some, such as SGI systems running IRIX and Altix

    process_group_ID

    ID of the process or process group to checkpoint.

    All operating systems that LSF supports


Job checkpoint and restart behavior

LSF invokes the echkpnt interface when a job is
  • Automatically checkpointed based on a configured checkpoint period

  • Manually checkpointed with bchkpnt

  • Migrated to a new host with bmig

After checkpointing, LSF invokes the erestart interface to restart the job. LSF also invokes the erestart interface when a user
  • Manually restarts a job using brestart

  • Migrates the job to a new host using bmig

All checkpoint and restart executables run under the user account of the user who submits the job.

Note:

By default, LSF redirects standard error and standard output to /dev/null and discards the data.

Checkpoint directory and files

LSF identifies checkpoint files by the checkpoint directory and job ID. For example:

bsub -k my_dir
Job <123> is submitted to default queue <default>

LSF writes the checkpoint file to my_dir/123.

LSF maintains all of the checkpoint files for a single job in one location. When a job restarts, LSF creates both a new subdirectory based on the new job ID and a symbolic link from the old to the new directory. For example, when job 123 restarts on a new host as job 456, LSF creates my_dir/456 and a symbolic link from my_dir/123 to my_dir/456.

The file path of the checkpoint directory can contain up to 4000 characters for UNIX and Linux, or up to 255 characters for Windows, including the directory and file name.

Precedence of job, queue, application, and cluster-level checkpoint values

LSF handles checkpoint and restart values as follows:
  1. Checkpoint directory and checkpoint period—values specified at the job level override values for the queue. Values specified in an application profile setting overrides queue level configuration.

    If checkpoint-related configuration is specified in the queue, application profile, and at job level:
    • Application-level and job-level parameters are merged. If the same parameter is defined at both job-level and in the application profile, the job-level value overrides the application profile value.

    • The merged result of job-level and application profile settings override queue-level configuration.

  2. Checkpoint and restart executables—the value for checkpoint_method specified at the job level overrides the application-level CHKPNT_METHOD, and the cluster-level value for LSB_ECHKPNT_METHOD specified in lsf.conf or as an environment variable.

  3. Configuration parameters and environment variables—values specified as environment variables override the values specified in lsf.conf


If the command line is…

And…

Then…

bsub -k "my_dir 240"

In lsb.queues,
CHKPNT=other_dir 360
  • LSF saves the checkpoint file to my_dir/job_ID every 240 minutes

bsub -k "my_dir fluent"

In lsf.conf,
LSB_ECHKPNT_METHOD=myapp
  • LSF invokes echkpnt.fluent at job checkpoint and erestart.fluent at job restart

bsub -k "my_dir"

In lsb.applications,
CHKPNT_PERIOD=360
  • LSF saves the checkpoint file to my_dir/job_ID every 360 minutes

bsub -k "240"

In lsb.applications,

CHKPNT_DIR=app_dir
CHKPNT_PERIOD=360
In lsb.queues,
CHKPNT=other_dir
  • LSF saves the checkpoint file to app_dir/job_ID every 240 minutes


Configuration to modify job checkpoint and restart

There are configuration parameters that modify various aspects of job checkpoint and restart behavior by:
  • Specifying mandatory application-level checkpoint and restart executables that apply to all checkpointable batch jobs in the cluster

  • Specifying the directory that contains customized application-level checkpoint and restart executables

  • Saving standard output and standard error to files in the checkpoint directory

  • Automatically checkpointing jobs before suspending or terminating them

  • For Cray systems only, copying all open job files to the checkpoint directory

Configuration to specify mandatory application-level executables

You can specify mandatory checkpoint and restart executables by defining the parameter LSB_ECHKPNT_METHOD in lsf.conf or as an environment variable.

Configuration file

Parameter and syntax

Behavior

lsf.conf

LSB_ECHKPNT_METHOD="echkpnt_application"

  • The specified echkpnt runs for all batch jobs submitted to the cluster. At restart, the corresponding erestart runs.

  • For example, if LSB_ECHKPNT_METHOD=fluent, at checkpoint, LSF runs echkpnt.fluent and at restart, LSF runs erestart.fluent.

  • If an LSF user specifies a different echkpnt_application at the job level using bsub -k or bmod -k, the job level value overrides the value in lsf.conf.


Configuration to specify the directory for application-level executables

By default, LSF looks for application-level checkpoint and restart executables in LSF_SERVERDIR. You can modify this behavior by specifying a different directory as an environment variable or in lsf.conf.

Configuration file

Parameter and syntax

Behavior

lsf.conf

LSB_ECHKPNT_METHOD_DIR=path

  • Specifies the absolute path to the directory that contains the echkpnt.application and erestart.application executables

  • User accounts that run these executables must have the correct permissions for the LSB_ECHKPNT_METHOD_DIR directory.


Configuration to save standard output and standard error

By default, LSF redirects the standard output and standard error from checkpoint and restart executables to /dev/null and discards the data. You can modify this behavior by defining the parameter LSB_ECHKPNT_KEEP_OUTPUT as an environment variable or in lsf.conf.

Configuration file

Parameter and syntax

Behavior

lsf.conf

LSB_ECHKPNT_KEEP_OUTPUT=Y | y

  • The stdout and stderr for echkpnt.application or echkpnt.default are redirected to checkpoint_dir/job_ID/
    • echkpnt.out

    • echkpnt.err

  • The stdout and stderr for erestart.application or erestart.default are redirected to checkpoint_dir/job_ID/
    • erestart.out

    • erestart.err


Configuration to checkpoint jobs before suspending or terminating them

LSF administrators can configure LSF at the queue level to checkpoint jobs before suspending or terminating them.

Configuration file

Parameter and syntax

Behavior

lsb.queues

JOB_CONTROLS=SUSPEND CHKPNT TERMINATE

  • LSF checkpoints jobs before suspending or terminating them

  • When suspending a job, LSF checkpoints the job and then stops it by sending the SIGSTOP signal

  • When terminating a job, LSF checkpoints the job and then kills it


Configuration to copy open job files to the checkpoint directory

For hosts that use the Cray operating system, LSF administrators can configure LSF at the host level to copy all open job files to the checkpoint directory every time the job is checkpointed.

Configuration file

Parameter and syntax

Behavior

lsb.hosts

HOST_NAME     CHKPNT
host_name        C
  • LSF copies all open job files to the checkpoint directory when a job is checkpointed


Job checkpoint and restart commands

Commands for submission


Command

Description

bsub -k "checkpoint_dir [checkpoint_period] [method=echkpnt_application]"

  • Specifies a relative or absolute path for the checkpoint directory and makes the job checkpointable.

  • If the specified checkpoint directory does not already exist, LSF creates the checkpoint directory.

  • If a user specifies a checkpoint period (in minutes), LSF creates a checkpoint file every chkpnt_period during job execution.

  • The command-line values for the checkpoint directory and checkpoint period override the values specified for the queue.

  • If a user specifies an echkpnt_application, LSF runs the corresponding restart executable when the job restarts. For example, for bsub -k "my_dir method=fluent" LSF runs echkpnt.fluent at job checkpoint and erestart.fluent at job restart.

  • The command-line value for echkpnt_application overrides the value specified by LSB_ECHKPNT_METHOD in lsf.conf or as an environment variable. Users can override LSB_ECHKPNT_METHOD and use the default checkpoint and restart executables by defining method=default.


Commands to monitor


Command

Description

bacct -l

  • Displays accounting statistics for finished jobs, including termination reasons. TERM_CHKPNT indicates that a job was checkpointed and killed.

  • If JOB_CONTROL is defined for a queue, LSF does not display the result of the action.

bhist -l

  • Displays the actions that LSF took on a completed job, including job checkpoint, restart, and migration to another host.

bjobs -l

  • Displays information about pending, running, and suspended jobs, including the checkpoint directory, the checkpoint period, and the checkpoint method (either application or default).


Commands to control


Command

Description

bmod -k "checkpoint_dir [checkpoint_period] [method=echkpnt_application]"

  • Resubmits a job and changes the checkpoint directory, checkpoint period, and the checkpoint and restart executables associated with the job.

bmod -kn

  • Dissociates the checkpoint directory from a job, which makes the job no longer checkpointable.

bchkpnt

  • Checkpoints the most recently submitted checkpointable job. Users can specify particular jobs to checkpoint by including various bchkpnt options.

bchkpnt -p checkpoint_period job_ID

  • Checkpoints a job immediately and changes the checkpoint period for the job.

bchkpnt -k job_ID

  • Checkpoints a job immediately and kills the job.

bchkpnt -p 0 job_ID

  • Checkpoints a job immediately and disables periodic checkpointing.

brestart

  • Restarts a checkpointed job on the first available host.

brestart -m

  • Restarts a checkpointed job on the specified host or host group.

bmig

  • Migrates one or more running jobs from one host to another. The jobs must be checkpointable or rerunnable.

  • Checkpoints, kills, and restarts one or more checkpointable jobs.


Commands to display configuration


Command

Description

bqueues -l

  • Displays information about queues configured in lsb.queues, including the values defined for checkpoint directory and checkpoint period.
    Note:

    The bqueues command displays the checkpoint period in seconds; the lsb.queues CHKPNT parameter defines the checkpoint period in minutes.

badmin showconf

  • Displays all configured parameters and their values set in lsf.conf or ego.conf that affect mbatchd and sbatchd.

    Use a text editor to view other parameters in the lsf.conf or ego.conf configuration files.

  • In a MultiCluster environment, badmin showconf only displays the parameters of daemons on the local cluster.