Knowledge Center Contents Previous Next Index |
Job Checkpoint, Restart, and Migration
Job checkpoint and restart optimizes resource usage by enabling a non-interactive job to restart on a new host from the point at which the job stopped-checkpointed jobs do not have to restart from the beginning. Job migration facilitates load balancing by enabling users to move a job from one host to another while taking advantage of job checkpoint and restart functionality.
Contents
- Checkpoint and restart options
- Checkpoint directory and files
- Checkpoint and restart executables
- Job restart
- Job migration
Checkpoint and restart options
You can implement job checkpoint and restart at one of the following levels.
- Kernel level-provided by your operating system, enabled by default
- User level-provided by special LSF libraries that you link to your application object files
- Application level-provided by your site-specific applications and supported by LSF through the use of application-specific
echkpnt
anderestart
executables
note:
For a detailed description of the job checkpoint and restart feature and how to configure it, see thePlatform LSF Configuration Reference
.Checkpoint directory and files
The job checkpoint and restart feature requires that a job be made checkpointable at the job, application profile, or queue level. LSF users can make a job checkpointable by submitting the job using
bsub -k
and specifying a checkpoint directory, and optional checkpoint period, initial checkpoint period, and checkpoint method. Administrators can make all jobs in a queue or an application profile checkpointable by specifying a checkpoint directory for the queue or application.Requirements
The following requirements apply to a checkpoint directory specified at the queue or application profile level:
- The specified checkpoint directory must already exist. LSF does not create the checkpoint directory.
- The user account that submits the job must have read and write permissions for the checkpoint directory.
- For the job to restart on another execution host, both the original and new hosts must have network connectivity to the checkpoint directory.
Behavior
Specifying a checkpoint directory at the queue level or in an application profile enables checkpointing.
- All jobs submitted to the queue or application profile are checkpointable. LSF writes the checkpoint files, which contain job state information, to the checkpoint directory. The checkpoint directory can contain checkpoint files for multiple jobs.
note:
LSF does not delete the checkpoint files; you must perform file maintenance manually.- If the administrator specifies a checkpoint period, in minutes, LSF creates a checkpoint file every
chkpnt_period
during job execution.- If the administrator specifies an initial checkpoint period in an application profile, in minutes, the first checkpoint does not happen until the initial period has elapsed. LSF then creates a checkpoint file every
chkpnt_period
after the initial checkpoint period, during job execution.- If a user specifies a checkpoint directory, initial checkpoint period, checkpoint method or checkpoint period at the job level with
bsub -k
, or modifies the job withbmod
, the job-level values override the queue-level and applcation profile values.The
brestart
command restarts checkpointed jobs that have stopped running.Precendence of checkpointing options
If checkpoint-related configuration is specified in both the queue and an application profile, the application profile setting overrides queue level configuration.
If checkpoint-related configuration is specified in the queue, application profile, and at job level:
- Application-level and job-level parameters are merged. If the same parameter is defined at both job-level and in the application profile, the job-level value overrides the application profile value.
- The merged result of job-level and application profile settings override queue-level configuration.
Checkpointing MultiCluster jobs
To enable checkpointing of MultiCluster jobs, define a checkpoint directory in both the send-jobs and receive-jobs queues (CHKPNT in
lsb.queues
), or in an application profile (CHKPNT_DIR, CHKPNT_PERIOD, CHKPNT_INITPERIOD, CHKPNT_METHOD inlsb.applications
) of both submission cluster and execution cluster. LSF uses the directory specified in the execution cluster.Checkpointing is not supported if a job runs on a leased host.
Checkpointing resizable jobs
After a checkpointable resizable job restarts (
brestart
), LSF restores the original job allocation request. LSF also restores job-level autoresizable attribute and notification command if they are specified at job submission.Example
The following example shows a queue configured for periodic checkpointing in
lsb.queues
:Begin Queue ... QUEUE_NAME=checkpoint CHKPNT=mydir 240 DESCRIPTION=Automatically checkpoints jobs every 4 hours to mydir ... End Queue
note:
Thebqueues
command displays the checkpoint period in seconds; thelsb.queues
CHKPNT
parameter defines the checkpoint period in minutes.If the command
bchkpnt -k 123
is used to checkpoint and kill job 123, you can restart the job using thebrestart
command as shown in the following example:
brestart -q priority mydir 123
Job <456> is submitted to queue <priority>LSF assigns a new job ID of 456, submits the job to the queue named "priority," and restarts the job.
Once job 456 is running, you can change the checkpoint period using the
bchkpnt
command:
bchkpnt -p 360 456
Job <456> is being checkpointed
note:
For a detailed description of the commands used with the job checkpoint and restart feature, see thePlatform LSF Configuration Reference
.Checkpoint and restart executables
LSF controls checkpointing and restart by means of interfaces named
echkpnt
anderestart
. By default, when a user specifies a checkpoint directory usingbsub -k
orbmod -k
or submits a job to a queue that has a checkpoint directory specified,echkpnt
sends checkpoint instructions to an executable namedechkpnt.default
.For application-level job checkpoint and restart, you can specify customized checkpoint and restart executables for each application that you use. The optional parameter
LSB_ECHKPNT_METHOD
specifies a checkpoint executable used for all jobs in the cluster. An LSF user can override this value when submitting a job.
note:
For a detailed description of how to write and configure application-level checkpoint and restart executables, see thePlatform LSF Configuration Reference
.Job restart
LSF can restart a checkpointed job on a host other than the original execution host using the information saved in the checkpoint file to recreate the execution environment. Only jobs that have been checkpointed successfully can be restarted from a checkpoint file. When a job restarts, LSF performs the following actions:
- LSF resubmits the job to its original queue as a new job and assigns a new job ID.
- When a suitable host becomes available, LSF dispatches the job.
- LSF recreates the execution environment from the checkpoint file.
- LSF restarts the job from its last checkpoint. You can restart a job manually from the command line using
brestart
, automatically through configuration, or by migrating the job to a different host usingbmig
.Requirements
To allow restart of a checkpointed job on a different host than the host on which the job originally ran, both the original and the new hosts must:
- Be binary compatible
- Run the same dot version of the operating system for predictable results
- Have network connectivity and read/execute permissions to the checkpoint and restart executables (in
LSF_SERVERDIR
by default)- Have network connectivity and read/write permissions to the checkpoint directory and the checkpoint file
- Have access to all files open during job execution so that LSF can locate them using an absolute path name
Job migration
Job migration is the process of moving a checkpointable or rerunnable job from one host to another. This facilitates load balancing by moving jobs from a heavily-loaded host to a lightly-loaded host.
You can initiate job migration manually on demand (
bmig
) or automatically. To initiate job migration automatically, you can configure a migration threshold at job submission, or at the host, queue, or in an application profile.
note:
For a detailed description of the job migration feature and how to configure it, see thePlatform LSF Configuration Reference
.Manual job migration
The
bmig
command migrates checkpointable or rerunnable jobs on demand. Jobs can be manually migrated by the job owner, queue administrator, and LSF administrator.For example, to migrate a job with job ID 123 to the first available host:
bmig 123
Job <123> is being migratedAutomatic job migration
Automatic job migration assumes that if a job is system-suspended (
SSUSP
) for an extended period of time, the execution host is probably heavily loaded. Specifying a migration threshold at job submission (bsub -mig
) or configuring an application profile-level, queue-level or host-level migration threshold allows the job to progress and reduces the load on the host. You can usebmig
at any time to override a configured migration threshold, orbmod -mig
to change a job-level migration threshold.For example, at the queue level, in
lsb.queues
:Begin Queue ...MIG=30
# Migration threshold set to 30 mins DESCRIPTION=Migrate suspended jobs after 30 mins ... End QueueAt the host level, in
lsb.hosts
:Begin Host HOST_NAME r1m pgMIG
# Keywords ... hostA 5.0 1830
... End HostFor example, in an application profile, in
lsb.applications
:Begin Application ...MIG=30
# Migration threshold set to 30 mins DESCRIPTION=Migrate suspended jobs after 30 mins ... End ApplicationIf you want to requeue migrated jobs instead of restarting or rerunning them, you can define the following parameters in
lsf.conf
:
LSB_MIG2PEND
=1
requeues a job with the original submission time and priorityLSB_REQUEUE_TO_BOTTOM
=1
requeues a job at the bottom of the queue, regardless of the submission time and priority
Platform Computing Inc.
www.platform.com |
Knowledge Center Contents Previous Next Index |