Configuration to modify pre- and post-execution processing

Configuration parameters modify various aspects of pre- and post-execution processing behavior by:
  • Preventing a new job from starting until post-execution processing has finished

  • Controlling the length of time post-execution processing can run

  • Specifying a user account under which the pre- and post-execution commands run

  • Controlling how many times pre-execution retries

Configuration to modify when new jobs can start

When a job finishes, sbatchd reports a job finish status of DONE or EXIT to mbatchd. This causes LSF to release resources associated with the job, allowing new jobs to start on the execution host before post-execution processing from a previous job has finished.

In some cases, you might want to prevent the overlap of a new job with post-execution processing. Preventing a new job from starting prior to completion of post-execution processing can be configured at the application level or at the job level.

At the job level, the bsub -w option allows you to specify job dependencies; the keywords post_done and post_err cause LSF to wait for completion of post-execution processing before starting another job.

At the application level:

File

Parameter and syntax

Description

lsb.applications

lsb.params

JOB_INCLUDE_POSTPROC=Y

  • Enables completion of post-execution processing before LSF reports a job finish status of DONE or EXIT

  • Prevents a new job from starting on a host until post-execution processing is finished on that host


  • sbatchd sends both job finish status (DONE or EXIT) and post-execution processing status (POST_DONE or POST_ERR) to mbatchd at the same time

  • The job remains in the RUN state and holds its job slot until post-execution processing has finished

  • Job requeue happens (if required) after completion of post-execution processing, not when the job itself finishes

  • For job history and job accounting, the job CPU and run times include the post-execution processing CPU and run times

  • The job control commands bstop, bkill, and bresume have no effect during post-execution processing

  • If a host becomes unavailable during post-execution processing for a rerunnable job, mbatchd sees the job as still in the RUN state and reruns the job

  • LSF does not preempt jobs during post-execution processing

Configuration to modify the post-execution processing time

Controlling the length of time post-execution processing can run is configured at the application level.

File

Parameter and syntax

Description

lsb.applications

lsb.params

JOB_POSTPROC_TIMEOUT=minutes

  • Specifies the length of time, in minutes, that post-execution processing can run.

  • The specified value must be greater than zero.

  • If post-execution processing takes longer than the specified value, sbatchd reports post-execution failure—a status of POST_ERR—and kills the process group of the job’s post-execution processes. This kills the parent process only.

  • If JOB_INCLUDE_POSTPROC=Y and sbatchd kills the post-execution process group, post-execution processing CPU time is set to zero, and the job’s CPU time does not include post-execution CPU time.


Configuration to modify the pre- and post-execution processing user account

Specifying a user account under which the pre- and post-execution commands run is configured at the system level. By default, both the pre- and post-execution commands run under the account of the user who submits the job.

File

Parameter and syntax

Description

lsf.sudoers

LSB_PRE_POST_EXEC_USER=user_name

  • Specifies the user account under which pre- and post-execution commands run (UNIX only)

  • This parameter applies only to pre- and post-execution commands configured at the queue level; pre-execution commands defined at the application or job level run under the account of the user who submits the job

  • If the pre-execution or post-execution commands perform privileged operations that require root permissions on UNIX hosts, specify a value of root

  • You must edit the lsf.sudoers file on all UNIX hosts within the cluster and specify the same user account


Configuration to control how many times pre-execution retries

By default, if job pre-execution fails, LSF retries the job automatically. The job remains in the queue and pre-execution is retried 5 times by default, to minimize any impact to performance and throughput.

Limiting the number of times LSF retries job pre-execution is configured cluster-wide (lsb.params), at the queue level (lsb.queues), and at the application level (lsb.applications). pre-execution retry in lsb.applications overrides lsb.queues, and lsb.queues overrides lsb.params configuration.


Configuration file

Parameter and syntax

Behavior

lsb.params

LOCAL_MAX_PREEXEC_RETRY=integer

  • Controls the maximum number of times to attempt the pre-execution command of a job on the local cluster.

  • Specify an integer greater than 0

    By default, the number of retries is unlimited.

MAX_PREEXEC_RETRY=integer

  • Controls the maximum number of times to attempt the pre-execution command of a job on the remote cluster.

  • Specify an integer greater than 0

    By default, the number of retries is 5.

REMOTE_MAX_PREEXEC_RETRY=integer

  • Controls the maximum number of times to attempt the pre-execution command of a job on the remote cluster.

    Equivalent to MAX_PREEXEC_RETRY

  • Specify an integer greater than 0

    By default, the number of retries is 5.

lsb.queues

LOCAL_MAX_PREEXEC_RETRY=integer

  • Controls the maximum number of times to attempt the pre-execution command of a job on the local cluster.

  • Specify an integer greater than 0

    By default, the number of retries is unlimited.

MAX_PREEXEC_RETRY=integer

  • Controls the maximum number of times to attempt the pre-execution command of a job on the remote cluster.

  • Specify an integer greater than 0

    By default, the number of retries is 5.

REMOTE_MAX_PREEXEC_RETRY=integer

  • Controls the maximum number of times to attempt the pre-execution command of a job on the remote cluster.

    Equivalent to MAX_PREEXEC_RETRY

  • Specify an integer greater than 0

    By default, the number of retries is 5.

lsb.applications

LOCAL_MAX_PREEXEC_RETRY=integer

  • Controls the maximum number of times to attempt the pre-execution command of a job on the local cluster.

  • Specify an integer greater than 0

    By default, the number of retries is unlimited.

MAX_PREEXEC_RETRY=integer

  • Controls the maximum number of times to attempt the pre-execution command of a job on the remote cluster.

  • Specify an integer greater than 0

    By default, the number of retries is 5.

REMOTE_MAX_PREEXEC_RETRY=integer

  • Controls the maximum number of times to attempt the pre-execution command of a job on the remote cluster.

    Equivalent to MAX_PREEXEC_RETRY

  • Specify an integer greater than 0

    By default, the number of retries is 5.


When pre-execution retry is configured, if a job pre-execution fails and exits with non-zero value, the number of pre-exec retries is set to 1. When the pre-exec retry limit is reached, the job is suspended with PSUSP status.

The number of times that pre-execution is retried includes queue-level, application-level, and job-level pre-execution command specifications. When pre-execution retry is configured, a job will be suspended when the sum of its queue-level pre-exec retry times + application-level pre-exec retry times is greater than the value of the pre-execution retry parameter or if the sum of its queue-level pre-exec retry times + job-level pre-exec retry times is greater than the value of the pre-execution retry parameter.

The pre-execution retry limit is recovered when LSF is restarted and reconfigured. LSF replays the pre-execution retry limit in the PRE_EXEC_START or JOB_STATUS events in lsb.events.