Resizable jobs

Enabling resizable jobs allows jobs to dynamically use the number of slots available at any given time or release slots that are no longer needed.

About resizable jobs

Resizable job

To optimize resource utilization, LSF allows job allocation to shrink and grow during the job run time.

Use resizable jobs for long-tailed jobs, jobs that use a large number of processors for a period, but then toward the end of the job use a smaller number of processors.

Without resizable jobs, a job’s slot allocation is static from the time the job is dispatched until it finishes. This means resources are wasted, even if you use reservation and backfill (estimated runtimes can be inaccurate). With resizable jobs, jobs can have additional slots added when needed, during the job’s runtime.

Autoresizable job

An autoresizable job is a resizable job with a minimum and maximum slot request, where LSF automatically schedules and allocates additional resources to satisfy the job maximum request as the job runs.

Use autoresizable jobs for jobs in which tasks are easily parallelized: Each step or task can be made to run on a separate processor to achieve a faster result. The more resources the job gets, the faster the job can run. Session Scheduler jobs are very good candidates.

For autoresizable jobs, LSF automatically recalculates the pending allocation requests. The maximum pending allocation request is calculated based on the maximum number of requested slots minus the number of allocated slots. Because the job is running and its previous minimum request is already satisfied, LSF is able to allocate additional slots to the running job. For instance, if job requests a minimum of 4 and a maximum of 32, if LSF allocates 20 slots to the job initially, its active pending allocation request is for another 12 slots. After LSF assigns another 4 slots, the pending allocation request is now 8 slots.

Default behavior (feature not enabled)

Figure 1. Long-tailed: wasted slots

With resizable jobs enabled

Figure 2. Long-tailed: releasing resources (shrink)
Figure 3. Adding resources (grow)

Pending allocation request

A pending allocation request is an additional resource request attached to a resizable job. Only running jobs can have pending allocation requests. At any given time, a job only has one allocation request.

LSF creates a new pending allocation request and schedules it after a job physically starts on the remote host (after LSF receives the JOB_EXECUTE event from sbatchd) or resize notification command successfully completes.

Resize notification command

A resize notification command is an executable that is invoked on the first execution host of a job in response to an allocation (grow or shrink) event. It can be used to inform the running application for allocation change. Due to the variety of implementations of applications, each resizable application may have its own notification command provided by the application developer.

The notification command runs under the same user ID environment, home, and working directory as the actual job. The standard input, output, and error of the program are redirected to the NULL device. If the notification command is not in the user's normal execution path (the $PATH variable), the full path name of the command must be specified.

A notification command exits with one of the following values:

LSB_RESIZE_NOTIFY_OK=0 

LSB_RESIZE_NOTIFY_FAIL=1

LSF sets these environment variables in the notification command environment. LSB_RESIZE_NOTIFY_OK indicates notification succeeds. For allocation grow and shrink events, LSF updates the job allocation to reflect the new allocation.

LSB_RESIZE_NOTIFY_FAIL indicates notification failure. For allocation "grow" event, LSF reschedules the pending allocation request. For allocation "shrink" event, LSF fails the allocation release request.

For a list of other environment variables that apply to the resize notification command, see the environment variables reference documentation in this guide.

Configuration to enable resizable jobs

The resizable jobs feature is enabled by defining an application profile using the RESIZABLE_JOBS parameter in lsb.applications.


Configuration file

Parameter and syntax

Behavior

lsb.applications

RESIZABLE_JOBS=Y|N|auto

  • When RESIZABLE_JOBS=Y jobs submitted to the application profile are resizable.

  • When RESIZABLE_JOBS=auto jobs submitted to the application profile are automatically resizable.

  • To enable cluster-wide resizable behavior by default, define RESIZABLE_JOBS=Y in the default application profile.

RESIZE_NOTIFY_CMD=notify_cmd

RESIZE_NOTIFY_CMD specifies an application-level resize notification command. The resize notification command is invoked on the first execution host of a running resizable job when a resize event occurs.

LSF sets appropriate environment variables to indicate the event type before running the notification command.


Configuration to modify resizable job behavior

There is no configuration to modify resizable job behavior.

Resizable job commands

Commands for submission


Command

Description

bsub -app application_profile_name

Submits the job to the specified application profile configured for resizable jobs

bsub -app application_profile_name -rnc resize_notification_command

Submits the job to the specified application profile configured for resizable jobs, with the specified resize notification command. The job-level resize notification command overrides the application-level RESIZE_NOTIFY_CMD setting.

bsub -ar -app application_profile_name

Submits the job to the specified application profile configured for resizable jobs, as an autoresizable job. The job-level -ar option overrides the application-level RESIZABLE_JOBS setting. For example, if the application profile is not autoresizable, job level bsub -ar will make the job autoresizable.


Commands to monitor


Command

Description

bacct -l

  • Displays resize notification command.

  • Displays resize allocation changes.

bhist -l

  • Displays resize notification command.

  • Displays resize allocation changes.

  • Displays the job-level autoresizable attribute.

bjobs -l

  • Displays resize notification command.

  • Displays resize allocation changes.

  • Displays the job-level autoresizable attribute.

  • Displays pending resize allocation requests.


Commands to control


Command

Description

bmod -ar | -arn

Add or remove the job-level autoresizable attribute. bmod only updates the autoresizable attribute for pending jobs.

bmod -rnc resize_notification_cmd | -rncn

Modify or remove resize notification command for submitted job.

bresize release

Release allocated resources from a running resizable job.

  • Release all slots except one slot from the first execution node.

  • Release all hosts except the first execution node.

  • Release a list of hosts, with the option to specify slots to release on each host.

  • Specify a resize notification command to be invoked on the first execution host of the job.

Example:

bresize release "1*hostA 2*hostB hostC" 221

To release resources from a running job, the job must be submitted to an application profile configured as resizable.

  • By default, only cluster administrators, queue administrators, root and the job owner are allowed to run bresize to change job allocations.

  • User group administrators are allowed to run bresize to change the allocation of jobs within their user groups.

bresize cancel

Cancel a pending allocation request. If job does not have active pending request, the command fails with an error message.

bresize release -rnc resize_notification_cmd

Specify or remove a resize notification command. The resize notification is invoked on the job first execution node. The resize notification command only applies to this release request and overrides the corresponding resize notification parameters defined in either the application profile (RESIZE_NOTIFY_CMD in lsb.applications) and job level (bsub -rnc notify_cmd), only for this resize request.

If the resize notification command completes successfully, LSF considers the allocation release done and updates the job allocation. If the resize notification command fails, LSF does not update the job allocation.

The resize_notification_cmd specifies the name of the executable to be invoked on the first execution host when the job's allocation has been modified.

The resize notification command runs under the user account that submitted the job .

-rncn overrides the resize notification command in both job-level and application-level for this bresize request.

bresize release -c

By default, if the job has an active pending allocation request, LSF does not allow users to release resource. Use the bresize release -c command to cancel the active pending resource request when releasing slots from existing allocation. By default, the command only releases slots.

If a job still has an active pending allocation request, but you do not want to allocate more resources to the job, use the bresize cancel command to cancel allocation request.

Only the job owner, cluster administrators, queue administrators, user group administrators, and root are allowed to cancel pending resource allocation requests.


Commands to display configuration


Command

Description

bapp

Displays the value of parameters defined in lsb.applications.