Knowledge Center Contents Previous Next Index |
Running Parallel Jobs
Contents
- How LSF Runs Parallel Jobs
- Preparing Your Environment to Submit Parallel Jobs to LSF
- Submitting Parallel Jobs
- Starting Parallel Tasks with LSF Utilities
- Job Slot Limits For Parallel Jobs
- Specifying a Minimum and Maximum Number of Processors
- Specifying a First Execution Host
- Controlling Processor Allocation Across Hosts
- Controlling Job Locality using Compute Units
- Running Parallel Processes on Homogeneous Hosts
- Limiting the Number of Processors Allocated
- Reserving Processors
- Reserving Memory for Pending Parallel Jobs
- Backfill Scheduling: Allowing Jobs to Use Reserved Job Slots
- Parallel Fairshare
- How Deadline Constraint Scheduling Works For Parallel Jobs
- Optimized Preemption of Parallel Jobs
- Processor Binding for Parallel Jobs
- Making Job Allocations Resizable
How LSF Runs Parallel Jobs
When LSF runs a job, the LSB_HOSTS variable is set to the names of the hosts running the batch job. For a parallel batch job, LSB_HOSTS contains the complete list of hosts that LSF has allocated to that job.
LSF starts one controlling process for the parallel batch job on the first host in the host list. It is up to your parallel application to read the LSB_HOSTS environment variable to get the list of hosts, and start the parallel job components on all the other allocated hosts.
LSF provides a generic interface to parallel programming packages so that any parallel package can be supported by writing shell scripts or wrapper programs.
Preparing Your Environment to Submit Parallel Jobs to LSF
Getting the host list
Some applications can take this list of hosts directly as a command line parameter. For other applications, you may need to process the host list.
Example
The following example shows a
/bin/sh
script that processes all the hosts in the host list, including identifying the host where the job script is executing.#!/bin/sh # Process the list of host names in LSB_HOSTS for host in $LSB_HOSTS ; do handle_host $host doneParallel job scripts
Each parallel programming package has different requirements for specifying and communicating with all the hosts used by a parallel job. LSF is not tailored to work with a specific parallel programming package. Instead, LSF provides a generic interface so that any parallel package can be supported by writing shell scripts or wrapper programs.
You can modify these scripts to support more parallel packages.
For more information, see Submitting Parallel Jobs
Use a job starter
You can configure the script into your queue as a job starter, and then all users can submit parallel jobs without having to type the script name. See Queue-Level Job Starters for more information about job starters.
- To see if your queue already has a job starter defined, run
bqueues -l
.Submitting Parallel Jobs
LSF can allocate more than one host or processor to run a job and automatically keeps track of the job status, while a parallel job is running.
Specify the number of processors
When submitting a parallel job that requires multiple processors, you can specify the exact number of processors to use.
- To submit a parallel job, use
bsub -n
and specify the number of processors the job requires.- To submit jobs based on the number of available job slots instead of the number of processors, use
PARALLEL_SCHED_BY_SLOT=Y
inlsb.params
.For example:
bsub -n 4 myjob
submits myjob as a parallel job. The job is started when 4 job slots are available.
tip:
WhenPARALLEL_SCHED_BY_SLOT=Y
inlsb.params
, the resource requirement string keywordncpus
refers to the number of slots instead of the number of processors, howeverlshosts
output will continue to showncpus
as defined byEGO_DEFINE_NCPUS
inlsf.conf
.Starting Parallel Tasks with LSF Utilities
For simple parallel jobs you can use LSF utilities to start parts of the job on other hosts. Because LSF utilities handle signals transparently, LSF can suspend and resume all components of your job without additional programming.
Running parallel tasks with lsgrun
The simplest parallel job runs an identical copy of the executable on every host. The
lsgrun
command takes a list of host names and runs the specified task on each host. Thelsgrun -p
command specifies that the task should be run in parallel on each host.Example
This example submits a job that uses
lsgrun
to runmyjob
on all the selected hosts in parallel:bsub -n 10 'lsgrun -p -m "$LSB_HOSTS" myjob'
Job <3856> is submitted to default queue <normal>.For more complicated jobs, you can write a shell script that runs
lsrun
in the background to start each component.Running parallel tasks with the blaunch distributed application framework
Most MPI implementations and many distributed applications use
rsh
andssh
as their task launching mechanism. Theblaunch
command provides a drop-in replacement forrsh
andssh
as a transparent method for launching parallel and distributed applications within LSF.Similar to the
lsrun
command,blaunch
transparently connects directly to the RES/SBD on the remote host, and subsequently creates and tracks the remote tasks, and provides the connection back to LSF. There is no need to insert pam or taskstarter into thersh
or ssh calling sequence, or configure any wrapper scripts.
important:
You cannot run blaunch directly from the command line.
blaunch
only works within an LSF job; it can only be used to launch tasks on remote hosts that are part of a job allocation. It cannot be used as a standalone command. On successblaunch
exits with 0.Windows:
blaunch
is supported on Windows 2000 or later with the following exceptions:
- Only the following signals are supported: SIGKILL, SIGSTOP, SIGCONT.
- The
-n
option is not supported.- CMD.EXE /C <user command line> is used as intermediate command shell when:
-no-shell
is not specified- CMD.EXE /C is not used when
-no-shell
is specified.- Windows Vista User Account Control must be configured correctly to run jobs.
See
Using Platform LSF HPC
for more information about using the blaunch distributed application framework.Submitting jobs with blaunch
Use bsub to call
blaunch
, or to invoke a job script that callsblaunch
. Theblaunch
command assumes thatbsub -n
implies one remote task per job slot.
- Submit a parallel job:
bsub -n 4 blaunch myjobSubmit a parallel job to launch tasks on a specific host: bsub -n 4 blaunch hostA myjobSubmit a job with a host list: bsub -n 4 blaunch -z "hostA hostB" myjobSubmit a job with a host file: bsub -n 4 blaunch -u ./hostfile myjobSubmit a job to an application profile bsub -n 4 -app pjob blaunch myjobJob Slot Limits For Parallel Jobs
A job slot is the basic unit of processor allocation in LSF. A sequential job uses one job slot. A parallel job that has
N
components (tasks) usesN
job slots, which can span multiple hosts.By default, running and suspended jobs count against the job slot limits for queues, users, hosts, and processors that they are associated with.
With processor reservation, job slots reserved by pending jobs also count against all job slot limits.
When backfilling occurs, the job slots used by backfill jobs count against the job slot limits for the queues and users, but not hosts or processors. This means when a pending job and a running job occupy the same physical job slot on a host, both jobs count towards the queue limit, but only the pending job counts towards host limit.
Specifying a Minimum and Maximum Number of Processors
By default, when scheduling a parallel job, the number of slots allocated on each host will not exceed the number of CPUs on that host even though host MXJ is set greater than number of CPUs. When submitting a parallel job, you can also specify a minimum number and a maximum number of processors.
If you specify a maximum and minimum number of processors, the job starts as soon as the minimum number of processors is available, but it uses up to the maximum number of processors, depending on how many processors are available at the time. Once the job starts running, no more processors are allocated to it even though more may be available later on.
Jobs that request fewer processors than the minimum PROCLIMIT defined for the queue or application profile to which the job is submitted, or more processors than the maximum PROCLIMIT are rejected. If the job requests minimum and maximum processors, the maximum requested cannot be less than the minimum PROCLIMIT, and the minimum requested cannot be more than the maximum PROCLIMIT.
If
PARALLEL_SCHED_BY_SLOT=Y
inlsb.params
, the job specifies a maximum and minimum number of job slots instead of processors. LSF ignores the number of CPUs constraint during parallel job scheduling and only schedules based on slots.If PARALLEL_SCHED_BY_SLOT is not defined for a resizable job, individual allocation requests are constrained by the number of CPUs during scheduling. However, the final resizable job allocation may not agree. For example, if an autoresizable job requests 1 to 4 slots, on a 2 CPUs 4 slots box, an autoresizable job eventually will use up to 4 slots.
Syntax
bsub -n
min_proc
[,
max_proc
]
Example
bsub -n 4,16 myjob
At most, 16 processors can be allocated to this job. If there are less than 16 processors eligible to run the job, this job can still be started as long as the number of eligible processors is greater than or equal to 4.
Specifying a First Execution Host
In general, the first execution host satisfies certain resource requirements that might not be present on other available hosts.
By default, LSF selects the first execution host dynamically according to the resource availability and host load for a parallel job. Alternatively, you can specify one or more first execution host candidates so that LSF selects one of the candidates as the first execution host.
When a first execution host is specified to run the first task of a parallel application, LSF does not include the first execution host or host group in a job resize allocation request.
Specify a first execution host
To specify one or more hosts, host groups, or compute units as first execution host candidates, add the (!) symbol after the host name. You can specify first execution host candidates at job submission, or in the queue definition.
Job level
- Use the
-m
option ofbsub
:
bsub -n 32 -m "hostA! hostB hostgroup1! hostC" myjob
The scheduler selects either hostA or a host defined in hostgroup1 as the first execution host, based on the job's resource requirements and host availability.
- In a MultiCluster environment, insert the (!) symbol after the cluster name, as shown in the following example:
bsub -n 2 -m "host2@cluster2! host3@cluster2" my_parallel_job
Queue level
The queue-level specification of first execution host candidates applies to all jobs submitted to the queue.
- Specify the first execution host candidates in the list of hosts in the HOSTS parameter in lsb.queues:
HOSTS = hostA! hostB hostgroup1! hostCRules
Follow these guidelines when you specify first execution host candidates:
- If you specify a host group or compute unit, you must first define the host group or compute unit in the file
lsb.hosts
.- Do not specify a dynamic host group as a first execution host.
- Do not specify "all," "allremote," or "others," or a host partition as a first execution host.
- Do not specify a preference (+) for a host identified by (!) as a first execution host candidate.
- For each parallel job, specify enough regular hosts to satisfy the CPU requirement for the job. Once LSF selects a first execution host for the current job, the other first execution host candidates
- Become unavailable to the current job
- Remain available to other jobs as either regular or first execution hosts
- You cannot specify first execution host candidates when you use the brun command.
If the first execution host is incorrect at job submission, the job is rejected. If incorrect configurations exist on the queue level, warning messages are logged and displayed when LSF starts, restarts or is reconfigured.
Job chunking
Specifying first execution host candidates affects job chunking. For example, the following jobs have different job requirements, and is not placed in the same job chunk:
bsub -n 2 -m "hostA! hostB hostC" myjob
bsub -n 2 -m "hostA hostB hostC" myjob
bsub -n 2 -m "hostA hostB! hostC" myjob
The requirements of each job in this example are:
- Job 1 must start on hostA
- Job 2 can start and run on hostA, hostB, or hostC
- Job 3 must start on hostB
For job chunking, all jobs must request the same hosts
and
the same first execution hosts (if specified). Jobs that specify a host preference must all specify the same preference.Resource reservation
If you specify first execution host candidates at the job or queue level, LSF tries to reserve a job slot on the first execution host. If LSF cannot reserve a first execution host job slot, it does not reserve slots on any other hosts.
Compute units
If compute units resource requirements are used, the compute unit containing the first execution host is given priority:
bsub -n 64 -m "hg! cu1 cu2 cu3 cu4" -R "cu[pref=config]" myjob
In this example the first execution host is selected from the host group
hg
. Next in the job's allocation list are any appropriate hosts from the same compute unit as the first execution host. Finally remaining hosts are grouped by compute unit, with compute unit groups appearing in the same order as in theComputeUnit
section oflsb.hosts
.Compound resource requirements
If compound resource requirements are being used, the resource requirements specific to the first execution host should appear first:
bsub -m "hostA! hg12" -R "1*{select[type==X86_64]rusage[licA=1]} + {select[type==any]}" myjob
In this example the first execution host must satisfy:
select[type==X86_64]rusage[licA=1]
Controlling Job Locality using Compute Units
Compute units are groups of hosts laid out by the LSF administrator and configured to mimic the network architecture, minimizing communications overhead for optimal placement of parallel jobs. Different granularities of compute units provide the flexibility to configure an extensive cluster accurately and run larger jobs over larger compute units.
Resource requirement keywords within the compute unit section can be used to allocate resources throughout compute units in manner analogous to host resource allocation. Compute units then replace hosts as the basic unit of allocation for a job.
High performance computing clusters running large parallel jobs spread over many hosts benefit from using compute units. Communications bottlenecks within the network architecture of a large cluster can be isolated through careful configuration of compute units. Using compute units instead of hosts as the basic allocation unit, scheduling policies can be applied on a large scale.
tip:
Configure each individual host as a compute unit to use the compute unit features for host level job allocation.
![]()
As indicated in the picture, two types of compute units have been defined in the parameter
COMPUTE_UNIT_TYPES
inlsb.params
:COMPUTE_UNIT_TYPES= enclosure! rack
! indicates the default compute unit type. The first type listed (
enclosure
) is the finest granularity and the only type of compute unit containing hosts and host groups. Coarser granularityrack
compute units can only containenclosure
s.The hosts have been grouped into compute units in the
ComputeUnit
section oflsb.hosts
as follows (some lines omitted):Begin ComputeUnit NAME MEMBER CONDENSED TYPE enclosure1 (host1[01-16]) Y enclosure ... enclosure8 (host8[01-16]) Y enclosure rack1 (enclosure[1-2]) Y rack rack2 (enclosure[3-4]) Y rack rack3 (enclosure[5-6]) Y rack rack4 (enclosure[7-8]) Y rack End ComputeUnitThis example defines 12 compute units, all of which have condensed output:
uenclosure1
throughenclosure8
are the finest granularity, and each contain 16 hosts. urack1
,rack2
,rack3
, andrack4
are the coarsest granularity, and each contain 2 enclosures.Syntax
The
cu
string supports the following syntax:cu[balance]
All compute units used for this job should contribute the same number of slots (to within one slot). Provides a balanced allocation over the fewest possible compute units.
cu[pref=config]
Compute units for this job are considered in the order they appear in the
lsb.hosts
configuration file. This is the default value.cu[pref=minavail]
Compute units with the fewest available slots are considered first for this job. Useful for smaller jobs (both sequential and parallel) since this reduces fragmentation of compute units, leaving whole compute units free for larger jobs.
cu[pref=maxavail]
Compute units with the most available slots are considered first for this job.
cu[maxcus=
number
]Maximum number of compute units the job can run across.
cu[usablecuslots=
number
]All compute units used for this job should contribute the same minimum
number
of slots. At most the final allocated compute unit can contribute fewer thannumber
slots.cu[type=
cu_type
]Type of compute unit being used, where
cu_type
is one of the types defined byCOMPUTE_UNIT_TYPES
inlsb.params
. The default is the compute unit type listed first in lsb.params.cu[excl]
Compute units used exclusively for the job. Must be enabled by
EXCLUSIVE
inlsb.queues
.Continuing with the example shown above, assume
lsb.queues
contains the parameter definitionEXCLUSIVE=CU[rack]
and that the slots available for each compute unit are shown underMAX
in the condensed display frombhosts
, whereHOST_NAME
refers to the compute unit:HOST_NAME STATUS JL/U MAX NJOBS RUN SSUSP USUSP RSV enclosure1 ok - 64 34 34 0 0 0 enclosure2 ok - 64 54 54 0 0 0 enclosure3 ok - 64 46 46 0 0 0 enclosure4 ok - 64 44 44 0 0 0 enclosure5 ok - 64 45 45 0 0 0 enclosure6 ok - 64 44 44 0 0 0 enclosure7 ok - 32 0 0 0 0 0 enclosure8 ok - 64 0 0 0 0 0 rack1 ok - 128 88 88 0 0 0 rack2 ok - 128 90 90 0 0 0 rack3 ok - 128 89 89 0 0 0 rack4 ok - 128 0 0 0 0 0Based on the 12 configured compute units, jobs can be submitted with a variety of compute unit requirements.
Using compute units
bsub -R "cu[]" -n 64 ./app
- This job is restricted to compute units of the default type
enclosure
. The defaultpref=config
applies, with compute units considered in configuration order. The job runs on 30 slots inenclosure1
, 10 slots inenclosure2
, 8 slots inenclosure3
, and 16 slots inenclosure4
for a total of 64 slots.- Compute units can be considered in order of most free slots or fewest free slots, where free slots include any slots available and not occupied by a running job.
bsub -R "cu[pref=minavail]" -n 32 ./app
- This job is restricted to compute units of the default type
enclosure
in the orderpref=minavail
. Compute units with the fewest free slots are considered first. The job runs on 10 slots inenclosure2
, 18 slots inenclosure3
and 3 slots inenclosure5
for a total of 32 slots.bsub -R "cu[type=rack:pref=maxavail]" -n 64 ./app
- This job is restricted to compute units of the default type
enclosure
in the orderpref=maxavail
. Compute units with the most free slots are considered first. The job runs on 64 slots inenclosure8
.Localized allocations
Jobs can be run over a limited number of compute units using the maxcus keyword.
bsub -R "cu[pref=maxavail:maxcus=1]" ./app
- This job is restricted to a single enclosure, and compute units with the most free slots are considered first. The job requirements are satisfied by
enclosure8
which has 64 free slots.bsub -n 64 -R "cu[maxcus=3]" ./app
- This job requires a total of 64 slots over 3 enclosures or less. Compute units are considered in configuration order. The job requirements are satisfied by the following allocation:
Balanced slot allocations
Balanced allocations split jobs evenly between compute units, which increases the efficiency of some applications.
bsub -n 80 -R "cu[balance:maxcus=4]" ./app
- This job requires a balanced allocation over the fewest possible compute units of type
enclosure
(the default type), with a total of 80 slots. Since none of the configured enclosures have 80 slots, 2 compute units with 40 slots each are used, satisfying themaxcus
requirement to use 4 compute units or less.- The keyword
pref
is not included so the default order ofpref=config
is used. The job requirements are satisfied by 40 slots on bothenclosure7
andenclosure8
for a total of 80 slots.bsub -n 64 -R "cu[balance:type=rack:pref=maxavail]" ./app
- This job requires a balanced allocation over the fewest possible compute units of type
rack
, with a total of 64 slots. Compute units with the most free slots are considered first, in the orderrack4
,rack1
,rack3
,rack2
. The job requirements are satisfied byrack4
.bsub -n "40,80" -R "cu[balance:pref=minavail]" ./app
- This job requires a balanced allocation over compute units of type
rack
, with a range of 40 to 80 slots. Only the minimum number of slots is considered when a range is specified along with keywordbalance
, so the job needs 40 slots. Compute units with the fewest free slots are considered first.- Because balance uses the fewest possible compute units, racks with 40 or more slots are considered first, namely
rack1
andrack4
. The rack with the fewest available slots is then selected, and all job requirements are satisfied byrack1
.Balanced host allocations
Using
balance
andptile
together within the requirement string results in a balanced host allocation over compute units, and the same number of slots from each host. The final host may provide fewer slots if required.
bsub -n 64 -R "cu[balance] span[ptile=4]" ./app
- This job requires a balanced allocation over the fewest possible compute units of type
enclosure
, with a total of 64 slots. Each host used must provide 4 slots. Sinceenclosure8
has 64 slots available over 16 hosts (4 slots per host), it satisfies the job requirements.- Had
enclosure8
not satisfied the requirements, other possible allocations in order of consideration (fewest compute units first) include:
Minimum slot allocations
Minimum slot allocations result in jobs spreading over fewer compute units, and ignoring compute units with few hosts available.
bsub -n 45 -R "cu[usablecuslots=10:pref=minavail]" ./app
- This job requires an allocation of at least 10 slots in each enclosure, except possibly the last one. Compute units with the fewest free slots are considered first. The requirements are satisfied by a slot allocation of:
bsub -n "1,140" -R "cu[usablecuslots=20]" ./app
- This job requires an allocation of at least 20 slots in each enclosure, except possibly the last one. Compute units are considered in configuration order and as close to 140 slots are allocated as possible. The requirements are satisfied by an allocation of 140 slots, where only the last compute unit has fewer than 20 slots allocated as follows:
Exclusive compute unit jobs
Because
EXCLUSIVE=CU[rack]
inlsb.queues
, jobs may use compute units of typerack
or finer granularity typeenclosure
exclusively. Exclusive jobs lock all compute units they run in, even if not all slots are being used by the job. Running compute unit exclusive jobs minimizes communications slowdowns resulting from shared network bandwidth.
bsub -R "cu[excl:type=enclosure]" ./app
- This job requires exclusive use of an enclosure with compute units considered in configuration order. The first enclosure not running any jobs is
enclosure7
.- Using
excl
withusablecuslots
, the job avoids compute units where a large portion of the hosts are unavailable.bsub -n 90 -R "cu[excl:usablecuslots=12:type=enclosure]" ./app
- This job requires exclusive use of compute units, and will not use a compute unit if fewer than 12 slots are available. Compute units are considered in configuration order. In this case the job requirements are satisfied by 64 slots in
enclosure7
and 26 slots inenclosure8
.bsub -R "cu[excl]" ./app
- This job requires exclusive use of a rack with compute units considered in configuration order. The only rack not running any jobs is
rack4
.Reservation
Compute unit constraints such as keywords
maxcus
,balance
, andexcl
can result in inaccurately predicted start times from default LSF resource reservation. Time-based resource reservation provides a more accurate pending job predicted start time. When calculating job a time-based predicted start time, LSF considers job scheduling constraints and requirements, including job topology and resource limits, for example.Host-level compute units
Configuring each individual host as a compute unit allows you to use the compute unit features for host level job allocation. Consider an example where one type of compute units has been defined in the parameter
COMPUTE_UNIT_TYPES
inlsb.params
:COMPUTE_UNIT_TYPES= host!
The hosts have been grouped into compute hosts in the
ComputeUnit
section oflsb.hosts
as follows:Begin ComputeUnit NAME MEMBER TYPE h1 host1 host h2 host2 host ... h50 host50 host End ComputeUnitEach configured compute unit of default type
host
contains a single host.Ordering host allocations
Using the compute
host
keywordpref
, hosts can be considered in order of most free slots or fewest free slots, where free slots include any slots available and not occupied by a running job:
bsub -R "cu[]" ./app
- Compute units of default type
host
, each containing a single host, are considered in configuration order.bsub -R "cu[pref=minavail]" ./app
- Compute units of default type
host
each contain a single host. Compute units with the fewest free slots are considered first.bsub -n 20 -R "cu[pref=maxavail]" ./app
- Compute units of default type
host
each contain a single host. Compute units with the most free slots are considered first. A total of 20 slots are allocated for this job.Limiting hosts in allocations
Using the compute unit keyword
maxcus
, the maximum number of hosts allocated to a job can be set:
bsub -n 12 -R "cu[pref=maxavail:maxcus=3]" ./app
- Compute units of default type
host
each contain a single host. Compute units with the most free slots are considered first. This job requires an allocation of 12 slots over at most 3 hosts.Balanced slot allocations
Using the compute unit keyword
balance
, jobs can be evenly distributed over hosts:
bsub -n 9 -R "cu[balance]" ./app
- Compute units of default type
host
, each containing a single host, are considered in configuration order. Possible balanced allocations are:
bsub -n 9 -R "cu[balance:maxcus=3]" ./app
- Compute units of default type
host
, each containing a single host, are considered in configuration order. Possible balanced allocations are 1 host with 9 slots, 2 hosts with 4 and 5 slots, or 3 hosts with 3 slots each.Minimum slot allocations
Using the compute unit keyword
usablecuslots
, hosts are only considered if they have a minimum number of slots free and usable for this job:
bsub -n 16 -R "cu[usablecuslots=4]" ./app
- Compute units of default type
host
, each containing a single host, are considered in configuration order. Only hosts with 4 or more slots available and not occupied by a running job are considered. Each host (except possibly the last host allocated) must contribute at least 4 slots to the job.bsub -n 16 -R "rusage[mem=1000] cu[usablecuslots=4]" ./app
- Compute units of default type
host
, each containing a single host, are considered in configuration order. Only hosts with 4 or more slots available, not occupied by a running job, and with 1000 memory units are considered. A host with 10 slots and 2000 units of memory, for example, will only have 2 slots free that satisfy the memory requirements of this job.Controlling Processor Allocation Across Hosts
Sometimes you need to control how the selected processors for a parallel job are distributed across the hosts in the cluster.
You can control this at the job level or at the queue level. The queue specification is ignored if your job specifies its own locality.
Specifying parallel job locality at the job level
By default, LSF does allocate the required processors for the job from the available set of processors.
A parallel job may span multiple hosts, with a specifiable number of processes allocated to each host. A job may be scheduled on to a single multiprocessor host to take advantage of its efficient shared memory, or spread out on to multiple hosts to take advantage of their aggregate memory and swap space. Flexible spanning may also be used to achieve parallel I/O.
You are able to specify "select all the processors for this parallel batch job on the same host", or "do not choose more than
n
processors on one host" by using thespan
section in the resource requirement string (bsub -R or RES_REQ
in the queue definition inlsb.queues
).If
PARALLEL_SCHED_BY_SLOT=Y
inlsb.params
, the span string is used to control the number of job slots instead of processors.Syntax
The
span
string supports the following syntax:span[hosts=1]
Indicates that all the processors allocated to this job must be on the same host.
span[ptile=
value
]Indicates the number of processors on each host that should be allocated to the job, where
value
is one of the following:
- Default
ptile
value, specified byn
processors. In the following example, the job requests 4 processors on each available host, regardless of how many processors the host has:span[ptile=4]Predefined ptile
value, specified by '!'. The following example uses the predefined maximum job slot limitlsb.hosts
(MXJ per host type/model) as its value:span[ptile='!']
tip:
If the host or host type/model does not define MXJ, the default predefined ptile value is 1.Predefined ptile
value with optional multipleptile
values, per host type or host model:
- For host type, you must specify
same[type]
in the resource requirement. In the following example, the job requests 8 processors on a host of typeHP
orSGI
, and 2 processors on a host of typeLINUX
, and the predefined maximum job slot limit inlsb.hosts
(MXJ) for other host types:span[ptile='!',HP:8,SGI:8,LINUX:2] same[type]For host model, you must specify same[model]
in the resource requirement. In the following example, the job requests 4 processors on hosts of modelPC1133
, and 2 processors on hosts of model PC233, and the predefined maximum job slot limit inlsb.hosts
(MXJ) for other host models:span[ptile='!',PC1133:4,PC233:2] same[model]
span[hosts=-1]
Disables span setting in the queue. LSF allocates the required processors for the job from the available set of processors.
Specifying multiple ptile values
In a
span
string with multipleptile
values, you must specify a predefined default value (ptile='!'
) and either host model or host type.You can specify both type and model in the
same
section in the resource requirement string, but theptile
values must be the same type.If you specify
same[type:model]
, youcannot
specify a predefinedptile
value (!
) in thespan
section.
restriction:
Under bash 3.0, the exclamation mark (!) is not interpreted correctly by the shell. To use predefined ptile value (ptile='!'), use the +H option to disable '!' style history substitution in bash (sh +H).The following
span
strings are valid:same[type:model] span[ptile=LINUX:2,SGI:4]
LINUX and SGI are both host types and can appear in the same
span
string.same[type:model] span[ptile=PC233:2,PC1133:4]
PC233
andPC1133
are both host models and can appear in the samespan
string.You cannot mix host model and host type in the same
span
string. The followingspan
strings arenot
correct:span[ptile='!',LINUX:2,PC1133:4] same[model]
span[ptile='!',LINUX:2,PC1133:4] same[type]
The
LINUX
host type andPC1133
host model cannot appear in the samespan
string.Multiple ptile values for a host type
For host type, you must specify
same[type]
in the resource requirement. For example:span[ptile='!',HP:8,SGI:8,LINUX:2] same[type]The job requests 8 processors on a host of type
HP
orSGI
, and 2 processors on a host of typeLINUX
, and the predefined maximum job slot limit inlsb.hosts
(MXJ) for other host types.Multiple ptile values for a host model
For host model, you must specify
same[model]
in the resource requirement. For example:span[ptile='!',PC1133:4,PC233:2] same[model]The job requests 4 processors on hosts of model
PC1133
, and 2 processors on hosts of model PC233, and the predefined maximum job slot limit inlsb.hosts
(MXJ) for other host models.Examples
bsub -n 4 -R "span[hosts=1]" myjob
Runs the job on a host that has at least 4 processors currently eligible to run the 4 components of this job.
bsub -n 4 -R "span[ptile=2]" myjob
Runs the job on 2 hosts, using 2 processors on each host. Each host may have more than 2 processors available.
bsub -n 4 -R "span[ptile=3]" myjob
Runs the job on 2 hosts, using 3 processors on the first host and 1 processor on the second host.
bsub -n 4 -R "span[ptile=1]" myjob
Runs the job on 4 hosts, even though some of the 4 hosts may have more than one processor currently available.
bsub -n 4 -R "type==any same[type] span[ptile='!',LINUX:2,SGI:4]" myjob
Submits
myjob
to request 4 processors running on 2 hosts of typeLINUX
(2 processors per host), or a single host of typeSGI
, or for other host types, the predefined maximum job slot limit inlsb.hosts
(MXJ).bsub -n 16 -R "type==any same[type] span[ptile='!',HP:8,SGI:8,LINUX:2]" myjob
Submits
myjob
to request 16 processors on 2 hosts of typeHP
orSGI
(8 processors per hosts), or on 8 hosts of typeLINUX
(2 processors per host), or the predefined maximum job slot limit inlsb.hosts
(MXJ) for other host types.bsub -n 4 -R "same[model] span[ptile='!',PC1133:4,PC233:2]" myjob
Submits
myjob
to request a single host of modelPC1133
(4 processors), or 2 hosts of model PC233 (2 processors per host), or the predefined maximum job slot limit inlsb.hosts
(MXJ) for other host models.Specifying parallel job locality at the queue level
The queue may also define the locality for parallel jobs using the RES_REQ parameter.
Running Parallel Processes on Homogeneous Hosts
Parallel jobs run on multiple hosts. If your cluster has heterogeneous hosts some processes from a parallel job may for example, run on Solaris and some on SGI IRIX. However, for performance reasons you may want all processes of a job to run on the same type of host instead of having some processes run on one type of host and others on another type of host.
You can use the
same
section in the resource requirement string to indicate to LSF that processes are to run on one type or model of host. You can also use a custom resource to define the criteria for homogeneous hosts.Examples
Running all parallel processes on the same host type
bsub -n 4 -R"select[type==SGI6 || type==SOL7] same[type]" myjob
Allocate 4 processors on the same host type-either SGI IRIX, or Solaris 7, but not both.
Running all parallel processes on the same host type and model
bsub -n 6 -R"select[type==any] same[type:model]" myjob
Allocate 6 processors on any host type or model as long as all the processors are on the same host type and model.
Running all parallel processes on hosts in the same high-speed connection group
bsub -n 12 -R "select[type==any && (hgconnect==hg1 || hgconnect==hg2 || hgconnect==hg3)] same[hgconnect:type]" myjob
For performance reasons, you want to have LSF allocate 12 processors on hosts in high-speed connection group
hg1
,hg2
, orhg3
, but not across hosts inhg1
,hg2
orhg3
at the same time. You also want hosts that are chosen to be of the same host type.This example reflects a network in which network connections among hosts in the same group are high-speed, and network connections between host groups are low-speed.
In order to specify this, you create a custom resource
hgconnect
inlsf.shared
.Begin Resource RESOURCENAME TYPE INTERVAL INCREASING RELEASE DESCRIPTION hgconnect STRING () () () (OS release) ... End ResourceIn the
lsf.cluster.
cluster_name
file, identify groups of hosts that share high-speed connections.Begin ResourceMap RESOURCENAME LOCATION hgconnect (hg1@[hostA hostB] hg2@[hostD hostE] hg3@[hostF hostG hostX]) End ResourceMapIf you want to specify the same resource requirement at the queue level, define a custom resource in
lsf.shared
as in the previous example, map hosts to high-speed connection groups inlsf.cluster.
cluster_name
, and define the following queue inlsb.queues
:Begin Queue QUEUE_NAME = My_test PRIORITY = 30 NICE = 20RES_REQ = "select[mem > 1000 && type==any && (hgconnect==hg1 || hgconnect==hg2 || hgconnect=hg3)]same[hgconnect:type]"
DESCRIPTION = either hg1 or hg2 or hg3 End QueueThis example allocates processors on hosts that:
- Have more than 1000 MB in memory
- Are of the same host type
- Are in high-speed connection group
hg1
orhg2
orhg3
Limiting the Number of Processors Allocated
Use the PROCLIMIT parameter in
lsb.queues
orlsb.applications
to limit the number of processors that can be allocated to a parallel job.
- Syntax
- How PROCLIMIT affects submission of parallel jobs
- Changing PROCLIMIT
- MultiCluster
- Resizable jobs
- Automatic queue selection
- Examples
Syntax
PROCLIMIT =
[minimum_limit
[default_limit
]]maximum_limit
All limits must be positive numbers greater than or equal to 1 that satisfy the following relationship:
1 <=
minimum
<=default
<=maximum
You can specify up to three limits in the PROCLIMIT parameter:
How PROCLIMIT affects submission of parallel jobs
The
-n
option ofbsub
specifies the number of processors to be used by a parallel job, subject to the processor limits of the queue or application profile.Jobs that specify fewer processors than the minimum PROCLIMIT or more processors than the maximum PROCLIMIT are rejected.
If a default value for PROCLIMIT is specified, jobs submitted without specifying
-n
use the default number of processors. If the queue or application profile has only minimum and maximum values for PROCLIMIT, the number of processors is equal to the minimum value. If only a maximum value for PROCLIMIT is specified, or no PROCLIMIT is specified, the number of processors is equal to 1.Incorrect processor limits are ignored, and a warning message is displayed when LSF is reconfigured or restarted. A warning message is also logged to the
mbatchd
log file when LSF is started.Changing PROCLIMIT
If you change the PROCLIMIT parameter, the new processor limit does not affect running jobs. Pending jobs with no processor requirements use the new default PROCLIMIT value. If the pending job does not satisfy the new processor limits, it remains in PEND state, and the pending reason changes to the following:
Job no longer satisfies PROCLIMIT configurationIf PROCLIMIT specification is incorrect (for example, too many parameters), a reconfiguration error message is issued. Reconfiguration proceeds and the incorrect PROCLIMIT is ignored.
MultiCluster
Jobs forwarded to a remote cluster are subject to the processor limits of the remote queues. Any processor limits specified on the local cluster are not applied to the remote job.
Resizable jobs
Resizable job allocation requests obey the PROCLIMIT definition in both application profiles and queues. When the maximum job slot request is greater than the maximum slot definition in PROCLIMIT, LSF chooses the minimum value of both. For example, if a job asks for
-n 1,4
, but PROCLIMIT is defined as2 2 3
, the maximum slot request for the job is 3 rather than 4.Automatic queue selection
When you submit a parallel job without specifying a queue name, LSF automatically selects the most suitable queue from the queues listed in the DEFAULT_QUEUE parameter in
lsb.params
or the LSB_DEFAULTQUEUE environment variable. Automatic queue selection takes into account any maximum and minimum PROCLIMIT values for the queues available for automatic selection.If you specify
-n
min_proc,
max_proc
, but do not specify a queue, the first queue that satisfies the processor requirements of the job is used. If no queue satisfies the processor requirements, the job is rejected.Example
For example, queues with the following PROCLIMIT values are defined in
lsb.queues
:
queueA
withPROCLIMIT=1 1 1
queueB
withPROCLIMIT=2 2 2
queueC
withPROCLIMIT=4 4 4
queueD
withPROCLIMIT=8 8 8
queueE
withPROCLIMIT=16 16 16
In
lsb.params
:DEFAULT_QUEUE=queueA queueB queueC queueD queueE
For the following jobs:
bsub -n 8 myjob
LSF automatically selects queueD to run
myjob
.
bsub -n 5 myjob
Job
myjob
fails because no default queue has the correct number of processors.Examples
Maximum processor limit
PROCLIMIT is specified in the default queue in
lsb.queues
as:PROCLIMIT = 3The maximum number of processors that can be allocated for this queue is 3.
Minimum and maximum processor limits
PROCLIMIT is specified in
lsb.queues
as:PROCLIMIT = 3 8The minimum number of processors that can be allocated for this queue is 3 and the maximum number of processors that can be allocated for this queue is 8.
Minimum, default, and maximum processor limits
PROCLIMIT is specified in
lsb.queues
as:PROCLIMIT = 4 6 9
- Minimum number of processors that can be allocated for this queue is 4
- Default number of processors for the queue is 6
- Maximum number of processors that can be allocated for this queue is 9
Example Descriptionbsub myjob
Because a default number of processors is configured, the jobmyjob
runs on 6 processors.
Reserving Processors
About processor reservation
When parallel jobs have to compete with sequential jobs for job slots, the slots that become available are likely to be taken immediately by a sequential job. Parallel jobs need multiple job slots to be available before they can be dispatched. If the cluster is always busy, a large parallel job could be pending indefinitely. The more processors a parallel job requires, the worse the problem is.
Processor reservation solves this problem by reserving job slots as they become available, until there are enough reserved job slots to run the parallel job.
You might want to configure processor reservation if your cluster has a lot of sequential jobs that compete for job slots with parallel jobs.
How processor reservation works
Processor reservation is disabled by default.
If processor reservation is enabled, and a parallel job cannot be dispatched because there are not enough job slots to satisfy its minimum processor requirements, the job slots that are currently available is reserved and accumulated.
A reserved job slot is unavailable to any other job. To avoid deadlock situations in which the system reserves job slots for multiple parallel jobs and none of them can acquire sufficient resources to start, a parallel job gives up all its reserved job slots if it has not accumulated enough to start within a specified time. The reservation time starts from the time the first slot is reserved. When the reservation time expires, the job cannot reserve any slots for one scheduling cycle, but then the reservation process can begin again.
If you specify first execution host candidates at the job or queue level, LSF tries to reserve a job slot on the first execution host. If LSF cannot reserve a first execution host job slot, it does not reserve slots on any other hosts.
Configure processor reservation
- To enable processor reservation, set SLOT_RESERVE in
lsb.queues
and specify the reservation time (a job cannot hold any reserved slots after its reservation time expires).Syntax
SLOT_RESERVE=MAX_RESERVE_TIME
[
n
]
.where
n
is an integer by which to multiply MBD_SLEEP_TIME. MBD_SLEEP_TIME is defined inlsb.params
; the default value is 60 seconds.Example
Begin Queue . PJOB_LIMIT=1 SLOT_RESERVE = MAX_RESERVE_TIME[5] . End QueueIn this example, if MBD_SLEEP_TIME is 60 seconds, a job can reserve job slots for 5 minutes. If MBD_SLEEP_TIME is 30 seconds, a job can reserve job slots for 5 *30= 150 seconds, or 2.5 minutes.
Viewing information about reserved job slots
Reserved slots can be displayed with the
bjobs
command. The number of reserved slots can be displayed with thebqueues
,bhosts
,bhpart
, andbusers
commands. Look in theRSV
column.Reserving Memory for Pending Parallel Jobs
By default, the
rusage
string reserves resources for running jobs. Because resources are not reserved for pending jobs, some memory-intensive jobs could be pending indefinitely because smaller jobs take the resources immediately before the larger jobs can start running. The more memory a job requires, the worse the problem is.Memory reservation for pending jobs solves this problem by reserving memory as it becomes available, until the total required memory specified on the
rusage
string is accumulated and the job can start. Use memory reservation for pending jobs if memory-intensive jobs often compete for memory with smaller jobs in your cluster.Unlike slot reservation, which only applies to parallel jobs, memory reservation applies to both sequential and parallel jobs.
Configuring memory reservation for pending parallel jobs
Use the RESOURCE_RESERVE parameter in
lsb.queues
to reserve host memory for pending jobs, as described in Memory Reservation for Pending Jobs.lsb.queues
- Set the RESOURCE_RESERVE parameter in a queue defined in
lsb.queues
.The RESOURCE_RESERVE parameter overrides the SLOT_RESERVE parameter. If both RESOURCE_RESERVE and SLOT_RESERVE are defined in the same queue, job slot reservation and memory reservation are enabled and an error is displayed when the cluster is reconfigured. SLOT_RESERVE is ignored. Backfill on memory may still take place.
The following queue enables both memory reservation and backfill in the same queue:
Begin Queue QUEUE_NAME = reservation_backfill DESCRIPTION = For resource reservation and backfill PRIORITY = 40 RESOURCE_RESERVE = MAX_RESERVE_TIME[20] BACKFILL = Y End QueueEnable per-slot memory reservation
By default, memory is reserved for parallel jobs on a per-host basis. For example, by default, the command:
bsub -n 4 -R "rusage[mem=500]" -q reservation myjob
requires the job to reserve 500 MB on each host where the job runs.
- To enable per-slot memory reservation, define RESOURCE_RESERVE_PER_SLOT=y in
lsb.params
. In this example, if per-slot reservation is enabled, the job must reserve 500 MB of memory for each job slot (4 * 500 = 2 GB) on the host in order to run.Backfill Scheduling: Allowing Jobs to Use Reserved Job Slots
By default, a reserved job slot cannot be used by another job. To make better use of resources and improve performance of LSF, you can configure backfill scheduling.
About backfill scheduling
Backfill scheduling allows other jobs to use the reserved job slots, as long as the other jobs do not delay the start of another job. Backfilling, together with processor reservation, allows large parallel jobs to run while not underutilizing resources.
In a busy cluster, processor reservation helps to schedule large parallel jobs sooner. However, by default, reserved processors remain idle until the large job starts. This degrades the performance of LSF because the reserved resources are idle while jobs are waiting in the queue.
Backfill scheduling allows the reserved job slots to be used by small jobs that can run and finish before the large job starts. This improves the performance of LSF because it increases the utilization of resources.
How backfilling works
For backfill scheduling, LSF assumes that a job can run until its run limit expires. Backfill scheduling works most efficiently when all the jobs in the cluster have a run limit.
Since jobs with a shorter run limit have more chance of being scheduled as backfill jobs, users who specify appropriate run limits in a backfill queue is rewarded by improved turnaround time.
Once the big parallel job has reserved sufficient job slots, LSF calculates the start time of the big job, based on the run limits of the jobs currently running in the reserved slots. LSF cannot backfill if the big job is waiting for a job that has no run limit defined.
If LSF can backfill the idle job slots, only jobs with run limits that expire before the start time of the big job is allowed to use the reserved job slots. LSF cannot backfill with a job that has no run limit.
Example
In this scenario, assume the cluster consists of a 4-CPU multiprocessor host.
- A sequential job (
job1
) with a run limit of 2 hours is submitted and gets started at 8:00 am (figure a).- Shortly afterwards, a parallel job (
job2
) requiring all 4 CPUs is submitted. It cannot start right away becausejob1
is using one CPU, so it reserves the remaining 3 processors (figure b).- At 8:30 am, another parallel job (
job3
) is submitted requiring only two processors and with a run limit of 1 hour. Sincejob2
cannot start until 10:00am (whenjob1
finishes), its reserved processors can be backfilled byjob3
(figure c). Thereforejob3
can complete beforejob2
's start time, making use of the idle processors.Job3
finishes at 9:30am andjob1
at 10:00am, allowingjob2
to start shortly after 10:00am. In this example, ifjob3
's run limit was 2 hours, it would not be able to backfilljob2
's reserved slots, and would have to run afterjob2
finishes.Limitations
- A job does not have an estimated start time immediately after
mbatchd
is reconfigured.Backfilling and job slot limits
A backfill job borrows a job slot that is already taken by another job. The backfill job does not run at the same time as the job that reserved the job slot first. Backfilling can take place even if the job slot limits for a host or processor have been reached. Backfilling cannot take place if the job slot limits for users or queues have been reached.
Job resize allocation requests
Pending job resize allocation requests are supported by backfill policies. However, the run time of pending allocation request is equal to the remaining run time of the running resizable job. For example, if RUN LIMIT of a resizable job is 20 hours and 4 hours have already passed, the run time of pending allocation request is 16 hours.
Configuring backfill scheduling
Backfill scheduling is enabled at the queue level. Only jobs in a backfill queue can backfill reserved job slots. If the backfill queue also allows processor reservation, then backfilling can occur among jobs within the same queue.
Configure a backfill queue
- To configure a backfill queue, define BACKFILL in
lsb.queues
.- Specify
Y
to enable backfilling. To disable backfilling, specifyN
or blank space.Example
BACKFILL=Y
Enforcing run limits
Backfill scheduling requires all jobs to specify a duration. If you specify a run time limit using the command line
bsub -W
option or by defining the RUNLIMIT parameter inlsb.queues
orlsb.applications
, LSF uses that value as a hard limit and terminates jobs that exceed the specified duration. Alternatively, you can specify an estimated duration by defining the RUNTIME parameter inlsb.applications
. LSF uses the RUNTIME estimate for scheduling purposes only, and does not terminate jobs that exceed the RUNTIME duration.Backfill scheduling works most efficiently when all the jobs in a cluster have a run limit specified at the job level (
bsub -W
). You can use the external submission executable,esub
, to make sure that all users specify a job-level run limit.Otherwise, you can specify ceiling and default run limits at the queue level (RUNLIMIT in
lsb.queues
) or application level (RUNLIMIT inlsb.applications
).View information about job start time
- Use
bjobs -l
to view the estimated start time of a job.Using backfill on memory
If BACKFILL is configured in a queue, and a run limit is specified with
-W
onbsub
or with RUNLIMIT in the queue, backfill jobs can use the accumulated memory reserved by the other jobs, as long as the backfill job can finish before the predicted start time of the jobs with the reservation.Unlike slot reservation, which only applies to parallel jobs, backfill on memory applies to sequential and parallel jobs.
The following queue enables both memory reservation and backfill on memory in the same queue:
Begin Queue QUEUE_NAME = reservation_backfill DESCRIPTION = For resource reservation and backfill PRIORITY = 40 RESOURCE_RESERVE = MAX_RESERVE_TIME[20] BACKFILL = Y End QueueExamples of memory reservation and backfill on memory
lsb.queues
The following queues are defined in
lsb.queues
:Begin Queue QUEUE_NAME = reservation DESCRIPTION = For resource reservation PRIORITY=40 RESOURCE_RESERVE = MAX_RESERVE_TIME[20] End Queue Begin Queue QUEUE_NAME = backfill DESCRIPTION = For backfill scheduling PRIORITY = 30 BACKFILL = y End Queuelsb.params
Per-slot memory reservation is enabled by RESOURCE_RESERVE_PER_SLOT=y in
lsb.params
.Assumptions
Assume one host in the cluster with 10 CPUs and 1 GB of free memory currently available.
Sequential jobs
Each of the following sequential jobs requires 400 MB of memory. The first three jobs run for 300 minutes.
Job 1:
bsub -W 300 -R "rusage[mem=400]" -q reservation myjob1
The job starts running, using 400M of memory and one job slot.
Job 2:
Submitting a second job with same requirements get the same result.
Job 3:
Submitting a third job with same requirements reserves one job slot, and reserve all free memory, if the amount of free memory is between 20 MB and 200 MB (some free memory may be used by the operating system or other software.)
Job 4:
bsub -W 400 -q backfill -R "rusage[mem=50]" myjob4
The job keeps pending, since memory is reserved by job 3 and it runs longer than job 1 and job 2.
Job 5:
bsub -W 100 -q backfill -R "rusage[mem=50]" myjob5
The job starts running. It uses one free slot and memory reserved by job 3. If the job does not finish in 100 minutes, it is killed by LSF automatically.
Job 6:
bsub -W 100 -q backfill -R "rusage[mem=300]" myjob6
The job keeps pending with no resource reservation because it cannot get enough memory from the memory reserved by job 3.
Job 7:
bsub -W 100 -q backfill myjob7
The job starts running. LSF assumes it does not require any memory and enough job slots are free.
Parallel jobs
Each process of a parallel job requires 100 MB memory, and each parallel job needs 4 cpus. The first two of the following parallel jobs run for 300 minutes.
Job 1:
bsub -W 300 -n 4 -R "rusage[mem=100]" -q reservation myJob1
The job starts running and use 4 slots and get 400MB memory.
Job 2:
Submitting a second job with same requirements gets the same result.
Job 3:
Submitting a third job with same requirements reserves 2 slots, and reserves all 200 MB of available memory, assuming no other applications are running outside of LSF.
Job 4:
bsub -W 400 -q backfill -R "rusage[mem=50]" myJob4
The job keeps pending since all available memory is already reserved by job 3. It runs longer than job 1 and job 2, so no backfill happens.
Job 5:
bsub -W 100 -q backfill -R "rusage[mem=50]" myJob5
This job starts running. It can backfill the slot and memory reserved by job 3. If the job does not finish in 100 minutes, it is killed by LSF automatically.
Using interruptible backfill
Interruptible backfill scheduling can improve cluster utilization by allowing reserved job slots to be used by low priority small jobs that are terminated when the higher priority large jobs are about to start.
An interruptible backfill job:
- Starts as a regular job and is killed when it exceeds the queue runtime limit, or
- Is started for backfill whenever there is a backfill time slice longer than the specified minimal time, and killed before the slot-reservation job is about to start. This applies to compute-intensive serial or single-node parallel jobs that can run a long time, yet be able to checkpoint or resume from an arbitrary computation point.
Resource allocation diagram
Job life cycle
- Jobs are submitted to a queue configured for interruptible backfill. The job runtime requirement is ignored.
- Job is scheduled as either regular job or backfill job.
- The queue runtime limit is applied to the regularly scheduled job.
- In backfill phase, the job is considered for run on any reserved resource, which duration is longer than the minimal time slice configured for the queue. The job runtime limit is set in such way, that the job releases the resource before it is needed by the slot reserving job.
- The job runs in a regular manner. It is killed upon reaching its runtime limit, and requeued for the next run. Requeueing must be explicitly configured in the queue.
Assumptions and limitations
- The interruptible backfill job holds the slot-reserving job start until its calculated start time, in the same way as a regular backfill job. The interruptible backfill job is killed when its run limit expires.
- Killing other running jobs prematurely does not affect the calculated run limit of an interruptible backfill job. Slot-reserving jobs do not start sooner.
- While the queue is checked for the consistency of interruptible backfill, backfill and runtime specifications, the requeue exit value clause is not verified, nor executed automatically. Configure requeue exit values according to your site policies.
- In LSF MultiCluster,
bhist
does not display interruptible backfill information for remote clusters.- A migrated job belonging to an interruptible backfill queue is migrated as if LSB_MIG2PEND is set.
- Interruptible backfill is disabled for resizable jobs. A resizable job can be submitted into interruptible backfill queue, but the job cannot be resized.
Configure an interruptible backfill queue
- Configure INTERRUPTIBLE_BACKFILL=
seconds
in the lowest priority queue in the cluster. There can only be one interruptible backfill queue in the cluster.Specify the minimum number of seconds for the job to be considered for backfilling.This minimal time slice depends on the specific job properties; it must be longer than at least one useful iteration of the job. Multiple queues may be created if a site has jobs of distinctively different classes.
For example:
Begin Queue QUEUE_NAME = background # REQUEUE_EXIT_VALUES (set to whatever needed) DESCRIPTION = Interruptible Backfill queueBACKFILL = Y INTERRUPTIBLE_BACKFILL = 1 RUNLIMIT = 10 PRIORITY = 1
End QueueInterruptible backfill is disabled if BACKFILL and RUNLIMIT are not configured in the queue.
The value of INTERRUPTIBLE_BACKFILL is the minimal time slice in seconds for a job to be considered for backfill. The value depends on the specific job properties; it must be longer than at least one useful iteration of the job. Multiple queues may be created for different classes of jobs.
BACKFILL and RUNLIMIT must be configured in the queue.
RUNLIMIT corresponds to a maximum time slice for backfill, and should be configured so that the wait period for the new jobs submitted to the queue is acceptable to users. 10 minutes of runtime is a common value.
You should configure REQUEUE_EXIT_VALUES for the queue so that resubmission is automatic. In order to terminate completely, jobs must have specific exit values:
- If jobs are checkpointible, use their checkpoint exit value.
- If jobs periodically save data on their own, use the SIGTERM exit value.
View the run limits for interruptible backfill jobs (bjobs and bhist)
- Use
bjobs
to display the run limit calculated based on the configured queue-level run limit.For example, the interruptible backfill queue
lazy
configures RUNLIMIT=60:bjobs -l 135
Job <135>, User <user1>, Project <default>, Status <RUN>, Queue <lazy>, Command <myjob> Mon Nov 21 11:49:22: Submitted from host <hostA>, CWD <$HOME/H PC/jobs>; RUNLIMIT 59.5 min of hostA Mon Nov 21 11:49:26: Started on <hostA>, Execution Home </home /user1>, Execution CWD </home/user1/HPC/jobs>;- Use
bhist
to display job-level run limit if specified.For example, job 135 was submitted with a run limit of 3 hours:
bsub -n 1 -q lazy -W 3:0 myjob
Job <135> is submitted to queue <lazy>.
bhist
displays the job-level run limit:bhist -l 135
Job <135>, User <user1>, Project <default>, Command <myjob> Mon Nov 21 11:49:22: Submitted from host <hostA>, to Queue <la zy>, CWD <$HOME/HPC/jobs>; RUNLIMIT 180.0 min of hostA Mon Nov 21 11:49:26: Dispatched to <hostA>; Mon Nov 21 11:49:26: Starting (Pid 2746); Mon Nov 21 11:49:27: Interruptible backfill runtime limit is 59.5 minutes; Mon Nov 21 11:49:27: Running with execution home </home/user1>, Execution CWD ...Displaying available slots for backfill jobs
The
bslots
command displays slots reserved for parallel jobs and advance reservations. The available slots are not currently used for running jobs, and can be used for backfill jobs. The available slots displayed bybslots
are only a snapshot of the slots currently not in use by parallel jobs or advance reservations. They are not guaranteed to be available at job submission.By default,
bslots
displays all available slots, and the available run time for those slots. When no reserved slots are available for backfill,bslots
displays "No reserved slots available."The backfill window calculation is based on the snapshot information (current running jobs, slot reservations, advance reservations) obtained from
mbatchd
.The backfill window displayed can serve as reference for submitting backfillable jobs. However, if you have specified extra resource requirements or special submission options, it does not insure that submitted jobs are scheduled and dispatched successfully.
bslots -R
only supports theselect
resource requirement string. Other resource requirement selections are not supported.If the available backfill window has no run time limit, its length is displayed as UNLIMITED.
Examples
Display all available slots for backfill jobs:
bslots
SLOTS RUNTIME
1 UNLIMITED
3 1 hour 30 minutes
5 1 hour 0 minutes
7 45 minutes
15 40 minutes
18 30 minutes
20 20 minutesDisplay available slots for backfill jobs requiring 15 slots or more:
bslots -n 15
SLOTS RUNTIME
15 40 minutes
18 30 minutes
20 20 minutesDisplay available slots for backfill jobs requiring a run time of 30 minutes or more:
bslots -W 30
SLOTS RUNTIME
3 1 hour 30 minutes
5 1 hour 0 minutes
7 45 minutes
15 40 minutes
18 30 minutes
bslots
-W 2:45No reserved slots available.
bslots -n 15 -W 30
SLOTS RUNTIME
15 40 minutes
18 30 minutesDisplay available slots for backfill jobs requiring a host with more than 500 MB of memory:
bslots -R "mem>500"
SLOTS RUNTIME
7 45 minutes
15 40 minutesDisplay the host names with available slots for backfill jobs:
bslots -l
SLOTS: 15
RUNTIME: 40 minutes
HOSTS: 1*hostB 1*hostE 3*hostC ...
3*hostZ ... ...
SLOTS: 15
RUNTIME: 30 minutes
HOSTS: 2*hostA 1*hostB 3*hostC ...
1*hostX ... ...
Submitting backfill jobs according to available slots
- Use
bslots
to display job slots available for backfill jobs.- Submit a job to a backfill queue. Specify a runtime limit and the number of processors required that are within the availability shown by
bslots
.Submitting a job according to the backfill slot availability shown by
bslots
does not guarantee that the job is backfilled successfully. The slots may not be available by the time job is actually scheduled, or the job cannot be dispatched because other resource requirements are not satisfied.Parallel Fairshare
LSF can consider the number of CPUs when using fairshare scheduling with parallel jobs.
If the job is submitted with
bsub -n
, the following formula is used to calculate dynamic priority:dynamic priority =
number_shares
/ (cpu_time
*CPU_TIME_FACTOR
+run_time
*number_CPUs
*RUN_TIME_FACTOR
+ (1 +job_slots
)*RUN_JOB_FACTOR
+ fairshare_adjustment(struc* shareAdjustPair)*FAIRSHARE_ADJUSTMENT_FACTOR
)where
number_CPUs
is the number of CPUs used by the job.Configure parallel fairshare
To configure parallel fairshare so that the number of CPUs is considered when calculating dynamic priority for queue-level user-based fairshare:
note:
LSB_NCPU_ENFORCE does not apply to host-partition user-based fairshare. For host-partition user-based fairshare, the number of CPUs is automatically considered.
- Configure fairshare at the queue level as indicated in Fairshare Scheduling.
- To enable parallel fairshare, set the parameter LSB_NCPU_ENFORCE=1 in
lsf.conf
.- To make your changes take effect, use the following commands to restart all LSF daemons:
#lsadmin reconfig
#lsadmin resrestart all
#badmin hrestart all
#badmin mbdrestart
How Deadline Constraint Scheduling Works For Parallel Jobs
For information about deadline constraint scheduling, see Using Deadline Constraint Scheduling. Deadline constraint scheduling is enabled by default.
If deadline constraint scheduling is enabled and a parallel job has a CPU limit but no run limit, LSF considers the number of processors when calculating how long the job takes.
LSF assumes that the minimum number of processors are used, and that they are all the same speed as the candidate host. If the job cannot finish under these conditions, LSF does not place the job.
The formula is:
(deadline time - current time) > (CPU limit on candidate host / minimum number of processors)
Optimized Preemption of Parallel Jobs
You can configure preemption for parallel jobs to reduce the number of jobs suspended in order to run a large parallel job.
When a high-priority parallel job preempts multiple low-priority parallel jobs, sometimes LSF preempts more low-priority jobs than are necessary to release sufficient job slots to start the high-priority job.
The PREEMPT_FOR parameter in
lsb.params
with the MINI_JOB keyword enables the optimized preemption of parallel jobs, so LSF preempts fewer of the low-priority parallel jobs.Enabling the feature only improves the efficiency in cases where both preemptive and preempted jobs are parallel jobs.
How optimized preemption works
When you run many parallel jobs in your cluster, and parallel jobs preempt other parallel jobs, you can enable a feature to optimize the preemption mechanism among parallel jobs.
By default, LSF can over-preempt parallel jobs. When a high-priority parallel job preempts multiple low-priority parallel jobs, sometimes LSF preempts more low-priority jobs than are necessary to release sufficient job slots to start the high-priority job. The optimized preemption mechanism reduces the number of jobs that are preempted.
Enabling the feature only improves the efficiency in cases where both preemptive and preempted jobs are parallel jobs. Enabling or disabling this feature has no effect on the scheduling of jobs that require only a single processor.
Configure optimized preemption
- Use the PREEMPT_FOR parameter in
lsb.params
and specify the keyword MINI_JOB to configure optimized preemption at the cluster level.If the parameter is already set, the MINI_JOB keyword can be used along with other keywords; the other keywords do not enable or disable the optimized preemption mechanism.
Processor Binding for Parallel Jobs
See also Processor binding for LSF job processes.
By default, there is no processor binding.
For multi-host parallel jobs, LSF sets two environment variables (
$LSB_BIND_JOB
and$LSB_BIND_CPU_LIST
) but does not attempt to bind the job to any host even if you enable the processor binding.Resizable jobs
Adding slots to or removing slots from a resizable job triggers unbinding and rebinding of job processes. Rebinding does not guarantee that the processes can be bound to the same processors they were bound to previously.
If a multihost parallel job becomes a single-host parallel job after resizing, LSF does not bind it.
If a single-host parallel job or sequential job becomes a multihost parallel job after resizing, LSF does not bind it.
After unbinding and binding, the job CPU affinity is changed. LSF puts the new CPU list in the LSB_BIND_CPU_LIST environment variable and the binding method to LSB_BIND_JOB environment variable. And it is the responsibility of the notification command to tell the job that CPU binding has changed.
Making Job Allocations Resizable
By default, if a job specifies a minimum and maximum slot request (
bsub -n
min
,
max
), LSF makes a onetime allocation and schedules the job. You can make jobsresizable
by submitting them an application profile configured for resizable jobs. You can use the-ar
option to submitautoresizable
jobs, where LSF dispatches jobs as long as minimum slot request is satisfied. After the job successfully starts, LSF continues to schedule and allocate additional resources to satisfy the maximum slot request for the job.For detailed information about the resizable job feature and how to configure it, see the
Platform LSF Configuration Guide
.Job scheduling and dispatch
The JOB_ACCEPT_INTERVAL parameter in
lsb.params
orlsb.queues
controls the number of seconds to wait after dispatching a job to a host before dispatching a second job to the same host. The parameter applies to all allocated hosts of a parallel job. For resizable job allocation requests, JOB_ACCEPT_INTERVAL applies to newly allocated hosts.Release job allocations
Some parallel jobs must release idle resources before the job completes. For instance, an embarrassingly parallel application typically has a "long tail" issue. In this case, most of short running tasks have already finished leaving many nodes idle. But there are still a few long running tasks left which occupy a small number of nodes. Ideally, the application should release these idle resources and let other jobs make use of them.
Traditional HPC applications also have this requirement, where a long running parallel portion finishes and job waits for a serial job to run to complete file staging. In this case, users would like to release every node except the first one for serial portion of the job run. The following figure illustrates this process:
![]()
Release idle resources from the application:
To release resources from running job, the job must be submitted to an application profile configured as resizable. Run
bresize release
to release allocated resources from a running resizable job:
- Release all slots except one slot from the first execution node
- Release all hosts except the first execution node
- Release a list of hosts and different slots for each explicitly
Only the job owner, cluster administrators, queue administrators, user group administrators, and root are allowed to release resources.
Specify a resize notification command
You can also use
bresize release
to specify a resize notification command to be invoked on the first execution host of the job. The resize notification command only applies to the release request and overrides the corresponding resize notification parameters defined in either the application profile (RESIZE_NOTIFY_CMD inlsb.applications
) or at job level (bsub -rnc
resize_notification_cmd
).If the resize notification command completes successfully, LSF considers the allocation release done and updates the job allocation. If the resize notification command fails, LSF does not update the job allocation.
By default, if the job has an active pending allocation request, you cannot release the resources. You must run
bresize release -c
to cancel the active pending allocation request and release the resources.Cancel an active pending request
If a job still has an active pending allocation request, but you do not want to allocate more resources to the job, use
bresize cancel
to cancel the allocation request.Only the job owner, cluster administrators, queue administrators, user group administrators, and root are allowed to cancel pending resource allocation requests.
Feature interactions
Chunk Jobs
Because candidate jobs for the chunk job feature are short-running sequential jobs, the resizable job feature does not support job chunking:
- Autoresizable jobs in a chunk queue or application profile cannot be chunked together
bresize
commands to resize job allocations do not apply to running chunk job membersbrequeue
Jobs requeued with
brequeue
start from the beginning. After requeue, LSF restores the original allocation request for the job.bswitch
bswitch
can switch resizable jobs between queues regardless of job state (including job is resizing state). Once the job is switched, the parameters in new queue apply, including threshold configuration, run limit, CPU limit, queue-level resource requirements, etc.User group administrators
User group administrators are allowed to issue
bresize
commands to release a part of resources from job allocation (bresize release
) or cancel active pending allocation request (bresize cancel
).Requeue exit values
If job-level, application-level or queue-level REQUEUE_EXIT_VALUES are defined, and as long as job exits with a defined exit code, LSF puts the requeued job back to PEND status. For resizable jobs, LSF schedules the job according to the initial allocation request regardless of any job allocation size change.
Automatic job rerun
A rerunnable job is rescheduled after the first running host becomes unreachable. Once job is rerun, LSF schedules resizable jobs based on their initial allocation request.
Platform Computing Inc.
www.platform.com |
Knowledge Center Contents Previous Next Index |