Knowledge Center         Contents    Previous  Next    Index  
Platform Computing Corp.

Running Parallel Jobs

Contents

How LSF Runs Parallel Jobs

When LSF runs a job, the LSB_HOSTS variable is set to the names of the hosts running the batch job. For a parallel batch job, LSB_HOSTS contains the complete list of hosts that LSF has allocated to that job.

LSF starts one controlling process for the parallel batch job on the first host in the host list. It is up to your parallel application to read the LSB_HOSTS environment variable to get the list of hosts, and start the parallel job components on all the other allocated hosts.

LSF provides a generic interface to parallel programming packages so that any parallel package can be supported by writing shell scripts or wrapper programs.

Preparing Your Environment to Submit Parallel Jobs to LSF

Getting the host list

Some applications can take this list of hosts directly as a command line parameter. For other applications, you may need to process the host list.

Example

The following example shows a /bin/sh script that processes all the hosts in the host list, including identifying the host where the job script is executing.

#!/bin/sh
# Process the list of host names in LSB_HOSTS

for host in $LSB_HOSTS ; do
handle_host $host
done 

Parallel job scripts

Each parallel programming package has different requirements for specifying and communicating with all the hosts used by a parallel job. LSF is not tailored to work with a specific parallel programming package. Instead, LSF provides a generic interface so that any parallel package can be supported by writing shell scripts or wrapper programs.

You can modify these scripts to support more parallel packages.

For more information, see Submitting Parallel Jobs

Use a job starter

You can configure the script into your queue as a job starter, and then all users can submit parallel jobs without having to type the script name. See Queue-Level Job Starters for more information about job starters.

  1. To see if your queue already has a job starter defined, run bqueues -l.

Submitting Parallel Jobs

LSF can allocate more than one host or processor to run a job and automatically keeps track of the job status, while a parallel job is running.

Specify the number of processors

When submitting a parallel job that requires multiple processors, you can specify the exact number of processors to use.

  1. To submit a parallel job, use bsub -n and specify the number of processors the job requires.
  2. To submit jobs based on the number of available job slots instead of the number of processors, use PARALLEL_SCHED_BY_SLOT=Y in lsb.params.
  3. For example:

    bsub -n 4 myjob 
     

    submits myjob as a parallel job. The job is started when 4 job slots are available.

    tip:  
    When PARALLEL_SCHED_BY_SLOT=Y in lsb.params, the resource requirement string keyword ncpus refers to the number of slots instead of the number of processors, however lshosts output will continue to show ncpus as defined by EGO_DEFINE_NCPUS in lsf.conf.

Starting Parallel Tasks with LSF Utilities

For simple parallel jobs you can use LSF utilities to start parts of the job on other hosts. Because LSF utilities handle signals transparently, LSF can suspend and resume all components of your job without additional programming.

Running parallel tasks with lsgrun

The simplest parallel job runs an identical copy of the executable on every host. The lsgrun command takes a list of host names and runs the specified task on each host. The lsgrun -p command specifies that the task should be run in parallel on each host.

Example

This example submits a job that uses lsgrun to run myjob on all the selected hosts in parallel:

bsub -n 10 'lsgrun -p -m "$LSB_HOSTS" myjob'
Job <3856> is submitted to default queue <normal>. 

For more complicated jobs, you can write a shell script that runs lsrun in the background to start each component.

Running parallel tasks with the blaunch distributed application framework

Most MPI implementations and many distributed applications use rsh and ssh as their task launching mechanism. The blaunch command provides a drop-in replacement for rsh and ssh as a transparent method for launching parallel and distributed applications within LSF.

Similar to the lsrun command, blaunch transparently connects directly to the RES/SBD on the remote host, and subsequently creates and tracks the remote tasks, and provides the connection back to LSF. There is no need to insert pam or taskstarter into the rsh or ssh calling sequence, or configure any wrapper scripts.

important:  
You cannot run blaunch directly from the command line.

blaunch only works within an LSF job; it can only be used to launch tasks on remote hosts that are part of a job allocation. It cannot be used as a standalone command. On success blaunch exits with 0.

Windows: blaunch is supported on Windows 2000 or later with the following exceptions:

See Using Platform LSF HPC for more information about using the blaunch distributed application framework.

Submitting jobs with blaunch

Use bsub to call blaunch, or to invoke a job script that calls blaunch. The blaunch command assumes that bsub -n implies one remote task per job slot.

Job Slot Limits For Parallel Jobs

A job slot is the basic unit of processor allocation in LSF. A sequential job uses one job slot. A parallel job that has N components (tasks) uses N job slots, which can span multiple hosts.

By default, running and suspended jobs count against the job slot limits for queues, users, hosts, and processors that they are associated with.

With processor reservation, job slots reserved by pending jobs also count against all job slot limits.

When backfilling occurs, the job slots used by backfill jobs count against the job slot limits for the queues and users, but not hosts or processors. This means when a pending job and a running job occupy the same physical job slot on a host, both jobs count towards the queue limit, but only the pending job counts towards host limit.

Specifying a Minimum and Maximum Number of Processors

By default, when scheduling a parallel job, the number of slots allocated on each host will not exceed the number of CPUs on that host even though host MXJ is set greater than number of CPUs. When submitting a parallel job, you can also specify a minimum number and a maximum number of processors.

If you specify a maximum and minimum number of processors, the job starts as soon as the minimum number of processors is available, but it uses up to the maximum number of processors, depending on how many processors are available at the time. Once the job starts running, no more processors are allocated to it even though more may be available later on.

Jobs that request fewer processors than the minimum PROCLIMIT defined for the queue or application profile to which the job is submitted, or more processors than the maximum PROCLIMIT are rejected. If the job requests minimum and maximum processors, the maximum requested cannot be less than the minimum PROCLIMIT, and the minimum requested cannot be more than the maximum PROCLIMIT.

If PARALLEL_SCHED_BY_SLOT=Y in lsb.params, the job specifies a maximum and minimum number of job slots instead of processors. LSF ignores the number of CPUs constraint during parallel job scheduling and only schedules based on slots.

If PARALLEL_SCHED_BY_SLOT is not defined for a resizable job, individual allocation requests are constrained by the number of CPUs during scheduling. However, the final resizable job allocation may not agree. For example, if an autoresizable job requests 1 to 4 slots, on a 2 CPUs 4 slots box, an autoresizable job eventually will use up to 4 slots.

Syntax

bsub -n min_proc[,max_proc] 

Example

bsub -n 4,16 myjob 

At most, 16 processors can be allocated to this job. If there are less than 16 processors eligible to run the job, this job can still be started as long as the number of eligible processors is greater than or equal to 4.

Specifying a First Execution Host

In general, the first execution host satisfies certain resource requirements that might not be present on other available hosts.

By default, LSF selects the first execution host dynamically according to the resource availability and host load for a parallel job. Alternatively, you can specify one or more first execution host candidates so that LSF selects one of the candidates as the first execution host.

When a first execution host is specified to run the first task of a parallel application, LSF does not include the first execution host or host group in a job resize allocation request.

Specify a first execution host

To specify one or more hosts, host groups, or compute units as first execution host candidates, add the (!) symbol after the host name. You can specify first execution host candidates at job submission, or in the queue definition.

Job level
  1. Use the -m option of bsub:
  2. bsub -n 32 -m "hostA! hostB hostgroup1! hostC" myjob

    The scheduler selects either hostA or a host defined in hostgroup1 as the first execution host, based on the job's resource requirements and host availability.

  3. In a MultiCluster environment, insert the (!) symbol after the cluster name, as shown in the following example:
  4. bsub -n 2 -m "host2@cluster2! host3@cluster2" my_parallel_job

Queue level

The queue-level specification of first execution host candidates applies to all jobs submitted to the queue.

  1. Specify the first execution host candidates in the list of hosts in the HOSTS parameter in lsb.queues:
  2. HOSTS = hostA! hostB hostgroup1! hostC 
    
Rules

Follow these guidelines when you specify first execution host candidates:

If the first execution host is incorrect at job submission, the job is rejected. If incorrect configurations exist on the queue level, warning messages are logged and displayed when LSF starts, restarts or is reconfigured.

Job chunking

Specifying first execution host candidates affects job chunking. For example, the following jobs have different job requirements, and is not placed in the same job chunk:

bsub -n 2 -m "hostA! hostB hostC" myjob 
bsub -n 2 -m "hostA hostB hostC" myjob 
bsub -n 2 -m "hostA hostB! hostC" myjob 

The requirements of each job in this example are:

For job chunking, all jobs must request the same hosts and the same first execution hosts (if specified). Jobs that specify a host preference must all specify the same preference.

Resource reservation

If you specify first execution host candidates at the job or queue level, LSF tries to reserve a job slot on the first execution host. If LSF cannot reserve a first execution host job slot, it does not reserve slots on any other hosts.

Compute units

If compute units resource requirements are used, the compute unit containing the first execution host is given priority:

bsub -n 64 -m "hg! cu1 cu2 cu3 cu4" -R "cu[pref=config]" myjob

In this example the first execution host is selected from the host group hg. Next in the job's allocation list are any appropriate hosts from the same compute unit as the first execution host. Finally remaining hosts are grouped by compute unit, with compute unit groups appearing in the same order as in the ComputeUnit section of lsb.hosts.

Compound resource requirements

If compound resource requirements are being used, the resource requirements specific to the first execution host should appear first:

bsub -m "hostA! hg12" -R "1*{select[type==X86_64]rusage[licA=1]} + 
{select[type==any]}" myjob 

In this example the first execution host must satisfy: select[type==X86_64]rusage[licA=1]

Controlling Job Locality using Compute Units

Compute units are groups of hosts laid out by the LSF administrator and configured to mimic the network architecture, minimizing communications overhead for optimal placement of parallel jobs. Different granularities of compute units provide the flexibility to configure an extensive cluster accurately and run larger jobs over larger compute units.

Resource requirement keywords within the compute unit section can be used to allocate resources throughout compute units in manner analogous to host resource allocation. Compute units then replace hosts as the basic unit of allocation for a job.

High performance computing clusters running large parallel jobs spread over many hosts benefit from using compute units. Communications bottlenecks within the network architecture of a large cluster can be isolated through careful configuration of compute units. Using compute units instead of hosts as the basic allocation unit, scheduling policies can be applied on a large scale.

tip:  
Configure each individual host as a compute unit to use the compute unit features for host level job allocation.

As indicated in the picture, two types of compute units have been defined in the parameter COMPUTE_UNIT_TYPES in lsb.params:

COMPUTE_UNIT_TYPES= enclosure! rack 

! indicates the default compute unit type. The first type listed (enclosure) is the finest granularity and the only type of compute unit containing hosts and host groups. Coarser granularity rack compute units can only contain enclosures.

The hosts have been grouped into compute units in the ComputeUnit section of lsb.hosts as follows (some lines omitted):

Begin ComputeUnit  
NAME         MEMBER             CONDENSED TYPE 
enclosure1   (host1[01-16])     Y         enclosure 
... 
enclosure8   (host8[01-16])     Y         enclosure 
rack1        (enclosure[1-2])   Y         rack 
rack2        (enclosure[3-4])   Y         rack 
rack3        (enclosure[5-6])   Y         rack 
rack4        (enclosure[7-8])   Y         rack 
End ComputeUnit 

This example defines 12 compute units, all of which have condensed output:

u	enclosure1 through enclosure8 are the finest granularity, and each contain 
16 hosts. 
u	rack1, rack2, rack3, and rack4 are the coarsest granularity, and each contain 
2 enclosures. 
Syntax

The cu string supports the following syntax:

cu[balance]

All compute units used for this job should contribute the same number of slots (to within one slot). Provides a balanced allocation over the fewest possible compute units.

cu[pref=config]

Compute units for this job are considered in the order they appear in the lsb.hosts configuration file. This is the default value.

cu[pref=minavail]

Compute units with the fewest available slots are considered first for this job. Useful for smaller jobs (both sequential and parallel) since this reduces fragmentation of compute units, leaving whole compute units free for larger jobs.

cu[pref=maxavail]

Compute units with the most available slots are considered first for this job.

cu[maxcus=number]

Maximum number of compute units the job can run across.

cu[usablecuslots=number]

All compute units used for this job should contribute the same minimum number of slots. At most the final allocated compute unit can contribute fewer than number slots.

cu[type=cu_type]

Type of compute unit being used, where cu_type is one of the types defined by COMPUTE_UNIT_TYPES in lsb.params. The default is the compute unit type listed first in lsb.params.

cu[excl]

Compute units used exclusively for the job. Must be enabled by EXCLUSIVE in lsb.queues.

Continuing with the example shown above, assume lsb.queues contains the parameter definition EXCLUSIVE=CU[rack] and that the slots available for each compute unit are shown under MAX in the condensed display from bhosts, where HOST_NAME refers to the compute unit:

HOST_NAME    STATUS   JL/U  MAX  NJOBS  RUN  SSUSP  USUSP  RSV 
enclosure1   ok       -     64   34     34    0      0      0 
enclosure2   ok       -     64   54     54    0      0      0 
enclosure3   ok       -     64   46     46    0      0      0 
enclosure4   ok       -     64   44     44    0      0      0 
enclosure5   ok       -     64   45     45    0      0      0 
enclosure6   ok       -     64   44     44    0      0      0 
enclosure7   ok       -     32   0      0     0      0      0 
enclosure8   ok       -     64   0      0     0      0      0 
rack1        ok       -     128  88     88    0      0      0 
rack2        ok       -     128  90     90    0      0      0 
rack3        ok       -     128  89     89    0      0      0 
rack4        ok       -     128  0      0     0      0      0 

Based on the 12 configured compute units, jobs can be submitted with a variety of compute unit requirements.

Using compute units
  1. bsub -R "cu[]" -n 64 ./app
  2. This job is restricted to compute units of the default type enclosure. The default pref=config applies, with compute units considered in configuration order. The job runs on 30 slots in enclosure1, 10 slots in enclosure2, 8 slots in enclosure3, and 16 slots in enclosure4 for a total of 64 slots.
  3. Compute units can be considered in order of most free slots or fewest free slots, where free slots include any slots available and not occupied by a running job.
  4. bsub -R "cu[pref=minavail]" -n 32 ./app
  5. This job is restricted to compute units of the default type enclosure in the order pref=minavail. Compute units with the fewest free slots are considered first. The job runs on 10 slots in enclosure2, 18 slots in enclosure3 and 3 slots in enclosure5 for a total of 32 slots.
  6. bsub -R "cu[type=rack:pref=maxavail]" -n 64 ./app
  7. This job is restricted to compute units of the default type enclosure in the order pref=maxavail. Compute units with the most free slots are considered first. The job runs on 64 slots in enclosure8.
Localized allocations

Jobs can be run over a limited number of compute units using the maxcus keyword.

  1. bsub -R "cu[pref=maxavail:maxcus=1]" ./app
  2. This job is restricted to a single enclosure, and compute units with the most free slots are considered first. The job requirements are satisfied by enclosure8 which has 64 free slots.
  3. bsub -n 64 -R "cu[maxcus=3]" ./app
  4. This job requires a total of 64 slots over 3 enclosures or less. Compute units are considered in configuration order. The job requirements are satisfied by the following allocation:
  5. compute unit
    free slots
    enclosure1
    30
    enclosure3
    18
    enclosure4
    16

Balanced slot allocations

Balanced allocations split jobs evenly between compute units, which increases the efficiency of some applications.

  1. bsub -n 80 -R "cu[balance:maxcus=4]" ./app
  2. This job requires a balanced allocation over the fewest possible compute units of type enclosure (the default type), with a total of 80 slots. Since none of the configured enclosures have 80 slots, 2 compute units with 40 slots each are used, satisfying the maxcus requirement to use 4 compute units or less.
  3. The keyword pref is not included so the default order of pref=config is used. The job requirements are satisfied by 40 slots on both enclosure7 and enclosure8 for a total of 80 slots.
  4. bsub -n 64 -R "cu[balance:type=rack:pref=maxavail]" ./app
  5. This job requires a balanced allocation over the fewest possible compute units of type rack, with a total of 64 slots. Compute units with the most free slots are considered first, in the order rack4, rack1, rack3, rack2. The job requirements are satisfied by rack4.
  6. bsub -n "40,80" -R "cu[balance:pref=minavail]" ./app
  7. This job requires a balanced allocation over compute units of type rack, with a range of 40 to 80 slots. Only the minimum number of slots is considered when a range is specified along with keyword balance, so the job needs 40 slots. Compute units with the fewest free slots are considered first.
  8. Because balance uses the fewest possible compute units, racks with 40 or more slots are considered first, namely rack1 and rack4. The rack with the fewest available slots is then selected, and all job requirements are satisfied by rack1.
Balanced host allocations

Using balance and ptile together within the requirement string results in a balanced host allocation over compute units, and the same number of slots from each host. The final host may provide fewer slots if required.

number of compute units
number of hosts
2
8+8
3
5+5+6
4
4+4+4+4
5
3+3+3+3+4

Minimum slot allocations

Minimum slot allocations result in jobs spreading over fewer compute units, and ignoring compute units with few hosts available.

  1. bsub -n 45 -R "cu[usablecuslots=10:pref=minavail]" ./app
  2. This job requires an allocation of at least 10 slots in each enclosure, except possibly the last one. Compute units with the fewest free slots are considered first. The requirements are satisfied by a slot allocation of:
  3. compute unit
    number of slots
    enclosure2
    10
    enclosure5
    19
    enclosure4
    16

  4. bsub -n "1,140" -R "cu[usablecuslots=20]" ./app
  5. This job requires an allocation of at least 20 slots in each enclosure, except possibly the last one. Compute units are considered in configuration order and as close to 140 slots are allocated as possible. The requirements are satisfied by an allocation of 140 slots, where only the last compute unit has fewer than 20 slots allocated as follows:
  6. compute unit
    number of slots
    enclosure1
    30
    enclosure4
    20
    enclosure6
    20
    enclosure7
    64
    enclosure2
    6

Exclusive compute unit jobs

Because EXCLUSIVE=CU[rack] in lsb.queues, jobs may use compute units of type rack or finer granularity type enclosure exclusively. Exclusive jobs lock all compute units they run in, even if not all slots are being used by the job. Running compute unit exclusive jobs minimizes communications slowdowns resulting from shared network bandwidth.

  1. bsub -R "cu[excl:type=enclosure]" ./app
  2. This job requires exclusive use of an enclosure with compute units considered in configuration order. The first enclosure not running any jobs is enclosure7.
  3. Using excl with usablecuslots, the job avoids compute units where a large portion of the hosts are unavailable.
  4. bsub -n 90 -R "cu[excl:usablecuslots=12:type=enclosure]" ./app
  5. This job requires exclusive use of compute units, and will not use a compute unit if fewer than 12 slots are available. Compute units are considered in configuration order. In this case the job requirements are satisfied by 64 slots in enclosure7 and 26 slots in enclosure8.
  6. bsub -R "cu[excl]" ./app
  7. This job requires exclusive use of a rack with compute units considered in configuration order. The only rack not running any jobs is rack4.
Reservation

Compute unit constraints such as keywords maxcus, balance, and excl can result in inaccurately predicted start times from default LSF resource reservation. Time-based resource reservation provides a more accurate pending job predicted start time. When calculating job a time-based predicted start time, LSF considers job scheduling constraints and requirements, including job topology and resource limits, for example.

Host-level compute units

Configuring each individual host as a compute unit allows you to use the compute unit features for host level job allocation. Consider an example where one type of compute units has been defined in the parameter COMPUTE_UNIT_TYPES in lsb.params:

COMPUTE_UNIT_TYPES= host! 

The hosts have been grouped into compute hosts in the ComputeUnit section of lsb.hosts as follows:

Begin ComputeUnit  
NAME  MEMBER   TYPE 
h1    host1    host 
h2    host2    host 
... 
h50   host50   host 
End ComputeUnit 

Each configured compute unit of default type host contains a single host.

Ordering host allocations

Using the compute host keyword pref, hosts can be considered in order of most free slots or fewest free slots, where free slots include any slots available and not occupied by a running job:

  1. bsub -R "cu[]" ./app
  2. Compute units of default type host, each containing a single host, are considered in configuration order.
  3. bsub -R "cu[pref=minavail]" ./app
  4. Compute units of default type host each contain a single host. Compute units with the fewest free slots are considered first.
  5. bsub -n 20 -R "cu[pref=maxavail]" ./app
  6. Compute units of default type host each contain a single host. Compute units with the most free slots are considered first. A total of 20 slots are allocated for this job.
Limiting hosts in allocations

Using the compute unit keyword maxcus, the maximum number of hosts allocated to a job can be set:

Balanced slot allocations

Using the compute unit keyword balance, jobs can be evenly distributed over hosts:

  1. bsub -n 9 -R "cu[balance]" ./app
  2. Compute units of default type host, each containing a single host, are considered in configuration order. Possible balanced allocations are:
  3. compute units
    hosts
    number of slots per host
    1
    1
    9
    2
    2
    4, 5
    3
    3
    3, 3, 3
    4
    4
    2, 2, 2, 3
    5
    5
    2, 2, 2, 2, 1
    6
    6
    2, 2, 2, 1, 1, 1
    7
    7
    2, 2, 1, 1, 1, 1, 1
    8
    8
    2, 1, 1, 1, 1, 1, 1, 1
    9
    9
    1, 1, 1, 1, 1, 1, 1, 1, 1

  4. bsub -n 9 -R "cu[balance:maxcus=3]" ./app
  5. Compute units of default type host, each containing a single host, are considered in configuration order. Possible balanced allocations are 1 host with 9 slots, 2 hosts with 4 and 5 slots, or 3 hosts with 3 slots each.
Minimum slot allocations

Using the compute unit keyword usablecuslots, hosts are only considered if they have a minimum number of slots free and usable for this job:

  1. bsub -n 16 -R "cu[usablecuslots=4]" ./app
  2. Compute units of default type host, each containing a single host, are considered in configuration order. Only hosts with 4 or more slots available and not occupied by a running job are considered. Each host (except possibly the last host allocated) must contribute at least 4 slots to the job.
  3. bsub -n 16 -R "rusage[mem=1000] cu[usablecuslots=4]" ./app
  4. Compute units of default type host, each containing a single host, are considered in configuration order. Only hosts with 4 or more slots available, not occupied by a running job, and with 1000 memory units are considered. A host with 10 slots and 2000 units of memory, for example, will only have 2 slots free that satisfy the memory requirements of this job.

Controlling Processor Allocation Across Hosts

Sometimes you need to control how the selected processors for a parallel job are distributed across the hosts in the cluster.

You can control this at the job level or at the queue level. The queue specification is ignored if your job specifies its own locality.

Specifying parallel job locality at the job level

By default, LSF does allocate the required processors for the job from the available set of processors.

A parallel job may span multiple hosts, with a specifiable number of processes allocated to each host. A job may be scheduled on to a single multiprocessor host to take advantage of its efficient shared memory, or spread out on to multiple hosts to take advantage of their aggregate memory and swap space. Flexible spanning may also be used to achieve parallel I/O.

You are able to specify "select all the processors for this parallel batch job on the same host", or "do not choose more than n processors on one host" by using the span section in the resource requirement string (bsub -R or RES_REQ in the queue definition in lsb.queues).

If PARALLEL_SCHED_BY_SLOT=Y in lsb.params, the span string is used to control the number of job slots instead of processors.

Syntax

The span string supports the following syntax:

span[hosts=1]

Indicates that all the processors allocated to this job must be on the same host.

span[ptile=value]

Indicates the number of processors on each host that should be allocated to the job, where value is one of the following:

span[hosts=-1]

Disables span setting in the queue. LSF allocates the required processors for the job from the available set of processors.

Specifying multiple ptile values

In a span string with multiple ptile values, you must specify a predefined default value (ptile='!') and either host model or host type.

You can specify both type and model in the same section in the resource requirement string, but the ptile values must be the same type.

If you specify same[type:model], you cannot specify a predefined ptile value (!) in the span section.

restriction:  
Under bash 3.0, the exclamation mark (!) is not interpreted correctly by the shell. To use predefined ptile value (ptile='!'), use the +H option to disable '!' style history substitution in bash (sh +H).

The following span strings are valid:

same[type:model] span[ptile=LINUX:2,SGI:4] 

LINUX and SGI are both host types and can appear in the same span string.

same[type:model] span[ptile=PC233:2,PC1133:4] 

PC233 and PC1133 are both host models and can appear in the same span string.

You cannot mix host model and host type in the same span string. The following span strings are not correct:

span[ptile='!',LINUX:2,PC1133:4] same[model] 
span[ptile='!',LINUX:2,PC1133:4] same[type] 

The LINUX host type and PC1133 host model cannot appear in the same span string.

Multiple ptile values for a host type

For host type, you must specify same[type] in the resource requirement. For example:

span[ptile='!',HP:8,SGI:8,LINUX:2] same[type] 

The job requests 8 processors on a host of type HP or SGI, and 2 processors on a host of type LINUX, and the predefined maximum job slot limit in lsb.hosts (MXJ) for other host types.

Multiple ptile values for a host model

For host model, you must specify same[model] in the resource requirement. For example:

span[ptile='!',PC1133:4,PC233:2] same[model] 

The job requests 4 processors on hosts of model PC1133, and 2 processors on hosts of model PC233, and the predefined maximum job slot limit in lsb.hosts (MXJ) for other host models.

Examples

bsub -n 4 -R "span[hosts=1]" myjob 

Runs the job on a host that has at least 4 processors currently eligible to run the 4 components of this job.

bsub -n 4 -R "span[ptile=2]" myjob 

Runs the job on 2 hosts, using 2 processors on each host. Each host may have more than 2 processors available.

bsub -n 4 -R "span[ptile=3]" myjob 

Runs the job on 2 hosts, using 3 processors on the first host and 1 processor on the second host.

bsub -n 4 -R "span[ptile=1]" myjob 

Runs the job on 4 hosts, even though some of the 4 hosts may have more than one processor currently available.

bsub -n 4 -R "type==any same[type] span[ptile='!',LINUX:2,SGI:4]" myjob 

Submits myjob to request 4 processors running on 2 hosts of type LINUX (2 processors per host), or a single host of type SGI, or for other host types, the predefined maximum job slot limit in lsb.hosts (MXJ).

bsub -n 16 -R "type==any same[type] span[ptile='!',HP:8,SGI:8,LINUX:2]" myjob 

Submits myjob to request 16 processors on 2 hosts of type HP or SGI (8 processors per hosts), or on 8 hosts of type LINUX (2 processors per host), or the predefined maximum job slot limit in lsb.hosts (MXJ) for other host types.

bsub -n 4 -R "same[model] span[ptile='!',PC1133:4,PC233:2]" myjob 

Submits myjob to request a single host of model PC1133 (4 processors), or 2 hosts of model PC233 (2 processors per host), or the predefined maximum job slot limit in lsb.hosts (MXJ) for other host models.

Specifying parallel job locality at the queue level

The queue may also define the locality for parallel jobs using the RES_REQ parameter.

Running Parallel Processes on Homogeneous Hosts

Parallel jobs run on multiple hosts. If your cluster has heterogeneous hosts some processes from a parallel job may for example, run on Solaris and some on SGI IRIX. However, for performance reasons you may want all processes of a job to run on the same type of host instead of having some processes run on one type of host and others on another type of host.

You can use the same section in the resource requirement string to indicate to LSF that processes are to run on one type or model of host. You can also use a custom resource to define the criteria for homogeneous hosts.

Examples

Running all parallel processes on the same host type
bsub -n 4 -R"select[type==SGI6 || type==SOL7] same[type]" myjob 

Allocate 4 processors on the same host type-either SGI IRIX, or Solaris 7, but not both.

Running all parallel processes on the same host type and model
bsub -n 6 -R"select[type==any] same[type:model]" myjob 

Allocate 6 processors on any host type or model as long as all the processors are on the same host type and model.

Running all parallel processes on hosts in the same high-speed connection group
bsub -n 12 -R "select[type==any && (hgconnect==hg1 || hgconnect==hg2 || hgconnect==hg3)] 
same[hgconnect:type]" myjob 

For performance reasons, you want to have LSF allocate 12 processors on hosts in high-speed connection group hg1, hg2, or hg3, but not across hosts in hg1, hg2 or hg3 at the same time. You also want hosts that are chosen to be of the same host type.

This example reflects a network in which network connections among hosts in the same group are high-speed, and network connections between host groups are low-speed.

In order to specify this, you create a custom resource hgconnect in lsf.shared.

Begin Resource
RESOURCENAME    TYPE    INTERVAL     INCREASING      RELEASE      DESCRIPTION
hgconnect       STRING        ()             ()           ()      (OS release)
...
End Resource 

In the lsf.cluster.cluster_name file, identify groups of hosts that share high-speed connections.

Begin ResourceMap
RESOURCENAME    LOCATION
hgconnect       (hg1@[hostA hostB] hg2@[hostD hostE] hg3@[hostF hostG hostX])
End ResourceMap 

If you want to specify the same resource requirement at the queue level, define a custom resource in lsf.shared as in the previous example, map hosts to high-speed connection groups in lsf.cluster.cluster_name, and define the following queue in lsb.queues:

Begin Queue 
QUEUE_NAME = My_test 
PRIORITY = 30 
NICE = 20 
RES_REQ = "select[mem > 1000 && type==any && (hgconnect==hg1 || 
hgconnect==hg2 || hgconnect=hg3)]same[hgconnect:type]" 
DESCRIPTION = either hg1 or hg2 or hg3
End Queue 

This example allocates processors on hosts that:

Limiting the Number of Processors Allocated

Use the PROCLIMIT parameter in lsb.queues or lsb.applications to limit the number of processors that can be allocated to a parallel job.

Syntax

PROCLIMIT = [minimum_limit [default_limit]] maximum_limit

All limits must be positive numbers greater than or equal to 1 that satisfy the following relationship:

1 <= minimum <= default <= maximum

You can specify up to three limits in the PROCLIMIT parameter:

If you specify ...
Then ...
One limit
It is the maximum processor limit. The minimum and default limits are set to 1.
Two limits
The first is the minimum processor limit, and the second is the maximum. The default is set equal to the minimum.
The minimum must be less than or equal to the maximum.
Three limits
The first is the minimum processor limit, the second is the default processor limit, and the third is the maximum.
The minimum must be less than the default and the maximum.

How PROCLIMIT affects submission of parallel jobs

The -n option of bsub specifies the number of processors to be used by a parallel job, subject to the processor limits of the queue or application profile.

Jobs that specify fewer processors than the minimum PROCLIMIT or more processors than the maximum PROCLIMIT are rejected.

If a default value for PROCLIMIT is specified, jobs submitted without specifying -n use the default number of processors. If the queue or application profile has only minimum and maximum values for PROCLIMIT, the number of processors is equal to the minimum value. If only a maximum value for PROCLIMIT is specified, or no PROCLIMIT is specified, the number of processors is equal to 1.

Incorrect processor limits are ignored, and a warning message is displayed when LSF is reconfigured or restarted. A warning message is also logged to the mbatchd log file when LSF is started.

Changing PROCLIMIT

If you change the PROCLIMIT parameter, the new processor limit does not affect running jobs. Pending jobs with no processor requirements use the new default PROCLIMIT value. If the pending job does not satisfy the new processor limits, it remains in PEND state, and the pending reason changes to the following:

Job no longer satisfies PROCLIMIT configuration 

If PROCLIMIT specification is incorrect (for example, too many parameters), a reconfiguration error message is issued. Reconfiguration proceeds and the incorrect PROCLIMIT is ignored.

MultiCluster

Jobs forwarded to a remote cluster are subject to the processor limits of the remote queues. Any processor limits specified on the local cluster are not applied to the remote job.

Resizable jobs

Resizable job allocation requests obey the PROCLIMIT definition in both application profiles and queues. When the maximum job slot request is greater than the maximum slot definition in PROCLIMIT, LSF chooses the minimum value of both. For example, if a job asks for -n 1,4, but PROCLIMIT is defined as 2 2 3, the maximum slot request for the job is 3 rather than 4.

Automatic queue selection

When you submit a parallel job without specifying a queue name, LSF automatically selects the most suitable queue from the queues listed in the DEFAULT_QUEUE parameter in lsb.params or the LSB_DEFAULTQUEUE environment variable. Automatic queue selection takes into account any maximum and minimum PROCLIMIT values for the queues available for automatic selection.

If you specify -n min_proc,max_proc, but do not specify a queue, the first queue that satisfies the processor requirements of the job is used. If no queue satisfies the processor requirements, the job is rejected.

Example

For example, queues with the following PROCLIMIT values are defined in lsb.queues:

In lsb.params: DEFAULT_QUEUE=queueA queueB queueC queueD queueE

For the following jobs:

bsub -n 8 myjob

LSF automatically selects queueD to run myjob.

bsub -n 5 myjob

Job myjob fails because no default queue has the correct number of processors.

Examples

Maximum processor limit

PROCLIMIT is specified in the default queue in lsb.queues as:

PROCLIMIT = 3 

The maximum number of processors that can be allocated for this queue is 3.

Example
Description
bsub -n 2 myjob
The job myjob runs on 2 processors.
bsub -n 4 myjob
The job myjob is rejected from the queue because it requires more than the maximum number of processors configured for the queue (3).
bsub -n 2,3 myjob
The job myjob runs on 2 or 3 processors.
bsub -n 2,5 myjob
The job myjob runs on 2 or 3 processors, depending on how many slots are currently available on the host.
bsub myjob
No default or minimum is configured, so the job myjob runs on 1 processor.

Minimum and maximum processor limits

PROCLIMIT is specified in lsb.queues as:

PROCLIMIT = 3 8 

The minimum number of processors that can be allocated for this queue is 3 and the maximum number of processors that can be allocated for this queue is 8.

Example
Description
bsub -n 5 myjob
The job myjob runs on 5 processors.
bsub -n 2 myjob
The job myjob is rejected from the queue because the number of processors requested is less than the minimum number of processors configured for the queue (3).
bsub -n 4,5 myjob
The job myjob runs on 4 or 5 processors.
bsub -n 2,6 myjob
The job myjob runs on 3 to 6 processors.
bsub -n 4,9 myjob
The job myjob runs on 4 to 8 processors.
bsub myjob
The default number of processors is equal to the minimum number (3). The job myjob runs on 3 processors.

Minimum, default, and maximum processor limits

PROCLIMIT is specified in lsb.queues as:

PROCLIMIT = 4 6 9 

Example
Description
bsub myjob
Because a default number of processors is configured, the job myjob runs on 6 processors.

Reserving Processors

About processor reservation

When parallel jobs have to compete with sequential jobs for job slots, the slots that become available are likely to be taken immediately by a sequential job. Parallel jobs need multiple job slots to be available before they can be dispatched. If the cluster is always busy, a large parallel job could be pending indefinitely. The more processors a parallel job requires, the worse the problem is.

Processor reservation solves this problem by reserving job slots as they become available, until there are enough reserved job slots to run the parallel job.

You might want to configure processor reservation if your cluster has a lot of sequential jobs that compete for job slots with parallel jobs.

How processor reservation works

Processor reservation is disabled by default.

If processor reservation is enabled, and a parallel job cannot be dispatched because there are not enough job slots to satisfy its minimum processor requirements, the job slots that are currently available is reserved and accumulated.

A reserved job slot is unavailable to any other job. To avoid deadlock situations in which the system reserves job slots for multiple parallel jobs and none of them can acquire sufficient resources to start, a parallel job gives up all its reserved job slots if it has not accumulated enough to start within a specified time. The reservation time starts from the time the first slot is reserved. When the reservation time expires, the job cannot reserve any slots for one scheduling cycle, but then the reservation process can begin again.

If you specify first execution host candidates at the job or queue level, LSF tries to reserve a job slot on the first execution host. If LSF cannot reserve a first execution host job slot, it does not reserve slots on any other hosts.

Configure processor reservation

  1. To enable processor reservation, set SLOT_RESERVE in lsb.queues and specify the reservation time (a job cannot hold any reserved slots after its reservation time expires).
Syntax

SLOT_RESERVE=MAX_RESERVE_TIME[n].

where n is an integer by which to multiply MBD_SLEEP_TIME. MBD_SLEEP_TIME is defined in lsb.params; the default value is 60 seconds.

Example
Begin Queue
.
PJOB_LIMIT=1
SLOT_RESERVE = MAX_RESERVE_TIME[5]
.
End Queue 

In this example, if MBD_SLEEP_TIME is 60 seconds, a job can reserve job slots for 5 minutes. If MBD_SLEEP_TIME is 30 seconds, a job can reserve job slots for 5 *30= 150 seconds, or 2.5 minutes.

Viewing information about reserved job slots

Reserved slots can be displayed with the bjobs command. The number of reserved slots can be displayed with the bqueues, bhosts, bhpart, and busers commands. Look in the RSV column.

Reserving Memory for Pending Parallel Jobs

By default, the rusage string reserves resources for running jobs. Because resources are not reserved for pending jobs, some memory-intensive jobs could be pending indefinitely because smaller jobs take the resources immediately before the larger jobs can start running. The more memory a job requires, the worse the problem is.

Memory reservation for pending jobs solves this problem by reserving memory as it becomes available, until the total required memory specified on the rusage string is accumulated and the job can start. Use memory reservation for pending jobs if memory-intensive jobs often compete for memory with smaller jobs in your cluster.

Unlike slot reservation, which only applies to parallel jobs, memory reservation applies to both sequential and parallel jobs.

Configuring memory reservation for pending parallel jobs

Use the RESOURCE_RESERVE parameter in lsb.queues to reserve host memory for pending jobs, as described in Memory Reservation for Pending Jobs.

lsb.queues
  1. Set the RESOURCE_RESERVE parameter in a queue defined in lsb.queues.
  2. The RESOURCE_RESERVE parameter overrides the SLOT_RESERVE parameter. If both RESOURCE_RESERVE and SLOT_RESERVE are defined in the same queue, job slot reservation and memory reservation are enabled and an error is displayed when the cluster is reconfigured. SLOT_RESERVE is ignored. Backfill on memory may still take place.

    The following queue enables both memory reservation and backfill in the same queue:

    Begin Queue
    QUEUE_NAME = reservation_backfill
    DESCRIPTION = For resource reservation and backfill
    PRIORITY = 40
    RESOURCE_RESERVE = MAX_RESERVE_TIME[20]
    BACKFILL = Y
    End Queue 
    

Enable per-slot memory reservation

By default, memory is reserved for parallel jobs on a per-host basis. For example, by default, the command:

bsub -n 4 -R "rusage[mem=500]" -q reservation myjob 

requires the job to reserve 500 MB on each host where the job runs.

  1. To enable per-slot memory reservation, define RESOURCE_RESERVE_PER_SLOT=y in lsb.params. In this example, if per-slot reservation is enabled, the job must reserve 500 MB of memory for each job slot (4 * 500 = 2 GB) on the host in order to run.

Backfill Scheduling: Allowing Jobs to Use Reserved Job Slots

By default, a reserved job slot cannot be used by another job. To make better use of resources and improve performance of LSF, you can configure backfill scheduling.

About backfill scheduling

Backfill scheduling allows other jobs to use the reserved job slots, as long as the other jobs do not delay the start of another job. Backfilling, together with processor reservation, allows large parallel jobs to run while not underutilizing resources.

In a busy cluster, processor reservation helps to schedule large parallel jobs sooner. However, by default, reserved processors remain idle until the large job starts. This degrades the performance of LSF because the reserved resources are idle while jobs are waiting in the queue.

Backfill scheduling allows the reserved job slots to be used by small jobs that can run and finish before the large job starts. This improves the performance of LSF because it increases the utilization of resources.

How backfilling works

For backfill scheduling, LSF assumes that a job can run until its run limit expires. Backfill scheduling works most efficiently when all the jobs in the cluster have a run limit.

Since jobs with a shorter run limit have more chance of being scheduled as backfill jobs, users who specify appropriate run limits in a backfill queue is rewarded by improved turnaround time.

Once the big parallel job has reserved sufficient job slots, LSF calculates the start time of the big job, based on the run limits of the jobs currently running in the reserved slots. LSF cannot backfill if the big job is waiting for a job that has no run limit defined.

If LSF can backfill the idle job slots, only jobs with run limits that expire before the start time of the big job is allowed to use the reserved job slots. LSF cannot backfill with a job that has no run limit.

Example

In this scenario, assume the cluster consists of a 4-CPU multiprocessor host.

  1. A sequential job (job1) with a run limit of 2 hours is submitted and gets started at 8:00 am (figure a).
  2. Shortly afterwards, a parallel job (job2) requiring all 4 CPUs is submitted. It cannot start right away because job1 is using one CPU, so it reserves the remaining 3 processors (figure b).
  3. At 8:30 am, another parallel job (job3) is submitted requiring only two processors and with a run limit of 1 hour. Since job2 cannot start until 10:00am (when job1 finishes), its reserved processors can be backfilled by job3 (figure c). Therefore job3 can complete before job2's start time, making use of the idle processors.
  4. Job3 finishes at 9:30am and job1 at 10:00am, allowing job2 to start shortly after 10:00am. In this example, if job3's run limit was 2 hours, it would not be able to backfill job2's reserved slots, and would have to run after job2 finishes.
Limitations
Backfilling and job slot limits

A backfill job borrows a job slot that is already taken by another job. The backfill job does not run at the same time as the job that reserved the job slot first. Backfilling can take place even if the job slot limits for a host or processor have been reached. Backfilling cannot take place if the job slot limits for users or queues have been reached.

Job resize allocation requests

Pending job resize allocation requests are supported by backfill policies. However, the run time of pending resize request is equal to the remaining run time of the running resizable job. For example, if RUN LIMIT of a resizable job is 20 hours and 4 hours have already passed, the run time of pending resize request is 16 hours.

Configuring backfill scheduling

Backfill scheduling is enabled at the queue level. Only jobs in a backfill queue can backfill reserved job slots. If the backfill queue also allows processor reservation, then backfilling can occur among jobs within the same queue.

Configure a backfill queue
  1. To configure a backfill queue, define BACKFILL in lsb.queues.
  2. Specify Y to enable backfilling. To disable backfilling, specify N or blank space.
Example

BACKFILL=Y

Enforcing run limits

Backfill scheduling requires all jobs to specify a duration. If you specify a run time limit using the command line bsub -W option or by defining the RUNLIMIT parameter in lsb.queues or lsb.applications, LSF uses that value as a hard limit and terminates jobs that exceed the specified duration. Alternatively, you can specify an estimated duration by defining the RUNTIME parameter in lsb.applications. LSF uses the RUNTIME estimate for scheduling purposes only, and does not terminate jobs that exceed the RUNTIME duration.

Backfill scheduling works most efficiently when all the jobs in a cluster have a run limit specified at the job level (bsub -W). You can use the external submission executable, esub, to make sure that all users specify a job-level run limit.

Otherwise, you can specify ceiling and default run limits at the queue level (RUNLIMIT in lsb.queues) or application level (RUNLIMIT in lsb.applications).

View information about job start time

  1. Use bjobs -l to view the estimated start time of a job.

Using backfill on memory

If BACKFILL is configured in a queue, and a run limit is specified with -W on bsub or with RUNLIMIT in the queue, backfill jobs can use the accumulated memory reserved by the other jobs, as long as the backfill job can finish before the predicted start time of the jobs with the reservation.

Unlike slot reservation, which only applies to parallel jobs, backfill on memory applies to sequential and parallel jobs.

The following queue enables both memory reservation and backfill on memory in the same queue:

Begin Queue
QUEUE_NAME = reservation_backfill
DESCRIPTION = For resource reservation and backfill
PRIORITY = 40
RESOURCE_RESERVE = MAX_RESERVE_TIME[20]
BACKFILL = Y
End Queue 

Examples of memory reservation and backfill on memory

lsb.queues

The following queues are defined in lsb.queues:

Begin Queue
QUEUE_NAME = reservation
DESCRIPTION = For resource reservation
PRIORITY=40
RESOURCE_RESERVE = MAX_RESERVE_TIME[20]
End Queue

Begin Queue
QUEUE_NAME = backfill
DESCRIPTION = For backfill scheduling
PRIORITY = 30
BACKFILL = y
End Queue 
lsb.params

Per-slot memory reservation is enabled by RESOURCE_RESERVE_PER_SLOT=y in lsb.params.

Assumptions

Assume one host in the cluster with 10 CPUs and 1 GB of free memory currently available.

Sequential jobs

Each of the following sequential jobs requires 400 MB of memory. The first three jobs run for 300 minutes.

Job 1:

bsub -W 300 -R "rusage[mem=400]" -q reservation myjob1 

The job starts running, using 400M of memory and one job slot.

Job 2:

Submitting a second job with same requirements get the same result.

Job 3:

Submitting a third job with same requirements reserves one job slot, and reserve all free memory, if the amount of free memory is between 20 MB and 200 MB (some free memory may be used by the operating system or other software.)

Job 4:

bsub -W 400 -q backfill -R "rusage[mem=50]" myjob4 

The job keeps pending, since memory is reserved by job 3 and it runs longer than job 1 and job 2.

Job 5:

bsub -W 100 -q backfill -R "rusage[mem=50]" myjob5 

The job starts running. It uses one free slot and memory reserved by job 3. If the job does not finish in 100 minutes, it is killed by LSF automatically.

Job 6:

bsub -W 100 -q backfill -R "rusage[mem=300]" myjob6 

The job keeps pending with no resource reservation because it cannot get enough memory from the memory reserved by job 3.

Job 7:

bsub -W 100 -q backfill myjob7 

The job starts running. LSF assumes it does not require any memory and enough job slots are free.

Parallel jobs

Each process of a parallel job requires 100 MB memory, and each parallel job needs 4 cpus. The first two of the following parallel jobs run for 300 minutes.

Job 1:

bsub -W 300 -n 4 -R "rusage[mem=100]" -q reservation myJob1 

The job starts running and use 4 slots and get 400MB memory.

Job 2:

Submitting a second job with same requirements gets the same result.

Job 3:

Submitting a third job with same requirements reserves 2 slots, and reserves all 200 MB of available memory, assuming no other applications are running outside of LSF.

Job 4:

bsub -W 400 -q backfill -R "rusage[mem=50]" myJob4 

The job keeps pending since all available memory is already reserved by job 3. It runs longer than job 1 and job 2, so no backfill happens.

Job 5:

bsub -W 100 -q backfill -R "rusage[mem=50]" myJob5 

This job starts running. It can backfill the slot and memory reserved by job 3. If the job does not finish in 100 minutes, it is killed by LSF automatically.

Using interruptible backfill

Interruptible backfill scheduling can improve cluster utilization by allowing reserved job slots to be used by low priority small jobs that are terminated when the higher priority large jobs are about to start.

An interruptible backfill job:

Resource allocation diagram

Job life cycle
  1. Jobs are submitted to a queue configured for interruptible backfill. The job runtime requirement is ignored.
  2. Job is scheduled as either regular job or backfill job.
  3. The queue runtime limit is applied to the regularly scheduled job.
  4. In backfill phase, the job is considered for run on any reserved resource, which duration is longer than the minimal time slice configured for the queue. The job runtime limit is set in such way, that the job releases the resource before it is needed by the slot reserving job.
  5. The job runs in a regular manner. It is killed upon reaching its runtime limit, and requeued for the next run. Requeueing must be explicitly configured in the queue.
Assumptions and limitations
Configure an interruptible backfill queue
  1. Configure INTERRUPTIBLE_BACKFILL=seconds in the lowest priority queue in the cluster. There can only be one interruptible backfill queue in the cluster.
  2. Specify the minimum number of seconds for the job to be considered for backfilling. This minimal time slice depends on the specific job properties; it must be longer than at least one useful iteration of the job. Multiple queues may be created if a site has jobs of distinctively different classes.

    For example:

    Begin Queue
    QUEUE_NAME   = background
    # REQUEUE_EXIT_VALUES (set to whatever needed)
    DESCRIPTION  = Interruptible Backfill queue
    BACKFILL = Y
    INTERRUPTIBLE_BACKFILL = 1
    RUNLIMIT = 10
    PRIORITY = 1
    End Queue 
     

    Interruptible backfill is disabled if BACKFILL and RUNLIMIT are not configured in the queue.

    The value of INTERRUPTIBLE_BACKFILL is the minimal time slice in seconds for a job to be considered for backfill. The value depends on the specific job properties; it must be longer than at least one useful iteration of the job. Multiple queues may be created for different classes of jobs.

    BACKFILL and RUNLIMIT must be configured in the queue.

    RUNLIMIT corresponds to a maximum time slice for backfill, and should be configured so that the wait period for the new jobs submitted to the queue is acceptable to users. 10 minutes of runtime is a common value.

    You should configure REQUEUE_EXIT_VALUES for the queue so that resubmission is automatic. In order to terminate completely, jobs must have specific exit values:

View the run limits for interruptible backfill jobs (bjobs and bhist)
  1. Use bjobs to display the run limit calculated based on the configured queue-level run limit.
  2. For example, the interruptible backfill queue lazy configures RUNLIMIT=60:

    bjobs -l 135 
    Job <135>, User <user1>, Project <default>, Status <RUN>, Queue <lazy>, Command 
                         <myjob> 
    Mon Nov 21 11:49:22: Submitted from host <hostA>, CWD <$HOME/H 
                         PC/jobs>; 
     RUNLIMIT 
     59.5 min of hostA 
    Mon Nov 21 11:49:26: Started on <hostA>, Execution Home </home 
                         /user1>, Execution CWD </home/user1/HPC/jobs>; 
    
  3. Use bhist to display job-level run limit if specified.
  4. For example, job 135 was submitted with a run limit of 3 hours:

    bsub -n 1 -q lazy -W 3:0 myjob 
    Job <135> is submitted to queue <lazy>. 
     

    bhist displays the job-level run limit:

    bhist -l 135 Job <135>, User <user1>, Project <default>, Command <myjob> Mon Nov 21 11:49:22: Submitted from host <hostA>, to Queue <la zy>, CWD <$HOME/HPC/jobs>; RUNLIMIT 180.0 min of hostA Mon Nov 21 11:49:26: Dispatched to <hostA>; Mon Nov 21 11:49:26: Starting (Pid 2746); Mon Nov 21 11:49:27: Interruptible backfill runtime limit is 59.5 minutes; Mon Nov 21 11:49:27: Running with execution home </home/user1>, Execution CWD ...

Displaying available slots for backfill jobs

The bslots command displays slots reserved for parallel jobs and advance reservations. The available slots are not currently used for running jobs, and can be used for backfill jobs. The available slots displayed by bslots are only a snapshot of the slots currently not in use by parallel jobs or advance reservations. They are not guaranteed to be available at job submission.

By default, bslots displays all available slots, and the available run time for those slots. When no reserved slots are available for backfill, bslots displays "No reserved slots available."

The backfill window calculation is based on the snapshot information (current running jobs, slot reservations, advance reservations) obtained from mbatchd.

The backfill window displayed can serve as reference for submitting backfillable jobs. However, if you have specified extra resource requirements or special submission options, it does not insure that submitted jobs are scheduled and dispatched successfully.

bslots -R only supports the select resource requirement string. Other resource requirement selections are not supported.

If the available backfill window has no run time limit, its length is displayed as UNLIMITED.

Examples

Display all available slots for backfill jobs:

bslots

SLOTS  RUNTIME
1      UNLIMITED
3      1 hour 30 minutes
5      1 hour 0 minutes
7      45 minutes
15     40 minutes
18     30 minutes
20     20 minutes

Display available slots for backfill jobs requiring 15 slots or more:

bslots -n 15

SLOTS  RUNTIME
15     40 minutes
18     30 minutes
20     20 minutes

Display available slots for backfill jobs requiring a run time of 30 minutes or more:

bslots -W 30

SLOTS  RUNTIME
3      1 hour 30 minutes
5      1 hour 0 minutes
7      45 minutes
15     40 minutes
18     30 minutes

bslots -W 2:45

No reserved slots available.

bslots -n 15 -W 30

SLOTS  RUNTIME
15     40 minutes
18     30 minutes

Display available slots for backfill jobs requiring a host with more than 500 MB of memory:

bslots -R "mem>500"

SLOTS  RUNTIME
7     45 minutes
15     40 minutes

Display the host names with available slots for backfill jobs:

bslots -l

SLOTS:  15
RUNTIME:  40 minutes
HOSTS: 1*hostB 1*hostE 3*hostC ...
        3*hostZ ...     ...

SLOTS: 15
RUNTIME: 30 minutes
HOSTS: 2*hostA 1*hostB 3*hostC ...
        1*hostX ...     ...

Submitting backfill jobs according to available slots

  1. Use bslots to display job slots available for backfill jobs.
  2. Submit a job to a backfill queue. Specify a runtime limit and the number of processors required that are within the availability shown by bslots.

Submitting a job according to the backfill slot availability shown by bslots does not guarantee that the job is backfilled successfully. The slots may not be available by the time job is actually scheduled, or the job cannot be dispatched because other resource requirements are not satisfied.

Parallel Fairshare

LSF can consider the number of CPUs when using fairshare scheduling with parallel jobs.

If the job is submitted with bsub -n, the following formula is used to calculate dynamic priority:

dynamic priority = number_shares / (cpu_time * CPU_TIME_FACTOR + run_time * number_CPUs * RUN_TIME_FACTOR + (1 + job_slots )* RUN_JOB_FACTOR + fairshare_adjustment(struc* shareAdjustPair)*FAIRSHARE_ADJUSTMENT_FACTOR)

where number_CPUs is the number of CPUs used by the job.

Configure parallel fairshare

To configure parallel fairshare so that the number of CPUs is considered when calculating dynamic priority for queue-level user-based fairshare:

note:  
LSB_NCPU_ENFORCE does not apply to host-partition user-based fairshare. For host-partition user-based fairshare, the number of CPUs is automatically considered.
  1. Configure fairshare at the queue level as indicated in Fairshare Scheduling.
  2. To enable parallel fairshare, set the parameter LSB_NCPU_ENFORCE=1 in lsf.conf.
  3. To make your changes take effect, use the following commands to restart all LSF daemons:
  4. # lsadmin reconfig
    # lsadmin resrestart all
    # badmin hrestart all
    # badmin mbdrestart 
    

How Deadline Constraint Scheduling Works For Parallel Jobs

For information about deadline constraint scheduling, see Using Deadline Constraint Scheduling. Deadline constraint scheduling is enabled by default.

If deadline constraint scheduling is enabled and a parallel job has a CPU limit but no run limit, LSF considers the number of processors when calculating how long the job takes.

LSF assumes that the minimum number of processors are used, and that they are all the same speed as the candidate host. If the job cannot finish under these conditions, LSF does not place the job.

The formula is:

(deadline time - current time) > (CPU limit on candidate host / minimum number of processors)

Optimized Preemption of Parallel Jobs

You can configure preemption for parallel jobs to reduce the number of jobs suspended in order to run a large parallel job.

When a high-priority parallel job preempts multiple low-priority parallel jobs, sometimes LSF preempts more low-priority jobs than are necessary to release sufficient job slots to start the high-priority job.

The PREEMPT_FOR parameter in lsb.params with the MINI_JOB keyword enables the optimized preemption of parallel jobs, so LSF preempts fewer of the low-priority parallel jobs.

Enabling the feature only improves the efficiency in cases where both preemptive and preempted jobs are parallel jobs.

How optimized preemption works

When you run many parallel jobs in your cluster, and parallel jobs preempt other parallel jobs, you can enable a feature to optimize the preemption mechanism among parallel jobs.

By default, LSF can over-preempt parallel jobs. When a high-priority parallel job preempts multiple low-priority parallel jobs, sometimes LSF preempts more low-priority jobs than are necessary to release sufficient job slots to start the high-priority job. The optimized preemption mechanism reduces the number of jobs that are preempted.

Enabling the feature only improves the efficiency in cases where both preemptive and preempted jobs are parallel jobs. Enabling or disabling this feature has no effect on the scheduling of jobs that require only a single processor.

Configure optimized preemption

  1. Use the PREEMPT_FOR parameter in lsb.params and specify the keyword MINI_JOB to configure optimized preemption at the cluster level.
  2. If the parameter is already set, the MINI_JOB keyword can be used along with other keywords; the other keywords do not enable or disable the optimized preemption mechanism.

Processor Binding for Parallel Jobs

See also Processor binding for LSF job processes.

By default, there is no processor binding.

For multi-host parallel jobs, LSF sets two environment variables ($LSB_BIND_JOB and $LSB_BIND_CPU_LIST) but does not attempt to bind the job to any host even if you enable the processor binding.

Resizable jobs

Adding slots to or removing slots from a resizable job triggers unbinding and rebinding of job processes. Rebinding does not guarantee that the processes can be bound to the same processors they were bound to previously.

If a multi-host parallel job becomes a single-host parallel job after resizing, LSF does not bind it.

If a single-host parallel job or sequential job becomes a multi-host parallel job after resizing, LSF does not bind it.

After unbinding and binding, the job CPU affinity is changed. LSF puts the new CPU list in the LSB_BIND_CPU_LIST environment variable and the binding method to LSB_BIND_JOB environment variable. And it is the responsibility of the notification command to tell the job that CPU binding has changed.

Job Allocations that Grow and Shrink (Resizable)

A resizable job is a job whose slot allocation can change while the job is running.

For detailed information about the resizable job feature and how to configure it, see the Platform LSF Configuration Guide.

Overview

To optimize resource utilization, LSF allows a job's allocation to shrink and grow during the job run time.

Making jobs resizable is most useful for:

Without resizable jobs, LSF makes a one-time allocation and schedules the job. Because no adjustments to the resource allocations occur, resources are sometimes wasted.

About releasing idle resources from the application

Once a resizable job is running, you can release resources (also called shrinking the allocation) as needed. When specifying which resources to release, you can:

Autoresizable jobs (grow model)

An autoresizable job is a job that can be dispatched with one slot allocation but can automatically acquire more slots as they become available. In other words, the resource allocation can grow.

An autoresizable job is submitted with a minimum and maximum slot requirement (bsub -n "min, max"). LSF automatically schedules and allocates resources to satisfy the job's minimum and maximum request.

An autoresizable job, with its minimum and maximum resource request, is scheduled initially like a regular LSF job.

  1. A job starts running when its minimum slot requirement can be met.
  2. If the job does not get its maximum number of slots once it starts running, a resize request is created for the remaining resources.
  3. A pending resize request is scheduled along with pending jobs in the same queue, except they are given a higher priority.
  4. Resize requests for jobs in the same queue and with the same priority are scheduled on a first-come, first-served basis, according to the submission time of the original job.
  5. When some or all resources become available, LSF allocates the additional resources to satisfy the resize request. LSF runs a user-defined resize notification command on the original job's first execution host to communicate to the application the resize. LSF monitors the exit value from the notification command. Upon success, if the job requires more slots, LSF creates a new resize request for the job. Upon failure, LSF deallocates the resources given in the resize, and requeues the resize request.

LSF does not remove resources from autoresizable jobs if they do not need them. You must shrink the allocation manually or set your application to shrink the allocation when needed.

Resize notification command

The resize notification command is the means by which LSF can communicate with your application in the event of a resize.

Environment variables for the resize notification command describe whether the resize is shrink or grow as well as the allocation changes.

LSF monitors the exit code from the command and, if it fails, takes appropriate action: If it is a grow event, LSF releases allocated slots and reschedules a pending request. If it is a shrink event, slots are not released.

No notification command is included with LSF; it is a user-created executable.

Resizable job management

Enable shrinking an allocation for an application

Prerequisites: You must set up your application profile with RESIZABLE_JOBS=Y.

You must be the job owner, cluster administrator, queue administrator, user group administrator, or root to release resources.

You can set your application to release its resources when they are no longer used.

  1. Configure the application profile for resizable jobs.
  2. RESIZABLE_JOBS=Y 
    
  3. Set your application to release resources when they are no longer used:
  4. bresize release -c -rncn released_host_specification job_ID (or 
    use the API lsb_resize_release).   
     

    -c cancels pending resize requests.

    -rncn releases resources without running a resize notification command.

    where released_host_specification is the specification (list or range of hosts and number of slots) of resources to be released and where job_ID is the ID of the job.

    Result: LSF releases slots when the application no longer needs them.

Shrink an allocation manually

Prerequisites: You must set up your application profile with RESIZABLE_JOBS=Y.

You must be the job owner, cluster administrator, queue administrator, user group administrator, or root to release resources.

If you have not set up your application to automatically shrink an allocation, a user can also shrink an allocation.

  1. In the application profile, specify a value for RESIZE_NOTIFY_CMD.
  2. The resize notification command communicates the job's allocation changes between LSF and the application. The application tells LSF it wants to release slots. LSF can then communicate with the application to release resources that are no longer needed, shrinking the allocation.

    A script is one possible solution and can inform an application of the resizing by:

  3. When resources are no longer used, release resources:
  4. bresize release -c released_host_specification job_ID

    where released_host_specification is the specification (list or range of hosts and number of slots) of resources to be released and where job_ID is the ID of the job.

    Example: bresize release -c "1*hostA 2*hostB hostC" 221

    LSF releases 1 slot on hostA, 2 slots on hostB, and all slots on hostC for job221.

If the resize notification command completes successfully, LSF considers the allocation release done and updates the job allocation. If the shrink request fails, LSF does not update the job allocation.

Autoresizable job management

Autoresizable jobs can have resources released or added.

Submit an autoresizable job

  1. Run bsub -n 4,10 -ar.
  2. LSF dispatches the job (as long as the minimum slot request is satisfied).

    After the job successfully starts, LSF continues to schedule and allocate additional resources to satisfy the maximum slot request for the job.

  3. (Optional, as required) Release resources that are no longer needed.
  4. bresize release released_host_specification job_ID

    where released_host_specification is the specification (list or range of hosts and number of slots) of resources to be released.

    Example: bresize release "1*hostA 2*hostB hostC" 221

    LSF releases 1 slot on hostA, 2 slots on hostB, and all slots on hostC for job221.

    Result: The resize notification command runs on the first execution host.

Check pending resize requests

A resize request pends until the job's maximum slot request has been allocated or the job finishes (or the resize request is canceled).

  1. Run bjobs -l job_id.

Cancel an active pending request

Prerequisites: Only the job owner, cluster administrators, queue administrators, user group administrators, and root can cancel pending resource allocation requests.

An allocation request must be pending.

If a job still has an active pending resize request, but you do not want to allocate more resources to the job, you can cancel it.

By default, if the job has an active pending resize request, you cannot release the resources. You must cancel the request first.

  1. Run bresize cancel.

Specify a resize notification command manually

You can specify a resize notification command on job submission, other than one that is set up for the application profile

  1. On job submission, run bsub -rnc resize_notification_cmd .
  2. The job submission command overrides the application profile setting.

  3. Ensure the resize notification command checks any environment variables for resizing.
  4. For example, LSB_RESIZE_EVENT indicates why the notification command was called (grow or shrink) and LSB_RESIZE_HOSTS lists slots and hosts. Use LSB_JOBID to determine which job is affected.

The command you specified runs on the first execution host of the resized job.

LSF monitors the exit code from the command and takes appropriate action when the command returns an exit code corresponding to resize failure.

Script for resizing

#!/bin/sh 
# The purpose of this script is to inform  
# an application of a resize event.   
# 
# You can identify the application by:  
# 
#   1. LSF job ID ($LSB_JOBID), or  
#   2. pid ($LS_JOBPID). 
  
# handle the 'grow' event 
if [ $LSB_RESIZE_EVENT = "grow" ]; then 
  
    # Inform the application that it can use  
    # additional slots as specified in  
    # $LSB_RESIZE_HOSTS.  
    # 
    # Exit with $LSB_RESIZE_NOTIFY_FAIL if  
    # the application fails to resize. 
    # 
    # If the application cannot use any 
    # additional resources, you may want  
    # to run `bresize cancel $LSB_JOBID' 
    # before exit. 
  
    exit $LSB_RESIZE_NOTIFY_OK 
fi 
  
# handle the 'shrink' event 
if [ $LSB_RESIZE_EVENT = "shrink" ]; then 
  
    # Instruct the application to release the 
    # slots specified in $LSB_RESIZE_HOSTS. 
    # 
    # Exit with $LSB_RESIZE_NOTIFY_FAIL if  
    # the resources cannot be released. 
     
    exit $LSB_RESIZE_NOTIFY_OK 
fi 
  
# unknown event -- should not happen 
exit $LSB_RESIZE_NOTIFY_FAIL 

Feature interactions

Resource usage

When a job grows or shrinks, its resource reservation (for example memory or shared resources) changes proportionately.

Limits

Slots are only added to a job's allocation when resize occurs if the job does not violate any resource limits placed on it.

Job scheduling and dispatch

The JOB_ACCEPT_INTERVAL parameter in lsb.params or lsb.queues controls the number of seconds to wait after dispatching a job to a host before dispatching a second job to the same host. The parameter applies to all allocated hosts of a parallel job. For resizable job allocation requests, JOB_ACCEPT_INTERVAL applies to newly allocated hosts.

Chunk jobs

Because candidate jobs for the chunk job feature are short-running sequential jobs, the resizable job feature does not support job chunking:

brequeue

Jobs requeued with brequeue start from the beginning. After requeue, LSF restores the original allocation request for the job.

blaunch

Parallel tasks running through blaunch can be resizable.

bswitch

bswitch can switch resizable jobs between queues regardless of job state (including job's resizing state). Once the job is switched, the parameters in new queue apply, including threshold configuration, run limit, CPU limit, queue-level resource requirements, etc.

User group administrators

User group administrators are allowed to issue bresize commands to release a part of resources from job allocation (bresize release) or cancel active pending resize request (bresize cancel).

Requeue exit values

If job-level, application-level or queue-level REQUEUE_EXIT_VALUES are defined, and as long as job exits with a defined exit code, LSF puts the requeued job back to PEND status. For resizable jobs, LSF schedules the job according to the initial allocation request regardless of any job allocation size change.

Automatic job rerun

A rerunnable job is rescheduled after the first running host becomes unreachable. Once job is rerun, LSF schedules resizable jobs based on their initial allocation request.

Compute units

Autoresizable jobs cannot have compute unit requirements.

Compound resource requirements

Resizable jobs cannot have compound resource requirements.


Platform Computing Inc.
www.platform.com
Knowledge Center         Contents    Previous  Next    Index