Knowledge Center Contents Previous Next Index |
About Platform LSF
Contents
Learn about Platform LSF
Before using Platform LSF for the first time, you should download and read LSF Version 7 Release Notes for the latest information about what's new in the current release and other important information.
Cluster Concepts
![]()
Clusters, jobs, and queues
Cluster
A group of computers (hosts) running LSF that work together as a single unit, combining computing power and sharing workload and resources. A cluster provides a single-system image for disparate computing resources.
Hosts can be grouped into clusters in a number of ways. A cluster could contain:
- All the hosts in a single administrative group
- All the hosts on one file server or sub-network
- Hosts that perform similar functions
Commands:
lshosts
- View static resource information about hosts in the clusterbhosts
- View resource and job information about server hosts in the clusterlsid
- View the cluster namelsclusters
- View cluster status and sizeConfiguration:
- Define hosts in your cluster in
lsf.cluster.
cluster_name
tip:
The name of your cluster should be unique. It should not be the same as any host or queue.Job
A unit of work run in the LSF system. A job is a command submitted to LSF for execution. LSF schedules, controls, and tracks the job according to configured policies.
Jobs can be complex problems, simulation scenarios, extensive calculations, anything that needs compute power.
Commands:
bjobs
- View jobs in the systembsub
- Submit jobsJob slot
A job slot is a bucket into which a single unit of work is assigned in the LSF system.
If hosts are configured with a number of job slots, you can dispatch jobs from queues until all the job slots are filled.
Commands:
bhosts
- View job slot limits for hosts and host groupsbqueues
- View job slot limits for queuesbusers
- View job slot limits for users and user groupsConfiguration:
- Define job slot limits in
lsb.resources
.Job states
LSF jobs have the following states:
- PEND - Waiting in a queue for scheduling and dispatch
- RUN - Dispatched to a host and running
- DONE - Finished normally with zero exit value
- EXIT - Finished with non-zero exit value
- PSUSP - Suspended while pending
- USUSP - Suspended by user
- SSUSP - Suspended by the LSF system
- POST_DONE - Post-processing completed without errors
- POST_ERR - Post-processing completed with errors
- UNKWN -
mbatchd
has lost contact withsbatchd
on the host on which the job runs- WAIT-For jobs submitted to a chunk job queue, members of a chunk job that are waiting to run
- ZOMBI-A job becomes ZOMBI if the execution host is unreachable when a non-rerunnable job is killed or a rerunnable job is requeued
Queue
A clusterwide container for jobs. All jobs wait in queues until they are scheduled and dispatched to hosts.
Queues do not correspond to individual hosts; each queue can use all server hosts in the cluster, or a configured subset of the server hosts.
When you submit a job to a queue, you do not need to specify an execution host. LSF dispatches the job to the best available execution host in the cluster to run that job.
Queues implement different job scheduling and control policies.
Commands:
bqueues
- View available queuesbsub
-q
- Submit a job to a specific queuebparams
- View default queuesConfiguration:
- Define queues in
lsb.queues
tip:
The names of your queues should be unique. They should not be the same as the cluster name or any host in the cluster.First-come, first-served (FCFS) scheduling
The default type of scheduling in LSF. Jobs are considered for dispatch based on their order in the queue.
Hosts
Host
An individual computer in the cluster.
Each host may have more than 1 processor. Multiprocessor hosts are used to run parallel jobs. A multiprocessor host with a single process queue is considered a single machine, while a box full of processors that each have their own process queue is treated as a group of separate machines.
Commands:
lsload
- View load on hostslshosts
- View configuration information about hosts in the cluster including number of CPUS, model, type, and whether the host is a client or serverbhosts
- View batch server hosts in the cluster
tip:
The names of your hosts should be unique. They should not be the same as the cluster name or any queue defined for the cluster.Submission host
The host where jobs are submitted to the cluster.
Jobs are submitted using the
bsub
command or from an application that uses the LSF API.Client hosts and server hosts can act as submission hosts.
Commands:
bsub
- Submit a jobbjobs
- View jobs that are submittedExecution host
The host where a job runs. Can be the same as the submission host. All execution hosts are server hosts.
Commands:
bjobs
- View where a job runsServer host
Hosts that are capable of submitting and executing jobs. A server host runs
sbatchd
to execute server requests and apply local policies.An LSF cluster may consist of static and dynamic hosts. Dynamic host configuration allows you to add and remove hosts without manual reconfiguration.
By default, all configuration changes made to LSF are static. To add or remove hosts within the cluster, you must manually change the configuration and restart all master candidates.
Commands:
lshosts
- View hosts that are servers (server=Yes
)Configuration:
- Server hosts are defined in the
lsf.cluster.
cluster_name
file by setting the value ofserver
to 1Client host
Hosts that are only capable of submitting jobs to the cluster. Client hosts run LSF commands and act only as submission hosts. Client hosts do not execute jobs or run LSF daemons.
Commands:
lshosts
- View hosts that are clients (server=No
)Configuration:
- Client hosts are defined in the
lsf.cluster.
cluster_name
file by setting the value ofserver
to 0Floating client host
In LSF, you can have both client hosts and floating client hosts. The difference is in the type of license purchased.
If you purchased a regular (fixed) client license, LSF client hosts are static. The client hosts must be listed in
lsf.cluster.
cluster_name
. The license is fixed to the hosts specified inlsf.cluster.
cluster_name
and whenever client hosts change, you must update it with the new host list.If you purchased a floating client license, LSF floating client hosts are dynamic. They are not listed in
lsf.cluster.
cluster_name
. Since LSF does not take into account the host name but the number of floating licenses, clients can change dynamically and licenses will be distributed to clients that request to use LSF.When you submit a job from any unlicensed host, and if there are any floating licenses free, the host will check out a license and submit your job to LSF. However, once a host checks out a floating client license, it keeps that license for the rest of the day, until midnight. A host that becomes a floating client behaves like a fixed client all day, then at 12 midnight it releases the license. At that time, the host turns back into a normal, unlicensed host, and the floating client license becomes available to any other host that needs it.
Master host
Where the master LIM and
mbatchd
run. An LSF server host that acts as the overall coordinator for that cluster. Each cluster has one master host to do all job scheduling and dispatch. If the master host goes down, another master candidate LSF server in the cluster becomes the master host.All LSF daemons run on the master host. The LIM on the master host is the master LIM.
Commands:
lsid
- View the master host nameConfiguration:
- The master host is defined along with other candidate master hosts by LSF_MASTER_LIST in
lsf.conf
.LSF daemons
LSF daemon Role mbatchd Job requests and dispatch mbschd Job scheduling sbatchdres Job execution lim Host information pim Job process information elim Collect and track custom dynamic load indices
mbatchd
Master Batch Daemon running on the master host. Started by
sbatchd
. Responsible for the overall state of jobs in the system.Receives job submission, and information query requests. Manages jobs held in queues. Dispatches jobs to hosts as determined by
mbschd
.Commands:
badmin mbdrestart
- Startssbatchd
Configuration:
- Port number defined in
lsf.conf
.mbschd
Master Batch Scheduler Daemon running on the master host. Works with
mbatchd
. Started bymbatchd
.Makes scheduling decisions based on job requirements and policies.
sbatchd
Slave Batch Daemon running on each server host. Receives the request to run the job from
mbatchd
and manages local execution of the job. Responsible for enforcing local policies and maintaining the state of jobs on the host.
sbatchd
forks a childsbatchd
for every job. The childsbatchd
runs an instance ofres
to create the execution environment in which the job runs. The childsbatchd
exits when the job is complete.Commands:
badmin hstartup
- Startssbatchd
badmin hshutdown
- Shuts downsbatchd
badmin hrestart
- Restartssbatchd
Configuration:
- Port number defined in
lsf.conf
res
Remote Execution Server (RES) running on each server host. Accepts remote execution requests to provide transparent and secure remote execution of jobs and tasks.
Commands:
lsadmin resstartup
- Starts reslsadmin resshutdown
- Shuts down reslsadmin resrestart
- Restarts resConfiguration:
- Port number defined in
lsf.conf
lim
Load Information Manager (LIM) running on each server host. Collects host load and configuration information and forwards it to the master LIM running on the master host. Reports the information displayed by
lsload
andlshosts
.Static indices are reported when the LIM starts up or when the number of CPUs (
ncpus
) change. Static indices are:
- Number of CPUs (
ncpus
)- Number of disks (
ndisks
)- Total available memory (
maxmem
)- Total available swap (
maxswp
)- Total available temp (
maxtmp
)- The number of physical processors per CPU configured on a host (
nprocs
).- The number of cores per processor configured on a host (
ncores
).- The number of threads per core configured on a host (
nthreads
).Dynamic indices for host load collected at regular intervals are:
- Hosts status (
status
)- 15 second, 1 minute, and 15 minute run queue lengths (
r15s
,r1m
, andr15m
)- CPU utilization (
ut
)- Paging rate (
pg
)- Number of login sessions (
ls
)- Interactive idle time (
it
)- Available swap space (
swp
)- Available memory (
mem
)- Available temp space (
tmp
)- Disk IO rate (
io
)Commands:
lsadmin limstartup
- Starts LIMlsadmin limshutdown
- Shuts down LIMlsadmin limrestart
- Restarts LIMlsload
- View dynamic load valueslshosts
- View static host load valuesConfiguration:
- Port number defined in
lsf.conf
.Master LIM
The LIM running on the master host. Receives load information from the LIMs running on hosts in the cluster.
Forwards load information to
mbatchd
, which forwards this information tombschd
to support scheduling decisions. If the master LIM becomes unavailable, a LIM on another host automatically takes over.Commands:
lsadmin limstartup
- Starts LIMlsadmin limshutdown
- Shuts down LIMlsadmin limrestart
- Restarts LIMlsload
- View dynamic load valueslshosts
- View static host load valuesConfiguration:
- Port number defined in
lsf.conf
.ELIM
External LIM (ELIM) is a site-definable executable that collects and tracks custom dynamic load indices. An ELIM can be a shell script or a compiled binary program, which returns the values of the dynamic resources you define. The ELIM executable must be named
elim.
anything
and located in LSF_SERVERDIR.pim
Process Information Manager (PIM) running on each server host. Started by LIM, which periodically checks on PIM and restarts it if it dies.
Collects information about job processes running on the host such as CPU and memory used by the job, and reports the information to
sbatchd
.Commands:
bjobs
- View job informationBatch jobs and tasks
You can either run jobs through the batch system where jobs are held in queues, or you can interactively run tasks without going through the batch system, such as tests for example.
Job
A unit of work run in the LSF system. A job is a command submitted to LSF for execution, using the
bsub
command. LSF schedules, controls, and tracks the job according to configured policies.Jobs can be complex problems, simulation scenarios, extensive calculations, anything that needs compute power.
Commands:
bjobs
- View jobs in the systembsub
- Submit jobsInteractive batch job
A batch job that allows you to interact with the application and still take advantage of LSF scheduling policies and fault tolerance. All input and output are through the terminal that you used to type the job submission command.
When you submit an interactive job, a message is displayed while the job is awaiting scheduling. A new job cannot be submitted until the interactive job is completed or terminated.
The
bsub
command stops display of output from the shell until the job completes, and no mail is sent to you by default. UseCtrl-C
at any time to terminate the job.Commands:
bsub -I
- Submit an interactive jobInteractive task
A command that is not submitted to a batch queue and scheduled by LSF, but is dispatched immediately. LSF locates the resources needed by the task and chooses the best host among the candidate hosts that has the required resources and is lightly loaded. Each command can be a single process, or it can be a group of cooperating processes.
Tasks are run without using the batch processing features of LSF but still with the advantage of resource requirements and selection of the best host to run the task based on load.
Commands:
lsrun
- Submit an interactive tasklsgrun
- Submit an interactive task to a group of hosts- See also LSF utilities such as
ch
,lsacct
,lsacctmrg
,lslogin
,lsplace
,lsload
,lsloadadj
,lseligible
,lsmon
,lstcsh
Local task
An application or command that does not make sense to run remotely. For example, the
ls
command on UNIX.Commands:
lsltasks
- View and add tasksConfiguration:
lsf.task
- Configure systemwide resource requirements for tasks
lsf.task.cluster
- Configure clusterwide resource requirements for tasks.lsftask
- Configure user-specific tasksRemote task
An application or command that can be run on another machine in the cluster.
Commands:
lsrtasks
- View and add tasksConfiguration:
lsf.task
- Configure systemwide resource requirements for taskslsf.task.cluster
- Configure clusterwide resource requirements for tasks.lsftask
- Configure user-specific tasksHost types and host models
Hosts in LSF are characterized by host type and host model.
The following example host type X86_64, with host models Opteron240, Opteron840, Intel_EM64T, Intel_IA64, etc.
![]()
Host type
The combination of operating system and host CPU architecture.
All computers that run the same operating system on the same computer architecture are of the same type - in other words, binary-compatible with each other.
Each host type usually requires a different set of LSF binary files.
Commands:
lsinfo -t
- View all host types defined inlsf.shared
Configuration:
- Defined in
lsf.shared
- Mapped to hosts in
lsf.cluster.
cluster_name
Host model
The host type of the computer, which determines the CPU speed scaling factor applied in load and placement calculations.
The CPU factor is taken into consideration when jobs are being dispatched.
Commands:
lsinfo -m
- View a list of currently running modelslsinfo -M
- View all models defined inlsf.shared
Configuration:
- Defined in
lsf.shared
- Mapped to hosts in
lsf.cluster.
cluster_name
Users and administrators
LSF user
A user account that has permission to submit jobs to the LSF cluster.
LSF administrator
In general, you must be an LSF administrator to perform operations that will affect other LSF users. Each cluster has one primary LSF administrator, specified during LSF installation. You can also configure additional administrators at the cluster level and at the queue level.
Primary LSF administrator
The first cluster administrator specified during installation and first administrator listed in
lsf.cluster.
cluster_name
. The primary LSF administrator account owns the configuration and log files. The primary LSF administrator has permission to perform clusterwide operations, change configuration files, reconfigure the cluster, and control jobs submitted by all users.Cluster administrator
May be specified during LSF installation or configured after installation. Cluster administrators can perform administrative operations on all jobs and queues in the cluster. Cluster administrators have the same cluster-wide operational privileges as the primary LSF administrator except that they do not necessarily have permission to change LSF configuration files.
For example, a cluster administrator can create an LSF host group, submit a job to any queue, or terminate another user's job.
Queue administrator
An LSF administrator user account that has administrative permissions limited to a specified queue. For example, an LSF queue administrator can perform administrative operations on the specified queue, or on jobs running in the specified queue, but cannot change LSF configuration or operate on LSF daemons.
Resources
Resource usage
The LSF system uses built-in and configured resources to track resource availability and usage. Jobs are scheduled according to the resources available on individual hosts.
Jobs submitted through the LSF system will have the resources they use monitored while they are running. This information is used to enforce resource limits and load thresholds as well as fairshare scheduling.
LSF collects information such as:
- Total CPU time consumed by all processes in the job
- Total resident memory usage in KB of all currently running processes in a job
- Total virtual memory usage in KB of all currently running processes in a job
- Currently active process group ID in a job
- Currently active processes in a job
On UNIX, job-level resource usage is collected through PIM.
Commands:
lsinfo
: View the resources available in your clusterbjobs -l
: View current resource usage of a jobConfiguration:
SBD_SLEEP_TIME
inlsb.params
: Configures how often resource usage information is sampled by PIM, collected bysbatchd
, and sent tombatchd
LSF_PIM_LINUX_ENHANCE
inlsf.conf
: Enables the enhanced PIM, obtaining exact memory usage of processes using shared memory on Linux operating systems. By default memory shared between jobs may be counted more than once.Load indices
Load indices measure the availability of dynamic, non-shared resources on hosts in the cluster. Load indices built into the LIM are updated at fixed time intervals.
Commands:
lsload -l
- View all load indicesbhosts -l
- View load levels on a hostExternal load indices
Defined and configured by the LSF administrator and collected by an External Load Information Manager (ELIM) program. The ELIM also updates LIM when new values are received.
Commands:
lsinfo
- View external load indicesStatic resources
Built-in resources that represent host information that does not change over time, such as the maximum RAM available to user processes or the number of processors in a machine. Most static resources are determined by the LIM at start-up time.
Static resources can be used to select appropriate hosts for particular jobs based on binary architecture, relative CPU speed, and system configuration.
Load thresholds
Two types of load thresholds can be configured by your LSF administrator to schedule jobs in queues. Each load threshold specifies a load index value:
loadSched
determines the load condition for dispatching pending jobs. If a host's load is beyond any definedloadSched
, a job will not be started on the host. This threshold is also used as the condition for resuming suspended jobs.loadStop
determines when running jobs should be suspended.To schedule a job on a host, the load levels on that host must satisfy both the thresholds configured for that host and the thresholds for the queue from which the job is being dispatched.
The value of a load index may either increase or decrease with load, depending on the meaning of the specific load index. Therefore, when comparing the host load conditions with the threshold values, you need to use either greater than (>) or less than (<), depending on the load index.
Commands:
bhosts-l
- View suspending conditions for hostsbqueues -l
- View suspending conditions for queuesbjobs -l
- View suspending conditions for a particular job and the scheduling thresholds that control when a job is resumedConfiguration:
lsb.hosts
- Configure thresholds for hostslsb.queues
- Configure thresholds for queuesRuntime resource usage limits
Limit the use of resources while a job is running. Jobs that consume more than the specified amount of a resource are signalled.
Configuration:
lsb.queues
- Configure resource usage limits for queuesHard and soft limits
Resource limits specified at the queue level are hard limits while those specified with job submission are soft limits. See
setrlimit(2)
man page for concepts of hard and soft limits.Resource allocation limits
Restrict the amount of a given resource that must be available during job scheduling for different classes of jobs to start, and which resource consumers the limits apply to. If all of the resource has been consumed, no more jobs can be started until some of the resource is released.
Configuration:
lsb.resources
- Configure queue-level resource allocation limits for hosts, users, queues, and projectsResource requirements (bsub -R)
Restrict which hosts the job can run on. Hosts that match the resource requirements are the candidate hosts. When LSF schedules a job, it collects the load index values of all the candidate hosts and compares them to the scheduling conditions. Jobs are only dispatched to a host if all load values are within the scheduling thresholds.
Commands:
bsub-R
- Specify resource requirement string for a jobConfiguration:
lsb.queues
- Configure resource requirements for queuesJob Life Cycle
1 Submit a job
You submit a job from an LSF client or server with the
bsub
command.If you do not specify a queue when submitting the job, the job is submitted to the default queue.
Jobs are held in a queue waiting to be scheduled and have the PEND state. The job is held in a job file in the
LSF_SHAREDIR/
cluster_name
/logdir/info/
directory, or in one of its subdirectories if MAX_INFO_DIRS is defined inlsb.params
.Job ID
LSF assigns each job a unique job ID when you submit the job.
Job name
You can also assign a name to the job with the
-J
option ofbsub
. Unlike the job ID, the job name is not necessarily unique.2 Schedule job
mbatchd
looks at jobs in the queue and sends the jobs for scheduling tombschd
at a preset time interval (defined by the parameter JOB_SCHEDULING_INTERVAL inlsb.params
).mbschd
evaluates jobs and makes scheduling decisions based on:
- Job priority
- Scheduling policies
- Available resources
mbschd
selects the best hosts where the job can run and sends its decisions back tombatchd
.- Resource information is collected at preset time intervals by the master LIM from LIMs on server hosts. The master LIM communicates this information to
mbatchd
, which in turn communicates it tombschd
to support scheduling decisions.3 Dispatch job
As soon as
mbatchd
receives scheduling decisions, it immediately dispatches the jobs to hosts.4 Run job
sbatchd
handles job execution. It:
- Receives the request from
mbatchd
- Creates a child
sbatchd
for the job- Creates the execution environment
- Starts the job using
res
- The execution environment is copied from the submission host to the execution host and includes the following:
- Environment variables needed by the job
- Working directory where the job begins running
- Other system-dependent environment settings, for example:
- On UNIX, resource limits and
umask
- On Windows, desktop and Windows root directory
- The job runs under the user account that submitted the job and has the status RUN.
5 Return output
When a job is completed, it is assigned the DONE status if the job was completed without any problems. The job is assigned the EXIT status if errors prevented the job from completing.
sbatchd
communicates job information including errors and output tombatchd
.6 Send email to client
mbatchd
returns the job output, job error, and job information to the submission host through email. Use the-o
and-e
options ofbsub
to send job output and errors to a file.Job report
A job report is sent by email to the LSF job owner and includes:
- Job information such as:
- CPU use
- Memory use
- Name of the account that submitted the job
- Job output
- Errors
Platform Computing Inc.
www.platform.com |
Knowledge Center Contents Previous Next Index |