[ Platform Documentation ] [ Title ] [ Contents ] [ Previous ] [ Next ] [ Index ]
[ Top ]
What Is Platform LSF HPC?
Platform LSFTM HPC ("LSF HPC") is the distributed workload management solution for maximizing the performance of High Performance Computing (HPC) clusters.
Platform LSF HPC is fully integrated with Platform LSF, the industry standard workload management software product, to provide load sharing in a distributed system and batch scheduling for compute-intensive jobs. Platform LSF HPC provides support for:
- Dynamic resource discovery and allocation (resource reservation) for parallel batch job execution
- Full job-level control of the distributed processes to ensure no processes will become un-managed. This effectively reduces the possibility of one parallel job causing severe disruption to an organization's computer service
- The standard MPI interface
- Full integration with Platform LSF, providing heterogeneous resource-based batch job scheduling including job-level resource usage enforcement
Advanced HPC scheduling policies
Platform LSF HPC enhances the job management capability of your cluster through advanced scheduling policies such as:
- Policy-based job preemption
- Advance reservation
- Memory and processor reservation
- Memory and processor backfill
- Cluster-wide resource allocation limits
- User and project-based fairshare scheduling
- Topology-aware scheduling
Run on every node to collect resource information such as processor load, memory availability, interconnect states, and other host-specific as well as cluster-wide resources. These agents coordinate to create a single system image of the cluster.
Supports advanced HPC scheduling policies that match user demand with resource supply.
Control sequential and parallel jobs (terminate, suspend, resume, send signals) running on the same host and across hosts. Configure and monitor job-level and system-wide CPU, memory, swap, and other runtime resource usage limits.
Application integration support
Packaged application integrations and tailored HPC configurations make Platform LSF HPC ideal for Industrial Manufacturing, Life Sciences, Government and Research sites using large-scale modeling and simulation parallel applications involving large amounts of data. Platform LSF HPC helps Computer-Aided Engineering (CAE) users reduce the cost of manufacturing, and increase engineer productivity and the quality of results.
Platform LSF HPC is integrated to work out of the box with many HPC applications, such as LSTC LS-Dyna, FLUENT, ANSYS, MSC Nastran, Gaussian, Lion Bioscience SRS, and NCBI BLAST.
Parallel application support
Platform LSF HPC supports jobs using the following parallel job launchers:
The IBM Parallel Operating Environment (POE) interfaces with the Resource Manager to allow users to run parallel jobs requiring dedicated access to the high performance switch.
The LSF HPC integration for IBM High-Performance Switch (HPS) systems provides support for submitting POE jobs from AIX hosts to run on IBM HPS hosts.
Platform LSF HPC provides the ability to start parallel jobs that use OpenMP to communicate between process on shared-memory machines and MPI to communicate across networked and non-shared memory machines.
Parallel Virtual Machine (PVM) is a parallel programming system distributed by Oak Ridge National Laboratory. PVM programs are controlled by the PVM hosts file, which contains host names and other information.
The Message Passing Interface (MPI) is a portable library that supports parallel programming. LSF HPC supports several MPI implementations, includding MPICH, a joint implementation of MPI by Argonne National Laboratory and Mississippi State University. LSF HPC also supports MPICH-P4, MPICH-GM, LAM/MPI, Intel® MPI, IBM Message Passing Library (MPL) communication protocols, as well as SGI and HP- UX vendor MPI integrations.
blaunch distributed application framework
Most MPI implementations and many distributed applications use
rsh
andssh
as their task launching mechanism. Theblaunch
command provides a drop-in replacement forrsh
andssh
as a transparent method for launching parallel and distributed applications within LSF.Similar to the LSF
lsrun
command,blaunch
transparently connects directly to the RES/SBD on the remote host, and subsequently creates and tracks the remote tasks, and provides the connection back to LSF. There no need to insertpam
, taskstarter into thersh
orssh
calling sequence, or configure any wrapper scripts.
blaunch
supports the following core command line options asrsh
andssh
:All other
rsh
andssh
options are silently ignored.
You cannot run blaunch directly from the LSF command line.
blaunch
only works within an LSF job; it can only be used to launch tasks on remote hosts that are part of a job allocation. It cannot be used as a standalone command. On successblaunch
exits with 0.
blaunch
is supported on Windows 2000 or later with the following exceptions:
- Only the following signals are supported: SIGKILL, SIGSTOP, SIGCONT.
- The
-n
option is not supported.- CMD.exe /C <user command line> is used as an intermediate command shell when
-no-shell
is not specified- CMD.exe /C is not used when
-no-shell
is specified.Seeblaunch Distributed Application Framework for more information.
PAM
The Parallel Application Manager (PAM) is the point of control for LSF HPC. PAM is fully integrated with LSF HPC. PAM interfaces the user application with LSF. For all parallel application processes (tasks), PAM:
- Monitors and forwards control signals to parallel tasks
- Monitors resource usage while the user application is running
- Passes job-level resource limits to
sbatchd
for enforcement- Collects resource usage information and exit status upon termination
See pam Command Reference for more information about PAM.
Resizable jobs
Jobs running in HPC system integrations (psets, cpusets, RMS, etc.) cannot be resized.
Resource requirements
Jobs running in HPC system integrations (psets, cpusets, RMS, etc.) cannot have compound resource requirements.
Jobs running in HPC system integrations (psets, cpusets, RMS, etc.) cannot have resource requirements with compute unit strings (
cu
[...]).When compound resource requirements are used at any level, an
esub
can create job- level resource requirements which overwrite most application-level and queue-level resource requirements.-R
merge rules are explained in detail in Administering Platform LSF.[ Top ]
LSF HPC Components
LSF HPC takes full advantage of the resources of LSF for resource selection and batch job process invocation and control.
Batch job submission to LSF using the
bsub
command.Master Batch Daemon (MBD) is the policy center for LSF. It maintains information about batch jobs, hosts, users, and queues. All of this information is used in scheduling batch jobs to hosts.
Load Information Manager is a daemon process running on each execution host. LIM monitors the load on its host and exchanges this information with the master LIM.
For batch submission the master LIM provides this information to
mbatchd
.The master LIM resides on one execution host and collects information from the LIMs on all other hosts in the cluster. If the master LIM becomes unavailable, another host will automatically take over.
Reads the environment variable LSF_PJL_TYPE, and generates the appropriate command line to invoke the PJL. The
esub
programs provided in LSF_SERVERDIR set this variable to the proper type.Slave Batch Daemons (SBDs) are batch job execution agents residing on the execution hosts.
sbatchd
receives jobs frommbatchd
in the form of a job specification and starts RES to run the job according the specification.sbatchd
reports the batch job status tombatchd
whenever job state changes.The
blaunch
command provides a drop-in replacement forrsh
andssh
as a transparent method for launching parallel and distributed applications within LSF.Parallel Application Manager is the point of control for LSF HPC. PAM is fully integrated with LSF HPC. PAM interfaces the user application with the LSF HPC system.
Remote Execution Servers reside on each execution host. RES manages all remote tasks and forwards signals, standard I/O, resources consumption data, and parallel job information between PAM and the tasks.
Parallel Job Launcher is any executable script or binary capable of starting parallel tasks on all hosts assigned for a parallel job (for example,
mpirun
,poe
,prun
.)
TaskStarter
is an executable responsible for starting a task on the local host and reporting the process ID and host name to the PAM.TaskStarter
is located in LSF_BINDIR.The individual process of a parallel application
The host name at the top of the execution host list as determined by LSF. Starts PAM.
The most suitable hosts to execute the batch job as determined by LSF
LSF HPC provides a generic
esub
to handle job submission requirements of your applications. Use the-a
option ofbsub
to specify the application you are running through LSF HPC.For example, to submit a job to LAM/MPI:
bsub -a lammpi bsub_options mpirun.lsf myjobThe method name
lammpi
, uses theesub
for LAM/MPI jobs (LSF_SERVERDIR/esub.lammpi
), which sets the environment variable LSF_PJL_TYPE=lammpi. The job launcher,mpirun.lsf
reads the environment variable LSF_PJL_TYPE=lammpi, and generates the appropriate command line to invoke LAM/MPI as the PJL to start the job.[ Top ]
[ Platform Documentation ] [ Title ] [ Contents ] [ Previous ] [ Next ] [ Index ]
Date Modified: August 20, 2009
Platform Computing: www.platform.com
Platform Support: support@platform.com
Platform Information Development: doc@platform.com
Copyright © 1994-2009 Platform Computing Corporation. All rights reserved.