Learn more about Platform products at http://www.platform.com

[ Platform Documentation ] [ Title ] [ Contents ] [ Previous ] [ Next ] [ Index ]



About Platform LSF HPC


Contents

[ Top ]


What Is Platform LSF HPC?

Platform LSFTM HPC ("LSF HPC") is the distributed workload management solution for maximizing the performance of High Performance Computing (HPC) clusters.

Platform LSF HPC is fully integrated with Platform LSF, the industry standard workload management software product, to provide load sharing in a distributed system and batch scheduling for compute-intensive jobs. Platform LSF HPC provides support for:

Advanced HPC scheduling policies

Platform LSF HPC enhances the job management capability of your cluster through advanced scheduling policies such as:

LSF daemons

Run on every node to collect resource information such as processor load, memory availability, interconnect states, and other host-specific as well as cluster-wide resources. These agents coordinate to create a single system image of the cluster.

HPC workload scheduler

Supports advanced HPC scheduling policies that match user demand with resource supply.

Job-level runtime resource management

Control sequential and parallel jobs (terminate, suspend, resume, send signals) running on the same host and across hosts. Configure and monitor job-level and system-wide CPU, memory, swap, and other runtime resource usage limits.

Application integration support

Packaged application integrations and tailored HPC configurations make Platform LSF HPC ideal for Industrial Manufacturing, Life Sciences, Government and Research sites using large-scale modeling and simulation parallel applications involving large amounts of data. Platform LSF HPC helps Computer-Aided Engineering (CAE) users reduce the cost of manufacturing, and increase engineer productivity and the quality of results.

Platform LSF HPC is integrated to work out of the box with many HPC applications, such as LSTC LS-Dyna, FLUENT, ANSYS, MSC Nastran, Gaussian, Lion Bioscience SRS, and NCBI BLAST.

Parallel application support

Platform LSF HPC supports jobs using the following parallel job launchers:

POE

The IBM Parallel Operating Environment (POE) interfaces with the Resource Manager to allow users to run parallel jobs requiring dedicated access to the high performance switch.

The LSF HPC integration for IBM High-Performance Switch (HPS) systems provides support for submitting POE jobs from AIX hosts to run on IBM HPS hosts.

OpenMP

Platform LSF HPC provides the ability to start parallel jobs that use OpenMP to communicate between process on shared-memory machines and MPI to communicate across networked and non-shared memory machines.

PVM

Parallel Virtual Machine (PVM) is a parallel programming system distributed by Oak Ridge National Laboratory. PVM programs are controlled by the PVM hosts file, which contains host names and other information.

MPI

The Message Passing Interface (MPI) is a portable library that supports parallel programming. LSF HPC supports several MPI implementations, includding MPICH, a joint implementation of MPI by Argonne National Laboratory and Mississippi State University. LSF HPC also supports MPICH-P4, MPICH-GM, LAM/MPI, Intel® MPI, IBM Message Passing Library (MPL) communication protocols, as well as SGI and HP- UX vendor MPI integrations.

blaunch distributed application framework

Most MPI implementations and many distributed applications use rsh and ssh as their task launching mechanism. The blaunch command provides a drop-in replacement for rsh and ssh as a transparent method for launching parallel and distributed applications within LSF.

Similar to the LSF lsrun command, blaunch transparently connects directly to the RES/SBD on the remote host, and subsequently creates and tracks the remote tasks, and provides the connection back to LSF. There no need to insert pam, taskstarter into the rsh or ssh calling sequence, or configure any wrapper scripts.

blaunch supports the following core command line options as rsh and ssh:

All other rsh and ssh options are silently ignored.

Important:


You cannot run blaunch directly from the LSF command line.

blaunch only works within an LSF job; it can only be used to launch tasks on remote hosts that are part of a job allocation. It cannot be used as a standalone command. On success blaunch exits with 0.

Windows

blaunch is supported on Windows 2000 or later with the following exceptions:

Seeblaunch Distributed Application Framework for more information.

PAM

The Parallel Application Manager (PAM) is the point of control for LSF HPC. PAM is fully integrated with LSF HPC. PAM interfaces the user application with LSF. For all parallel application processes (tasks), PAM:

See pam Command Reference for more information about PAM.

Resizable jobs

Jobs running in HPC system integrations (psets, cpusets, RMS, etc.) cannot be resized.

Resource requirements

Jobs running in HPC system integrations (psets, cpusets, RMS, etc.) cannot have compound resource requirements.

Jobs running in HPC system integrations (psets, cpusets, RMS, etc.) cannot have resource requirements with compute unit strings (cu[...]).

When compound resource requirements are used at any level, an esub can create job- level resource requirements which overwrite most application-level and queue-level resource requirements. -R merge rules are explained in detail in Administering Platform LSF.

[ Top ]


LSF HPC Components

LSF HPC takes full advantage of the resources of LSF for resource selection and batch job process invocation and control.

User requests

Batch job submission to LSF using the bsub command.

mbatchd

Master Batch Daemon (MBD) is the policy center for LSF. It maintains information about batch jobs, hosts, users, and queues. All of this information is used in scheduling batch jobs to hosts.

LIM

Load Information Manager is a daemon process running on each execution host. LIM monitors the load on its host and exchanges this information with the master LIM.

For batch submission the master LIM provides this information to mbatchd.

The master LIM resides on one execution host and collects information from the LIMs on all other hosts in the cluster. If the master LIM becomes unavailable, another host will automatically take over.

mpirun.lsf

Reads the environment variable LSF_PJL_TYPE, and generates the appropriate command line to invoke the PJL. The esub programs provided in LSF_SERVERDIR set this variable to the proper type.

sbatchd

Slave Batch Daemons (SBDs) are batch job execution agents residing on the execution hosts. sbatchd receives jobs from mbatchd in the form of a job specification and starts RES to run the job according the specification. sbatchd reports the batch job status to mbatchd whenever job state changes.

blaunch

The blaunch command provides a drop-in replacement for rsh and ssh as a transparent method for launching parallel and distributed applications within LSF.

PAM

Parallel Application Manager is the point of control for LSF HPC. PAM is fully integrated with LSF HPC. PAM interfaces the user application with the LSF HPC system.

RES

Remote Execution Servers reside on each execution host. RES manages all remote tasks and forwards signals, standard I/O, resources consumption data, and parallel job information between PAM and the tasks.

PJL

Parallel Job Launcher is any executable script or binary capable of starting parallel tasks on all hosts assigned for a parallel job (for example, mpirun, poe, prun.)

TS

TaskStarter is an executable responsible for starting a task on the local host and reporting the process ID and host name to the PAM. TaskStarter is located in LSF_BINDIR.

Application task

The individual process of a parallel application

First execution host

The host name at the top of the execution host list as determined by LSF. Starts PAM.

Execution hosts

The most suitable hosts to execute the batch job as determined by LSF

esub.pjl_type

LSF HPC provides a generic esub to handle job submission requirements of your applications. Use the -a option of bsub to specify the application you are running through LSF HPC.

For example, to submit a job to LAM/MPI:

bsub -a lammpi bsub_options mpirun.lsf myjob

The method name lammpi, uses the esub for LAM/MPI jobs (LSF_SERVERDIR/esub.lammpi), which sets the environment variable LSF_PJL_TYPE=lammpi. The job launcher, mpirun.lsf reads the environment variable LSF_PJL_TYPE=lammpi, and generates the appropriate command line to invoke LAM/MPI as the PJL to start the job.

[ Top ]


[ Platform Documentation ] [ Title ] [ Contents ] [ Previous ] [ Next ] [ Index ]


      Date Modified: March 13, 2009
Platform Computing: www.platform.com

Platform Support: support@platform.com
Platform Information Development: doc@platform.com

Copyright © 1994-2009 Platform Computing Corporation. All rights reserved.