[ Platform Documentation ] [ Title ] [ Contents ] [ Previous ] [ Next ] [ Index ]
This chapter shows how to use LSBLIB to access the services provided by LSF Batch and other LSF products. Since LSF Batch is built on top of LSF Base, LSBLIB relies on services provided by LSLIB. However, you only need to link your program with LSBLIB to use LSBLIB functions because the header file of LSBLIB (
lsbatch.h
) already includes the LSLIB (lsf.h
). All other LSF products (such as Platform Parallel and Platform Make) relies on services provided by LSBLIB.LSF Batch and Platform JobScheduler services are provided by
mbatchd
. Services for processing event and job log files which do not involve any daemons. LSBLIB is shared by both LSF Batch and Platform JobScheduler. The functions described for LSF Batch in this chapter also apply to other LSF products, unless explicitly indicated otherwise.
- Initializing LSF Batch Applications
- Getting Information about LSF Batch Queues
- Getting Information about LSF Batch Hosts
- Job Submission and Modification
- Getting Information about Batch Jobs
- Job Manipulation
- Processing LSF Batch Log Files
[ Top ]
Initializing LSF Batch Applications
lsb_init() function
Before accessing any of the LSF Batch services, an application must initialize LSBLIB. An application does this by calling
lsb_init()
.
lsb_init()
has the following parameter:char *appNameOn success,
lsb_init()
returns 0. On failure, it returns -1 and setslsberrno
to indicate the error.The parameter appName is the name of the application. Use appName to log detailed messages about the transactions inside LSLIB for debugging purpose. If
LSB_CMD_LOG_MASK
is defined asLOG_DEBUG1
, the messages will be logged.Messages are logged in
LSF_LOGDIR/appname
. If appname isNULL
, the log file isLSF_LOGDIR/bcmd
.Here is an example of code showing the usage of this function:
/* Include <lsf/lsbatch.h> when using this function */ if (lsb_init(argv[0]) < 0) { lsb_perror("simbsub: lsb_init() failed"); exit(-1); }The function
lsb_perror(char *usrMsg)
prints a batch LSF error message onstderr
. The user message usrMsg is printed, followed by a colon (:
) and the batch error message corresponding tolsberrno
.[ Top ]
Getting Information about LSF Batch Queues
LSF Batch queues hold jobs in LSF Batch and according to scheduling policies and limits on resource usage.
lsb_queueinfo()
gets information about the queues in LSF Batch. This includes:
- Queue name
- Parameters
- Statistics
- Status
- Resource limits
- Scheduling policies and parameters
- Users and hosts associated with the queue.
The example program in this section uses
lsb_queueinfo()
to get the queue information:struct queueInfoEnt *lsb_queueinfo(queues,numQueues, hostname,username,options)
lsb_queueinfo()
has the following parameters:char **queues; Array containing names of queues of interest int *numQueues; Number of queues char *hostname; Specified queues using hostname char *username; Specified queues enabled for user int options; Reserved for future use; supply 0To get information on all queues, set *numQueues to 0. If *numQueues is 1 and queue is
NULL
, information on the default system queue is returned.If hostname is not
NULL
, then all queues using host hostname as a batch server host will be returned. If username is notNULL
, then all queues allowing user username to submit jobs to will be returned.On success,
lsb_queueinfo()
returns an array containing a queueInfoEnt structure (see below) for each queue of interest and sets *numQueues to the size of the array. On failure,lsb_queueinfo()
returnsNULL
and setslsberrno
to indicate the error.The queueInfoEnt structure is defined in
lsbatch.h
asstruct queueInfoEnt { char *queue; Name of the queue char *description; Description of the queue int priority; Priority of the queue short nice; Value that runs jobs in the queue char *userList; Users allowed to submit jobs to the queue char *hostList; Hosts that can run jobs in the queue int nIdx; Size of the loadSched and loadStop arrays float *loadSched; Load thresholds that control scheduling of job from the queue float *loadStop; Load thresholds that control suspension of jobs from the queue int userJobLimit; Number of unfinished jobs a user can dispatch from the queue int procJobLimit; Number of unfinished jobs the queue can dispatch to a processor char *windows; Queue run window int rLimits[LSF_RLIM_NLIMITS]; Per-process resource limits for jobs char *hostSpec; Obsolete. Use defaultHostSpec instead int qAttrib; Attributes of the queue int qStatus; Status of the queue int maxJobs; Job slot limit of the queue. int numJobs; Total number of job slots required by all jobs int numPEND; Number of job slots needed by pending jobs int numRUN; Number of jobs slots used by running jobs int numSSUSP; Number of job slots used by system suspended jobs int numUSUSP; Number of jobs slots used by user suspended jobs int mig; Queue migration threshold in minutes int schedDelay; Schedule delay for new jobs int acceptIntvl; Minimum interval between two jobs dispatche d to the same host char *windowsD; Queue dispatch window char *nqsQueues; Blank-separated list of NQS queue specifiers char *userShares; Blank-separated list of user shares char *defaultHostSpec; Value of DEFAULT_HOST_SPEC for the queue inlsb.queues
int procLimit; Maximum number of job slots a job can take char *admins; Queue level administrators char *preCmd; Queue level pre-exec command char *postCmd; Queue's post-exec command char *requeueEValues; Queue's requeue exit status int hostJobLimit; Per host job slot limit char *resReq; Queue level resource requirement int numRESERVE; Reserved job slots for pending jobs int slotHoldTime; Time period for reserving job slots char *sndJobsTo; Remote queues to forward jobs to char *rcvJobsFrom; Remote queues which can forward to me char *resumeCond; Conditions to resume jobs char *stopCond; Conditions to suspend jobs char *jobStarter; Queue level job starter char *suspendActCmd; Action commands for SUSPEND char *resumeActCmd; Action commands for RESUME char *terminateActCmd; Action commands for TERMINATE int sigMap[LSB_SIG_NUM]; Configurable signal mapping char *preemption; Preemption policy int maxRschedTime; Time period for remote cluster to schedule job struct shareAcctInfoEnt *shareAccts; Array of shareAcctInfoEnt char *chkpntDir; chkpnt directory int chkpntPeriod; chkpnt period int imptJobBklg; Number of important jobs kept in the queue int defLimits[LSF_RLIM_NLIMITS]; LSF resource limits (soft) int chunkJobSize; Maximum number of jobs in one chunk };The variable nIdx is the number of load threshold values for job scheduling. This is the total number of load indices returned by LIM. The parameters sndJobsTo, rcvJobsFrom, and maxRschedTime are used with LSF MultiCluster. The variable chunkJobSize must be larger than 1.
For a complete description of the fields in the
queueInfoEnt
structure, see thelsb_queueinfo()
man page.Include
lsbatch.h
in every application that uses LSBLIB functions.lsf.h
does not have to be explicitly included in your program becauselsbatch.h
includeslsf.h
.Like the data structures returned by LSLIB functions, the data structures returned by an LSBLIB function are dynamically allocated inside LSBLIB and are automatically freed next time the same function is called. Do not attempt to free the space allocated by LSBLIB. To keep this information across calls, make your own copy of the data structure.
The program below takes a queue name as the first argument and displays information about the named queue.
/****************************************************** * LSBLIB -- Examples * * simbqueues * Display information about a specific queue in the * cluster. * (Queue name is given on the command line argument) * It is similar to the command "bqueues QUEUE_NAME". ******************************************************/
# include <lsf/lsbatch.h> int main (int argc, char *argv[]) { struct queueInfoEnt *qInfo; char *queues; /* take the command line argument as the queue name */ int numQueues = 1; /* only 1 queue name in the array queue */ char *host = NULL;/* all queues are of interest */ char *user = NULL;/* all queues are of interest */ int options = 0; /* check if input is in the right format: "./simbqueues QUEUENAME" */ if (argc != 2) { printf("Usage: %s queue_name\n", argv[0]); exit(-1); } queues = argv[1]; /* initialize LSBLIB and get the configuration environment */ if (lsb_init(argv[0]) < 0) { lsb_perror("simbqueues: lsb_init() failed"); exit(-1); } /* get queue information about the specified queue */ qInfo = lsb_queueinfo(&queues, &numQueues, host, user, options); if (qInfo == NULL) { lsb_perror("simbqueues: lsb_queueinfo() failed"); exit(-1); } /* display the queue information (name, descriptions, priority, nice value, max num of jobs, num of PEND, RUN, SUSP and TOTAL jobs) */ printf("Information about %s queue:\n", queues); printf("Description: %s\n", qInfo[0].description); printf("Priority: %d Nice: %d \n", qInfo[0].priority, qInfo[0].nice); printf("Maximum number of job slots:"); if (qInfo->maxJobs < INFINIT_INT) printf("%5d\n", qInfo[0].maxJobs); else printf("%5s\n", "unlimited"); printf("Job slot statistics: PEND(%d) RUN(%d) SUSP(%d) TOTAL(%d).\n", qInfo[0].numPEND, qInfo[0].numRUN, qInfo[0].numSSUSP + qInfo[0].numUSUSP, qInfo[0].numJobs); exit(0); } /* main */In the above program,
INFINIT_INT
is defined inlsf.h
and is used to indicate that there is no limit set for maxJobs. This applies to all Platform LSF API function calls. Platform LSF will supplyINFINIT_INT
automatically whenever the value for the variable is either invalid (not available) or infinity. This value should be checked for all variables that are optional. For example, if you display the loadSched/loadStop values, anINFINIT_INT
indicates that the threshold is not configured and is ignored.Similarly,
lsb_perror()
prints error messages regarding function call failure. You can checklsberrno
if you want to take different actions for different errors.The above program will produce output similar to the following:
Information about normal queue: Description: For normal low priority jobs Priority: 25 Nice: 20 Maximum number of job slots : 40 Job slot statistics: PEND( 5) RUN(12) SUSP(1) TOTAL(18)[ Top ]
Getting Information about LSF Batch Hosts
LSF Batch execution hosts execute jobs in the LSF Batch system.
LSBLIB provides
lsb_hostinfo()
to get information about the server hosts in LSF Batch. This includes configured static and dynamic information. Examples of host information include: host name, status, job limits and statistics, dispatch windows, and scheduling parameters.The example program in this section uses
lsb_hostinfo()
:struct hostInfoEnt *lsb_hostinfo(hosts, numHosts)
lsb_hostinfo()
gets information about LSF Batch server hosts. On success, it returns an array of hostInfoEnt structures which hold the host information and sets *numHosts to the size of the array. On failure,lsb_hostinfo()
returnsNULL
and setslsberrno
to indicate the error.
lsb_hostinfo()
has the following parameters:char **hosts; Array of names of hosts of interest int *numHosts; Number of names in hostsTo get information on all hosts, set *numHosts to 0. This sets *numHosts to the actual number of hostInfoEnt structures when
lsb_hostinfo()
returns successfully.If *numHosts is 1 and hosts is
NULL
,lsb_hostinfo()
returns information on the local host.The hostInfoEnt structure is defined in
lsbatch.h
asstruct hostInfoEnt { char *host; Name of the host int hStatus; Status of host. (see below) int busySched; Reason host will not schedule jobs int busyStop; Reason host has suspended jobs float cpuFactor; Host CPU factor, as returned by LIM int nIdx; Size of the loadSched and loadStop arrays, as returned from LIM float *load; Load LSF Batch used for scheduling batch jobs float *loadSched; Load thresholds that control scheduling of job on host float *loadStop; Load thresholds that control suspension of jobs on host char *windows; Host dispatch window int userJobLimit; Maximum number of jobs a user can run on host int maxJobs; Maximum number of jobs that host can process concurrently int numJobs; Number of jobs running or suspended on host int numRUN; Number of jobs running on host int numSSUSP; Number of jobs suspended by sbatchd on host int numUSUSP; Number of jobs suspended by a user on host int mig; Migration threshold for jobs on host int attr; Host attributes #define H_ATTR_CHKPNTABLE 0x1 #define H_ATTR_CHKPNT_COPY 0x2 float *realLoad; Load mbatchd obtained from LIM int numRESERVE; Num of slots reserved for pending jobs int chkSig; Variable is obsolete };There are differences between the host information returned by
ls_gethostinfo()
and the host information returned by thelsb_hostinfo()
.ls_gethostinfo()
returns general information about the hosts whereaslsb_hostinfo()
returns LSF Batch specific information about hosts.For a complete description of the fields in the hostInfoEnt structure, see the
lsb_hostinfo(3)
man page.The following example takes a host name as an argument and displays information about the named host. It is a simplified version of the LSF Batch
bhosts
command./****************************************************** * LSBLIB -- Examples * * simbhosts * Display information about the batch server host with * the given name in the cluster. ******************************************************/
#include <lsf/lsbatch.h> int main (int argc, char *argv[]) { struct hostInfoEnt *hInfo; /* array holding all job info entries */ char *hostname = argv[1]; /* given host name */ int numHosts = 1;/* number of interested host */ /* check if input is in the right format: "./simbhosts HOSTNAME" */ if (argc!=2) { printf("Usage: %s hostname\n", argv[1]); exit(-1); } /* initialize LSBLIB and get the configuration environment */ if (lsb_init(argv[0]) < 0) { lsb_perror("simbhosts: lsb_init() failed"); exit(-1); } hInfo = lsb_hostinfo(&hostname, &numHosts); /* get host info */ if (hInfo == NULL) { lsb_perror("simbhosts: lsb_hostinfo() failed"); exit (-1); } /* display the host information (name,status, job limit, num of RUN/SSUSP/USUSP jobs)*/ printf("HOST_NAME STATUS JL/U NJOBS RUN SSUSP USUSP\n"); printf ("%-18.18s", hInfo->host); if (hInfo->hStatus & HOST_STAT_UNLICENSED) printf(" %-9s\n", "unlicensed"); else if (hInfo->hStatus & HOST_STAT_UNAVAIL) printf(" %-9s", "unavail"); else if (hInfo->hStatus & HOST_STAT_UNREACH) printf(" %-9s", "unreach"); else if (hInfo->hStatus & ( HOST_STAT_BUSY | HOST_STAT_WIND | HOST_STAT_DISABLED | HOST_STAT_LOCKED | HOST_STAT_FULL | HOST_STAT_NO_LIM)) printf(" %-9s", "closed"); else printf(" %-9s", "ok"); if (hInfo->userJobLimit < INFINIT_INT) printf("%4d", hInfo->userJobLimit); else printf("%4s", "-"); printf("%7d %4d %4d %4d\n", hInfo->numJobs, hInfo-> numRUN, hInfo->numSSUSP, hInfo->numUSUSP); exit(0); } /* main */The example output from the above program follows:
%a.out hostB
HOST_NAME STATUS JL/U NJOBS RUN SSUSP USUSP hostB ok - 2 1 1 0hStatus is the status of the host. It is the bitwise inclusive OR of some of the following constants defined in
lsbatch.h
:
If none of the above holds, hStatus is set to
HOST_STAT_OK
to indicate that the host is ready to accept and run jobs.The constant
INFINIT_INT
defined inlsf.h
is used to indicate that there is no limit set for userJobLimit.[ Top ]
Job Submission and Modification
Job submission and modification are the most common operations in LSF Batch. A user can submit jobs to the system and then modify them if the job has not been started.
LSBLIB provides
lsb_submit()
for job submission andlsb_modify()
for job modification.LS_LONG_INT lsb_submit(jobSubReq, jobSubReply) LS_LONG_INT lsb_modify(jobSubReq, jobSubReply, jobId)On success, these calls return the job ID. On failure, it returns -1, and
lsberrno
set to indicate the error.lsb_submit()
is similar tolsb_modify()
, exceptlsb_modify()
modifies the parameters of an already submitted job.Both of these functions use the same data structure:
struct submit *jobSubReq; Job specifications struct submitReply *jobSubReply; Results of job submission LS_LONG_INT jobId; ID of the job to modify (lsb_modify() only)The submit structure is defined in
lsbatch.h
as:struct submit { int options; Indicates which optional fields are present int options2; Indicates which additional fields are present char *jobName; Job name (optional) char *queue; Submit the job to this queue (optional) int numAskedHosts; Size of askedHosts (optional) char **askedHosts; Array of names of candidate hosts (optional) char *resReq; Resource requirements of the job (optional) int rlimits[LSF_RLIM_NLIMITS]; Limits on system resource use by all of the job's processes char *hostSpec; Host model used for scaling rlimits (optional) int numProcessors; Initial number of processors needed by the job char *dependCond; Job dependency condition (optional) char *timeEvent Time event string for scheduled repetitive jobs (optional) time_t beginTime; Dispatch the job on or after beginTime time_t termTime; Job termination deadline int sigValue; This variable is obsolete) char *inFile; Path name of the job's standard input file (optional) char *outFile; Path name of the job's standard output file (optional) char *errFile; Path name of the job's standard error output file (optional) char *command; Command line of the job char *newCommand New command for bmod (optional) time_t chkpntPeriod; Job is checkpointable with this period (optional) char *chkpntDir; Directory for this job's chk directory (optional) int nxf; Size of xf (optional) struct xFile *xf; Array of file transfer specifications (optional) char *preExecCmd; Job's pre-execution command (optional) char *mailUser; User E-mail address to which the job's output are mailed (optional) int delOptions; Bits to be removed from options (lsb_modify()
only) char *projectName; Name of the job's project (optional) int maxNumProcessors; Requested maximum num of job slots for the job char *loginShell; Login shell to be used to re-initialize environment char *exceptList; Lists the exception handlers int userPriority Job priority (optional) };For a complete description of the fields in the submit structure, see the
lsb_submit(3)
man page.The
submitReply
structure is defined inlsbatch.h
asstruct submitReply { char *queue; Queue name the job was submitted to LS_LONG_INT badJobId; dependCond contains badJobId but there is no such job char *badJobName; dependCond contains badJobName but there is no such job int badReqIndx; Index of a host or resource limit that caused an error };The last three variables in the structure submitReply are only used when the
lsb_submit()
orlsb_modify()
fail.For a complete description of the fields in the submitReply structure, see the
lsb_submit(3)
man page.To submit a new job, fill out this data structure and then call
lsb_submit()
. The delOptions variable is ignored by LSF Batch forlsb_submit()
.The example job submission program below takes the job command line as an argument and submits the job to LSF Batch. For simplicity, it is assumed that the job command does not have arguments.
/****************************************************** * LSBLIB -- Examples * * simple bsub * This program submits a batch job to LSF * It is the equivalent of using the "bsub" command without * any options. ******************************************************/
#include <stdio.h> #include <stdlib.h> #include <lsf/lsbatch.h> #include "combine_arg.h" /* To use the function "combine_arg
" to combine arguments on the command line include its header file "combine_arg.h
". */ int main(int argc, char **argv) { struct submit req; /* job specifications */ memset(&req, 0, sizeof(req)); /* initializes req */ struct submitReply reply; /* results of job submission */ int jobId; /* job ID of submitted job */ int i; /* initialize LSBLIB and get the configuration environment */ if (lsb_init(argv[0]) < 0) { lsb_perror("simbsub: lsb_init() failed"); exit(-1); } /* check if input is in the right format: "./simbsub COMMAND ARGUMENTS" */ if (argc < 2) { fprintf(stderr, "Usage: simbsub command\n"); exit(-1); } /* options and options2 are bitwise inclusive OR of some of the SUB_* flags */ req.options = 0; req.options2 = 0; for (i = 0; i < LSF_RLIM_NLIMITS; i++) /* resource limits are initialized to default */ req.rLimits[i] = DEFAULT_RLIMIT; req.beginTime = 0; /* specific date and time to dispatch the job */ req.termTime = 0; /* specifies job termination deadline */ req.numProcessors = 1; /* initial number of processors needed by a (parallel) job */ req.maxNumProcessors = 1; /* max num of processors required to run the (parallel) job */ req.command = combine_arg(argc,argv); /* command line of job */ printf("----------------------------------------------\n"); jobId = lsb_submit(&req, &reply); /* submit the job with specifications */ if (jobId < 0) /* if job submission fails, lsb_submit returns -1 */ switch (lsberrno) { /* and sets lsberrno to indicate the error */ case LSBE_QUEUE_USE: case LSBE_QUEUE_CLOSED: lsb_perror(reply.queue); exit(-1); default: lsb_perror(NULL); exit(-1); } exit(0); } /* main */The above program will produce output similar to the following:
Job <5602> is submitted to default queue <default>.Sample program explanations
req.options = 0; req.options2 = 0;The options and options2 fields of the submit structure are the bitwise inclusive OR of some of the
SUB_*
flags defined inlsbatch.h
. These flags serve two purposes.Some flags indicate which of the optional fields of the submit structure are present. Those that are not present have default values.
Other flags indicate submission options. For a description of these flags, see
lsb_submit(3)
.Since options indicate which of the optional fields are meaningful, the programmer does not need to initialize the fields that are not chosen by options. All parameters that are not optional must be initialized properly.
req.numProcessors = 1; /* initial number of processors needed by a (parallel) job */ req.maxNumProcessors = 1; /* max number of processors required to run the (parallel) job */numProcessors and maxNumProcessors are initialized to ensure only one processor is requested. They are defined in order to synchronize the job specification in
lsb_submit()
to the default used bybsub
.If the resReq field of the submit structure is
NULL
, then LSBLIB will try to obtain resource requirements for a command from the remote task list (see Getting Task Resource Requirements). If the task does not appear in the remote task list, thenNULL
is passed to LSF Batch.mbatchd
uses the default resource requirements with optionDFT_FROMTYPE
bit set when making a LSLIB call for host selection from LIM. See Handling Default Resource Requirements for more information about default resource requirements.for (i = 0; i < LSF_RLIM_NLIMITS; i++) /* resource limits are initialized to default */ req.rLimits[i] = DEFAULT_RLIMIT;The default resource limit (
DEFAULT_RLIMIT
) defined inlsf.h
are for no resource limits.The constants used to index the rlimits array of the submit structure is defined in
lsf.h
. The resource limits currently supported by LSF Batch are listed below.The hostSpec field of the submit structure specifies the host model to use for scaling rlimits[LSF_RLIMIT_CPU] and rlimits[LSF_RLIMIT_RUN] (See
lsb_queueinfo(3)
). If hostSpec isNULL
, the local host's model is assumed.req.beginTime = 0; /* specific date and time to dispatch the job */ req.termTime = 0; /* specifies job termination deadline */If the beginTime field of the submit structure is 0, start the job as soon as possible.
A USR2 signal is sent if the job is running at termTime. If the job does not terminate within 10 minutes after being sent this signal, it is killed. If the termTime field of the submit structure is 0, the job is allowed to run until it reaches a resource limit.
The example below checks the value of
lsberrno
whenlsb_submit()
fails:if (jobId < 0) /* if job submission fails, lsb_submit returns -1 */ switch (lsberrno) { /* and sets lsberrno to indicate the error */ case LSBE_QUEUE_USE: case LSBE_QUEUE_CLOSED: lsb_perror(reply.queue); exit(-1); default: lsb_perror(NULL); exit(-1); }Different actions are taken depending on the type of the error. All possible error numbers are defined in
lsbatch.h
. For example, error numberLSBE_QUEUE_USE
indicates that the user is not authorized to use the queue. The error numberLSBE_QUEUE_CLOSED
indicates that the queue is closed.Since a queue name was not specified for the job, the job is submitted to the default queue. The queue field of the submitReply structure contains the name of the queue to which the job was submitted.
The above program will produce output similar to the following:
Job <5602> is submitted to default queue <default>.The output from the job is mailed to the user because the program did not specify a file name for the outFile parameter in the submit structure.
The program assumes that uniform user names and user ID spaces exist among all the hosts in the cluster. That is, a job submitted by a given user will run under the same user's account on the execution host. For situations where non-uniform user names and user ID spaces exist, account mapping must be used to determine the account used to run a job.
If you are familiar with the
bsub
command, it may help to know how the fields in the submit structure relate to thebsub
command options. This is provided in the following table.* indicates a bitwise OR mask for options2.
** indicates -1 means undefined
Even if all the options are not used, all optional string fields must be initialized to the empty string. For a complete description of the fields in the submit structure, see the
lsb_submit(3)
man page.To modify an already submitted job, fill out a new submit structure to override existing parameters, and use delOptions to remove option bits that were previously specified for the job. Modifying a submitted job is like re-submitting the job. Thus a similar program can be used to modify an existing job with minor changes. One additional parameter that must be specified for job modification is the job Id. The parameter delOptions can also be set if you want to clear some option bits that were previously set.
All applications that call
lsb_submit()
andlsb_modify()
are subject to authentication constraints described in Authentication.[ Top ]
Getting Information about Batch Jobs
LSBLIB provides functions to get status information about batch jobs. Since there could be many thousands of jobs in the LSF Batch system, getting all of this information in one message could use a lot of memory space. LSBLIB allows the application to open a stream connection and then read the job records one by one. This insures the memory space needed is always the size of one job record.
LSF Batch Job ID
LSF version 4.1 API supports 64-bit batch job ID. The LSF Batch job ID will store in a 64-bit integer. It consists of two parts:
The base ID is stored in the lower 32 bits. The array index is shared in the top 32 bits. The top 32 bits are only used when the underlying job is an array job.
![]()
For LSF Version 3.x API, the job ID is stored in a 32-bit integer. The base ID is stored in the lower 20 bits whereas the array index in the top 12 bits.
LSBLIB provides the following C macros (defined in
lsbatch.h
) for manipulating job IDs:LSB_JOBID(base_ID, array_index) Yield an LSF Batch job ID LSB_ARRAY_IDX(job_ID) Yield array index part of the job ID LSB_ARRAY_JOBID(job_ID) Yield the base ID part of the job IDThe function calls used to get job information are:
int lsb_openjobinfo(
job_ID,
jobName,
user,
queue,
host,
options);
struct jobInfoEnt *lsb_readjobinfo(
more);
void lsb_closejobinfo(
void);
These functions are used to open a job information connection with
mbatchd
, read job records, and then close the job information connection.lsb_openjobinfo()
lsb_openjobinfo()
takes the following arguments:LS_LONG_INT jobId; Select job with the given job Id char *jobName; Select job(s) with the given job name char *user; Select job(s) submitted by the named user or user group char *queue; Select job(s) submitted to the named queue char *host; Select job(s) that are dispatched to the named host int options; Selection flags constructed from the bits defined inlsbatch.h
The options parameter contains additional job selection flags defined in
lsbatch.h
. These are:If options is 0, then the default is
CUR_JOB
.
lsb_openjobinfo()
returns the total number of matching job records in the connection. On failure, it returns -1 and setslsberrno
to indicate the error.lsb_readjobinfo()
lsb_readjobinfo()
takes one argument:int *more; If not NULL, contains the remaining number of jobs unreadEither this parameter or the return value from the
lsb_openjobinfo()
can be used to keep track of the number of job records that can be returned from the connection. This parameter is updated each timelsb_readjobinfo()
is called.The jobInfoEnt structure returned by
lsb_readjobinfo()
is defined inlsbatch.h
as:struct jobInfoEnt { LS_LONG_INT jobId; job ID char *user; submission user int status; job status /* possible values for the status field */ #define JOB_STAT_PEND 0x01 job is pending #define JOB_STAT_PSUSP 0x02 job is held #define JOB_STAT_RUN 0x04 job is running #define JOB_STAT_SSUSP 0x08 job is suspended by LSF Batch system #define JOB_STAT_USUSP 0x10 job is suspended by user #define JOB_STAT_EXIT 0x20 job exited #define JOB_STAT_DONE 0x40 job is completed successfully #define JOB_STAT_PDONE 0x80 post job process done successfully #define JOB_STAT_PERROR 0x100 post job process error #define JOB_STAT_WAIT 0x200 chunk job waiting its execution turn #define JOB_STAT_UNKWN 0x1000 unknown status int *reasonTb; pending or suspending reasons int numReasons; length of reasonTb vector int reasons; reserved for future use int subreasons; reserved for future use int jobPid; process Id of the job time_t submitTime; time when the job is submitted time_t reserveTime; time when job slots are reserved time_t startTime; time when job is actually started time_t predictedStartTime; job's predicted start time time_t endTime; time when the job finishes time_t lastEvent; last time event time_t nextEvent; next time event int duration; duration time (minutes) float cpuTime; CPU time consumed by the job int umask; file mode creation mask for the job char *cwd; current working directory where job is submitted char *subHomeDir; submitting user's home directory char *fromHost; host from which the job is submitted char **exHosts; host(s) on which the job executes int numExHosts; number of execution hosts float cpuFactor; CPU factor of the first execution host int nIdx; number of load indices in the loadSched and loadStop vector float *loadSched; stop scheduling new jobs if this threshold is exceeded float *loadStop; stop jobs if this threshold is exceeded struct submit submit; job submission parameters int exitStatus; exit status int execUid; user ID under which the job is running char *execHome; home directory of the user denoted by execUid char *execCwd; current working directory where job is running char *execUsername; user name corresponds to execUid time_t jRusageUpdateTime; last time job's resource usage is updated struct jRusage runRusage; last updated job's resource usage int jType; job type /* Possible values for the jType field */ #define JGRP_NODE_JOB 1 this structure stores a normal batch job #define JGRP_NODE_GROUP 2 this structure stores a job group #define JGRP_NODE_ARRAY 3 this structure stores a job array char *parentGroup; for job group use char *jName; if jType is JGRP_NODE_GROUP, then it is job group name. Otherwise, it is the job's name int counter[NUM_JGRP_COUNTERS]; /* index into the counter array, only used for job array */ #define JGRP_COUNT_NJOBS 0 total jobs in the array #define JGRP_COUNT_PEND 1 number of pending jobs in the array #define JGRP_COUNT_NPSUSP 2 number of held jobs in the array #define JGRP_COUNT_NRUN 3 number of running jobs in the array #define JGRP_COUNT_NSSUSP 4 number of jobs suspended by the system in the array #define JGRP_COUNT_NUSUSP 5 number of jobs suspended by the user in the array #define JGRP_COUNT_NEXIT 6 number of exited jobs in the array #define JGRP_COUNT_NDONE 7 number of successfully completed jobs int counter[NUM_JGRP_COUNTERS]; u_short port; service port of the job int jobPriority; job dynamic priority int numExternalMsg; number of external message(s) in the job struct jobExternalMsgReply **externalMsg; };
jobInfoEnt
can store a job array as well as a non-array batch job, depending on the value of jType field, which can be eitherJGRP_NODE_JOB
orJGRP_NODE_ARRAY
.lsb_closejobinfo()
Call
lsb_closejobinfo()
after receiving all job records in the connection.Below is an example of a simplified
bjobs
command. This program displays all pending jobs belonging to all users.
/******************************************************
* LSBLIB -- Examples
*
* simple bjobs
* Submit command as an lsbatch job with no options set
* and retrieve the job info
* It is similar to the "bjobs" command with no options.
******************************************************/#include <stdio.h> #include <lsf/lsbatch.h> #include "submit_cmd.h" int main(int argc, char **argv) { /* variables for simulating submission */ struct submit req; /* job specifications */ memset(&req, 0, sizeof(req)); /* initializes req */ struct submitReply reply; /* results of job submission */ int jobId; /* job ID of submitted job */ /* variables for simulating bjobs command */ int options = PEND_JOB; /* the status of the jobs whose info is returned */ char *user = "all"; /* match jobs for all users */ struct jobInfoEnt *job; /* detailed job info */ int more; /* number of remaining jobs unread */ /* initialize LSBLIB and get the configuration environment */ if (lsb_init(argv[0]) < 0) { lsb_perror("simbjobs: lsb_init() failed"); exit(-1); } /* check if input is in the right format: * "./simbjobs COMMAND ARGUMENTS" */ if (argc < 2) { fprintf(stderr, "Usage: simbjobs command\n"); exit(-1); } jobId = submit_cmd(&req, &reply, argc, argv); /* submit a job */ if (jobId < 0) /* if job submission fails, lsb_submit returns -1 */ switch (lsberrno) { /* and sets lsberrno to indicate the error */ case LSBE_QUEUE_USE: case LSBE_QUEUE_CLOSED: lsb_perror(reply.queue); exit(-1); default: lsb_perror(NULL); exit(-1); } /* gets the total number of pending job. Exits if failure */ if (lsb_openjobinfo(0, NULL, user, NULL, NULL, options)<0) { lsb_perror("lsb_openjobinfo"); exit(-1); } /* display all pending jobs */ printf("All pending jobs submitted by all users:\n"); for (;;) { job = lsb_readjobinfo(&more); /* get the job details */ if (job == NULL) { lsb_perror("lsb_readjobinfo"); exit(-1); } printf("%s",ctime(&job->submitTime)); /* submission time of job */ printf("Job <%s> ", lsb_jobid2str(job->jobId)); /* job ID */ printf("of user <%s>, ", job->user); /* user that submits the job */ printf("submitted from host <%s>\n", job->fromHost); /* name of sumbission host */ /* continue to display if there is remaining job */ if (!more) /* if there are no remaining jobs undisplayed, exits */ break; } /* when finished to display the job info, close the connection to the mbatchd */ lsb_closejobinfo(); exit(0); }The above program will produce output similar to the following:
All pending jobs submitted by all users: Mon Mar 1 10:34:04 EST 1996 Job <123> of user <john>, submitted from host <orange> Mon Mar 1 11:12:11 EST 1996 Job <126> of user <john>, submitted from host <orange> Mon Mar 1 14:11:34 EST 1996 Job <163> of user <ken>, submitted from host <apple> Mon Mar 1 15:00:56 EST 1996 Job <199> of user <tim>, submitted from host <pear>Use
lsb_pendreason()
, to print out the reasons why the job is still pending Seelsb_pendreason(3)
for details.[ Top ]
Job Manipulation
Users manipulate jobs in different ways, after a job has been submitted. It can be suspended, resumed, killed, or sent arbitrary signal jobs.
All applications that manipulate jobs are subject to authentication provisions described in Authentication.
Sending a signal to a job
Users can send signals to submitted jobs. If the job has not been started, you can send
KILL
,TERM
,INT
, andSTOP
signals. These signals cause the job to be cancelled (KILL
,TERM
,INT
) or suspended (STOP
). If the job has already started, then any signal can be sent to the job.
lsb_signaljob()
sends a signal to a job:int lsb_signaljob(jobId, sigValue); LS_LONG_INT jobId; Select job with the given job Id int sigValue; Signal sent to the jobThe
jobId
andsigValue
parameters are self-explanatory.The following example takes a job ID as the argument and sends a
SIGSTOP
signal to the job.
/******************************************************
* LSBLIB -- Examples
*
* simple bstop
* The program takes a job ID as the argument and sends a * SIGSTOP signal to the job
******************************************************/#include <stdio.h> #include <lsf/lsbatch.h> #include <stdlib.h> #include <signal.h> int main(int argc, char **argv) { /* check if input is in the right format: "simbstop JOBID" */ if (argc != 2) { printf("Usage: %s jobId\n", argv[0]); exit(-1); } /* initialize LSBLIB and get the configuration environment */ if (lsb_init(argv[0]) < 0) { lsb_perror("lsb_init"); exit(-1); } /* send the SIGSTOP signal and check if lsb_signaljob() runs successfully */ if (lsb_signaljob(atoi(argv[1]), SIGSTOP) <0) { lsb_perror("lsb_signaljob"); exit(-1); } printf("Job %s is signaled\n", argv[1]); exit(0); }On success, the function returns 0. On failure, it returns -1 and sets
lsberrno
to indicate the error.Switching a job to a different queue
A job can be switched to a different queue after submission. This can be done even after the job has already started.
Use
lsb_switchjob()
to switch a job from one queue to another:int lsb_switchjob(jobId, queue); LS_LONG_INT jobId; Select job with the given job Id char *queue Name of the queue for the new jobBelow is an example program that switches a specified job to a new queue.
/****************************************************** * LSBLIB -- Examples * * simple bstop * The program switches a specified job to a new queue. ******************************************************/
#include <stdio.h> #include <lsf/lsbatch.h> #include <stdlib.h> int main(int argc, char **argv) { /* check if the input is in the right format: "./simbstop JOBID QUEUENAME" */ if (argc != 3) { printf("Usage: %s jobId new_queue\n", argv[1]); exit(-1); } /* initialize LSBLIB and get the configuration environment */ if (lsb_init(argv[0]) <0) { lsb_perror("lsb_init"); exit(-1); } /* switch the job to the new queue and check for success */ if (lsb_switchjob(atoi(argv[1]), argv[2]) < 0) { lsb_perror("lsb_switchjob"); exit(-1); } printf("Job %s is switched to new queue <%s>\n", argv[1], argv[2]); exit(0); }On success,
lsb_switchjob()
returns 0. On failure, it returns -1 and setslsberrno
to indicate the error.Forcing a job to run
After a job is submitted to the LSF Batch system, it remains pending until LSF Batch runs it (for details on the factors that govern when and where a job starts to run, see Administering Platform LSF).
A job can be forced to run on a specified list of hosts immediately using the following LSBLIB function:
int lsb_runjob (struct runJobRequest *runReq)
ls
b_runjob()
takes therunJobRequest
structure, which is defined inlsbatch.h
:struct runJobRequest { LS_LONG_INT jobId; Job ID of the job to start int numHosts; Number of hosts to run the job on char **hostname; Host names where jobs run #define RUNJOB_OPT_NORMAL 0x01 #define RUNJOB_OPT_NOSTOP 0x02 #define RUNJOB_OPT_PENDONLY 0x04 Pending jobs only, no finished jobs #define RUNJOB_OPT_FROM_BEGIN 0x08 Checkpoint jobs only, from beginning #define RUNJOB_OPT_FREE 0x10 brun to use free CPUs only int options; Run job request options int *slots; Number of slots per host }To force a job to run, the job must have been submitted and in either PEND or FINISHED state. Only the LSF administrator or the owner of the job can start the job.
lsb_runjob()
restarts a job in FINISHED status.A job can be run without any scheduling constraints such as job slot limits. If the job is started with the options field being 0 or RUNJOB_OPT_NORMAL, then the job is subject to the:
To override a started, use RUNJOB_OPT_NOSTOP and the job will not be stopped due to the above mentioned load conditions. However, all LSBLIB's job manipulation APIs can still be applied to the job.
The following is an example program that runs a specified job on a host that has no batch job running.
/******************************************************
* LSBLIB -- Examples
*
* simple brun
* The program takes a job ID as the argument and runs that
* job on a vacant hosts
******************************************************/#include <stdio.h> #include <lsf/lsbatch.h> #include <stdlib.h> int main(int argc, char **argv) { struct hostInfoEnt *hInfo; /* host information */ int numHosts = 0; /* number of hosts */ int i; struct runJobRequest runJobReq; /* specification for the job to be run */ /* check if the input is in the right format: "./simbrun JOBID" */ if (argc != 2) { printf("Usage: %s jobId\n", argv[0]); exit(-1); } /* initialize LSBLIB and get the configuration environment */ if (lsb_init(argv[0]) < 0) { lsb_perror("lsb_init"); exit(-1); } /* get host information */ hInfo = lsb_hostinfo(NULL, &numHosts); if (hInfo == NULL) { lsb_perror("lsb_hostinfo"); exit(-1); } /* find a vacant host */ for (i = 0; i < numHosts; i++) { if (hInfo[i].hStatus & (HOST_STAT_BUSY | HOST_STAT_WIND | HOST_STAT_DISABLED | HOST_STAT_LOCKED | HOST_STAT_FULL | HOST_STAT_NO_LIM | HOST_STAT_UNLICENSED | HOST_STAT_UNAVAIL | HOST_STAT_UNREACH)) continue; /* found a vacant host */ if (hInfo[i].numJobs == 0) break; } /* return error message when there is no vacant host found */ if (i == numHosts) { fprintf(stderr, "Cannot find vacate host to run job < %s >\n", argv[1]); exit(-1); } /* define the specifications for the job to be run (The job can be stopped due to load conditions) */ runJobReq.jobId = atoi(argv[1]); runJobReq.options = 0; runJobReq.numHosts = 1; runJobReq.hostname = (char **)malloc(sizeof(char*)); runJobReq.hostname[0] = hInfo[i].host; /* run the job and check for the success */ if (lsb_runjob(&runJobReq) < 0) { lsb_perror("lsb_runjob"); exit(-1); } exit (0); }On success,
lsb_runjob()
returns 0. On failure, returns -1 and setslsberrno
to indicate the error.[ Top ]
Processing LSF Batch Log Files
LSF Batch saves a lot of valuable information about the system and jobs. Such information is logged by
mbatchd
in the fileslsb.events
andlsb.acct
under the directory$LSB_SHAREDIR/
your_cluster
/logdir
, whereLSB_SHAREDIR
is defined in thelsf.conf
file andyour_cluster
is the name of your Platform LSF cluster.
mbatchd
logs such information for several purposes.
- Some of the events serve as the backup of
mbatchd
's memory. In casembatchd
crashes, all critical information from the event file can then be used by the newly startedmbatchd
to restore the current state of LSF Batch.- The events can be used to produce historical information about the LSF Batch system and user jobs.
- Such information can be used to produce accounting or statistic reports.
Thelsb.events
file contains critical user job information. Never use your program to modifylsb.events
. Writing into this file may cause the loss of user jobs.
LSBLIB provides a function to read information from these files into a well-defined data structure:
struct eventRec *lsb_geteventrec(log_fp, lineNum) FILE *log_fp; File handle for either an event log file or job log file int *lineNum; Line number of the next event recordThe parameter
log_fp
is returned by a successfulfopen()
call. The content inlineNum
is modified to indicate the line number of the next event record in the log file on a successful return. This value can then be used to report the line number when an error occurs while reading the log file. This value should be initiated to 0 beforelsb_geteventrec()
is called for the first time.
lsb_geteventrec()
returns the following data structure:struct eventRec { char version[MAX_VERSION_LEN]; Version number of the mbatchd int type; Type of the event time_t eventTime; Event time stamp union eventLog eventLog; Event data };The event type is used to determine the structure of the data in
eventLog
. LSBLIB remembers the storage allocated for the previously returned data structure and automatically frees it before returning the next event record.
lsb_geteventrec()
returnsNULL
and setslsberrno
toLSBE_EOF
when there are no more records in the event file.Events are logged by
mbatchd
for different purposes. There are job-related events and system-related events. Applications can choose to process certain events and ignore other events. For example, thebhist
command processes job-related events only. The currently available event types are listed below.* Available only if the Platform JobScheduler component is enabled.
The
lsb.acct
file uses onlyEVENT_JOB_FINISH
.lsb.events
file uses all other event types. For detailed formats of these log files, seelsb.events(5)
andlsb.acct(5)
.Each event type corresponds to a different data structure in the union:
union eventLog { struct jobNewLog jobNewLog; EVENT_JOB_NEW struct jobStartLog jobStartLog; EVENT_JOB_START struct jobStatusLog jobStatusLog; EVENT_JOB_STATUS struct jobSwitchLog jobSwitchLog; EVENT_JOB_SWITCH struct jobMoveLog jobMoveLog; EVENT_JOB_MOVE struct queueCtrlLog queueCtrlLog; EVENT_QUEUE_CTRL struct hostCtrlLog hostCtrlLog; EVENT_HOST_CTRL struct mbdStartLog mbdStartLog; EVENT_MBD_START struct mbdDieLog mbdDieLog; EVENT_MBD_DIE struct unfulfillLog unfulfillLog; EVENT_MBD_UNFULFILL struct jobFinishLog jobFinishLog; EVENT_JOB_FINISH struct loadIndexLog loadIndexLog; EVENT_LOAD_INDEX struct migLog migLog; EVENT_MIG struct calendarLog calendarLog; Shared by all calendar events struct jobForceRequestLog jobForceRequestLog EVENT_JOB_FORCE struct jobForwardLog jobForwardLog; EVENT_JOB_FORWARD struct jobAcceptLog jobAcceptLog; EVENT_JOB_ACCEPT struct statusAckLog statusAckLog; EVENT_STATUS_ACK struct signalLog signalLog; EVENT_JOB_SIGNAL struct jobExecuteLog jobExecuteLog; EVENT_JOB_EXECUTE struct jobRequeueLog jobRequeueLog; EVENT_JOB_REQUEUE struct sigactLog sigactLog; EVENT_JOB_SIGACT struct jobStartAcceptLog jobStartAcceptLog EVENT_JOB_START_ACCEPT struct jobMsgLog jobMsgLOg; EVENT_JOB_MSG struct jobMsgAckLog jobMsgAckLog; EVENT_JOB_MSG_ACK struct chkpntLog chkpntLog; EVENT_CHKPNT struct jobOccupyReqLog jobOccupyReqLog; EVENT_JOB_OCCUPY_REQ struct jobVacatedLog jobVacatedLog; EVENT_JOB_VACATED struct jobCleanLog jobCleanLog; EVENT_JOB_CLEAN struct jobExceptionLog jobExceptionLog; EVENT_JOB_EXCEPTION struct jgrpNewLog jgrpNewLog; EVENT_JGRP_ADD struct jgrpCtrlLog jgrpCtrlLog; EVENT_JGRP_CTR struct logSwitchLog logSwitchLog; EVENT_LOG_SWITCH struct jobModLog jobModLog; EVENT_JOB_MODIFY struct jgrpStatusLog jgrpStatusLog; EVENT_JGRP_STATUS struct jobAttrSetLog jobAttrSetLog; EVENT_JOB_ATTR_SET struct jobExternalMsgLog jobExternalMsgLog; EVENT_JOB_EXT_MSG struct jobChunkLog jobChunkLog; EVENT_JOB_CHUNK struct sbdUnreportedStatusLog sbdUnreportedStatusLog; EVENT_SBD_UNREPORTED_STATUS };The detailed data structures in the above union are defined in
lsbatch.h
and described inlsb_geteventrec(3)
.Below is an example program that takes an argument as job name and displays a chronological history about all jobs matching the job name. This program assumes that the
lsb.events
file is in/local/lsf/work/cluster1/logdir
.
/******************************************************
* LSBLIB -- Examples
*
* get event record
* The program takes a job name as the argument and returns
* the information of the job with this given name
******************************************************/
#include <stdio.h> #include <string.h> #include <time.h> #include <lsf/lsbatch.h> int main(int argc, char **argv) { char *eventFile = "/local/lsf/mnt/work/cluster1/logdir/lsb.events"; /*location of lsb.events*/ FILE *fp;/* file handler for lsb.events */ struct eventRec *record; /* pointer to the return struct of lsb_geteventrec() */ int lineNum = 0;/* line number of next event */ char *jobName = argv[1];/* specified job name */ int i; struct jobNewLog *newJob;/* new job event record */ struct jobStartLog *startJob;/* start job event record */ struct jobStatusLog *statusJob; /* job status change event record */ /* check if the input is in the right format: "./geteventrec JOBNAME" */ if (argc != 2) { printf("Usage: %s job name\n", argv[0]); exit(-1); } /* initialize LSBLIB and get the configuration environment */ if (lsb_init(argv[0]) < 0) { lsb_perror("lsb_init"); exit(-1); } /* open the file for read */ fp = fopen(eventFile, "r"); if (fp == NULL) { perror(eventFile); exit(-1); } /* get events and print out the information of the event records with the given job name in different format */ for (;;) { record = lsb_geteventrec(fp, &lineNum); if (record == NULL) { if (lsberrno == LSBE_EOF) exit(0); lsb_perror("lsb_geteventrec"); exit(-1); } /* find the record with the given job name */ if (record->eventLog.jobNewLog.jobName==NULL) continue; if (strcmp(record->eventLog.jobNewLog.jobName, jobName) != 0) continue; else switch (record->type) { case EVENT_JOB_NEW: newJob = &(record->eventLog.jobNewLog); printf("%sJob <%d> submitted by <%s> from <%s> to <%s> queue\n", ctime(&record-> eventTime), newJob->jobId, newJob-> userName, newJob->fromHost, newJob-> queue); continue; case EVENT_JOB_START: startJob = &(record->eventLog.jobStartLog); printf("%sJob <%d> started on ", ctime(&record- > eventTime), newJob->jobId); for (i=0; i<startJob->numExHosts; i++) printf("<%s> ", startJob->execHosts[i]); printf("\n"); continue; case EVENT_JOB_STATUS: statusJob = &(record->eventLog.jobStatusLog); printf("%sJob <%d> status changed to: ", ctime(&record->eventTime), statusJob-> jobId); switch(statusJob->jStatus) { case JOB_STAT_PEND: printf("pending\n"); continue; case JOB_STAT_RUN: printf("running\n"); continue; case JOB_STAT_SSUSP: case JOB_STAT_USUSP: case JOB_STAT_PSUSP: printf("suspended\n"); continue; case JOB_STAT_UNKWN: printf("unknown (sbatchd unreachable)\n"); continue; case JOB_STAT_EXIT: printf("exited\n"); continue; case JOB_STAT_DONE: printf("done\n"); continue; default: printf("\nError: unknown job status %d\n", statusJob->jStatus); continue; } default: /* Only display a few selected event types */ continue; } } exit(0); }In the above program, events that are of no interest are skipped. The job status codes are defined in
lsbatch.h
. Thelsb.acct
file stores job accounting information, which allowslsb.acct
to be processed similarly. Since currently there is only one event type (EVENT_JOB_FINISH) inlsb.acct
, processing is simpler than in the above example.[ Top ]
[ Platform Documentation ] [ Title ] [ Contents ] [ Previous ] [ Next ] [ Index ]
Date Modified: March 13, 2009
Platform Computing: www.platform.com
Platform Support: support@platform.com
Platform Information Development: doc@platform.com
Copyright © 1994-2009 Platform Computing Corporation. All rights reserved.