LSF batch event files

LSF batch saves a lot of valuable information about the system and jobs. Such information is logged by mbatchd in the files lsb.events and lsb.acct under the directory $LSB_SHAREDIR/your_cluster/logdir, where LSB_SHAREDIR is defined in the lsf.conf file and your_cluster is the name of your Platform LSF cluster.

mbatchd logs such information for several purposes.

  • Some of the events serve as the backup of mbatchd’s memory. In case mbatchd crashes, all critical information from the event file can then be used by the newly started mbatchd to restore the current state of LSF batch.

  • The events can be used to produce historical information about the LSF batch system and user jobs.

  • Such information can be used to produce accounting or statistic reports.

CAUTION:

The lsb.events file contains critical user job information. Never use your program to modify lsb.events. Writing into this file may cause the loss of user jobs.

lsb_geteventrec()

LSBLIB provides a function to read information from these files into a well-defined data structure:

struct eventRec *lsb_geteventrec(log_fp, lineNum)
FILE  *log_fp;              File handle for either an event log
                            file or job log file
int   *lineNum;             Line number of the next event
                              record

The parameter log_fp is returned by a successful fopen() call. The content in lineNum is modified to indicate the line number of the next event record in the log file on a successful return. This value can then be used to report the line number when an error occurs while reading the log file. This value should be initiated to 0 before lsb_geteventrec() is called for the first time.

eventRec Structure

lsb_geteventrec() returns the following data structure:

struct eventRec {
    char version[MAX_VERSION_LEN];    Version number of the mbatchd
    int type;                         Type of the event
    time_t eventTime;                    Event time stamp
    union eventLog eventLog;          Event data
};

The event type is used to determine the structure of the data in eventLog. LSBLIB remembers the storage allocated for the previously returned data structure and automatically frees it before returning the next event record.

lsb_geteventrec() returns NULL and sets lsberrno to LSBE_EOF when there are no more records in the event file.

Events are logged by mbatchd for different purposes. There are job-related events and system-related events. Applications can choose to process certain events and ignore other events. For example, the bhist command processes job-related events only. The currently available event types are listed below.

Event Type

Description

EVENT_JOB_NEW

Submit new job

EVENT_JOB_START

mbatchd is trying to start a job

EVENT_JOB_STATUS

Job status change event

EVENT_JOB_SWITCH

Job switched to another queue

EVENT_JOB_MOVE

Move a pending job’s position within a queue

EVENT_QUEUE_CTRL

Queue status changed by Platform LSF administrator (bqc operation)

EVENT_HOST_CTRL

Host status changed by Platform LSF administrator (bhc operation)

EVENT_MBD_START

New mbatchd start event

EVENT_MBD_DIE

Log parameters before mbatchd die

EVENT_MBD_UNFULFILL

mbatchd has an action to be fulfilled

EVENT_JOB_FINISH

Job has finished (logged in lsb.acct only)

EVENT_LOAD_INDEX

Complete list of load index names

EVENT_MIG

Job has migrated

EVENT_PRE_EXEC_START

The pre-execution command started

EVENT_JOB_ROUTE

The job has been routed to NQS

EVENT_JOB_MODIFY

The job’s parameters have been modified

EVENT_JOB_SIGNAL

Signal/delete a job

EVENT_CAL_NEW

Add new calendar to the system *

EVENT_CAL_MODIFY

Calendar modified *

EVENT_CAL_DELETE

Calendar deleted *

EVENT_JOB_FORCE

Forcing a job to start on specified hosts (brun operation)

EVENT_JOB_FORWARD

Job forwarded to another cluster

EVENT_JOB_ACCEPT

Job from a remote cluster dispatched

EVENT_STATUS_ACK

Job status successfully sent to submission cluster

EVENT_JOB_EXECUTE

Job started successfully on the execution host

EVENT_JOB_MSG

Send a message to a job

EVENT_JOB_MSG_ACK

The message has been delivered.

EVENT_JOB_REQUEUE

Job is requeued

EVENT_JOB_OCCUPY_REQ

Submission mbatchd logs this after sending an occupy request to execution mbatchd

EVENT_JOB_VACATED

Submission mbatchd logs this event after all execution mbatchds have vacated the occupied hosts for the job.

EVENT_JOB_SIGACT

An signal action on a job has been initiated or finished

EVENT_JOB_START_ACCEPT

Job accepted by sbatchd

EVENT_SBD_JOB_STATUS

sbatchd’s new job status

EVENT_CAL_UNDELETE

Undeleted a calendar in the system

EVENT_JOB_CLEAN

Job is cleaned out of the core

EVENT_JOB_EXCEPTION

Job exception was detected

EVENT_JGRP_ADD

Adding a new job group

EVENT_JGRP_MOD

Modifying a job group

EVENT_JGRP_CNT

Controlling a job group

EVENT_LOG_SWITCH

Switching the event file lsb.events

EVENT_JOB_MODIFY2

Job modification request

EVENT_JGRP_STATUS

Log job group status

EVENT_JOB_ATTR_SET

Job attributes have been set

EVENT_JOB_EXT_MSG

Send an external message to a job

EVENT_JOB_ATTA_DATA

Update data status of a message for a job

EVENT_JOB_CHUNK

Insert one job to a chunk

EVENT_SBD_UNREPORTED_

STATUS

Save unreported sbatchd status

EVENT_ADRSV_FINISH

An advanced reservation expired.

EVENT_HGHOST_CTRL

Dynamic host group control changes.

EVENT_CPUPROFILE_STATUS

Save current CPU allocation on service partition.

EVENT_DATA_LOGGING

Write a data logging file.

EVENT_JOB_RUN_RUSAGE

Write job ruasage to lsb.stream.

EVENT_END_OF_STREAM

Close stream and open new stream.

EVENT_SLA_RECOMPUTE

Re-evaluate SLA goal.

EVENT_METRIC_LOG

Write performance metrics to lsb.stream.

EVENT_TASK_FINISH

Write a task finish log to ssched.acct.

EVENT_JOB_RESIZE_NOTIFY_START

Job resize allocation made.

EVENT_JOB_RESIZE_NOTIFY_ACCEPT

Job resize notification action initialized.

EVENT_JOB_RESIZE_NOTIFY_DONE

Job resize notification action completed.

EVENT_JOB_RESIZE_RELEASE

Job resize release request received.

EVENT_JOB_RESIZE_CANCEL

Job resize cancel request received.

EVENT_JOB_RESIZE

Job resize event for lsb.acct.

* Available only if the Platform JobScheduler component is enabled.

Tip:

The lsb.acct file uses only EVENT_JOB_FINISH. lsb.events file uses all other event types. For detailed formats of these log files, see lsb.events(5) and lsb.acct(5).

eventLog Union

Each event type corresponds to a different data structure in the union:

union  eventLog { 
    struct jobNewLog     jobNewLog;      EVENT_JOB_NEW
    struct jobStartLog   jobStartLog;    EVENT_JOB_START
    struct jobStatusLog  jobStatusLog;   EVENT_JOB_STATUS
    struct jobSwitchLog  jobSwitchLog;   EVENT_JOB_SWITCH
    struct jobMoveLog    jobMoveLog;     EVENT_JOB_MOVE
    struct queueCtrlLog  queueCtrlLog;   EVENT_QUEUE_CTRL
    struct hostCtrlLog   hostCtrlLog;    EVENT_HOST_CTRL
    struct mbdStartLog   mbdStartLog;    EVENT_MBD_START
    struct mbdDieLog     mbdDieLog;      EVENT_MBD_DIE
    struct unfulfillLog  unfulfillLog;   EVENT_MBD_UNFULFILL
    struct jobFinishLog  jobFinishLog;   EVENT_JOB_FINISH
    struct loadIndexLog  loadIndexLog;   EVENT_LOAD_INDEX
    struct migLog        migLog;         EVENT_MIG
    struct calendarLog   calendarLog;    Shared by all calendar events
    struct jobForceRequestLog jobForceRequestLog  
                                               EVENT_JOB_FORCE
    struct jobForwardLog jobForwardLog;  EVENT_JOB_FORWARD
    struct jobAcceptLog  jobAcceptLog;   EVENT_JOB_ACCEPT
    struct statusAckLog  statusAckLog;   EVENT_STATUS_ACK
    struct signalLog     signalLog;      EVENT_JOB_SIGNAL
    struct jobExecuteLog jobExecuteLog;  EVENT_JOB_EXECUTE
    struct jobRequeueLog jobRequeueLog;  EVENT_JOB_REQUEUE
    struct sigactLog sigactLog;          EVENT_JOB_SIGACT
    struct jobStartAcceptLog jobStartAcceptLog
                                             EVENT_JOB_START_ACCEPT
    struct jobMsgLog     jobMsgLOg;      EVENT_JOB_MSG
    struct jobMsgAckLog  jobMsgAckLog;   EVENT_JOB_MSG_ACK
    struct chkpntLog     chkpntLog;      EVENT_CHKPNT
    struct jobOccupyReqLog jobOccupyReqLog;  
                                               EVENT_JOB_OCCUPY_REQ
    struct jobVacatedLog jobVacatedLog;  EVENT_JOB_VACATED
    struct jobCleanLog   jobCleanLog;    EVENT_JOB_CLEAN
    struct jobExceptionLog jobExceptionLog;  
                                               EVENT_JOB_EXCEPTION
    struct jgrpNewLog    jgrpNewLog;     EVENT_JGRP_ADD
    struct jgrpCtrlLog   jgrpCtrlLog;    EVENT_JGRP_CTR
    struct logSwitchLog  logSwitchLog;   EVENT_LOG_SWITCH
    struct jobModLog     jobModLog;      EVENT_JOB_MODIFY
    struct jgrpStatusLog jgrpStatusLog;  EVENT_JGRP_STATUS
    struct jobAttrSetLog jobAttrSetLog;  EVENT_JOB_ATTR_SET
    struct jobExternalMsgLog jobExternalMsgLog;
                                                EVENT_JOB_EXT_MSG
    struct jobChunkLog   jobChunkLog;    EVENT_JOB_CHUNK
    struct sbdUnreportedStatusLog sbdUnreportedStatusLog;
                                      EVENT_SBD_UNREPORTED_STATUS
    struct rsvFinishLog rsvFinishLog;
    struct hgCtrlLog hgCtrlLog;
    struct cpuProfileLog cpuProfileLog;
    struct dataLoggingLog dataLoggingLog;
    struct jobRunRusageLog   jobRunRusageLog; 
    struct eventEOSLog       eventEOSLog;
    struct slaLog            slaLog;
    struct perfmonLog     perfmonLog;
    struct taskFinishLog     taskFinishLog;
    struct jobResizeNotifyStartLog    jobResizeNotifyStartLog;
    struct jobResizeNotifyAcceptLog   jobResizeNotifyAcceptLog;
    struct jobResizeNotifyDoneLog jobResizeNotifyDoneLog;
    struct jobResizeReleaseLog jobResizeReleaseLog;
    struct jobResizeCancelLog jobResizeCancelLog;
    struct jobResizeLog  jobResizeLog;
};

The detailed data structures in the above union are defined in lsbatch.h and described in lsb_geteventrec(3).

Example

Below is an example program that takes an argument as job name and displays a chronological history about all jobs matching the job name. This program assumes that the lsb.events file is in /local/lsf/work/cluster1/logdir.

/******************************************************

* LSBLIB -- Examples

*

* get event record

* The program takes a job name as the argument and returns

* the information of the job with this given name

******************************************************/

#include <stdio.h>
#include <string.h>
#include <time.h>
#include <lsf/lsbatch.h>
int main(int argc, char **argv)
{
    char *eventFile =             "/local/lsf/mnt/work/cluster1/logdir/lsb.events";
       /*location of lsb.events*/
    FILE *fp;/* file handler for lsb.events */
    struct eventRec *record;
  /* pointer to the return struct of lsb_geteventrec() */
    int  lineNum = 0;/* line number of next event */
    char *jobName = argv[1];/* specified job name */
    int  i;
    struct jobNewLog *newJob;/* new job event record */
    struct jobStartLog *startJob;/* start job event record */
    struct jobStatusLog *statusJob;
        /* job status change event record */
    /* check if the input is in the right format:     "./geteventrec JOBNAME" */
    if (argc != 2) {
        printf("Usage: %s job name\n", argv[0]);
        exit(-1);
    }
    /* initialize LSBLIB and get the configuration environment */
    if (lsb_init(argv[0]) < 0) {
        lsb_perror("lsb_init");
        exit(-1);
    }
    /* open the file for read */
    fp = fopen(eventFile, "r");
    if (fp == NULL) {
        perror(eventFile);
        exit(-1);
    }
    /* get events and print out the information of the event
    records with the given job name in different format */
    for (;;) {
        record = lsb_geteventrec(fp, &lineNum);
        if (record == NULL) {
            if (lsberrno == LSBE_EOF)
                exit(0);
            lsb_perror("lsb_geteventrec");
            exit(-1);
        }
        /* find the record with the given job name */
        if (record->eventLog.jobNewLog.jobName==NULL)
            continue;
        if (strcmp(record->eventLog.jobNewLog.jobName, jobName) != 0)
            continue;
        else
            switch (record->type) {
        case EVENT_JOB_NEW:
             newJob = &(record->eventLog.jobNewLog);
                printf("%sJob <%d> submitted by <%s> from <%s>                       to <%s> queue\n", ctime(&record->                       eventTime), newJob->jobId, newJob->                      userName, newJob->fromHost, newJob->                      queue);
            continue;
        case EVENT_JOB_START:
            startJob = &(record->eventLog.jobStartLog);
                printf("%sJob <%d> started on ", ctime(&record->                      eventTime), newJob->jobId);
            for (i=0; i<startJob->numExHosts; i++)
                printf("<%s> ", startJob->execHosts[i]);
            printf("\n");
            continue;
        case EVENT_JOB_STATUS:
            statusJob = &(record->eventLog.jobStatusLog);
                printf("%sJob <%d> status changed to: ",                       ctime(&record->eventTime), statusJob->                       jobId);
                switch(statusJob->jStatus) {
        case JOB_STAT_PEND:
                printf("pending\n");
                continue;
                case JOB_STAT_RUN: 
                printf("running\n");
                continue;
        case JOB_STAT_SSUSP:
        case JOB_STAT_USUSP:
        case JOB_STAT_PSUSP:
                printf("suspended\n");
                continue; 
        case JOB_STAT_UNKWN:
                printf("unknown (sbatchd unreachable)\n");
                continue;
        case JOB_STAT_EXIT:
                printf("exited\n");
                continue;
        case JOB_STAT_DONE:
                printf("done\n");
                continue;
        default: 
                printf("\nError: unknown job status %d\n",                       statusJob->jStatus);
                continue;
        }
        default:            
        /* Only display a few selected event types */
            continue;
            }
    } 
    exit(0);
}
Tip:

In the above program, events that are of no interest are skipped. The job status codes are defined in lsbatch.h. The lsb.acct file stores job accounting information, which allows lsb.acct to be processed similarly. Since currently there is only one event type (EVENT_JOB_FINISH) in lsb.acct, processing is simpler than in the above example.