Knowledge Center Contents Previous Next Index |
Error and Event Logging
Contents
- System Directories and Log Files
- Managing Error Logs
- System Event Log
- Duplicate Logging of Event Logs
- LSF Job Termination Reason Logging
- Understanding LSF job exit codes
System Directories and Log Files
LSF uses directories for temporary work files, log files and transaction files and spooling.
LSF keeps track of all jobs in the system by maintaining a transaction log in the work subtree. The LSF log files are found in the directory
LSB_SHAREDIR/
cluster_name
/logdir
.The following files maintain the state of the LSF system:
lsb.events
LSF uses the
lsb.events
file to keep track of the state of all jobs. Each job is a transaction from job submission to job completion. LSF system keeps track of everything associated with the job in thelsb.events
file.lsb.events.
n
The events file is automatically trimmed and old job events are stored in
lsb.event.
n
files. Whenmbatchd
starts, it refers only to thelsb.events
file, not thelsb.events.
n
files. Thebhist
command can refer to these files.Job script files in the info directory
When a user issues a
bsub
command from a shell prompt, LSF collects all of the commands issued on the bsub line and spools the data tombatchd
, which saves thebsub
command script in the info directory (or in one of its subdirectories if MAX_INFO_DIRS is defined inlsb.params
) for use at dispatch time or if the job is rerun. The info directory is managed by LSF and should not be modified by anyone.Log directory permissions and ownership
Ensure that the permissions on the LSF_LOGDIR directory to be writable by
root
. The LSF administrator must own LSF_LOGDIR.Log levels and descriptions
Support for UNICOS accounting
In Cray UNICOS environments, LSF writes to the Network Queuing System (NQS) accounting data file,
nqacct
, on the execution host. This lets you track LSF jobs and other jobs together, through NQS.Support for IRIX Comprehensive System Accounting (CSA)
The IRIX 6.5.9 Comprehensive System Accounting facility (CSA) writes an accounting record for each process in the
pacct
file, which is usually located in the/var/adm/acct/day
directory. IRIX system administrators then use thecsabuild
command to organize and present the records on a job by job basis.The LSF_ENABLE_CSA parameter in
lsf.conf
enables LSF to write job events to thepacct
file for processing through CSA. For LSF job accounting, records are written topacct
at the start and end of each LSF job.See the
Platform LSF Configuration Reference
for more information about the LSF_ENABLE_CSA parameter.See the IRIX 6.5.9 resource administration documentation for information about CSA.
Managing Error Logs
Error logs maintain important information about LSF operations. When you see any abnormal behavior in LSF, you should first check the appropriate error logs to find out the cause of the problem.
LSF log files grow over time. These files should occasionally be cleared, either by hand or using automatic scripts.
Daemon error logs
LSF log files are reopened each time a message is logged, so if you rename or remove a daemon log file, the daemons will automatically create a new log file.
The LSF daemons log messages when they detect problems or unusual situations.
The daemons can be configured to put these messages into files.
The error log file names for the LSF system daemons are:
res.log.
host_name
sbatchd.log.
host_name
mbatchd.log.
host_name
mbschd.log.
host_name
LSF daemons log error messages in different levels so that you can choose to log all messages, or only log messages that are deemed critical. Message logging for LSF daemons (except LIM) is controlled by the parameter LSF_LOG_MASK in
lsf.conf
. Possible values for this parameter can be any log priority symbol that is defined in/usr/include/sys/syslog.h
. The default value for LSF_LOG_MASK is LOG_WARNING.
important:
LSF_LOG_MASK in lsf.conf no longer specifies LIM logging level in LSF Version 7. For LIM, you must use EGO_LOG_MASK in ego.conf to control message logging for LIM. The default value for EGO_LOG_MASK is LOG_WARNING.Set the log files owner
Prerequisites: You must be the cluster administrator. The performance monitoring (perfmon) metrics must be enabled or you must set LC_PERFM to debug.
You can set the log files owner for the LSF daemons (not including the
mbschd
). The default owner is the LSF Administrator.
restriction:
Applies to UNIX hosts only.
restriction:
This change only takes effect for daemons that are running asroot
.
- Edit
lsf.conf
and add the parameter LSF_LOGFILE_OWNER.- Specify a user account name to set the owner of the log files.
- Shut down the LSF daemon or daemons you want to set the log file owner for.
Run
lsfshutdown
on the host.- Delete or move any existing log files.
important:
If you do not clear out the existing log files, the file ownership does not change.- Restart the LSF daemons you shut down.
Run
lsfstartup
on the host.View the number of file descriptors remaining
Prerequisites: The performance monitoring (perfmon) metrics must be enabled or you must set LC_PERFM to debug.
The
mbatchd
daemon can log a large number of files in a short period when you submit a large number of jobs to LSF. You can view the remaining file descriptors at any time.
restriction:
Applies to UNIX hosts only.
- Run
badmin perfmon view
.The free, used, and total amount of file descriptors display.
On AIX5, 64-bit hosts, if the file descriptor limit has never been changed, the maximum value displays: 9223372036854775797.
Error logging
If the optional LSF_LOGDIR parameter is defined in
lsf.conf
, error messages from LSF servers are logged to files in this directory.If LSF_LOGDIR is defined, but the daemons cannot write to files there, the error log files are created in
/tmp
.If LSF_LOGDIR is not defined, errors are logged to the system error logs (
syslog
) using the LOG_DAEMON facility.syslog
messages are highly configurable, and the default configuration varies widely from system to system. Start by looking for the file/etc/syslog.conf
, and read the man pages forsyslog(3)
andsyslogd(1)
.If the error log is managed by
syslog
, it is probably already being automatically cleared.If LSF daemons cannot find
lsf.conf
when they start, they will not find the definition of LSF_LOGDIR. In this case, error messages go tosyslog
. If you cannot find any error messages in the log files, they are likely in thesyslog
.System Event Log
The LSF daemons keep an event log in the
lsb.events
file. Thembatchd
daemon uses this information to recover from server failures, host reboots, andmbatchd
restarts. Thelsb.events
file is also used by thebhist
command to display detailed information about the execution history of batch jobs, and by thebadmin
command to display the operational history of hosts, queues, and daemons.By default,
mbatchd
automatically backs up and rewrites thelsb.events
file after every 1000 batch job completions. This value is controlled by the MAX_JOB_NUM parameter in thelsb.params
file. The oldlsb.events
file is moved tolsb.events.1
, and each oldlsb.events.
n
file is moved tolsb.events.
n+1
. LSF never deletes these files. If disk storage is a concern, the LSF administrator should arrange to archive or remove oldlsb.events.
n
files periodically.
caution:
Do not remove or modify the currentlsb.events
file. Removing or modifying thelsb.events
file could cause batch jobs to be lost.Duplicate Logging of Event Logs
To recover from server failures, host reboots, or
mbatchd
restarts, LSF uses information stored inlsb.events
. To improve the reliability of LSF, you can configure LSF to maintain copies of these logs, to use as a backup.If the host that contains the primary copy of the logs fails, LSF will continue to operate using the duplicate logs. When the host recovers, LSF uses the duplicate logs to update the primary copies.
How duplicate logging works
By default, the event log is located in
LSB_SHAREDIR
. Typically,LSB_SHAREDIR
resides on a reliable file server that also contains other critical applications necessary for running jobs, so if that host becomes unavailable, the subsequent failure of LSF is a secondary issue.LSB_SHAREDIR
must be accessible from all potential LSF master hosts.When you configure duplicate logging, the duplicates are kept on the file server, and the primary event logs are stored on the first master host. In other words,
LSB_LOCALDIR
is used to store the primary copy of the batch state information, and the contents ofLSB_LOCALDIR
are copied to a replica inLSB_SHAREDIR
, which resides on a central file server. This has the following effects:
- Creates backup copies of
lsb.events
- Reduces the load on the central file server
- Increases the load on the LSF master host
Failure of file server
If the file server containing
LSB_SHAREDIR
goes down, LSF continues to process jobs. Client commands such asbhist
, which directly readLSB_SHAREDIR
will not work.When the file server recovers, the current log files are replicated to
LSB_SHAREDIR
.Failure of first master host
If the first master host fails, the primary copies of the files (in
LSB_LOCALDIR)
become unavailable. Then, a new master host is selected. The new master host uses the duplicate files (inLSB_SHAREDIR)
to restore its state and to log future events. There is no duplication by the second or any subsequent LSF master hosts.When the first master host becomes available after a failure, it will update the primary copies of the files (in
LSB_LOCALDIR
) from the duplicates (in) and continue operations as before.If the first master host does not recover, LSF will continue to use the files in
LSB_SHAREDIR
, but there is no more duplication of the log files.Simultaneous failure of both hosts
If the master host containing
LSB_LOCALDIR
and the file server containingLSB_SHAREDIR
both fail simultaneously, LSF will be unavailable.Network partioning
We assume that Network partitioning does not cause a cluster to split into two independent clusters, each simultaneously running
mbatchd
.This may happen given certain network topologies and failure modes. For example, connectivity is lost between the first master, M1, and both the file server and the secondary master, M2. Both M1 and M2 will run
mbatchd
service with M1 logging events toLSB_LOCALDIR
and M2 logging toLSB_SHAREDIR
. When connectivity is restored, the changes made by M2 toLSB_SHAREDIR
will be lost when M1 updatesLSB_SHAREDIR
from its copy inLSB_LOCALDIR
.The archived event files are only available on
LSB_LOCALDIR
, so in the case of network partitioning, commands such asbhist
cannot access these files. As a precaution, you should periodically copy the archived files fromLSB_LOCALDIR
toLSB_SHAREDIR
.Setting an event update interval
If NFS traffic is too high and you want to reduce network traffic, use EVENT_UPDATE_INTERVAL in
lsb.params
to specify how often to back up the data and synchronize the LSB_SHAREDIR and LSB_LOCALDIR directories.The directories are always synchronized when data is logged to the files, or when
mbatchd
is started on the first LSF master host.Automatic archiving and duplicate logging
Event logs
Archived event logs,
lsb.events.
n
, are not replicated toLSB_SHAREDIR
. If LSF starts a new event log while the file server containingLSB_SHAREDIR
is down, you might notice a gap in the historical data inLSB_SHAREDIR
.Configure duplicate logging
To enable duplicate logging, set LSB_LOCALDIR in
lsf.conf
to a directory on the first master host (the first host configured inlsf.cluster.
cluster_name
) that will be used to store the primary copies oflsb.events
. This directory should only exist on the first master host.
- Edit
lsf.conf
and set LSB_LOCALDIR to a local directory that exists only on the first master host.- Use the commands
lsadmin reconfig
andbadmin mbdrestart
to make the changes take effect.LSF Job Termination Reason Logging
When a job finishes, LSF reports the last job termination action it took against the job and logs it into
lsb.acct
.If a running job exits because of node failure, LSF sets the correct exit information in
lsb.acct
,lsb.events
, and the job output file. Jobs terminated by a signal from LSF, the operating system, or an application have the signal logged as the LSF exit code. Exit codes are not the same as the termination actions.View logged job exit information (bacct -l)
- Use
bacct -l
to view job exit information logged tolsb.acct
:bacct -l 7265
Accounting information about jobs that are: - submitted by all users. - accounted on all projects. - completed normally or exited - executed on all hosts. - submitted to all queues. - accounted on all service classes. ------------------------------------------------------------------------------ Job <7265>, User <lsfadmin>, Project <default>, Status <EXIT>, Queue <normal>, Command <srun sleep 100000> Thu Sep 16 15:22:09: Submitted from host <hostA>, CWD <$HOME>; Thu Sep 16 15:22:20: Dispatched to 4 Hosts/Processors <4*hostA>; Thu Sep 16 15:22:20: slurm_id=21793;ncpus=4;slurm_alloc=n[13-14]; Thu Sep 16 15:23:21: Completed <exit>; TERM_RUNLIMIT: job killed after reaching LSF run time limit. Accounting information about this job: Share group charged </lsfadmin> CPU_T WAIT TURNAROUND STATUS HOG_FACTOR MEM SWAP 0.04 11 72 exit 0.0006 0K 0K ------------------------------------------------------------------------------ SUMMARY: ( time unit: second ) Total number of done jobs: 0 Total number of exited jobs: 1 Total CPU time consumed: 0.0 Average CPU time consumed: 0.0 Maximum CPU time of a job: 0.0 Minimum CPU time of a job: 0.0 Total wait time in queues: 11.0 Average wait time in queue: 11.0 Maximum wait time in queue: 11.0 Minimum wait time in queue: 11.0 Average turnaround time: 72 (seconds/job) Maximum turnaround time: 72 Minimum turnaround time: 72 Average hog factor of a job: 0.00 ( cpu time / turnaround time ) Maximum hog factor of a job: 0.00 Minimum hog factor of a job: 0.00Termination reasons displayed by bacct
When LSF detects that a job is terminated,
bacct -l
displays one of the following termination reasons:
tip:
The integer values logged to the JOB_FINISH event inlsb.acct
and termination reason keywords are mapped inlsbatch.h
.Restrictions
- If a queue-level JOB_CONTROL is configured, LSF cannot determine the result of the action. The termination reason only reflects what the termination reason could be in LSF.
- LSF cannot be guaranteed to catch any external signals sent directly to the job.
- In MultiCluster, a
brequeue
request sent from the submission cluster is translated to TERM_OWNER or TERM_ADMIN in the remote execution cluster. The termination reason in the email notification sent from the execution cluster as well as that in thelsb.acct
is set to TERM_OWNER or TERM_ADMIN.Example output of bacct and bhist
Understanding LSF job exit codes
Exit codes are generated by LSF when jobs end due to signals received instead of exiting normally. LSF collects exit codes via the
wait3()
system call on UNIX platforms. The LSF exit code is a result of the system exit values. Exit codes less than 128 relate to application exit values, while exit codes greater than 128 relate to system signal exit values (LSF adds 128 to system values). Usebhist
to see the exit code for your job.How or why the job may have been signaled, or exited with a certain exit code, can be application and/or system specific. The application or system logs might be able to give a better description of the problem.
tip:
Termination signals are operating system dependent, so signal 5 may not be SIGTRAP and 11 may not be SIGSEGV on all UNIX and Linux systems. You need to pay attention to the execution host type in order to correct translate the exit value if the job has been signaled.Application exit values
The most common cause of abnormal LSF job termination is due to application system exit values. If your application had an explicit exit value less than 128,
bjobs
andbhist
display the actual exit code of the application; for example,Exited with exit code 3
. You would have to refer to the application code for the meaning of exit code 3.It is possible for a job to explicitly exit with an exit code greater than 128, which can be confused with the corresponding system signal. Make sure that applications you write do not use exit codes greater than128.
System signal exit values
Jobs terminated with a system signal are returned by LSF as exit codes greater than 128 such that exit_code-128=signal_value. For example, exit code 133 means that the job was terminated with signal 5 (SIGTRAP on most systems, 133-128=5). A job with exit code 130 was terminated with signal 2 (SIGINT on most systems, 130-128 = 2).
Some operating systems define exit values as 0-255. As a result, negative exit values or values > 255 may have a wrap-around effect on that range. The most common example of this is a program that exits -1 will be seen with "exit code 255" in LSF.
bhist and bjobs output
In most cases,
bjobs
andbhist
show the application exit value (128 +signal
). In some cases,bjobs
andbhist
show the actual signal value.If LSF sends catchable signals to the job, it displays the exit value. For example, if you run
bkill
jobID
to kill the job, LSF passes SIGINT, which causes the job to exit with exit code 130 (SIGINT is 2 on most systems, 128+2 = 130).If LSF sends uncatchable signals to the job, then the entire process group for the job exits with the corresponding signal. For example, if you run
bkill -s SEGV
jobID
to kill the job,bjobs
andbhist
show:Exited by signal 7Example
The following example shows a job that exited with exit code 139, which means that the job was terminated with signal 11 (SIGSEGV on most systems, 139-128=11). This means that the application had a core dump.
bjobs -l 2012
Job <2012>, User , Project , Status , Queue , Command Fri Dec 27 22:47:28: Submitted from host , CWD <$HOME>; Fri Dec 27 22:47:37: Started on , Execution Home , Execution CWD ; Fri Dec 27 22:48:02:Exited with exit code 139.
The CPU time used is 0.2 seconds. SCHEDULING PARAMETERS: r15s r1m r15m ut pg io ls it tmp swp mem loadSched - - - - - - - - - - - loadStop - - - - - - - - - - - cpuspeed bandwidth loadSched - - loadStop - -
Platform Computing Inc.
www.platform.com |
Knowledge Center Contents Previous Next Index |