Knowledge Center Contents Previous Next Index |
Troubleshooting and Error Messages
Contents
- Shared File Access
- Common LSF Problems
- Error Messages
- Setting Daemon Message Log to Debug Level
- Setting Daemon Timing Levels
Shared File Access
A frequent problem with LSF is non-accessible files due to a non-uniform file space. If a task is run on a remote host where a file it requires cannot be accessed using the same name, an error results. Almost all interactive LSF commands fail if the user's current working directory cannot be found on the remote host.
Shared files on UNIX
If you are running NFS, rearranging the NFS mount table may solve the problem. If your system is running the
automount
server, LSF tries to map the filenames, and in most cases it succeeds. If shared mounts are used, the mapping may break for those files. In such cases, specific measures need to be taken to get around it.The automount maps must be managed through NIS. When LSF tries to map filenames, it assumes that automounted file systems are mounted under the
/tmp_mnt
directory.Shared files on Windows
- To share files among Windows machines, set up a share on the server and access it from the client. You can access files on the share either by specifying a UNC path (
\\server\share\path
) or connecting the share to a local drive name and using adrive:\path
syntax. Using UNC is recommended because drive mappings may be different across machines, while UNC allows you to unambiguously refer to a file on the network.Shared files across UNIX and Windows
For file sharing across UNIX and Windows, you require a third party NFS product on Windows to export directories from Windows to UNIX.
Common LSF Problems
This section lists some other common problems with the LIM, RES,
mbatchd
,sbatchd
, and interactive applications.Most problems are due to incorrect installation or configuration. Check the error log files; often the log message points directly to the problem.
LIM dies quietly
- Run the following command to check for errors in the LIM configuration files.
lsadmin ckconfig -v
This displays most configuration errors. If this does not report any errors, check in the LIM error log.
LIM unavailable
Sometimes the LIM is up, but executing the
lsload
command prints the following error message:Communication time out.If the LIM has just been started, this is normal, because the LIM needs time to get initialized by reading configuration files and contacting other LIMs. If the LIM does not become available within one or two minutes, check the LIM error log for the host you are working on.
To prevent communication timeouts when starting or restarting the local LIM, define the parameter LSF_SERVER_HOSTS in the
lsf.conf
file. The client will contact the LIM on one of the LSF_SERVER_HOSTS and execute the command, provided that at least one of the hosts defined in the list has a LIM that is up and running.When the local LIM is running but there is no master LIM in the cluster, LSF applications display the following message:
Cannot locate master LIM now, try later.
- Check the LIM error logs on the first few hosts listed in the
Host
section of thelsf.cluster.
cluster_name
file. If LSF_MASTER_LIST is defined inlsf.conf
, check the LIM error logs on the hosts listed in this parameter instead.Master LIM is down
Sometimes the master LIM is up, but executing the
lsload
orlshosts
command prints the following error message:Master LIM is down; try laterIf the
/etc/hosts
file on the host where the master LIM is running is configured with the host name assigned to the loopback IP address (127.0.0.1), LSF client LIMs cannot contact the master LIM. When the master LIM starts up, it sets its official host name and IP address to the loopback address. Any client requests will get the master LIM address as 127.0.0.1, and try to connect to it, and in fact will try to access itself.
- Check the IP configuration of your master LIM in
/etc/hosts
. The following example incorrectly sets the master LIM IP address to the loopback address:127.0.0.1 localhost myhostnameThe following example correctly sets the master LIM IP address:
127.0.0.1 localhost 192.168.123.123 myhostnameFor a master LIM running on a host that uses an IPv6 address, the loopback address is
::1The following example correctly sets the master LIM IP address using an IPv6 address:
::1 localhost ipv6-localhost ipv6-loopback fe00::0 ipv6-localnet ff00::0 ipv6-mcastprefix ff02::1 ipv6-allnodes ff02::2 ipv6-allrouters ff02::3 ipv6-allhostsRES does not start
- Check the RES error log.
User permission denied
If remote execution fails with the following error message, the remote host could not securely determine the user ID of the user requesting remote execution.
User permission denied.
- Check the RES error log on the remote host; this usually contains a more detailed error message.
- If you are not using an identification daemon (LSF_AUTH is not defined in the
lsf.conf
file), then all applications that do remote executions must be owned by root with thesetuid
bit set. This can be done as follows.
chmod 4755 filename
- If the binaries are on an NFS-mounted file system, make sure that the file system is not mounted with the
nosuid
flag.- If you are using an identification daemon (defined in the
lsf.conf
file by LSF_AUTH),inetd
must be configured to run the daemon. The identification daemon must not be run directly.- If LSF_USE_HOSTEQUIV is defined in the
lsf.conf
file, check if/etc/hosts.equiv
orHOME/.rhosts
on the destination host has the client host name in it. Inconsistent host names in a name server with/etc/hosts
and/etc/hosts.equiv
can also cause this problem.- On SGI hosts running a name server, you can try the following command to tell the host name lookup code to search the
/etc/hosts
file before calling the name server.
setenv HOSTRESORDER "local,nis,bind"
- For Windows hosts, users must register and update their Windows passwords using the
lspasswd
command. Passwords must be 3 characters or longer, and 31 characters or less.For Windows password authentication in a non-shared file system environment, you must define the parameter LSF_MASTER_LIST in
lsf.conf
so that jobs will run with correct permissions. If you do not define this parameter, LSF assumes that the cluster uses a shared file system environment.Non-uniform file name space
A command may fail with the following error message due to a non-uniform file name space.
chdir(...) failed: no such file or directory
You are trying to execute a command remotely, where either your current working directory does not exist on the remote host, or your current working directory is mapped to a different name on the remote host.
If your current working directory does not exist on a remote host, you should not execute commands remotely on that host.
On UNIX
If the directory exists, but is mapped to a different name on the remote host, you have to create symbolic links to make them consistent.
LSF can resolve most, but not all, problems using
automount
. The automount maps must be managed through NIS. Follow the instructions in your Release Notes for obtaining technical support if you are running automount and LSF is not able to locate directories on remote hosts.Batch daemons die quietly
- First, check the
sbatchd
andmbatchd
error logs. Try running the following command to check the configuration.
badmin ckconfig
This reports most errors. You should also check if there is any email in the LSF administrator's mailbox. If the
mbatchd
is running but thesbatchd
dies on some hosts, it may be becausembatchd
has not been configured to use those hosts.See Host not used by LSF.
sbatchd starts but mbatchd does not
- Check whether LIM is running. You can test this by running the
lsid
command. If LIM is not running properly, follow the suggestions in this chapter to fix the LIM first. It is possible thatmbatchd
is temporarily unavailable because the master LIM is temporarily unknown, causing the following error message.
sbatchd: unknown service
- Check whether services are registered properly. See Registering Service Ports for information about registering LSF services.
Detached processes
LSF uses process groups to keep track of all the processes of a job. When a job is launched, the application runs under the job-RES (or root) process group.
If an application creates a new process group, and its PPID still belongs to the job, the PIM can track this new process group as part of the job.
However, if the application forks a child, the child becomes a new process group, and the parent dies immediately, the child process group is now orphaned and cannot be tracked.
Any process that daemonizes itself will almost certainly be lost (will orphan child processes) because it will change its process group right after being detached.
The only reliable way to not lose track of a process is to prevent it from using a new process group.
Host not used by LSF
If you configure a list of server hosts in the
Host
section of thelsb.hosts
file,mbatchd
allowssbatchd
to run only on the hosts listed. If you try to configure an unknown host in theHostGroup
orHostPartition
sections of thelsb.hosts
file, or as aHOSTS
definition for a queue in thelsb.queues
file,mbatchd
logs the following message.
mbatchd on host: LSB_CONFDIR/cluster1/configdir/file(line #): Host hostname is not used by lsbatch;
ignoredIf you start
sbatchd
on a host that is not known bymbatchd
,mbatchd
rejects thesbatchd
. Thesbatchd
logs the following message and exits.
This host is not used by lsbatch system.
Both of these errors are most often caused by not running the following commands, in order, after adding a host to the configuration.
lsadmin reconfig badmin reconfigYou must run both of these before starting the daemons on the new host.
UNKNOWN host type or model
Viewing UNKNOWN host type or model
- Run
lshosts
. A model or type UNKNOWN indicates the host is down or the LIM on the host is down. You need to take immediate action. For example:lshosts
HOST_NAME type model cpuf ncpus maxmem maxswp server RESOURCES hostAUNKNOWN
Ultra2 20.2 2 256M 710M Yes ()Fixing UNKNOWN matched host type or matched model
- Start the host.
- Run
lsadmin limstartup
to start LIM on the host.For example:
lsadmin limstartup hostA
Starting up LIM on <hostA> .... doneor, if EGO is enabled in the LSF cluster, you can also run:
egosh ego start lim hostA
Starting up LIM on <hostA> .... doneYou can specify more than one host name to start up LIM on multiple hosts. If you do not specify a host name, LIM is started up on the host from which the command is submitted.
On UNIX, in order to start up LIM remotely, you must be root or listed in
lsf.sudoers
(orego.sudoers
if EGO is enabled in the LSF cluster) and be able to run thersh
command across all hosts without entering a password.- Wait a few seconds, then run
lshosts
again. You should now be able to see a specific model or type for the host or DEFAULT. If you see DEFAULT, it means that automatic detection of host type or model has failed, and the host type configured inlsf.shared
cannot be found. LSF will work on the host, but a DEFAULT model may be inefficient because of incorrect CPU factors. A DEFAULT type may also cause binary incompatibility because a job from a DEFAULT host type can be migrated to another DEFAULT host type.DEFAULT host type or model
Viewing DEFAULT host type or model
If you see DEFAULT in
lim -t
, it means that automatic detection of host type or model has failed, and the host type configured inlsf.shared
cannot be found. LSF will work on the host, but a DEFAULT model may be inefficient because of incorrect CPU factors. A DEFAULT type may also cause binary incompatibility because a job from a DEFAULT host type can be migrated to anotherDEFAULT host type.
- Run
lshosts
. If Model or Type are displayed as DEFAULT when you uselshosts
and automatic host model and type detection is enabled, you can leave it as is or change it. For example:lshosts
HOST_NAME type model cpuf ncpus maxmem maxswp server RESOURCES hostADEFAULT DEFAULT
1 2 256M 710M Yes ()If model is DEFAULT, LSF will work correctly but the host will have a CPU factor of 1, which may not make efficient use of the host model.
If type is DEFAULT, there may be binary incompatibility. For example, there are 2 hosts, one is Solaris, the other is HP. If both hosts are set to type DEFAULT, it means jobs running on the Solaris host can be migrated to the HP host and vice-versa.
Fixing DEFAULT matched host type or matched model
- Run
lim -t
on the host whose type is DEFAULT:lim -t
Host Type : LINUX86
Host Architecture : SUNWUltra2_200_sparcv9
Physical Processors : 2Cores per Processor : 4 Threads per Core: : 2
Matched Type: DEFAULT
Matched Architecture: DEFAULT
Matched Model: DEFAULT
CPU Factor : 60.0Note the value of
Host Type
andHost Architecture
.- Edit
lsf.shared
.
- In the
HostType
section, enter a new host type. Use the host type name detected withlim -t
. For example:Begin HostType TYPENAME DEFAULTCRAYJ
LINUX86
... End HostType- In the
HostModel
section, enter the new host model with architecture and CPU factor. Use the architecture detected withlim -t
. Add the host model to the end of the host model list. The limit for host model entries is 127. Lines commented out with#
are not counted in the 127-line limit. For example:Begin HostModelMODELNAME CPUFACTOR ARCHITECTURE # keyword
Ultra2 20 SUNWUltra2_200_sparcv9
End HostModel- Save changes to
lsf.shared
.- Run
lsadmin reconfig
to reconfigure LIM.- Wait a few seconds, and run
lim -t
again to check the type and model of the host.Error Messages
The following error messages are logged by the LSF daemons, or displayed by the following commands.
lsadmin ckconfig badmin ckconfigGeneral errors
The messages listed in this section may be generated by any LSF daemon.
can't open file: error
The daemon could not open the named file for the reason given by
error
. This error is usually caused by incorrect file permissions or missing files. All directories in the path to the configuration files must have execute (x
) permission for the LSF administrator, and the actual files must have read (r
) permission. Missing files could be caused by incorrect path names in thelsf.conf
file, running LSF daemons on a host where the configuration files have not been installed, or having a symbolic link pointing to a nonexistent file or directory.
file(line): malloc failed
Memory allocation failed. Either the host does not have enough available memory or swap space, or there is an internal error in the daemon. Check the program load and available swap space on the host; if the swap space is full, you must add more swap space or run fewer (or smaller) programs on that host.
auth_user: getservbyname(ident/tcp) failed: error; ident must be registered in services
LSF_AUTH=ident is defined in the
lsf.conf
file, but theident/tcp
service is not defined in the services database. Addident/tcp
to the services database, or remove LSF_AUTH from thelsf.conf
file andsetuid root
those LSF binaries that require authentication.
auth_user: operation(<host>/<port>) failed: error
LSF_AUTH=ident is defined in the
lsf.conf
file, but the LSF daemon failed to contact theidentd
daemon on host. Check thatidentd
is defined ininetd.conf
and theidentd
daemon is running on host.
auth_user: Authentication data format error (rbuf=<data>) from <host>/<port>
auth_user: Authentication port mismatch (...) from <host>/<port>
LSF_AUTH=ident is defined in the
lsf.conf
file, but there is a protocol error between LSF and the ident daemon onhost
. Make sure the ident daemon on the host is configured correctly.
userok: Request from bad port (<
port_number
>), denied
LSF_AUTH is not defined, and the LSF daemon received a request that originates from a non-privileged port. The request is not serviced.
Set the LSF binaries to be owned by root with the
setuid
bit set, or define LSF_AUTH=ident and set up an ident server on all hosts in the cluster. If the binaries are on an NFS-mounted file system, make sure that the file system is not mounted with thenosuid
flag.
userok: Forged username suspected from <
host
>/<port>: <
claimed_user
>/<
actual_user
>
The service request claimed to come from user
claimed_user
but ident authentication returned that the user was actuallyactual_user
. The request was not serviced.
userok: ruserok(<host>,<uid>) failed
LSF_USE_HOSTEQUIV is defined in the
lsf.conf
file, buthost
has not been set up as an equivalent host (see/etc/host.equiv
), and useruid
has not set up a.rhosts
file.
init_AcceptSock: RES service(res) not registered, exiting
init_AcceptSock: res/tcp: unknown service, exiting
initSock: LIM service not registered.
initSock: Service lim/udp is unknown. Read LSF Guide for help
get_ports: <serv> service not registered
The LSF services are not registered. See Registering Service Ports for information about configuring LSF services.
init_AcceptSock: Can't bind daemon socket to port <port>: error, exiting
init_ServSock: Could not bind socket to port <port>: error
These error messages can occur if you try to start a second LSF daemon (for example, RES is already running, and you execute RES again). If this is the case, and you want to start the new daemon, kill the running daemon or use the
lsadmin
orbadmin
commands to shut down or restart the daemon.Configuration errors
The messages listed in this section are caused by problems in the LSF configuration files. General errors are listed first, and then errors from specific files.
file(line): Section name expected after Begin; ignoring section
file(line): Invalid section name name; ignoring section
The keyword
begin
at the specified line is not followed by a section name, or is followed by an unrecognized section name.
file(line): section section: Premature EOF
The end of file was reached before reading the
end section
line for the named section.
file(line): keyword line format error for section section; Ignore this section
The first line of the section should contain a list of keywords. This error is printed when the keyword line is incorrect or contains an unrecognized keyword.
file(line): values do not match keys for section section; Ignoring line
The number of fields on a line in a configuration section does not match the number of keywords. This may be caused by not putting
()
in a column to represent the default value.
file: HostModel section missing or invalid
file: Resource section missing or invalid
file: HostType section missing or invalid
The
HostModel
,Resource
, orHostType
section in thelsf.shared
file is either missing or contains an unrecoverable error.
file(line): Name name reserved or previously defined. Ignoring index
The name assigned to an external load index must not be the same as any built-in or previously defined resource or load index.
file(line): Duplicate clustername name in section cluster. Ignoring current line
A cluster name is defined twice in the same
lsf.shared
file. The second definition is ignored.
file(line): Bad cpuFactor for host model model. Ignoring line
The CPU factor declared for the named host model in the
lsf.shared
file is not a valid number.
file(line): Too many host models, ignoring model name
You can declare a maximum of 127 host models in the
lsf.shared
file.
file(line): Resource name name too long in section resource. Should be less than 40 characters. Ignoring line
The maximum length of a resource name is 39 characters. Choose a shorter name for the resource.
file(line): Resource name name reserved or previously defined. Ignoring line.
You have attempted to define a resource name that is reserved by LSF or already defined in the
lsf.shared
file. Choose another name for the resource.
file(line): illegal character in resource name: name, section resource. Line ignored.
Resource names must begin with a letter in the set [a-zA-Z], followed by letters, digits or underscores [a-zA-Z0-9_].
LIM messages
The following messages are logged by the LIM:
main: LIM cannot run without licenses, exiting
The LSF software license key is not found or has expired. Check that FLEXnet is set up correctly, or contact your LSF technical support.
main: Received request from unlicensed host <host>/<port>
LIM refuses to service requests from hosts that do not have licenses. Either your LSF license has expired, or you have configured LSF on more hosts than your license key allows.
initLicense: Trying to get license for LIM from source <LSF_CONFDIR/license.dat>
getLicense: Can't get software license for LIM from license file <LSF_CONFDIR/license.dat>: feature not yet available.
Your LSF license is not yet valid. Check whether the system clock is correct.
findHostbyAddr/<proc>: Host <host>/<port> is unknown by <myhostname>
function: Gethostbyaddr_(<host>/<port>) failed: error
main: Request from unknown host <host>/<port>: error
function: Received request from non-LSF host <host>/<port>
The daemon does not recognize
host
. The request is not serviced. These messages can occur ifhost
was added to the configuration files, but not all the daemons have been reconfigured to read the new information. If the problem still occurs after reconfiguring all the daemons, check whether the host is a multi-addressed host.See Host Naming for information about working with multi-addressed hosts.
rcvLoadVector: Sender (<host>/<port>) may have different config?
MasterRegister: Sender (host) may have different config?
LIM detected inconsistent configuration information with the sending LIM. Run the following command so that all the LIMs have the same configuration information.
lsadmin reconfig
Note any hosts that failed to be contacted.
rcvLoadVector: Got load from client-only host <host>/<port>. Kill LIM on <host>/<port>
A LIM is running on a client host. Run the following command, or go to the client host and kill the LIM daemon.
lsadmin limshutdown host
saveIndx: Unknown index name <name> from ELIM
LIM received an external load index name that is not defined in the
lsf.shared
file. If name is defined inlsf.shared
, reconfigure the LIM. Otherwise, add name to thelsf.shared
file and reconfigure all the LIMs.
saveIndx: ELIM over-riding value of index <name>
This is a warning message. The ELIM sent a value for one of the built-in index names. LIM uses the value from ELIM in place of the value obtained from the kernel.
getusr: Protocol error numIndx not read (cc=num): error
getusr: Protocol error on index number (cc=num): error
Protocol error between ELIM and LIM.
RES messages
These messages are logged by the RES.
doacceptconn: getpwnam(<username>@<host>/<port>) failed: error
doacceptconn: User <username> has uid <uid1> on client host <host>/<port>, uid <uid2> on RES host; assume bad user
authRequest: username/uid <userName>/<uid>@<host>/<port> does not exist
authRequest: Submitter's name <clname>@<clhost> is different from name <lname> on this host
RES assumes that a user has the same userID and username on all the LSF hosts. These messages occur if this assumption is violated. If the user is allowed to use LSF for interactive remote execution, make sure the user's account has the same userID and username on all LSF hosts.
doacceptconn: root remote execution permission denied
authRequest: root job submission rejected
Root tried to execute or submit a job but LSF_ROOT_REX is not defined in the
lsf.conf
file.
resControl: operation permission denied, uid = <uid>
The user with user ID
uid
is not allowed to make RES control requests. Only the LSF manager, or root if LSF_ROOT_REX is defined inlsf.conf
, can make RES control requests.
resControl: access(respath, X_OK): error
The RES received a reboot request, but failed to find the file
respath
to re-execute itself. Make surerespath
contains the RES binary, and it has execution permission.mbatchd and sbatchd messages
The following messages are logged by the
mbatchd
andsbatchd
daemons:
renewJob: Job <jobId>: rename(<from>,<to>) failed: error
mbatchd
failed in trying to re-submit a rerunnable job. Check that the filefrom
exists and that the LSF administrator can rename the file. Iffrom
is in an AFS directory, check that the LSF administrator's token processing is properly setup.See the document "Installing LSF on AFS" on the Platform Web site for more information about installing on AFS.
logJobInfo_: fopen(<logdir/info/jobfile>) failed: error
logJobInfo_: write <logdir/info/jobfile> <data> failed: error
logJobInfo_: seek <logdir/info/jobfile> failed: error
logJobInfo_: write <logdir/info/jobfile> xdrpos <pos> failed: error
logJobInfo_: write <logdir/info/jobfile> xdr buf len <len> failed: error
logJobInfo_: close(<logdir/info/jobfile>) failed: error
rmLogJobInfo: Job <jobId>: can't unlink(<logdir/info/jobfile>): error
rmLogJobInfo_: Job <jobId>: can't stat(<logdir/info/jobfile>): error
readLogJobInfo: Job <jobId> can't open(<logdir/info/jobfile>): error
start_job: Job <jobId>: readLogJobInfo failed: error
readLogJobInfo: Job <jobId>: can't read(<logdir/info/jobfile>) size size: error
initLog: mkdir(<logdir/info>) failed: error
<fname>: fopen(<logdir/file> failed: error
getElogLock: Can't open existing lock file <logdir/file>: error
getElogLock: Error in opening lock file <logdir/file>: error
releaseElogLock: unlink(<logdir/lockfile>) failed: error
touchElogLock: Failed to open lock file <logdir/file>: error
touchElogLock: close <logdir/file> failed: error
mbatchd
failed to create, remove, read, or write the log directory or a file in the log directory, for the reason given inerror
. Check that LSF administrator has read, write, and execute permissions on thelogdir
directory.If
logdir
is on AFS, check that the instructions in the document "Installing LSF on AFS" on the Platform Web site have been followed. Use thefs ls
command to verify that the LSF administrator ownslogdir
and that the directory has the correct acl.
replay_newjob: File <logfile> at line <line>: Queue <queue> not found, saving to queue <lost_and_found>
replay_switchjob: File <logfile> at line <line>: Destination queue <queue> not found, switching to queue <lost_and_found>
When
mbatchd
was reconfigured, jobs were found inqueue
but that queue is no longer in the configuration.
replay_startjob: JobId <jobId>: exec host <host> not found, saving to host <lost_and_found>
When
mbatchd
was reconfigured, the event log contained jobs dispatched to host, but that host is no longer configured to be used by LSF.
do_restartReq: Failed to get hData of host <host_name>/<host_addr>
mbatchd
received a request fromsbatchd
on hosthost_name
, but that host is not known tombatchd
. Either the configuration file has been changed butmbatchd
has not been reconfigured to pick up the new configuration, orhost_name
is a client host but thesbatchd
daemon is running on that host. Run the following command to reconfigure thembatchd
or kill thesbatchd
daemon onhost_name
.
badmin reconfig
LSF command messages
LSF daemon (LIM) not responding ... still trying
During LIM restart, LSF commands will fail and display this error message. User programs linked to the LIM API will also fail for the same reason. This message is displayed when LIM running on the master host list or server host list is restarted after configuration changes, such as adding new resources, binary upgrade, and so on.
Use LSF_LIM_API_NTRIES in
lsf.conf
or as an environment variable to define how many times LSF commands will retry to communicate with the LIM API while LIM is not available. LSF_LIM_API_NTRIES is ignored by LSF and EGO daemons and all EGO commands.When LSB_API_VERBOSE=Y in
lsf.conf
, LSF batch commands will display the not responding retry error message tostderr
when LIM is not available.When LSB_API_VERBOSE=N in
lsf.conf
, LSF batch commands will not display the retry error meesage when LIM is not available.Batch command client messages
LSF displays error messages when a batch command cannot communicate with
mbatchd
. The following table provides a list of possible error reasons and the associated error message output.
EGO command messages
You cannot run the egosh command because the administrator has chosen not to enable EGO in lsf.conf: LSF_ENABLE_EGO=N.
If EGO is disabled, the
egosh
command cannot findego.conf
or cannot contactvemkd
(not started).Setting Daemon Message Log to Debug Level
The message log level for LSF daemons is set in
lsf.conf
with the parameter LSF_LOG_MASK. To include debugging messages, set LSF_LOG_MASK to one of:
- LOG_DEBUG
- LOG_DEBUG1
- LOG_DEBUG2
- LOG_DEBUG3
By default, LSF_LOG_MASK=LOG_WARNING and these debugging messages are not displayed.
The debugging log classes for LSF daemons is set in
lsf.conf
with the parameters LSB_DEBUG_CMD, LSB_DEBUG_MBD, LSB_DEBUG_SBD, LSB_DEBUG_SCH, LSF_DEBUG_LIM, LSF_DEBUG_RES.The location of log files is specified with the parameter LSF_LOGDIR in
lsf.conf
.You can use the
lsadmin
andbadmin
commands to temporarily change the class, log file, or message log level for specific daemons such as LIM, RES,mbatchd
,sbatchd
, andmbschd
without changinglsf.conf
.How the message log level takes effect
The message log level you set will only be in effect from the time you set it until you turn it off or the daemon stops running, whichever is sooner. If the daemon is restarted, its message log level is reset back to the value of LSF_LOG_MASK and the log file is stored in the directory specified by LSF_LOGDIR.
Limitations
When debug or timing level is set for RES with
lsadmin resdebug
, orlsadmin restime
, the debug level only affects root RES. The root RES is the RES that runs under the root user ID.Application RESs always use
lsf.conf
to set the debug environment. Application RESs are the RESs that have been created bysbatchd
to service jobs and run under the ID of the user who submitted the job.This means that any RES that has been launched automatically by the LSF system will not be affected by temporary debug or timing settings. The application RES will retain settings specified in
lsf.conf
.Debug commands for daemons
The following commands set temporary message log level options for LIM, RES,
mbatchd
,sbatchd
, andmbschd
.lsadmin limdebug
[-c
class_name]
[-l
debug_level
] [-f
logfile_name
] [-o
] [host_name
]lsadmin resdebug
[-c
class_name]
[-l
debug_level
] [-f
logfile_name
] [-o
] [host_name
]badmin mbddebug
[-c
class_name]
[-l
debug_level
] [-f
logfile_name
] [-o
]badmin sbddebug
[-c
class_name]
[-l
debug_level
] [-f
logfile_name
] [-o
] [host_name
]badmin schddebug
[-c
class_name]
[-l
debug_level
] [-f
logfile_name
] [-o
]For a detailed description of
lsadmin
andbadmin
, see thePlatform LSF Command Reference
.Examples
lsadmin limdebug -c "LC_MULTI LC_PIM" -f myfile hostA hostB
Log additional messages for the LIM daemon running on
hostA
andhostB
, related to MultiCluster and PIM. Create log files in the LSF_LOGDIR directory with the namemyfile.lim.log.hostA
, andmyfile.lim.log.hostB
. The debug level is the default value, LOG_DEBUG level in parameter LSF_LOG_MASK.
lsadmin limdebug -o hostA hostB
Turn off temporary debug settings for LIM on
hostA
andhostB
and reset them to the daemon starting state. The message log level is reset back to the value of LSF_LOG_MASK and classes are reset to the value of LSF_DEBUG_RES, LSF_DEBUG_LIM, LSB_DEBUG_MBD, LSB_DEBUG_SBD, and LSB_DEBUG_SCH. The log file is reset to the LSF system log file in the directory specified by LSF_LOGDIR in the formatdaemon_name
.log.host_name
.
badmin sbddebug -o
Turn off temporary debug settings for
sbatchd
on the local host (host from which the command was submitted) and reset them to the daemon starting state. The message log level is reset back to the value of LSF_LOG_MASK and classes are reset to the value of LSF_DEBUG_RES, LSF_DEBUG_LIM, LSB_DEBUG_MBD, LSB_DEBUG_SBD, and LSB_DEBUG_SCH. The log file is reset to the LSF system log file in the directory specified by LSF_LOGDIR in the formatdaemon_name
.log.host_name
.
badmin mbddebug -l 1
Log messages for
mbatchd
running on the local host and set the log message level to LOG_DEBUG1. This command must be submitted from the host on whichmbatchd
is running becausehost_name
cannot be specified withmbddebug
.
badmin sbddebug -f hostB/myfolder/myfile hostA
Log messages for
sbatchd
running onhostA
, to the directorymyfile
on the serverhostB
, with the file namemyfile.sbatchd.log.hostA
. The debug level is the default value, LOG_DEBUG level in parameter LSF_LOG_MASK.
badmin schddebug -l 2
Log messages for
mbatchd
running on the local host and set the log message level to LOG_DEBUG2. This command must be submitted from the host on whichmbatchd
is running becausehost_name
cannot be specified withschddebug
.badmin schddebug -l 1 -c "LC_PERFM"
badmin schdtime -l 2
Activate the LSF scheduling debug feature.
Log performance messages for
mbatchd
running on the local host and set the log message level to LOG_DEBUG. Set the timing level formbschd
to include two levels of timing information.
lsadmin resdebug -o hostA
Turn off temporary debug settings for RES on
hostA
and reset them to the daemon starting state. The message log level is reset back to the value of LSF_LOG_MASK and classes are reset to the value of LSF_DEBUG_RES, LSF_DEBUG_LIM, LSB_DEBUG_MBD, LSB_DEBUG_SBD, and LSB_DEBUG_SCH. The log file is reset to the LSF system log file in the directory specified by LSF_LOGDIR in the formatdaemon_name
.log.host_name
.For timing level examples, see Setting Daemon Timing Levels.
Setting Daemon Timing Levels
The timing log level for LSF daemons is set in
lsf.conf
with the parameters LSB_TIME_CMD, LSB_TIME_MBD, LSB_TIME_SBD, LSB_TIME_SCH, LSF_TIME_LIM, LSF_TIME_RES.The location of log files is specified with the parameter LSF_LOGDIR in
lsf.conf
. Timing is included in the same log files as messages.To change the timing log level, you need to stop any running daemons, change
lsf.conf
, and then restart the daemons.It is useful to track timing to evaluate the performance of the LSF system. You can use the
lsadmin
andbadmin
commands to temporarily change the timing log level for specific daemons such as LIM, RES,mbatchd
,sbatchd
, andmbschd
without changinglsf.conf
.LSF_TIME_RES is not supported on Windows.
How the timing log level takes effect
The timing log level you set will only be in effect from the time you set it until you turn the timing log level off or the daemon stops running, whichever is sooner. If the daemon is restarted, its timing log level is reset back to the value of the corresponding parameter for the daemon (LSB_TIME_MBD, LSB_TIME_SBD, LSF_TIME_LIM, LSF_TIME_RES). Timing log messages are stored in the same file as other log messages in the directory specified with the parameter LSF_LOGDIR in
lsf.conf
.Limitations
When debug or timing level is set for RES with
lsadmin resdebug
, orlsadmin restime
, the debug level only affects root RES. The root RES is the RES that runs under the root user ID.An application RES always uses
lsf.conf
to set the debug environment. An application RES is the RES that has been created bysbatchd
to service jobs and run under the ID of the user who submitted the job.This means that any RES that has been launched automatically by the LSF system will not be affected by temporary debug or timing settings. The application RES will retain settings specified in
lsf.conf
.Timing level commands for daemons
The total execution time of a function in the LSF system is recorded to evaluate response time of jobs submitted locally or remotely.
The following commands set temporary timing options for LIM, RES,
mbatchd
,sbatchd
, andmbschd
.lsadmin limtime
[-l
timing_level
] [-f
logfile_name
] [-o
] [host_name
]lsadmin restime
[-l
timing_level
] [-f
logfile_name
] [-o
] [host_name
]badmin mbdtime
[-l
timing_level
] [-f
logfile_name
] [-o
]badmin sbdtime
[-l
timing_level
] [-f
logfile_name
] [-o
] [host_name
]badmin schdtime
[-l
timing_level
] [-f
logfile_name
] [-o
]For debug level examples, see Setting Daemon Message Log to Debug Level.
For a detailed description of
lsadmin
andbadmin
, see thePlatform LSF Command Reference
.
Platform Computing Inc.
www.platform.com |
Knowledge Center Contents Previous Next Index |