External load indices

External load indices report the values of dynamic external resources. A dynamic external resource is a customer-defined resource with a numeric value that changes over time, such as the space available in a directory. Use the external load indices feature to make the values of dynamic external resources available to LSF, or to override the values reported for an LSF built-in load index.

About external load indices

LSF bases job scheduling and host selection decisions on the resources available within your cluster. A resource is a characteristic of a host (such as available memory) or a cluster (such as the number of shared software licenses) that LSF uses to make job scheduling and host selection decisions.

A static resource has a value that does not change, such as a host’s maximum swap space. A dynamic resource has a numeric value that changes over time, such as a host’s currently available swap space. Load indices supply the values of dynamic resources to a host’s load information manager (LIM), which periodically collects those values.

LSF has a number of built-in load indices that measure the values of dynamic, host-based resources (resources that exist on a single host)—for example, CPU, memory, disk space, and I/O. You can also define shared resources (resources that hosts in your cluster share, such as floating software licenses) and make these values available to LSF to use for job scheduling decisions.

If you have specific workload or resource requirements at your site, the LSF administrator can define external resources. You can use both built-in and external resources for LSF job scheduling and host selection.

To supply the LIM with the values of dynamic external resources, either host-based or shared, the LSF administrator writes a site-specific executable called an external load information manager (elim) executable. The LSF administrator programs the elim to define external load indices, populate those indices with the values of dynamic external resources, and return the indices and their values to stdout. An elim can be as simple as a small script, or as complicated as a sophisticated C program.
Note:

LSF does not include a default elim; you should write your own executable to meet the requirements of your site.

The following illustrations show the benefits of using the external load indices feature. In these examples, jobs require the use of floating software licenses.

Default behavior (feature not enabled)

With external load indices enabled

Scope


Applicability

Details

Operating system

  • UNIX

  • Windows

  • A mix of UNIX and Windows hosts

Dependencies

  • UNIX and Windows user accounts must be valid on all hosts in the cluster and must have the correct permissions to successfully run jobs.

  • All elim executables run under the same user account as the load information manager (LIM)—by default, the LSF administrator (lsfadmin) account.

  • External dynamic resources (host-based or shared) must be defined in lsf.shared.


Configuration to enable external load indices

To enable the use of external load indices, you must
  • Define the dynamic external resources in lsf.shared. By default, these resources are host-based (local to each host) until the LSF administrator configures a resource-to-host-mapping in the ResourceMap section of lsf.cluster.cluster_name. The presence of the dynamic external resource in lsf.shared and lsf.cluster.cluster_name triggers LSF to start the elim executables.

  • Map the external resources to hosts in your cluster in lsf.cluster.cluster_name.
    Important:

    You must run the command lsadmin reconfig followed by badmin mbdrestart to apply changes.

  • Create one or more elim executables in the directory specified by the parameter LSF_SERVERDIR. LSF does not include a default elim; you should write your own executable to meet the requirements of your site. The section Create an elim executable provides guidelines for writing an elim.

Define a dynamic external resource

To define a dynamic external resource for which elim collects an external load index value, define the following parameters in the Resource section of lsf.shared:

Configuration file

Parameter and syntax

Description

lsf.shared

RESOURCENAME

resource_name

  • Specifies the name of the external resource.

TYPE

Numeric

  • Specifies the type of external resource: Numeric resources have numeric values.

  • Specify Numeric for all dynamic resources.

INTERVAL

seconds

  • Specifies the interval for data collection by an elim.

  • For numeric resources, defining an interval identifies the resource as a dynamic resource with a corresponding external load index.
    Important:

    You must specify an interval: LSF treats a numeric resource with no interval as a static resource and, therefore, does not collect load index values for that resource.

INCREASING

Y | N

  • Specifies whether a larger value indicates a greater load.
    • Y—a larger value indicates a greater load. For example, if you define an external load index for the number of shared software licenses in use, the larger the value, the heavier the load.

    • N—a larger value indicates a lighter load. For example, if you define an external load index for the number of shared software licenses currently available, the larger the value, the lighter the load, and the more licenses are available.

RELEASE

Y | N

  • For shared resources only, specifies whether LSF releases the resource when a job that uses the resource is suspended.
    • Y—Releases the resource when a job is suspended.

    • N—Holds the resource when a job is suspended.

DESCRIPTION

description

  • Brief description of the resource. Enter a description that enables you to easily identify the type and purpose of the resource.

  • The lsinfo command and the ls_info() API call return the contents of the DESCRIPTION parameter.


Map an external resource

Once external resources are defined in lsf.shared, they must be mapped to hosts in the ResourceMap section of lsf.cluster.cluster_name.


Configuration file

Parameter and syntax

Default behavior

lsf.cluster. cluster_name

RESOURCENAMEresource_name

  • Specifies the name of the external resource as defined in the Resource section of lsf.shared.

LOCATION
  • ([all]) | ([all ~host_name])

  • Maps the resource to the master host only; all hosts share a single instance of the dynamic external resource.

  • To prevent specific hosts from accessing the resource, use the not operator (~) and specify one or more host names. All other hosts can access the resource.

  • [default]

  • Maps the resource to all hosts in the cluster; every host has an instance of the dynamic external resource.

  • If you use the default keyword for any external resource, all elim executables in LSF_SERVERDIR run on all hosts in the cluster. For information about how to control which elim executables run on each host, see the section How LSF determines which hosts should run an elim executable.

  • ([host_name]) | ([host_name] [host_name])

  • Maps the resource to one or more specific hosts.

  • To specify sets of hosts that share a dynamic external resource, enclose each set in square brackets ([ ]) and use a space to separate each host name.


Create an elim executable

You can write one or more elim executables. The load index names defined in your elim executables must be the same as the external resource names defined in the lsf.shared configuration file.

All elim executables must
  • Be located in LSF_SERVERDIR and follow these naming conventions:

    Operating system

    Naming convention

    UNIX

    LSF_SERVERDIR\elim.application

    Windows

    LSF_SERVERDIR\elim.application.exe

    or

    LSF_SERVERDIR\elim.application.bat


    Restriction:

    The name elim.user is reserved for backward compatibility. Do not use the name elim.user for your application-specific elim.

    Note:

    LSF invokes any elim that follows this naming convention,—move backup copies out of LSF_SERVERDIR or choose a name that does not follow the convention. For example, use elim_backup instead of elim.backup.

  • Exit upon receipt of a SIGTERM signal from the load information manager (LIM).

  • Periodically output a load update string to stdout in the format number_indices index_name index_value [index_name index_value …] where

    Value

    Defines

    number_indices

    • The number of external load indices collected by the elim.

    index_name

    • The name of the external load index.

    index_value

    • The external load index value returned by your elim.


For example, the string

3 tmp2 47.5 nio 344.0 licenses 5

reports three indices: tmp2, nio, and licenses, with values 47.5, 344.0, and 5, respectively.
    • The load update string must report values between -INFINIT_LOAD and INFINIT_LOAD as defined in the lsf.h header file.

    • The elim should ensure that the entire load update string is written successfully to stdout. Program the elim to exit if it fails to write the load update string to stdout.
      • If the elim executable is a C program, check the return value of printf(3s).

      • If the elim executable is a shell script, check the return code of /bin/echo(1).

    • If the elim executable is implemented as a C program, use setbuf(3) during initialization to send unbuffered output to stdout.

    • Each LIM sends updated load information to the master LIM every 15 seconds; the elim executable should write the load update string at most once every 15 seconds. If the external load index values rarely change, program the elim to report the new values only when a change is detected.

If you map any external resource as default in lsf.cluster.cluster_name, all elim executables in LSF_SERVERDIR run on all hosts in the cluster. If LSF_SERVERDIR contains more than one elim executable, you should include a header that checks whether the elim is programmed to report values for the resources expected on the host. For detailed information about using a checking header, see the section How environment variables determine elim hosts.

Overriding built-in load indices

An elim executable can be used to override the value of a built-in load index. For example, if your site stores temporary files in the /usr/tmp directory, you might want to monitor the amount of space available in that directory. An elim can report the space available in the /usr/tmp directory as the value for the tmp built-in load index. However, the value reported by an elim must less than the maximum size of /usr/tmp.

To override a built-in load index value, you must:
  • Write an elim executable that periodically measures the value of the dynamic external resource and writes the numeric value to standard output. The external load index must correspond to a numeric, dynamic external resource as defined by TYPE and INTERVAL in lsf.shared.

  • Configure an external resource in lsf.shared and map the resource in lsf.cluster.cluster_name, even though you are overriding a built-in load index. Use a name other than the built-in load index, for example, mytmp rather than tmp.

  • Program your elim to output the formal name of the built-in index (for example, r1m, it, ls, or swp), not the resource name alias (cpu, idle, login, or swap). For example, an elim that collects the value of the external resource mytmp reports the value as tmp (the built-in load index) in the load update string: 1 tmp 20.

Setting up an ELIM to support JSDL

To support the use of Job Submission Description Language (JSDL) files at job submission, LSF collects the following load indices:

Attribute name

Attribute type

Resource name

OperatingSystemName

string

osname

OperatingSystemVersion

string

osver

CPUArchitectureName

string

cpuarch

IndividualCPUSpeed

int64

cpuspeed

IndividualNetworkBandwidth

int64

bandwidth

(This is the maximum bandwidth).


The file elim.jsdl is automatically configured to collect these resources. To enable the use of elim.jsdl, uncomment the lines for these resources in the ResourceMap section of the file lsf.cluster.cluster_name.

Example of an elim executable

See the section How environment variables determine elim hosts for an example of a simple elim script.

You can find additional elim examples in the LSF_MISC/examples directory. The elim.c file is an elim written in C. You can modify this example to collect the external load indices required at your site.

External load indices behavior

How LSF manages multiple elim executables

The LSF administrator can write one elim executable to collect multiple external load indices, or the LSF administrator can divide external load index collection among multiple elim executables. On each host, the load information manager (LIM) starts a master elim (MELIM), which manages all elim executables on the host and reports the external load index values to the LIM. Specifically, the MELIM
  • Starts elim executables on the host. The LIM checks the ResourceMap section LOCATION settings (default, all, or host list) and directs the MELIM to start elim executables on the corresponding hosts.
    Note:

    If the ResourceMap section contains even one resource mapped as default, and if there are multiple elim executables in LSF_SERVERDIR, the MELIM starts all of the elim executables in LSF_SERVERDIR on all hosts in the cluster. Not all of the elim executables continue to run, however. Those that use a checking header could exit with ELIM_ABORT_VALUE if they are not programmed to report values for the resources listed in LSF_RESOURCES.

  • Restarts an elim if the elim exits. To prevent system-wide problems in case of a fatal error in the elim, the maximum restart frequency is once every 90 seconds. The MELIM does not restart any elim that exits with ELIM_ABORT_VALUE.

  • Collects the load information reported by the elim executables.

  • Checks the syntax of load update strings before sending the information to the LIM.

  • Merges the load reports from each elim and sends the merged load information to the LIM. If there is more than one value reported for a single resource, the MELIM reports the latest value.

  • Logs its activities and data into the log file LSF_LOGDIR/melim.log.host_name

  • Increases system reliability by buffering output from multiple elim executables; failure of one elim does not affect other elim executables running on the same host.

How LSF determines which hosts should run an elim executable

LSF provides configuration options to ensure that your elim executables run only when they can report the resources values expected on a host. This maximizes system performance and simplifies the implementation of external load indices. To control which hosts run elim executables, you
  • Must map external resource names to locations in lsf.cluster.cluster_name

  • Optionally, use the environment variables LSF_RESOURCES, LSF_MASTER, and ELIM_ABORT_VALUE in your elim executables

How resource mapping determines elim hosts

The following table shows how the resource mapping defined in lsf.cluster.cluster_name determines the hosts on which your elim executables start.

If the specified LOCATION is …

Then the elim executables start on …

  • ([all]) | ([all ~host_name])

  • The master host, because all hosts in the cluster (except those identified by the not operator [~]) share a single instance of the external resource.

  • [default]

  • Every host in the cluster, because the default setting identifies the external resource as host-based.

  • If you use the default keyword for any external resource, all elim executables in LSF_SERVERDIR run on all hosts in the cluster. For information about how to program an elim to exit when it cannot collect information about resources on a host, see How environment variables determine elim hosts.

  • ([host_name]) | ([host_name] [host_name])

  • On the specified hosts.

  • If you specify a set of hosts, the elim executables start on the first host in the list. For example, if the LOCATION in the ResourceMap section of lsf.cluster.cluster_name is ([hostA hostB hostC] [hostD hostE hostF]):
    • LSF starts the elim executables on hostA and hostD to report values for the resources shared by that set of hosts.

    • If the host reporting the external load index values becomes unavailable, LSF starts the elim executables on the next available host in the list. In this example, if hostA becomes unavailable, LSF starts the elim executables on hostB.

    • If hostA becomes available again, LSF starts the elim executables on hostA and shuts down the elim executables on hostB.


How environment variables determine elim hosts

If you use the default keyword for any external resource in lsf.cluster.cluster_name, all elim executables in LSF_SERVERDIR run on all hosts in the cluster. You can control the hosts on which your elim executables run by using the environment variables LSF_MASTER, LSF_RESOURCES, and ELIM_ABORT_VALUE. These environment variables provide a way to ensure that elim executables run only when they are programmed to report the values for resources expected on a host.

  • LSF_MASTER—You can program your elim to check the value of the LSF_MASTER environment variable. The value is Y on the master host and N on all other hosts. An elim executable can use this parameter to check the host on which the elim is currently running.

  • LSF_RESOURCES—When the LIM starts an MELIM on a host, the LIM checks the resource mapping defined in the ResourceMap section of lsf.cluster.cluster_name. Based on the mapping (default, all, or a host list), the LIM sets LSF_RESOURCES to the list of resources expected on the host. Use LSF_RESOURCES in a checking header to verify that an elim is programmed to collect values for at least one of the resources listed in LSF_RESOURCES.

  • ELIM_ABORT_VALUE—An elim should exit with ELIM_ABORT_VALUE if the elim is not programmed to collect values for at least one of the resources listed in LSF_RESOURCES. The MELIM does not restart an elim that exits with ELIM_ABORT_VALUE.

The following sample code shows how to use a header to verify that an elim is programmed to collect load indices for the resources expected on the host. If the elim is not programmed to report on the requested resources, the elim does not need to run on the host.
#!/bin/sh 
# list the resources that the elim can report to lim 
my_resource="myrsc" 
# do the check when $LSF_RESOURCES is defined by lim 
if [ -n "$LSF_RESOURCES" ]; then 
# check if the resources elim can report are listed in $LSF_RESOURCES 
res_ok=`echo " $LSF_RESOURCES " | /bin/grep " $my_resource " ` 
# exit with $ELIM_ABORT_VALUE if the elim cannot report on at least
# one resource listed in $LSF_RESOURCES
    if [ "$res_ok" = " " ] ; then
        exit $ELIM_ABORT_VALUE
    fi
 fi
while [ 1 ];do 
# set the value for resource "myrsc" 
val="1" 
# create an output string in the format: 
# number_indices index1_name index1_value... 
reportStr="1 $my_resource $val"
 echo "$reportStr"
# wait for 30 seconds before reporting again
sleep 30
done

Configuration to modify external load indices


Configuration file

Parameter and syntax

Behavior

lsf.cluster. cluster_name

Parameters section

ELIMARGS=cmd_line_args

  • Specifies the command-line arguments required by an elim on startup.

ELIM_POLL_INTERVAL=seconds

  • Specifies the frequency with which the LIM samples external load index information from the MELIM.

LSF_ELIM_BLOCKTIME=seconds

  • UNIX only. Specifies how long the MELIM waits before restarting an elim that fails to send a complete load update string.

  • The MELIM does not restart an elim that exits with ELIM_ABORT_VALUE.

LSF_ELIM_DEBUG=y

  • UNIX only. Used for debugging; logs all load information received from elim executables to the MELIM log file (melim.log.host_name).

LSF_ELIM_RESTARTS=integer

  • UNIX only. Limits the number of times an elim can be restarted.

  • You must also define either LSF_ELIM_DEBUG or LSF_ELIM_BLOCKTIME.

  • Defining this parameter prevents an ongoing restart loop in the case of a faulty elim.


External load indices commands

Commands to submit workload


Command

Description

bsub -R "res_req" [-R "res_req"] …

  • Runs the job on a host that meets the specified resource requirements.

  • If you specify a value for a dynamic external resource in the resource requirements string, LSF uses the most recent values provided by your elim executables for host selection.

  • For example:
    • Define a dynamic external resource called "usr_tmp" that represents the space available in the /usr/tmp directory.

    • Write an elim executable to report the value of usr_tmp to LSF.

    • To run the job on hosts that have more than 15 MB available in the /usr/tmp directory, run the command bsub -R "usr_tmp > 15" myjob

    • LSF uses the external load index value for usr_tmp to locate a host with more than 15 MB available in the /usr/tmp directory.


Commands to monitor


Command

Description

lsload

  • Displays load information for all hosts in the cluster on a per host basis.

lsload -R "res_req"

  • Displays load information for specific resources.


Commands to control


Command

Description

lsadmin reconfig followed by

badmin mbdrestart

  • Applies changes when you modify lsf.shared or lsf.cluster.cluster_name.


Commands to display configuration


Command

Description

lsinfo

  • Displays configuration information for all resources, including the external resources defined in lsf.shared.

lsinfo -l

  • Displays detailed configuration information for external resources.

lsinfo resource_name

  • Displays configuration information for the specified resources.

bhosts -s

  • Displays information about numeric shared resources, including which hosts that share each resource.

bhosts -s shared_resource_name

  • Displays configuration information for the specified resources.