External load indices report the values of dynamic external resources. A dynamic external resource is a customer-defined resource with a numeric value that changes over time, such as the space available in a directory. Use the external load indices feature to make the values of dynamic external resources available to LSF, or to override the values reported for an LSF built-in load index.
LSF bases job scheduling and host selection decisions on the resources available within your cluster. A resource is a characteristic of a host (such as available memory) or a cluster (such as the number of shared software licenses) that LSF uses to make job scheduling and host selection decisions.
A static resource has a value that does not change, such as a host’s maximum swap space. A dynamic resource has a numeric value that changes over time, such as a host’s currently available swap space. Load indices supply the values of dynamic resources to a host’s load information manager (LIM), which periodically collects those values.
LSF has a number of built-in load indices that measure the values of dynamic, host-based resources (resources that exist on a single host)—for example, CPU, memory, disk space, and I/O. You can also define shared resources (resources that hosts in your cluster share, such as floating software licenses) and make these values available to LSF to use for job scheduling decisions.
If you have specific workload or resource requirements at your site, the LSF administrator can define external resources. You can use both built-in and external resources for LSF job scheduling and host selection.
The following illustrations show the benefits of using the external load indices feature. In these examples, jobs require the use of floating software licenses.
Define the dynamic external resources in lsf.shared. By default, these resources are host-based (local to each host) until the LSF administrator configures a resource-to-host-mapping in the ResourceMap section of lsf.cluster.cluster_name. The presence of the dynamic external resource in lsf.shared and lsf.cluster.cluster_name triggers LSF to start the elim executables.
Create one or more elim executables in the directory specified by the parameter LSF_SERVERDIR. LSF does not include a default elim; you should write your own executable to meet the requirements of your site. The section Create an elim executable provides guidelines for writing an elim.
Once external resources are defined in lsf.shared, they must be mapped to hosts in the ResourceMap section of lsf.cluster.cluster_name.
You can write one or more elim executables. The load index names defined in your elim executables must be the same as the external resource names defined in the lsf.shared configuration file.
3 tmp2 47.5 nio 344.0 licenses 5
The load update string must report values between -INFINIT_LOAD and INFINIT_LOAD as defined in the lsf.h header file.
If the elim executable is implemented as a C program, use setbuf(3) during initialization to send unbuffered output to stdout.
Each LIM sends updated load information to the master LIM every 15 seconds; the elim executable should write the load update string at most once every 15 seconds. If the external load index values rarely change, program the elim to report the new values only when a change is detected.
If you map any external resource as default in lsf.cluster.cluster_name, all elim executables in LSF_SERVERDIR run on all hosts in the cluster. If LSF_SERVERDIR contains more than one elim executable, you should include a header that checks whether the elim is programmed to report values for the resources expected on the host. For detailed information about using a checking header, see the section How environment variables determine elim hosts.
An elim executable can be used to override the value of a built-in load index. For example, if your site stores temporary files in the /usr/tmp directory, you might want to monitor the amount of space available in that directory. An elim can report the space available in the /usr/tmp directory as the value for the tmp built-in load index. However, the value reported by an elim must less than the maximum size of /usr/tmp.
Write an elim executable that periodically measures the value of the dynamic external resource and writes the numeric value to standard output. The external load index must correspond to a numeric, dynamic external resource as defined by TYPE and INTERVAL in lsf.shared.
Configure an external resource in lsf.shared and map the resource in lsf.cluster.cluster_name, even though you are overriding a built-in load index. Use a name other than the built-in load index, for example, mytmp rather than tmp.
Program your elim to output the formal name of the built-in index (for example, r1m, it, ls, or swp), not the resource name alias (cpu, idle, login, or swap). For example, an elim that collects the value of the external resource mytmp reports the value as tmp (the built-in load index) in the load update string: 1 tmp 20.
The file elim.jsdl is automatically configured to collect these resources. To enable the use of elim.jsdl, uncomment the lines for these resources in the ResourceMap section of the file lsf.cluster.cluster_name.
See the section How environment variables determine elim hosts for an example of a simple elim script.
You can find additional elim examples in the LSF_MISC/examples directory. The elim.c file is an elim written in C. You can modify this example to collect the external load indices required at your site.
If the ResourceMap section contains even one resource mapped as default, and if there are multiple elim executables in LSF_SERVERDIR, the MELIM starts all of the elim executables in LSF_SERVERDIR on all hosts in the cluster. Not all of the elim executables continue to run, however. Those that use a checking header could exit with ELIM_ABORT_VALUE if they are not programmed to report values for the resources listed in LSF_RESOURCES.
Restarts an elim if the elim exits. To prevent system-wide problems in case of a fatal error in the elim, the maximum restart frequency is once every 90 seconds. The MELIM does not restart any elim that exits with ELIM_ABORT_VALUE.
Collects the load information reported by the elim executables.
Checks the syntax of load update strings before sending the information to the LIM.
Merges the load reports from each elim and sends the merged load information to the LIM. If there is more than one value reported for a single resource, the MELIM reports the latest value.
Logs its activities and data into the log file LSF_LOGDIR/melim.log.host_name
Increases system reliability by buffering output from multiple elim executables; failure of one elim does not affect other elim executables running on the same host.
If you use the default keyword for any external resource in lsf.cluster.cluster_name, all elim executables in LSF_SERVERDIR run on all hosts in the cluster. You can control the hosts on which your elim executables run by using the environment variables LSF_MASTER, LSF_RESOURCES, and ELIM_ABORT_VALUE. These environment variables provide a way to ensure that elim executables run only when they are programmed to report the values for resources expected on a host.
LSF_RESOURCES—When the LIM starts an MELIM on a host, the LIM checks the resource mapping defined in the ResourceMap section of lsf.cluster.cluster_name. Based on the mapping (default, all, or a host list), the LIM sets LSF_RESOURCES to the list of resources expected on the host. Use LSF_RESOURCES in a checking header to verify that an elim is programmed to collect values for at least one of the resources listed in LSF_RESOURCES.
ELIM_ABORT_VALUE—An elim should exit with ELIM_ABORT_VALUE if the elim is not programmed to collect values for at least one of the resources listed in LSF_RESOURCES. The MELIM does not restart an elim that exits with ELIM_ABORT_VALUE.
#!/bin/sh# list the resources that the elim can report to limmy_resource="myrsc"# do the check when $LSF_RESOURCES is defined by limif [ -n "$LSF_RESOURCES" ]; then# check if the resources elim can report are listed in $LSF_RESOURCESres_ok=`echo " $LSF_RESOURCES " | /bin/grep " $my_resource " `# exit with $ELIM_ABORT_VALUE if the elim cannot report on at least# one resource listed in $LSF_RESOURCESif [ "$res_ok" = " " ] ; thenexit $ELIM_ABORT_VALUEfifiwhile [ 1 ];do# set the value for resource "myrsc"val="1"# create an output string in the format:# number_indices index1_name index1_value...reportStr="1 $my_resource $val"echo "$reportStr"# wait for 30 seconds before reporting againsleep 30done