[ Platform Documentation ] [ Title ] [ Contents ] [ Previous ] [ Next ] [ Index ]
- Getting Load Information on Selected Load Indices
- Writing a Parallel Application
- Discovering Why a Job Is Suspended
- Reading lsf.conf Parameters
- Signal Handling in Windows
[ Top ]
Getting Load Information on Selected Load Indices
Getting Dynamic Load Information shows how to get load information from the LIM. Depending on the size of your LSF cluster and the frequency at which the
ls_load()
function is called, returning load information of all the hosts can produce unnecessary overhead.LSLIB provides
ls_loadinfo()
call that allows an application to specify a selected number of load indices and get only those load indices that are of interest to the application.Getting a list of all load index names
Since LSF allows a site to install an ELIM to collect additional load indices, the names and the total number of load indices are often dynamic and have to be found out at run time unless the application is only using the built-in load indices.
Below is an example routine that returns a list of all available load index names and the total number of load indices.
#include <lsf/lsf.h> char **getIndexList(int *listsize) { struct lsInfo *lsInfo = (struct lsInfo *) malloc (sizeof (struct lsInfo)); static char *nameList[268]; static int first = 1; int i; if (first) { /* only need to do so when called for the first time */ lsInfo = ls_info(); if (lsInfo == NULL) return (NULL); first = 0; } if (listsize != NULL) *listsize = lsInfo->numIndx; for (i=0; i<lsInfo->numIndx; i++) nameList[i] = lsInfo->resTable[i].name; return (nameList); }The above code fragment returns a list of load index names currently installed in the LSF cluster. The content of listSize will be modified to the total number of load indices. If
ls_info()
fails, then the program returnsNULL
. The data structure returned byls_info()
contains all the load index names before any other resource names. The load index names start with the 11 built-in load indices followed by site external load indices (through ELIM).Displaying selected load indices
By providing a list of load index names to an LSLIB function, you can get the load information about the specified load indices.
The following example shows how you can display the values of the external load indices. This program uses
ls_loadinfo()
:struct hostLoad *ls_loadinfo(resreq, numhosts, options, fromhost, hostlist, listsize, namelist)The parameters for this routine are:
char *resreq; Resource requirement int *numhosts; Return parameter, number of hosts returned int options; Host and load selection options char *fromhost; Used only if DFT_FROMTYPE is set in options char **hostlist; A list of candidate hosts for selection int listsize; Number of hosts in hostlist char ***namelist; Input/output parameter -- load index name list
ls_loadinfo()
is similar tols_load()
except thatls_loadinfo()
allows an application to supply both a list of load indices and a list of candidate hosts. If both of namelist and hostlist areNULL
, then it operates in the same way asls_load()
function.The parameter namelist allows an application to specify a list of load indices of interest. The function then returns only the specified load indices. On return, this parameter is modified to point to another name list that contains the same set of load index names. This load index is in a different order to reflect the mapping of index names and the actual load values returned in the hostLoad array:
#include <stdio.h> #include <lsf/lsf.h> /*include the header file with the getIndexList function here*/ main() { struct hostLoad *load; char **loadNames; int numIndx; int numUsrIndx; int nHosts; int i; int j; loadNames = getIndexList(&numIndx); if (loadNames == NULL) { ls_perror("Unable to get load index names\n"); exit(-1); } numUsrIndx = numIndx - 11; /* this is the total num of site defined indices*/ if (numUsrIndx == 0) { printf("No external load indices defined\n"); exit(-1); } loadNames += 11; /* skip the 11 built-in load index names */ load = ls_loadinfo(NULL, &nHosts, 0, NULL, NULL, 0, &loadNames); if (load == NULL) { ls_perror("ls_loadinfo"); exit(-1); } printf("Report on external load indices\n"); for (i=0; i<nHosts; i++) { printf("Host %s:\n", load[i].hostName); for (j=0; j<numUsrIndx; j++) printf("index name: %s, value %5.0f\n", loadNames[j], load[i].li[j]); } }The above program uses the
getIndexList()
function described in the previous example program to get a list of all available load index names. Sample output from the above program follows:Report on external load indices Host hostA: index name: usr_tmp, value 87 index name: num_licenses, value 1 Host hostD: index name: usr_tmp, value 18 index name: num_licenses, value 2[ Top ]
Writing a Parallel Application
LSF provides job placement and remote execution support for parallel applications. A master LIM's host selection or placement service can return an array of good hosts for an application. The application can then use remote execution service provided by RES to run tasks on these hosts concurrently.
This section contains samples of how to write a parallel application using LSLIB.
ls_rtask() function
Running a task remotely discusses the use of
ls_rexecv()
for remote execution. You can also usels_rtask()
for remote execution.ls_rtask()
andls_rexecv()
differ in how the server host behaves.
ls_rexecv()
is useful when the server host does not need to do anything but wait for the remote task to finish. After initiating the remote task,ls_rexecv()
replaces the current program with the Network I/O Server (NIOS) by callingexecv()
. The NIOS then handles the rest of the work on the server host: delivering input/output between local terminal and remote task and exiting with the same status as the remote task.ls_rexecv()
is considered to be the remote execution version of the UNIXexecv()
system call.
ls_rtask()
provides more flexibility if the server host has to do other things after the remote task is initiated. For example, the application may want to start more than one task on several hosts. Unlikels_rexecv()
,ls_rtask()
returns immediately after the remote task is started. The syntax ofls_rtask()
is:int ls_rtask(host, argv, options)The parameters are:
char *host; Name of the remote host to start task on char **argv; Program name and arguments int options; Remote execution optionsThe options parameter is similar to that of the
ls_rexecv()
function.ls_rtask()
returns the task ID of the remote task which is used by the application to differentiate multiple outstanding remote tasks. When a remote task finishes, the status of the remote task is sent back to the NIOS running on the local host, which then notifies the application by issuing aSIGUSR1
signal. The application can then callls_rwait()
to collect the status of the remote task. Thels_rwait()
behaves in much the same way as thewait(2)
system call. Considerls_rtask()
as a combination of remotefork()
andexecv()
.
Applications callingls_rtask()
must set up a signal handler for theSIGUSR1
signal, or the application could be killed bySIGUSR1
.
You need to be careful if your application handles
SIGTSTP
,SIGTTIN
, orSIGTTOU
. If handlers for these signals areSIG_DFL
, thels_rtask()
function automatically installs a handler for them to properly coordinate with the NIOS when these signals are received. If you intend to handle these signals by yourself instead of using the default set by LSLIB, you need to use the low level LSLIB functionls_stoprex()
before the end of your signal handler.Running tasks on many machines
Below is an example program that uses
ls_rtask()
to runrm -f
/tmp/core
on user specified hosts.#include <stdio.h> #include <sys/types.h> #include <sys/wait.h> #include <lsf/lsf.h> int main (int argc, char **argv) { char *command[4]; int numHosts; int i; int tid; if (argc <= 1) { printf("Usage: %s host1 [host2 ...]\n",argv[0]); exit(-1); } numHosts = argc - 1; command[0] = "rm"; command[1] = "-f"; command[2] = "/tmp/core"; command[3] = NULL; if (ls_initrex(numHosts, 0) < 0) { ls_perror("ls_initrex"); exit(-1); } signal(SIGUSR1, SIG_IGN); /* Run command on the specified hosts */ for (i=1; i<=numHosts; i++) { if ((tid = ls_rtask(argv[i], command, 0)) < 0) { fprintf(stderr, "lsrtask failed for host %s: %s\n", argv[i], ls_sysmsg()); exit(-1); } printf("Task %d started on %s\n", tid, argv[i]); } while (numHosts) { LS_WAIT_T status; tid = ls_rwait(&status, 0, NULL); if (tid < 0) { ls_perror("ls_rwait"); exit(-1); } printf("task %d finished\n", tid); numHosts--; } exit(0); }The above program sets the signal handler for
SIGUSR1
toSIG_IGN
. This causes the signal to be ignored. It usesls_rwait()
to poll the status of remote tasks. You could set a signal handler so that it callsls_rwait()
inside the signal handler.Use the task ID to preform an operation on the task. For example, you can send a signal to a remote task explicitly by calling
ls_rkill()
.To run the task on remote hosts one after another instead of concurrently, call
ls_rwait()
right afterls_rtask()
.Also note the use of
ls_sysmsg()
instead ofls_perror()
, which does not allow flexible printing format.The above example program produces output similar to the following:
% a.out hostD hostA hostB
Task 1 started on hostD Task 2 started on hostA Task 3 started on hostB Task 1 finished Task 3 finished Task 2 finishedRemote tasks are run concurrently, so the order in which tasks finish is not necessarily the same as the order in which tasks are started.
[ Top ]
Discovering Why a Job Is Suspended
Getting Information about Batch Jobs shows how to get information about submitted jobs. It is frequently desirable to know the reasons why jobs are in a certain status. LSBLIB provides a function to print such information. This section describes a routine that prints out why a job is in suspending status.
When
lsb_readjobinfo()
reads a record of a pending job, the variables reasons and subreasons contained in the returnedstruct jobInfoEnt
calllsb_suspreason()
. This gets the reason text explaining why the job is still in pending state:char *lsb_suspreason(reasons, subReasons, ld);where reasons and subReasons are integer reason flags as returned by a
lsb_readjobinfo()
function while ld is a pointer to the following data structure:struct loadIndexLog { int nIdx; Number of load indices configured for the LSF cluster char **name; List of the load index names };Call the below initialization and code fragment after
lsb_readjobinfo()
is called./* initialization */ struct loadIndexLog *indices =(struct loadIndexLog *)malloc (sizeof(struct loadIndexLog)); char *suspreason; /* get the list of all load index names */ indices->name = getindexlist(&indices->nIdx); /* get and print out the suspended reason */ suspreason = lsb_suspreason(job->reasons,job-> subreasons,indices); printf("%s\n",suspreason);[ Top ]
What if the Job is Pending
Use
lsb_pendreason()
to write a program to print out the reason why a job is in pending status.char *lsb_pendreason (int numReasons, int *rsTb, struct jobInfoHead *jInfoH, struct loadIndexLog *ld, int clusterId)
rsTb
is a reason table in which each entry contains one pending reason.numReasons
is an integer representing the number of reasons in the table.
struct jobInfoHead
is returned by thelsb_openjobinfo_a()
function. It is defined as follow:struct jobInfoHead { int numJobs; Number of jobs LS_LONG_INT *jobIds; Job IDs int numHosts; Number of hosts char **hostNames; Name of hosts };
ld
is the same struct as used in the abovelsb_suspreason()
function call.This program is similar but different from the above program for displaying the suspending reason. Use
lsb_openjobinfo_a()
to open the job information connection, instead oflsb_openjobinfo()
. Because thestruct
jobInfoHead is needed as one of the arguments when calling the functionlsb_pendreason()
.struct jobInfoHead *lsb_openjobinfo(jobId, jobName, user, queue, host, options);For information on using
lsb_openjobinfo_a()
, see the discussion onlsb_openjobinfo()
in Getting Information about Batch Jobs.The following initialization and code fragment show how to display the pending reason using
lsb_pendreason()
:/* initialization */ char *pendreason; struct loadIndexLog *indices =(struct loadIndexLog *) malloc(sizeof(struct loadIndexLog)); struct jobInfoHead *jInfoH = (struct jobInfoHead *) malloc(sizeof(struct jobInfoHead)); /* open the job information connection with mbatchd */ jInfoH = lsb_openjobinfo_a(0, NULL, user, NULL, NULL, options); /* gets the total number of pending job, exits if failure */ if (jInfoH==NULL) { lsb_perror("lsb_openjobinfo"); exit(-1); } /* get the list of all load index names */ indices->name = getindexlist(&indices->nIdx); /* get and print out the pending reasons */ pendreason = lsb_pendreason(job->numReasons,job-> reasonTb,jInfoH,indices); printf("%s\n",pendreason);
Usels_loadinfo()
to get the list of all load index names. For more information, see Displaying selected load indices.
[ Top ]
Reading lsf.conf Parameters
You can refer to the contents of the
lsf.conf
file or even define your own site specific variables in thelsf.conf
file.The
lsf.conf
file follows the Bourne shell syntax. It can be sourced by a shell script and set into your environment before starting your C program. Use these variables as environment variables in your program.
ls_readconfenv()
reads thelsf.conf
variables in your C program:int ls_readconfenv(paramList, confPath)where
confPath
is the directory in which thelsf.conf
file is stored. paramList is an array of the following data structure:struct config_param { char *paramName; Name of the parameter, input char *paramValue; Value of the parameter, output }
ls_readconfenv()
reads the values of the parameters defined inlsf.conf
and matches the names described in the paramList array. Each resulting value is saved into the paramValue variable of the array element matching paramName. If a particular parameter mentioned in the paramList is not defined inlsf.conf
, then on return its value is left NULL.The following example program reads the variables
LSF_CONFDIR
,MY_PARAM1
, andMY_PARAM2
inlsf.conf
file and displays them on screen. Note thatLSF_CONFDIR
is a standard LSF parameter, while the other two parameters are user site specific. The example program below assumeslsf.conf
is in/etc
directory.#include <stdio.h> #include <lsf/lsf.h> struct config_param myParams[] = { #define LSF_CONFDIR 0 {"LSF_CONFDIR", NULL}, #define MY_PARAM1 1 {"MY_PARAM1", NULL}, #define MY_PARAM2 2 {"MY_PARAM2", NULL}, {NULL, NULL} }; main() { if (ls_readconfenv(myParams, "/etc") < 0) { ls_perror("ls_readconfenv"); exit(-1); } if (myParams[LSF_CONFDIR].paramValue == NULL) printf("LSF_CONFDIR is not defined in /etc/lsf.conf\n"); else printf("LSF_CONFDIR=%s\n",myParams[LSF_CONFDIR].paramValue); if (myParams[MY_PARAM1].paramValue == NULL) printf("MY_PARAM1 is not defined in /etc/lsf.conf\n"); else printf("MY_PARAM1=%s\n", myParams[MY_PARAM1].paramValue); if (myParams[MY_PARAM2].paramValue == NULL) printf("MY_PARAM2 is not defined\n"); else printf("MY_PARAM2=%s\n", myParams[MY_PARAM2].paramValue); exit(0); }Initialize the paramValue parameter in the config_param data structure must be initialized to
NULL
. Next, modify the paramValue to point to a result string if a matching paramName is found in thelsf.conf
file. End the array with aNULL
paramName.[ Top ]
Signal Handling in Windows
LSF uses the UNIX signal mechanism to perform job control. For example, the
bkill
command in UNIX normally results in the signalsSIGINT
,SIGTERM
, andSIGKILL
being sent to the target job. Signal handling code that exists in UNIX applications allows processes to shut down in stages. In the past, the Windows equivalent to thebkill
command was TerminateProcess(). It terminates the process immediately and does not allow the process to release shared resources the waybkill
does.LSF version 3.2 has been modified to provide signal notification through the Windows message queue. LSF now includes messages corresponding to common UNIX signals. This means that a customized Windows application can process these messages.
For example, the
bkill
command now sends theSIGINT
andSIGTERM
signals to Windows applications as job control messages. An LSF-aware Windows application can interpret these messages and shut down neatly.To write a Windows application that takes advantage of this feature, register the specific signal messages that the application handles. Then modify the message loop to check each message before dispatching it. Take the appropriate action if the message is a job control message.
The following examples show sample code that might help you to write your own applications.
Job control in a Windows application
This example program shows how a Windows application can receive a Windows job control notification from the LSF system.
Catching the notification messages involves:
- Registering the windows messages for the signals that you want to receive (in this case,
SIGTERM
).- Look for the messages you want to catch in your GetMessage loop.
Do not use DispatchMessage() to dispatch the message, since it is addressed to the thread, not the window. This program displays information in its main window, and waits for
SIGTERM
. OnceSIGTERM
is received, it posts a quit message and exits. A real program could do some cleanup when theSIGTERM
message is received./* WINJCNTL.C */ #include <windows.h> #include <stdio.h> #define BUFSIZE 512 static UINT msgSigTerm; static int xpos; static int pid_ypos; static int tid_ypos; static int msg_ypos; static int pid_buf_len; static int tid_buf_len; static int msg_buf_len; static char pid_buf[BUFSIZE]; static char tid_buf[BUFSIZE]; static char msg_buf[BUFSIZE]; LRESULT WINAPI MainWndProc(HWND hWnd, UINT msg, WPARAM wParam, LPARAM lParam) { HDC hDC; PAINTSTRUCT ps; TEXTMETRIC tm; switch (msg) { case WM_CREATE: hDC = GetDC(hWnd); GetTextMetrics(hDC, &tm); ReleaseDC(hWnd, hDC); xpos = 0; pid_ypos = 0; tid_ypos = pid_ypos + tm.tmHeight; msg_ypos = tid_ypos + tm.tmHeight; break; case WM_PAINT: hDC = BeginPaint(hWnd, &ps); TextOut(hDC, xpos, pid_ypos, pid_buf, pid_buf_len); TextOut(hDC, xpos, tid_ypos, tid_buf, tid_buf_len); TextOut(hDC, xpos, msg_ypos, msg_buf, msg_buf_len); EndPaint(hWnd, &ps); break; case WM_DESTROY: PostQuitMessage(0); break; default: return DefWindowProc(hWnd, msg, wParam, lParam); } return 0; } int WINAPI WinMain(HINSTANCE hInstance, HINSTANCE hPrevInstance, LPSTR lpCmdLine, int nCmdShow) { ATOM rc; WNDCLASS wc; HWND hWnd; MSG msg; /* Create and register a windows class */ if (hPrevInstance == NULL) { wc.style = CS_OWNDC | CS_VREDRAW | CS_HREDRAW; wc.lpfnWndProc = MainWndProc; wc.cbClsExtra = 0; wc.cbWndExtra = 0; wc.hInstance = hInstance; wc.hIcon = LoadIcon(NULL, IDI_APPLICATION); wc.hCursor = LoadCursor(NULL, IDC_ARROW); wc.hbrBackground = (HBRUSH) (COLOR_WINDOW + 1); rc = RegisterClass(&wc); } /* Register the message we want to catch */ msgSigTerm = RegisterWindowMessage("SIGTERM"); /* Format some output for the main window */ sprintf(pid_buf, "My process ID is: %d", GetCurrentProcessId()); pid_buf_len = strlen(pid_buf); sprintf(tid_buf, "My thread ID is: %d", GetCurrentThreadId()); tid_buf_len = strlen(tid_buf); sprintf(msg_buf, "Message ID is: %u", msgSigTerm); msg_buf_len = strlen(msg_buf); /* Create the main window */ hWnd = CreateWindow("WinJCntlClass", "Windows Job Control Demo App", WS_OVERLAPPEDWINDOW, 0, 0, CW_USEDEFAULT, CW_USEDEFAULT, NULL, NULL, hInstance, NULL); ShowWindow(hWnd, nCmdShow); /* Enter the message loop, waiting for msgSigTerm. When we get it, just post a quit message */ while (GetMessage(&msg, NULL, 0, 0)) { if (msg.message == msgSigTerm) { PostQuitMessage(0); } else { TranslateMessage(&msg); DispatchMessage(&msg); } } return msg.wParam; }Job control in a console application
This example program shows how a console application can receive a Windows job control notification from the LSF system.
Catching the notification messages involves:
- Registering the windows messages for the signals that you want to receive (in this case,
SIGINT
andSIGTERM
).- Creating a message queue by calling PeekMessage (this is how Microsoft suggests console applications should create message queues).
- Look for the message you want to catch enter a GetMessage loop.
Do not DispatchMessage here, since you do not have a window to dispatch to.
This program sits in the message loop. It is waiting for
SIGINT
andSIGTERM
, and displays messages when those signals are received. A real application would do clean-up and exit if it received either of these signals./* CONJCNTL.C */ #include <windows.h> #include <stdio.h> #include <stdlib.h> int main(void) { DWORD pid = GetCurrentProcessId(); DWORD tid = GetCurrentThreadId(); UINT msgSigInt = RegisterWindowMessage("SIGINT"); UINT msgSigTerm = RegisterWindowMessage("SIGTERM"); MSG msg; /* Make a message queue -- this is the method suggested by MS */ PeekMessage(&msg, NULL, WM_USER, WM_USER, PM_NOREMOVE); printf("My process id: %d\n", pid); printf("My thread id: %d\n", tid); printf("SIGINT message id: %d\n", msgSigInt); printf("SIGTERM message id: %d\n", msgSigTerm); printf("Entering loop...\n"); fflush(stdout); while (GetMessage(&msg, NULL, 0, 0)) { printf("Received message: %d\n", msg.message); if (msg.message == msgSigInt) { printf("SIGINT received, continuing.\n"); } else if (msg.message == msgSigTerm) { printf("SIGTERM received, continuing.\n"); } fflush(stdout); } printf("Exiting.\n"); fflush(stdout); return EXIT_SUCCESS; }[ Top ]
[ Platform Documentation ] [ Title ] [ Contents ] [ Previous ] [ Next ] [ Index ]
Date Modified: March 13, 2009
Platform Computing: www.platform.com
Platform Support: support@platform.com
Platform Information Development: doc@platform.com
Copyright © 1994-2009 Platform Computing Corporation. All rights reserved.