Learn more about Platform products at http://www.platform.com

[ Platform Documentation ] [ Title ] [ Contents ] [ Previous ] [ Next ] [ Index ]



Advanced Programming Topics


Contents

[ Top ]


Getting Load Information on Selected Load Indices

Getting Dynamic Load Information shows how to get load information from the LIM. Depending on the size of your LSF cluster and the frequency at which the ls_load() function is called, returning load information of all the hosts can produce unnecessary overhead.

LSLIB provides ls_loadinfo() call that allows an application to specify a selected number of load indices and get only those load indices that are of interest to the application.

Getting a list of all load index names

Since LSF allows a site to install an ELIM to collect additional load indices, the names and the total number of load indices are often dynamic and have to be found out at run time unless the application is only using the built-in load indices.

Example

Below is an example routine that returns a list of all available load index names and the total number of load indices.

#include <lsf/lsf.h>

char **getIndexList(int *listsize)
{
    struct lsInfo *lsInfo = (struct lsInfo *) malloc (sizeof
    (struct lsInfo));
    static char *nameList[268];
    static int first = 1;
    int i;

    if (first) {
        /* only need to do so when called for the first time 
*/
        lsInfo = ls_info();
        if (lsInfo == NULL)
            return (NULL);
        first = 0;
    }
    if (listsize != NULL)
        *listsize = lsInfo->numIndx;
    for (i=0; i<lsInfo->numIndx; i++)
        nameList[i] = lsInfo->resTable[i].name;
    return (nameList);
}

The above code fragment returns a list of load index names currently installed in the LSF cluster. The content of listSize will be modified to the total number of load indices. If ls_info() fails, then the program returns NULL. The data structure returned by ls_info()contains all the load index names before any other resource names. The load index names start with the 11 built-in load indices followed by site external load indices (through ELIM).

Displaying selected load indices

By providing a list of load index names to an LSLIB function, you can get the load information about the specified load indices.

ls_loadinfo()

The following example shows how you can display the values of the external load indices. This program uses ls_loadinfo():

struct hostLoad *ls_loadinfo(resreq, numhosts, options, 
                            fromhost, hostlist, listsize,
                            namelist)

The parameters for this routine are:

char *resreq;              Resource requirement
int *numhosts;             Return parameter, number of hosts returned 
int options;               Host and load selection options 
char *fromhost;            Used only if DFT_FROMTYPE is set in options 
char **hostlist;           A list of candidate hosts for selection 
int listsize;              Number of hosts in hostlist 
char ***namelist;          Input/output parameter -- load index name list 

ls_loadinfo() is similar to ls_load() except that ls_loadinfo() allows an application to supply both a list of load indices and a list of candidate hosts. If both of namelist and hostlist are NULL, then it operates in the same way as ls_load() function.

The parameter namelist allows an application to specify a list of load indices of interest. The function then returns only the specified load indices. On return, this parameter is modified to point to another name list that contains the same set of load index names. This load index is in a different order to reflect the mapping of index names and the actual load values returned in the hostLoad array:

Example

#include <stdio.h>
#include <lsf/lsf.h>

/*include the header file with the getIndexList function 
here*/

main() 
{
    struct hostLoad *load;
    char **loadNames;
    int numIndx;
    int numUsrIndx;
    int nHosts;
    int i;
    int j;

    loadNames = getIndexList(&numIndx);
    if (loadNames == NULL) {
        ls_perror("Unable to get load index names\n");
        exit(-1);
    }

    numUsrIndx = numIndx - 11;  /* this is the total num of
                                site defined indices*/
    if (numUsrIndx == 0) {
        printf("No external load indices defined\n");
        exit(-1);
    }

    loadNames += 11;  /* skip the 11 built-in load index names */
    
    load = ls_loadinfo(NULL, &nHosts, 0, NULL, NULL, 0, 
                      &loadNames);
    if (load == NULL) {
        ls_perror("ls_loadinfo");
        exit(-1);
    }

    printf("Report on external load indices\n");

    for (i=0; i<nHosts; i++) {
        printf("Host %s:\n", load[i].hostName);
        for (j=0; j<numUsrIndx; j++) 
            printf("index name: %s, value %5.0f\n", 
                   loadNames[j], load[i].li[j]);
    }
}

Example output

The above program uses the getIndexList() function described in the previous example program to get a list of all available load index names. Sample output from the above program follows:

Report on external load indices
Host hostA:
        index name: usr_tmp, value 87 
        index name: num_licenses, value 1
Host hostD:
        index name: usr_tmp, value 18
        index name: num_licenses, value 2

[ Top ]


Writing a Parallel Application

LSF provides job placement and remote execution support for parallel applications. A master LIM's host selection or placement service can return an array of good hosts for an application. The application can then use remote execution service provided by RES to run tasks on these hosts concurrently.

This section contains samples of how to write a parallel application using LSLIB.

ls_rtask() function

Running a task remotely discusses the use of ls_rexecv() for remote execution. You can also use ls_rtask() for remote execution. ls_rtask()and ls_rexecv() differ in how the server host behaves.

ls_rexecv() is useful when the server host does not need to do anything but wait for the remote task to finish. After initiating the remote task, ls_rexecv() replaces the current program with the Network I/O Server (NIOS) by calling execv(). The NIOS then handles the rest of the work on the server host: delivering input/output between local terminal and remote task and exiting with the same status as the remote task. ls_rexecv() is considered to be the remote execution version of the UNIX execv() system call.

ls_rtask()

ls_rtask() provides more flexibility if the server host has to do other things after the remote task is initiated. For example, the application may want to start more than one task on several hosts. Unlike ls_rexecv(), ls_rtask() returns immediately after the remote task is started. The syntax of ls_rtask() is:

int ls_rtask(host, argv, options)

The parameters are:

char *host;                Name of the remote host to start task on
char **argv;               Program name and arguments
int  options;               Remote execution options

options parameter

The options parameter is similar to that of the ls_rexecv() function. ls_rtask() returns the task ID of the remote task which is used by the application to differentiate multiple outstanding remote tasks. When a remote task finishes, the status of the remote task is sent back to the NIOS running on the local host, which then notifies the application by issuing a SIGUSR1 signal. The application can then call ls_rwait() to collect the status of the remote task. The ls_rwait() behaves in much the same way as the wait(2) system call. Consider ls_rtask() as a combination of remote fork() and execv().

Note


Applications calling ls_rtask() must set up a signal handler for the SIGUSR1 signal, or the application could be killed by SIGUSR1.

You need to be careful if your application handles SIGTSTP, SIGTTIN, or SIGTTOU. If handlers for these signals are SIG_DFL, the ls_rtask() function automatically installs a handler for them to properly coordinate with the NIOS when these signals are received. If you intend to handle these signals by yourself instead of using the default set by LSLIB, you need to use the low level LSLIB function ls_stoprex() before the end of your signal handler.

Running tasks on many machines

Example

Below is an example program that uses ls_rtask() to run rm -f /tmp/core on user specified hosts.

#include <stdio.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <lsf/lsf.h>

int main (int argc, char **argv)
{
    char *command[4];
    int numHosts;
    int i;
    int tid;

    if (argc <= 1) {
        printf("Usage: %s host1 [host2 ...]\n",argv[0]);
        exit(-1);
    }

    numHosts = argc - 1;
    command[0] = "rm";
    command[1] = "-f";
    command[2] = "/tmp/core";
    command[3] = NULL;

    if (ls_initrex(numHosts, 0) < 0) {
        ls_perror("ls_initrex");
        exit(-1);
    }

    signal(SIGUSR1, SIG_IGN);

    /* Run command on the specified hosts */
    for (i=1; i<=numHosts; i++) {
        if ((tid = ls_rtask(argv[i], command, 0)) < 0) {
            fprintf(stderr, "lsrtask failed for host %s: 
%s\n", 
                    argv[i], ls_sysmsg());
            exit(-1);
        }
        printf("Task %d started on %s\n", tid, argv[i]);
    }

    while (numHosts) {
        LS_WAIT_T status;

        tid = ls_rwait(&status, 0, NULL);
        if (tid < 0) {
            ls_perror("ls_rwait");
            exit(-1);
        }
        
        printf("task %d finished\n", tid);
        numHosts--;
    }

    exit(0);
} 

The above program sets the signal handler for SIGUSR1 to SIG_IGN. This causes the signal to be ignored. It uses ls_rwait() to poll the status of remote tasks. You could set a signal handler so that it calls ls_rwait() inside the signal handler.

Use the task ID to preform an operation on the task. For example, you can send a signal to a remote task explicitly by calling ls_rkill().

To run the task on remote hosts one after another instead of concurrently, call ls_rwait() right after ls_rtask().

Also note the use of ls_sysmsg() instead of ls_perror(), which does not allow flexible printing format.

Example output

The above example program produces output similar to the following:

% a.out hostD hostA hostB
Task 1 started on hostD
Task 2 started on hostA
Task 3 started on hostB
Task 1 finished
Task 3 finished
Task 2 finished

Remote tasks are run concurrently, so the order in which tasks finish is not necessarily the same as the order in which tasks are started.

[ Top ]


Discovering Why a Job Is Suspended

Getting Information about Batch Jobs shows how to get information about submitted jobs. It is frequently desirable to know the reasons why jobs are in a certain status. LSBLIB provides a function to print such information. This section describes a routine that prints out why a job is in suspending status.

lsb_suspreason()

When lsb_readjobinfo() reads a record of a pending job, the variables reasons and subreasons contained in the returned struct jobInfoEnt call lsb_suspreason(). This gets the reason text explaining why the job is still in pending state:

char *lsb_suspreason(reasons, subReasons, ld);

where reasons and subReasons are integer reason flags as returned by a lsb_readjobinfo() function while ld is a pointer to the following data structure:

struct loadIndexLog {
    int  nIdx;             Number of load indices configured for the
                               LSF cluster
    char **name;           List of the load index names
};

Call the below initialization and code fragment after lsb_readjobinfo() is called.

/* initialization */
struct loadIndexLog *indices =(struct loadIndexLog *)malloc
(sizeof(struct loadIndexLog));
char *suspreason;

/* get the list of all load index names */
indices->name = getindexlist(&indices->nIdx);

/* get and print out the suspended reason */
suspreason = lsb_suspreason(job->reasons,job-> 
subreasons,indices);
printf("%s\n",suspreason);

[ Top ]


What if the Job is Pending

lsb_pendreason()

Use lsb_pendreason() to write a program to print out the reason why a job is in pending status.

char *lsb_pendreason (int numReasons, int *rsTb, 
                      struct jobInfoHead *jInfoH,
                      struct loadIndexLog *ld, int clusterId)

jobInfoHead structure

struct jobInfoHead is returned by the lsb_openjobinfo_a() function. It is defined as follow:

struct jobInfoHead { 
    int  numJobs;         Number of jobs
    LS_LONG_INT *jobIds;  Job IDs
    int  numHosts;        Number of hosts
    char **hostNames;     Name of hosts
}; 

ld is the same struct as used in the above lsb_suspreason() function call.

This program is similar but different from the above program for displaying the suspending reason. Use lsb_openjobinfo_a() to open the job information connection, instead of lsb_openjobinfo(). Because the struct jobInfoHead is needed as one of the arguments when calling the function lsb_pendreason().

struct jobInfoHead *lsb_openjobinfo(jobId, jobName, user, 
queue, host, options);

For information on using lsb_openjobinfo_a(), see the discussion on lsb_openjobinfo() in Getting Information about Batch Jobs.

The following initialization and code fragment show how to display the pending reason using lsb_pendreason():

/* initialization */
char *pendreason;
struct loadIndexLog *indices =(struct loadIndexLog *) 
malloc(sizeof(struct loadIndexLog));
struct jobInfoHead *jInfoH = (struct jobInfoHead *) 
malloc(sizeof(struct jobInfoHead));

/* open the job information connection with mbatchd */
jInfoH = lsb_openjobinfo_a(0, NULL, user, NULL, NULL, 
options);

/* gets the total number of pending job, exits if failure */
if (jInfoH==NULL) {
    lsb_perror("lsb_openjobinfo");
    exit(-1);
}
/* get the list of all load index names */
indices->name = getindexlist(&indices->nIdx);

/* get and print out the pending reasons */
pendreason = lsb_pendreason(job->numReasons,job-> 
reasonTb,jInfoH,indices);
printf("%s\n",pendreason);

Note


Use ls_loadinfo() to get the list of all load index names. For more information, see Displaying selected load indices.

[ Top ]


Reading lsf.conf Parameters

You can refer to the contents of the lsf.conf file or even define your own site specific variables in the lsf.conf file.

The lsf.conf file follows the Bourne shell syntax. It can be sourced by a shell script and set into your environment before starting your C program. Use these variables as environment variables in your program.

ls_readconfenv()

ls_readconfenv() reads the lsf.conf variables in your C program:

int ls_readconfenv(paramList, confPath)

where confPath is the directory in which the lsf.conf file is stored. paramList is an array of the following data structure:

struct config_param {
    char *paramName;       Name of the parameter, input
    char *paramValue;      Value of the parameter, output
}

ls_readconfenv() reads the values of the parameters defined in lsf.conf and matches the names described in the paramList array. Each resulting value is saved into the paramValue variable of the array element matching paramName. If a particular parameter mentioned in the paramList is not defined in lsf.conf, then on return its value is left NULL.

Example

The following example program reads the variables LSF_CONFDIR, MY_PARAM1, and MY_PARAM2 in lsf.conf file and displays them on screen. Note that LSF_CONFDIR is a standard LSF parameter, while the other two parameters are user site specific. The example program below assumes lsf.conf is in /etc directory.

#include <stdio.h>
#include <lsf/lsf.h>

struct config_param myParams[] =
{
#define LSF_CONFDIR                  0
     {"LSF_CONFDIR", NULL},
#define MY_PARAM1                    1
     {"MY_PARAM1", NULL},
#define MY_PARAM2                    2
     {"MY_PARAM2", NULL},
     {NULL, NULL}
};

main()
{
    if (ls_readconfenv(myParams, "/etc") < 0) {
        ls_perror("ls_readconfenv");
        exit(-1);
    }

    if (myParams[LSF_CONFDIR].paramValue == NULL) 
        printf("LSF_CONFDIR is not defined in
        /etc/lsf.conf\n");
    else
        printf("LSF_CONFDIR=%s\n",myParams[LSF_CONFDIR].paramValue);

    if (myParams[MY_PARAM1].paramValue == NULL)
        printf("MY_PARAM1 is not defined in /etc/lsf.conf\n");
    else
        printf("MY_PARAM1=%s\n", myParams[MY_PARAM1].paramValue);

    if (myParams[MY_PARAM2].paramValue == NULL)
        printf("MY_PARAM2 is not defined\n");
    else
        printf("MY_PARAM2=%s\n", myParams[MY_PARAM2].paramValue);

    exit(0);
}

Initialize the paramValue parameter in the config_param data structure must be initialized to NULL. Next, modify the paramValue to point to a result string if a matching paramName is found in the lsf.conf file. End the array with a NULL paramName.

[ Top ]


Signal Handling in Windows

LSF uses the UNIX signal mechanism to perform job control. For example, the bkill command in UNIX normally results in the signals SIGINT, SIGTERM, and SIGKILL being sent to the target job. Signal handling code that exists in UNIX applications allows processes to shut down in stages. In the past, the Windows equivalent to the bkill command was TerminateProcess(). It terminates the process immediately and does not allow the process to release shared resources the way bkill does.

LSF version 3.2 has been modified to provide signal notification through the Windows message queue. LSF now includes messages corresponding to common UNIX signals. This means that a customized Windows application can process these messages.

For example, the bkill command now sends the SIGINT and SIGTERM signals to Windows applications as job control messages. An LSF-aware Windows application can interpret these messages and shut down neatly.

To write a Windows application that takes advantage of this feature, register the specific signal messages that the application handles. Then modify the message loop to check each message before dispatching it. Take the appropriate action if the message is a job control message.

The following examples show sample code that might help you to write your own applications.

Job control in a Windows application

Example

This example program shows how a Windows application can receive a Windows job control notification from the LSF system.

Catching the notification messages involves:

Note

Do not use DispatchMessage() to dispatch the message, since it is addressed to the thread, not the window. This program displays information in its main window, and waits for SIGTERM. Once SIGTERM is received, it posts a quit message and exits. A real program could do some cleanup when the SIGTERM message is received.

/* WINJCNTL.C */
#include <windows.h>
#include <stdio.h>
#define BUFSIZE 512
static UINT msgSigTerm;
static int xpos;
static int pid_ypos;
static int tid_ypos;
static int msg_ypos;
static int pid_buf_len;
static int tid_buf_len;
static int msg_buf_len;
static char pid_buf[BUFSIZE];
static char tid_buf[BUFSIZE];
static char msg_buf[BUFSIZE];

LRESULT WINAPI MainWndProc(HWND hWnd, UINT msg, WPARAM wParam, 
LPARAM lParam)
{
    HDC hDC;
    PAINTSTRUCT ps;
    TEXTMETRIC tm;
    switch (msg) {
        case WM_CREATE:
            hDC = GetDC(hWnd);
            GetTextMetrics(hDC, &tm);
            ReleaseDC(hWnd, hDC);
            xpos = 0;
            pid_ypos = 0;
            tid_ypos = pid_ypos + tm.tmHeight;
            msg_ypos = tid_ypos + tm.tmHeight;
            break;

        case WM_PAINT:

            hDC = BeginPaint(hWnd, &ps);
            TextOut(hDC, xpos, pid_ypos, pid_buf, 
pid_buf_len);
            TextOut(hDC, xpos, tid_ypos, tid_buf, 
tid_buf_len);
            TextOut(hDC, xpos, msg_ypos, msg_buf, 
msg_buf_len);
            EndPaint(hWnd, &ps);
            break;

        case WM_DESTROY:
            PostQuitMessage(0);
            break;

        default:
            return DefWindowProc(hWnd, msg, wParam, lParam);
    }
    return 0;
}

int WINAPI WinMain(HINSTANCE hInstance, HINSTANCE 
hPrevInstance,
LPSTR lpCmdLine, int nCmdShow)

{
    ATOM rc;
    WNDCLASS wc;
    HWND hWnd;
    MSG msg;

/* Create and register a windows class */
    if (hPrevInstance == NULL) {
            wc.style = CS_OWNDC | CS_VREDRAW | CS_HREDRAW;
            wc.lpfnWndProc = MainWndProc;
            wc.cbClsExtra = 0;
            wc.cbWndExtra = 0;
            wc.hInstance = hInstance;
            wc.hIcon = LoadIcon(NULL, IDI_APPLICATION);
            wc.hCursor = LoadCursor(NULL, IDC_ARROW);
            wc.hbrBackground = (HBRUSH) (COLOR_WINDOW + 1);

            rc = RegisterClass(&wc);
    }

/* Register the message we want to catch */
    msgSigTerm = RegisterWindowMessage("SIGTERM");

/* Format some output for the main window */
sprintf(pid_buf, "My process ID is: %d", 
GetCurrentProcessId());
pid_buf_len = strlen(pid_buf);
sprintf(tid_buf, "My thread ID is: %d", GetCurrentThreadId());
tid_buf_len = strlen(tid_buf);
sprintf(msg_buf, "Message ID is: %u", msgSigTerm);
msg_buf_len = strlen(msg_buf);

/* Create the main window */
    hWnd = CreateWindow("WinJCntlClass",
            "Windows Job Control Demo App",
            WS_OVERLAPPEDWINDOW,
            0,
            0,
            CW_USEDEFAULT,
            CW_USEDEFAULT,
            NULL,
            NULL,
            hInstance,
            NULL);
    ShowWindow(hWnd, nCmdShow);

/* Enter the message loop, waiting for msgSigTerm. When we get
it, just post a quit message */

    while (GetMessage(&msg, NULL, 0, 0)) {
        if (msg.message == msgSigTerm) {
            PostQuitMessage(0);
        } else {
            TranslateMessage(&msg);
            DispatchMessage(&msg);
        }
    }
    return msg.wParam;
}

Job control in a console application

Example

This example program shows how a console application can receive a Windows job control notification from the LSF system.

Catching the notification messages involves:

Note


Do not DispatchMessage here, since you do not have a window to dispatch to.

This program sits in the message loop. It is waiting for SIGINT and SIGTERM, and displays messages when those signals are received. A real application would do clean-up and exit if it received either of these signals.

/* CONJCNTL.C */
#include <windows.h>
#include <stdio.h>
#include <stdlib.h>
int main(void)
{
    DWORD pid = GetCurrentProcessId();
    DWORD tid = GetCurrentThreadId();
    UINT msgSigInt = RegisterWindowMessage("SIGINT");
    UINT msgSigTerm = RegisterWindowMessage("SIGTERM");
    MSG msg;

/* Make a message queue -- this is the method suggested by MS 
*/

    PeekMessage(&msg, NULL, WM_USER, WM_USER, PM_NOREMOVE);
    printf("My process id: %d\n", pid);
    printf("My thread id: %d\n", tid);
    printf("SIGINT message id: %d\n", msgSigInt);
    printf("SIGTERM message id: %d\n", msgSigTerm);
    printf("Entering loop...\n");
    fflush(stdout);

    while (GetMessage(&msg, NULL, 0, 0)) {
        printf("Received message: %d\n", msg.message);
        if (msg.message == msgSigInt) {
            printf("SIGINT received, continuing.\n");
        } else if (msg.message == msgSigTerm) {
            printf("SIGTERM received, continuing.\n");
        }
        fflush(stdout);
    }

    printf("Exiting.\n");
    fflush(stdout);
    return EXIT_SUCCESS;
}

[ Top ]


[ Platform Documentation ] [ Title ] [ Contents ] [ Previous ] [ Next ] [ Index ]


      Date Modified: March 13, 2009
Platform Computing: www.platform.com

Platform Support: support@platform.com
Platform Information Development: doc@platform.com

Copyright © 1994-2009 Platform Computing Corporation. All rights reserved.