Learn more about Platform products at http://www.platform.com

[ Platform Documentation ] [ Title ] [ Contents ] [ Previous ] [ Next ] [ Index ]



Using Platform LSF HPC with the Etnus TotalView Debugger


Contents

[ Top ]


How LSF HPC Works with TotalView

Platform LSF HPC is integrated with Etnus TotalView® multiprocess debugger. You should already be familiar with using TotalView software and debugging parallel applications.

Debugging LSF HPC jobs with TotalView

Etnus TotalView is a source-level and machine-level debugger for analyzing, debugging, and tuning multiprocessor or multithreaded programs. LSF HPC works with TotalView two ways:

Once your job is running and its processes are attached to TotalView, you can debug your program as you normally would.

For more information

See the TotalView Users Guide for information about using TotalView.

Installing LSF HPC for TotalView

lsfinstall installs the application-specific esub program esub.tvpoe for debugging POE jobs in TotalView. It behaves like esub.poe and runs the poejob script, but it also sets the appropriate TotalView options and environment variables for POE jobs.

lsfinstall also configures hpc_ibm_tv queue for debugging POE jobs in lsb.queues. The queue is not rerunnable, does not allow interactive batch jobs (bsub -I), and specifies the following TERMINATE_WHEN action:

TERMINATE_WHEN=LOAD PREEMPT WINDOW

lsfinstall installs the following application-specific esub programs to use TotalView with LSF HPC:

Environment variables for TotalView

On the submission host, make sure that:

Setting TotalView preferences

Before running and debugging jobs with TotalView, you should set the following options in your $HOME/.preferences.tvd file:

Limitations

While your job is running and you are using TotalView to debug it, you cannot use LSF HPC job control commands:

[ Top ]


Running Jobs for TotalView Debugging

Submit jobs two ways:


You must set the path to the TotalView binary in the $PATH environment variable on the submission host, and the $DISPLAY environment variable to console_name:0.0.

Compiling your program for debugging

Before using submitting your job in LSF HPC for debugging in TotalView, compile your source code with the -g compiler option. This option generates the appropriate debugging information in the symbol table.

Any multiprocess programs that call fork(), vfork(), or execve() should be linked to the dbfork library.

See your compiler documentation and the TotalView Users Guide for more information about compiling programs for debugging.

Starting a job and TotalView together through LSF HPC

Syntax

bsub -a tvapplication [bsub_options] mpirun.lsf job [job_options] [-tvopt tv_options]

-a tvapplication

Specifies the application you want to run through LSF HPC and debug in TotalView.

-tvopt tv_options

Specifies options to be passed to TotalView. Use any valid TotalView command option, except -a (LSF uses this option internally). See the TotalView Users Guide for information about TotalView command options and setting up parallel debugging sessions.

Example

To submit a POE job and run TotalView:

% bsub -a tvpoe -n 2 mpirun.lsf myjob -tvopt -no_ask_on_dlopen

The method name tvpoe, uses the special esub for debugging POE jobs with TotalView (LSF_SERVERDIR/esub.tvpoe). -no_ask_on_dlopen is a TotalView option that tells TotalView not to prompt about stopping processes that use the dlopen system call.

To submit a LAM/MPI job and run TotalView:

% bsub -a tvlammpi -n 2 mpirun.lsf myjob -tvopt -no_ask_on_dlopen

The method name tvlammpi, uses the special esub for debugging LAM/MPI jobs with TotalView (LSF_SERVERDIR/esub.tvlammpi). -no_ask_on_dlopen is a TotalView option that tells TotalView not to prompt about stopping processes that use the dlopen system call.

When the TotalView Root window opens:

  1. TotalView automatically acquires the pam process and a Process window opens.
  2. Click Go in the Process window to start debugging the process.


    Depending on your TotalView preferences, you may see the Stop Before Going Parallel dialog. Click Yes. Use the Parallel page on the File > Preferences dialog to change the setting of When a job goes parallel or calls exec() radio buttons.

    The process starts running and stops at the first breakpoint you set.


    For MPICH-GM jobs, TotalView stops at two breakpoints: one in pam, and one in MPI_init(). Click Go to continue debugging.

  3. Debug your job as you would normally in TotalView.

    When you are finished debugging your program, choose File > Exit to exit TotalView, and click Yes in the Exit dialog. As TotalView exits it kills the pam process. In a few moments, LSF HPC detects that PAM has exited and your job exits as Done successfully.

Running TotalView and attaching a LSF HPC job

Syntax

bsub -a application [bsub_options] mpirun.lsf job [job_options]

-a application

Specifies the application you want to run through LSF HPC and debug in TotalView.

See the TotalView Users Guide for information about attaching jobs in TotalView and setting up parallel debugging sessions.

Example

  1. Submit a job.

    For example:

    % bsub -a poe -n 2 mpirun.lsf myjob
    

    The method name poe, uses the esub for running POE jobs (LSF_SERVERDIR/esub.poe).

    % bsub -a mpich_gm -n 2 mpirun.lsf myjob
    

    The method name mpich_gm, uses the special esub for running MPICH-GM jobs (LSF_SERVERDIR/esub.mpich_gm).

  2. Start TotalView on the execution host.


    For TotalView to load PAM, LSF_BINDIR must be in the $PATH environment variable on the execution host, or use FIle > Search Path... in TotalView to set the path to LSF_BINDIR.

    The TotalView Root window opens, and pam appears in the Unattached page of the TotalView Root window.

  3. Double-click pam as the process to attach.

    A Process window opens. Your jobs move from the Unattached page to the Attached page.


    You should see all of your job processes in the Unattached page; you can select any process to attach, but to attach all parallel task on the local and remote hosts, you must attach to pam.

  4. Click Go in the Process window?
  5. Debug your job as you would normally in TotalView.

    When you are finished debugging your program, choose File > Exit to exit TotalView, and click Yes in the Exit dialog. As TotalView exits it kills the pam process. In a few moments, LSF HPC detects that PAM has exited and your job exits as Done successfully.

Viewing source code while debugging

Use View > Lookup Function to view the source code of your application while debugging. Enter main in the Name field and click OK. TotalView finds the source code for the main() function and displays it in the Source Pane.

See the TotalView Users Guide for information about displaying source code.

[ Top ]


Controlling and Monitoring Jobs Being Debugged in TotalView

Controlling jobs

While your job is running and you are using TotalView to debug it, you cannot use LSF HPC job control commands:

Monitoring jobs

Use bjobs to see the resource usage of jobs running under TotalView:

bsub -n 2 -a tvmpich_gm mpirun.lsf ./cpi -tvopt -no_ask_on_dlopen
Job <365> is submitted to queue <hpc_linux>.
bjobs -l 365

Job <365>, User <user1>, Project <default>, Status <DONE>, Queue
<hpc_linux>, 
                     Command <totalview pam   -no_ask_on_dlopen -a -g 1
-tv gmmpirun_wrapper  ./cpi>
Fri Oct 11 15:46:47: Submitted from host <hostA>, CWD <$HOME>, 2
Processors 
                     Requested, Requested Resources <select[ (gm_ports >
0) ] rusage[gm_ports=1:duration=10]>;
Fri Oct 11 15:46:58: Started on 2 Hosts/Processors <hostA> <hostB>,
Execution Home </home/user1>, Execution CWD
</home/user1>;
Fri Oct 11 15:53:07: Done successfully. The CPU time used is 69.7 seconds.

 SCHEDULING PARAMETERS:
           r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp
mem
 loadSched   -     -     -     -       -     -    -     -     -      -
-  
 loadStop    -     -     -     -       -     -    -     -     -      -
-  

          adapter_windows 
 loadSched      -           -               -  
 loadStop       -           -               - 

% bsub -a tvpoe -n 4 mpirun.lsf $JOB
Job <341> is submitted to queue <hpc_ibm>.
% bjobs -l 341
Job <341>, User <user1>, Project <default>, Status <DONE>, Queue <hpc_ibm>, Com
                     mand <totalview pam   -a -g 1 -tv poejob 
/home/user1/cpi.poe >
Wed Oct 16 09:59:42: Submitted from host <hostA>, CWD </home/user1, 
                     4 Processors Requested;
Wed Oct 16 09:59:53: Started on 4 Hosts/Processors <hostA> 
                     <hostA> <hostA> <q
                     ataix05.lsf.platform.com>, Execution Home </home/user1>, E
                     xecution CWD </home/user1>;
Wed Oct 16 10:01:19: Done successfully. The CPU time used is 97.0 seconds.

 SCHEDULING PARAMETERS:
           r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem
 loadSched   -     -     -     -       -     -    -     -     -      -      -  
 loadStop    -     -     -     -       -     -    -     -     -      -      -  

          lammpi_load adapter_windows 
 loadSched      -           -               -  
 loadStop       -           -   

[ Top ]


[ Platform Documentation ] [ Title ] [ Contents ] [ Previous ] [ Next ] [ Index ]


      Date Modified: August 20, 2009
Platform Computing: www.platform.com

Platform Support: support@platform.com
Platform Information Development: doc@platform.com

Copyright © 1994-2009 Platform Computing Corporation. All rights reserved.