[ Platform Documentation ] [ Title ] [ Contents ] [ Previous ] [ Next ] [ Index ]
- How LSF HPC Works with TotalView
- Running Jobs for TotalView Debugging
- Controlling and Monitoring Jobs Being Debugged in TotalView
[ Top ]
How LSF HPC Works with TotalView
Platform LSF HPC is integrated with Etnus TotalView® multiprocess debugger. You should already be familiar with using TotalView software and debugging parallel applications.
Debugging LSF HPC jobs with TotalView
Etnus TotalView is a source-level and machine-level debugger for analyzing, debugging, and tuning multiprocessor or multithreaded programs. LSF HPC works with TotalView two ways:
- Use LSF HPC to start TotalView together with your job
- Start TotalView separately, submit your job through LSF HPC and attach the processes of your job to TotalView for debugging
Once your job is running and its processes are attached to TotalView, you can debug your program as you normally would.
See the TotalView Users Guide for information about using TotalView.
Installing LSF HPC for TotalView
lsfinstall
installs the application-specificesub
programesub.tvpoe
for debugging POE jobs in TotalView. It behaves likeesub.poe
and runs thepoejob
script, but it also sets the appropriate TotalView options and environment variables for POE jobs.
lsfinstall
also configureshpc_ibm_tv
queue for debugging POE jobs inlsb.queues
. The queue is not rerunnable, does not allow interactive batch jobs (bsub -I
), and specifies the following TERMINATE_WHEN action:TERMINATE_WHEN=LOAD PREEMPT WINDOW
lsfinstall
installs the following application-specificesub
programs to use TotalView with LSF HPC:
- Configures
hpc_linux_tv
queue for debugging LAM/MPI and MPICH-GM jobs inlsb.queues.
The queue is not rerunnable, does not allow interactive batch jobs (bsub -I
), and specifies the following TERMINATE_WHEN action:TERMINATE_WHEN=LOAD PREEMPT WINDOWesub.tvlammpi
--for debugging LAM/MPI jobs in TotalView; behaves likeesub.lammpi
, but also sets the appropriate TotalView options and environment variables for LAM/MPI jobs, and sends the job to thehpc_linux_tv
queueesub.tvmpich_gm
--for debugging MPICH-GM jobs in TotalView; behaves likeesub,mpich_gm
, but also sets the appropriate TotalView options and environment variables for MPICH-GM jobs, and sends the job to thehpc_linux_tv
queueEnvironment variables for TotalView
On the submission host, make sure that:
- The path to the TotalView binary is in your $PATH environment variable
- $DISPLAY is set to console_name
:0.0
Setting TotalView preferences
Before running and debugging jobs with TotalView, you should set the following options in your
$HOME/.preferences.tvd
file:
dset ignore_control_c {false}
to allow TotalView to respond to <CTRL-C>dset ask_on_dlopen {false}
to tell TotalView not to prompt about stopping processes that use thedlopen
system callLimitations
While your job is running and you are using TotalView to debug it, you cannot use LSF HPC job control commands:
bchkpnt
andbmig
are not supported- Default TotalView signal processing prevents
bstop
andbresume
from suspending and resuming jobs, andbkill
from terminating jobsbrequeue
causes TotalView to display all jobs in error status. Click Go and the jobs will rerun.- Load thresholds and host dispatch windows do not affect jobs running in TotalView
- Preemption is not visible to TotalView
- Rerunning jobs within TotalView is not supported
[ Top ]
Running Jobs for TotalView Debugging
Submit jobs two ways:
You must set the path to the TotalView binary in the $PATH environment variable on the submission host, and the $DISPLAY environment variable to console_name:0.0
.
Compiling your program for debugging
Before using submitting your job in LSF HPC for debugging in TotalView, compile your source code with the
-g
compiler option. This option generates the appropriate debugging information in the symbol table.Any multiprocess programs that call
fork()
,vfork()
, orexecve()
should be linked to thedbfork
library.See your compiler documentation and the TotalView Users Guide for more information about compiling programs for debugging.
Starting a job and TotalView together through LSF HPC
bsub
-a tv
application [bsub_options]mpirun.lsf
job [job_options] [-tvopt
tv_options]Specifies the application you want to run through LSF HPC and debug in TotalView.
Specifies options to be passed to TotalView. Use any valid TotalView command option, except
-a
(LSF uses this option internally). See the TotalView Users Guide for information about TotalView command options and setting up parallel debugging sessions.To submit a POE job and run TotalView:
% bsub -a tvpoe -n 2 mpirun.lsf myjob -tvopt -no_ask_on_dlopenThe method name
tvpoe
, uses the specialesub
for debugging POE jobs with TotalView (LSF_SERVERDIR/esub.tvpoe
).-no_ask_on_dlopen
is a TotalView option that tells TotalView not to prompt about stopping processes that use thedlopen
system call.To submit a LAM/MPI job and run TotalView:
% bsub -a tvlammpi -n 2 mpirun.lsf myjob -tvopt -no_ask_on_dlopenThe method name
tvlammpi
, uses the specialesub
for debugging LAM/MPI jobs with TotalView (LSF_SERVERDIR/esub.tvlammpi
).-no_ask_on_dlopen
is a TotalView option that tells TotalView not to prompt about stopping processes that use thedlopen
system call.When the TotalView Root window opens:
- TotalView automatically acquires the
pam
process and a Process window opens.- Click Go in the Process window to start debugging the process.
Depending on your TotalView preferences, you may see the Stop Before Going Parallel dialog. Click Yes. Use the Parallel page on the File > Preferences dialog to change the setting of When a job goes parallel or calls exec() radio buttons.
The process starts running and stops at the first breakpoint you set.
For MPICH-GM jobs, TotalView stops at two breakpoints: one inpam
, and one inMPI_init()
. Click Go to continue debugging.
- Debug your job as you would normally in TotalView.
When you are finished debugging your program, choose File > Exit to exit TotalView, and click Yes in the Exit dialog. As TotalView exits it kills the
pam
process. In a few moments, LSF HPC detects that PAM has exited and your job exits asDone successfully
.Running TotalView and attaching a LSF HPC job
bsub
-a
application [bsub_options]mpirun.lsf
job [job_options]Specifies the application you want to run through LSF HPC and debug in TotalView.
See the TotalView Users Guide for information about attaching jobs in TotalView and setting up parallel debugging sessions.
- Submit a job.
For example:
% bsub -a poe -n 2 mpirun.lsf myjobThe method name
poe
, uses theesub
for running POE jobs (LSF_SERVERDIR/esub.poe
).% bsub -a mpich_gm -n 2 mpirun.lsf myjobThe method name
mpich_gm
, uses the specialesub
for running MPICH-GM jobs (LSF_SERVERDIR/esub.mpich_gm
).- Start TotalView on the execution host.
For TotalView to load PAM, LSF_BINDIR must be in the $PATH environment variable on the execution host, or use FIle > Search Path... in TotalView to set the path to LSF_BINDIR.
The TotalView Root window opens, and
pam
appears in the Unattached page of the TotalView Root window.- Double-click
pam
as the process to attach.A Process window opens. Your jobs move from the Unattached page to the Attached page.
You should see all of your job processes in the Unattached page; you can select any process to attach, but to attach all parallel task on the local and remote hosts, you must attach to pam.
- Click Go in the Process window?
- Debug your job as you would normally in TotalView.
When you are finished debugging your program, choose File > Exit to exit TotalView, and click Yes in the Exit dialog. As TotalView exits it kills the
pam
process. In a few moments, LSF HPC detects that PAM has exited and your job exits asDone successfully
.Viewing source code while debugging
Use View > Lookup Function to view the source code of your application while debugging. Enter
main
in the Name field and click OK. TotalView finds the source code for themain()
function and displays it in the Source Pane.See the TotalView Users Guide for information about displaying source code.
[ Top ]
Controlling and Monitoring Jobs Being Debugged in TotalView
Controlling jobs
While your job is running and you are using TotalView to debug it, you cannot use LSF HPC job control commands:
bchkpnt
andbmig
are not supported- Default TotalView signal processing prevents
bstop
andbresume
from suspending and resuming jobs, andbkill
from terminating jobsbrequeue
causes TotalView to display all jobs in error status. Click Go and the jobs will rerun.- Job rerun within TotalView is not supported. Do not submit jobs for debugging to a rerunnable queue.
Monitoring jobs
Use
bjobs
to see the resource usage of jobs running under TotalView:bsub -n 2 -a tvmpich_gm mpirun.lsf ./cpi -tvopt -no_ask_on_dlopen
Job <365> is submitted to queue <hpc_linux>.bjobs -l 365
Job <365>, User <user1>, Project <default>, Status <DONE>, Queue <hpc_linux>, Command <totalview pam -no_ask_on_dlopen -a -g 1 -tv gmmpirun_wrapper ./cpi> Fri Oct 11 15:46:47: Submitted from host <hostA>, CWD <$HOME>, 2 Processors Requested, Requested Resources <select[ (gm_ports > 0) ] rusage[gm_ports=1:duration=10]>; Fri Oct 11 15:46:58: Started on 2 Hosts/Processors <hostA> <hostB>, Execution Home </home/user1>, Execution CWD </home/user1>; Fri Oct 11 15:53:07: Done successfully. The CPU time used is 69.7 seconds. SCHEDULING PARAMETERS: r15s r1m r15m ut pg io ls it tmp swp mem loadSched - - - - - - - - - - - loadStop - - - - - - - - - - - adapter_windows loadSched - - - loadStop - - - %bsub -a tvpoe -n 4 mpirun.lsf $JOB
Job <341> is submitted to queue <hpc_ibm>. %bjobs -l 341
Job <341>, User <user1>, Project <default>, Status <DONE>, Queue <hpc_ibm>, Com mand <totalview pam -a -g 1 -tv poejob /home/user1/cpi.poe > Wed Oct 16 09:59:42: Submitted from host <hostA>, CWD </home/user1, 4 Processors Requested; Wed Oct 16 09:59:53: Started on 4 Hosts/Processors <hostA> <hostA> <hostA> <q ataix05.lsf.platform.com>, Execution Home </home/user1>, E xecution CWD </home/user1>; Wed Oct 16 10:01:19: Done successfully. The CPU time used is 97.0 seconds. SCHEDULING PARAMETERS: r15s r1m r15m ut pg io ls it tmp swp mem loadSched - - - - - - - - - - - loadStop - - - - - - - - - - - lammpi_load adapter_windows loadSched - - - loadStop - -[ Top ]
[ Platform Documentation ] [ Title ] [ Contents ] [ Previous ] [ Next ] [ Index ]
Date Modified: August 20, 2009
Platform Computing: www.platform.com
Platform Support: support@platform.com
Platform Information Development: doc@platform.com
Copyright © 1994-2009 Platform Computing Corporation. All rights reserved.