Knowledge Center         Contents    Previous  Next    Index  
Platform Computing Corp.

Achieving Performance and Scalability

Contents

Optimizing Performance in Large Sites

As your site grows, you must tune your LSF cluster to support a large number of hosts and an increased workload.

This chapter discusses how to efficiently tune querying, scheduling, and event logging in a large cluster that scales to 5000 hosts and 100,000 jobs at any one time.

To target performance optimization to a cluster with 5000 hosts and 100,000 jobs, you must:

What's new in LSF performance?

LSF provides parameters for tuning your cluster, which you will learn about in this chapter. However, before you calculate the values to use for tuning your cluster, consider the following enhancements to the general performance of LSF daemons, job dispatching, and event replaying:

The following graph shows the improvement in LIM startup after the LSF performance enhancements:

Tuning UNIX for Large Clusters

The following hardware and software specifications are requirements for a large cluster that supports 5,000 hosts and 100,000 jobs at any one time.

In this section

Hardware recommendation

LSF master host:

Software requirement

To meet the performance requirements of a large cluster, increase the file descriptor limit of the operating system.

The file descriptor limit of most operating systems used to be fixed, with a limit of 1024 open files. Some operating systems, such as Linux and AIX, have removed this limit, allowing you to increase the number of file descriptors.

Increase the file descriptor limit
  1. To achieve efficiency of performance in LSF, follow the instructions in your operating system documentation to increase the number of file descriptors on the LSF master host.
  2. tip:  
    To optimize your configuration, set your file descriptor limit to a value at least as high as the number of hosts in your cluster.

    The following is an example configuration. The instructions for different operating systems, kernels, and shells are varied. You may have already configured the host to use the maximum number of file descriptors that are allowed by the operating system. On some operating systems, the limit is configured dynamically.

    Your cluster size is 5000 hosts. Your master host is on Linux, kernel version 2.4:

  3. Log in to the LSF master host as the root user.
  4. Add the following line to your /etc/rc.d/rc.local startup script:
  5. echo -n "5120" > /proc/sys/fs/file-max 
    
  6. Restart the operating system to apply the changes.
  7. In the bash shell, instruct the operating system to use the new file limits:
  8. # ulimit -n unlimited 
    

Tuning LSF for Large Clusters

To enable and sustain large clusters, you need to tune LSF for efficient querying, dispatching, and event log management.

In this section

Managing scheduling performance

For fast job dispatching in a large cluster, configure the following parameters:

LSB_MAX_JOB_DISPATCH_PER_SESSION in lsf.conf

The maximum number of jobs the scheduler can dispatch in one scheduling session

Some operating systems, such as Linux and AIX, let you increase the number of file descriptors that can be allocated on the master host. You do not need to limit the number of file descriptors to 1024 if you want fast job dispatching. To take advantage of the greater number of file descriptors, you must set LSB_MAX_JOB_DISPATCH_PER_SESSION to a value greater than 300.

Set LSB_MAX_JOB_DISPATCH_PER_SESSION to one-half the value of MAX_SBD_CONNS. This setting configures mbatchd to dispatch jobs at a high rate while maintaining the processing speed of other mbatchd tasks.

MAX_SBD_CONNS in lsb.params

The maximum number of open file connections between mbatchd and sbatchd.

Specify a value equal to the number of hosts in your cluster plus a buffer. For example, if your cluster includes 4000 hosts, set:

MAX_SBD_CONNS=4100

Highly recommended for large clusters to decrease the load on the master LIM. Forces the client sbatchd to contact the local LIM for host status and load information. The client sbatchd only contacts the master LIM or a LIM on one of the LSF_SERVER_HOSTS if sbatchd cannot find the information locally.

Enable fast job dispatch
  1. Log in to the LSF master host as the root user.
  2. Increase the system-wide file descriptor limit of your operating system if you have not already done so.
  3. In lsb.params, set MAX_SBD_CONNS equal to the number of hosts in the cluster plus a buffer.
  4. In lsf.conf, set the parameter LSB_MAX_JOB_DISPATCH_PER_SESSION to a value greater than 300 and less than or equal to one-half the value of MAX_SBD_CONNS.
  5. For example, for a cluster with 4000 hosts:

    LSB_MAX_JOB_DISPATCH_PER_SESSION = 2050 
    MAX_SBD_CONNS=4100 
    
  6. In lsf.conf, define the parameter LSF_SERVER_HOSTS to decrease the load on the master LIM.
  7. In the shell you used to increase the file descriptor limit, shut down the LSF batch daemons on the master host:
  8. badmin hshutdown

  9. Run badmin mbdrestart to restart the LSF batch daemons on the master host.
  10. Run badmin hrestart all to restart every sbatchd in the cluster:
  11. note:  
    When you shut down the batch daemons on the master host, all LSF services are temporarily unavailable, but existing jobs are not affected. When mbatchd is later started by sbatchd, its previous status is restored and job scheduling continues.
Enable continuous scheduling
  1. To enable the scheduler to run continuously, define the parameter JOB_SCHEDULING_INTERVAL=0 in lsb.params.

Limiting the number of batch queries

In large clusters, job querying can grow very quickly. If your site sees a lot of high traffic job querying, you can tune LSF to limit the number of job queries that mbatchd can handle. This helps decrease the load on the master host.

If a job information query is sent after the limit has been reached, an error message is displayed and mbatchd keeps retrying, in one second intervals. If the number of job queries later drops below the limit, mbatchd handles the query.

You define the maximum number of concurrent jobs queries to be handled by mbatchd in the parameter MAX_CONCURRENT_JOB_QUERY in lsb.params:

Syntax
MAX_CONCURRENT_JOB_QUERY=max_query 

Where:

max_query 

Specifies the maximum number of job queries that can be handled by mbatchd. Valid values are positive integers between 1 and 100. The default value is unlimited.

Examples
MAX_CONCURRENT_JOB_QUERY=20 

Specifies that no more than 20 queries can be handled by mbatchd.

MAX_CONCURRENT_JOB_QUERY=101 

Incorrect value. The default value will be used. An unlimited number of job queries will be handled by mbatchd.

Improving the speed of host status updates

To improve the speed with which mbatchd obtains and reports host status, configure the parameter LSB_SYNC_HOST_STAT_LIM in the file lsb.params. This also improves the speed with which LSF reschedules jobs: the sooner LSF knows that a host has become unavailable, the sooner LSF reschedules any rerunnable jobs executing on that host.

For example, during maintenance operations, the cluster administrator might need to shut down half of the hosts at once. LSF can quickly update the host status and reschedule any rerunnable jobs that were running on the unavailable hosts.

When you define this parameter, mbatchd periodically obtains the host status from the master LIM, and then verifies the status by polling each sbatchd at an interval defined by the parameters MBD_SLEEP_TIME and LSB_MAX_PROBE_SBD.

Managing your user's ability to move jobs in a queue

JOB_POSITION_CONTROL_BY_ADMIN=Y allows an LSF administrator to control whether users can use btop and bbot to move jobs to the top and bottom of queues. When set, only the LSF administrator (including any queue administrators) can use bbot and btop to move jobs within a queue. A user attempting to user bbot or btop receives the error "User permission denied."

remember:  
You must be an LSF administrator to set this parameter.

Managing the number of pending reasons

For efficient, scalable management of pending reasons, use CONDENSE_PENDING_REASONS=Y in lsb.params to condense all the host-based pending reasons into one generic pending reason.

If a job has no other main pending reason, bjobs -p or bjobs -l will display the following:

Individual host based reasons 

If you condense host-based pending reasons, but require a full pending reason list, you can run the following command:

badmin diagnose <job_ID>

remember:  
You must be an LSF administrator or a queue administrator to run this command.

Achieving efficient event switching

Periodic switching of the event file can weaken the performance of mbatchd ,which automatically backs up and rewrites the events file after every 1000 batch job completions. The old lsb.events file is moved to lsb.events.1, and each old lsb.events.n file is moved to lsb.events.n+1.

Change the frequency of event switching with the following two parameters in lsb.params:

The two parameters work together. Specify the MIN_SWITCH_PERIOD value in seconds.

For example:

MAX_JOB_NUM=1000 
MIN_SWITCH_PERIOD=7200 

This instructs mbatchd to check if the events file has logged 1000 batch job completions every two hours. The two parameters can control the frequency of the events file switching as follows:

tip:  
For large clusters, set the MIN_SWITCH_PERIOD to a value equal to or greater than 600. This causes mbatchd to fork a child process that handles event switching, thereby reducing the load on mbatchd. mbatchd terminates the child process and appends delta events to new events after the MIN_SWITCH_PERIOD has elapsed. If you define a value less than 600 seconds, mbatchd will not fork a child process for event switching.

Automatic load updating

Periodically, the LIM daemons exchange load information. In large clusters, let LSF automatically load the information by dynamically adjusting the period based on the load.

important:  
For automatic tuning of the loading interval, make sure the parameter EXINTERVAL in lsf.cluster.cluster_name file is not defined. Do not configure your cluster to load the information at specific intervals.

Managing the I/O performance of the info directory

In large clusters, there are large numbers of jobs submitted by its users. Since each job generally has a job file, this results in a large number of job files stored in the LSF_SHAREDIR/cluster_name/logdir/info directory at any time. When the total size of the job files reaches a certain point, you will notice a significant delay when performing I/O operations in the info directory.

This delay is caused by a limit in the total size of files that can reside in a file server directory. This limit is dependent on the file system implementation. A high load on the file server delays the master batch daemon operations, and therefore slows down the overall cluster throughput.

You can prevent this delay by creating and using subdirectories under the parent directory. Each new subdirectory is subject to the file size limit, but the parent directory is not subject to the total file size of its subdirectories. Since the total file size of the info directory is divided among its subdirectories, your cluster can process more job operations before reaching the total size limit of the job files.

If your cluster has a lot of jobs resulting in a large info directory, you can tune your cluster by enabling LSF to create subdirectories in the info directory. Use MAX_INFO_DIRS in lsb.params to create the subdirectories and enable mbatchd to distribute the job files evenly throughout the subdirectories.

Syntax
MAX_INFO_DIRS=num_subdirs 

Where num_subdirs specifies the number of subdirectories that you want to create under the LSF_SHAREDIR/cluster_name/logdir/info directory. Valid values are positive integers between 1 and 1024. By default, MAX_INFO_DIRS is not defined.

Run badmin reconfig to create and use the subdirectories.

Duplicate event logging
note:  
If you enabled duplicate event logging, you must run badmin mbdrestart instead of badmin reconfig to restart mbatchd.

Run bparams -l to display the value of the MAX_INFO_DIRS parameter.

Example
MAX_INFO_DIRS=10 

mbatchd creates ten subdirectories from LSB_SHAREDIR/cluster_name/logdir/info/0 to LSB_SHAREDIR/cluster_name/logdir/info/9.

Processor binding for LSF job processes

See also Processor Binding for Parallel Jobs.

Rapid progress of modern processor manufacture technologies has enabled the low cost deployment of LSF on hosts with multicore and multithread processors. The default soft affinity policy enforced by the operating system scheduler may not give optimal job performance. For example, the operating system scheduler may place all job processes on the same processor or core leading to poor performance. Frequently switching processes as the operating system schedules and reschedules work between cores can cause cache invalidations and cache miss rates to grow large.

Processor binding for LSF job processes takes advantage of the power of multiple processors and multiple cores to provide hard processor binding functionality for sequential LSF jobs and parallel jobs that run on a single host.

restriction:  
Processor binding is supported on hosts running Linux with kernel version 2.6 or higher.

For multi-host parallel jobs, LSF sets two environment variables ($LSB_BIND_JOB and $LSB_BIND_CPU_LIST) but does not attempt to bind the job to any host.

When processor binding for LSF job processes is enabled on supported hosts, job processes of an LSF job are bound to a processor according to the binding policy of the host. When an LSF job is completed (exited or done successfully) or suspended, the corresponding processes are unbound from the processor.

When a suspended LSF job is resumed, the corresponding processes are bound again to a processor. The process is not guaranteed to be bound to the same processor it was bound to before the job was suspended.

The processor binding affects the whole job process group. All job processes forked from the root job process (the job RES) are bound to the same processor.

Processor binding for LSF job processes does not bind daemon processes.

If processor binding is enabled, but the execution hosts do not support processor affinity, the configuration has no effect on the running processes. Processor binding has no effect on a single-processor host.

Processor, core, and thread-based CPU binding

By default, the number of CPUs on a host represents the number of physical processors a machine has. For LSF hosts with multiple cores, threads, and processors, ncpus can be defined by the cluster administrator to consider one of the following:

Globally, this definition is controlled by the parameter EGO_DEFINE_NCPUS in lsf.conf or ego.conf. The default behavior for ncpus is to consider only the number of physical processors (EGO_DEFINE_NCPUS=procs).

tip:  
When PARALLEL_SCHED_BY_SLOT=Y in lsb.params, the resource requirement string keyword ncpus refers to the number of slots instead of the number of processors, however lshosts output will continue to show ncpus as defined by EGO_DEFINE_NCPUS in lsf.conf.

Binding job processes randomly to multiple processors, cores, or threads, may affect job performance. Processor binding configured with LSF_BIND_JOB in lsf.conf or BIND_JOB in lsb.applications, detects the EGO_DEFINE_NCPUS policy to bind the job processes by processor, core, or thread (PCT).

For example, if a host's PCT policy is set to processor (EGO_DEFINE_NCPUS=procs) and the binding option is set to BALANCE, the first job process is bound to the first physical processor, the second job process is bound to the second physical processor and so on.

If host's PCT policy is set to core level (EGO_DEFINE_NCPUS=cores) and the binding option is set to BALANCE, the first job process is bound to the first core on the first physical processor, the second job process is bound to the first core on the second physical processor, the third job process is bound to the second core on the first physical processor and so on.

If host's PCT policy is set to thread level (EGO_DEFINE_NCPUS=threads) and the binding option is set to BALANCE, the first job process is bound to the first thread on the first physical processor, the second job process is bound to the first thread on the second physical processor, the third job process is bound to the second thread on the first physical processor and so on.

BIND_JOB=BALANCE

The BIND_JOB=BALANCE option instructs LSF to bind the job based on the load of the available processors/cores/threads. For each slot:

If there is a single 2 processor quad core host and you submit a parallel job with -n 2 -R"span[hosts=1]" when the PCT level is core, the job is bound to the first core on the first processor and the first core on the second processor:

After submitting another three jobs with -n 2 -R"span[hosts=1]":

If PARALLEL_SCHED_BY_SLOT=Y is set in lsb.params, the job specifies a maximum and minimum number of job slots instead of processors. If the MXJ value is set to 16 for this host (there are 16 job slots on this host), LSF can dispatch more jobs to this host. Another job submitted to this host is bound to the first core on the first processor and the first core on the second processor:

BIND_JOB=PACK

The BIND_JOB=PACK option instructs LSF to try to pack all the processes onto a single processor. If this cannot be done, LSF tries to use as few processors as possible. Email is sent to you after job dispatch and when job finishes. If no processors/cores/threads are free (when the PCT level is processor/core/thread level), LSF tries to use the BALANCE policy for the new job.

LSF depends on the order of processor IDs to pack jobs to a single processor.

If PCT level is processor (default value after installation), there is no difference between BALANCE and PACK.

This option binds jobs to a single processor where it makes sense, but does not oversubscribe the processors/cores/threads. The other processors are used when they are needed. For instance, when the PCT level is core level, if we have a single 4 processor quad core host and we had bound 4 sequential jobs onto the first processor, the 5th-8th sequential job is bound to the second processor.

If you submit three single-host parallel jobs with -n 2 -R"span[hosts=1]" when the PCT level is core level, the first job is bound to the first and seconds cores of the first processor, the second job is bound to the third and fourth cores of the first processor. Binding the third job to the first processor oversubscribes the cores in the first processor, so the third job is bound to the first and second cores of the second processor:

After JOB1 and JOB2 finished, if you submit one single-host parallel jobs with -n 2 -R"span[hosts=1], the job is bound to the third and fourth cores of the second processor:

BIND_JOB=ANY

BIND_JOB=ANY binds the job to the first N available processors/cores/threads with no regard for locality. If the PCT level is core, LSF binds the first N available cores regardless of whether they are on the same processor or not. LSF arranges the order based on APIC ID.

If PCT level is processor (default value after installation), there is no difference between ANY and BALANCE.

For example, with a single 2-processor quad core host and the below table is the relationship of APIC ID and logic processor/core id:

APC ID
Processor ID
Core ID
0
0
0
1
0
1
2
0
2
3
0
3
4
1
0
5
1
1
6
1
2
7
1
3

If the PCT level is core level and you submits two jobs to this host with -n 3 -R "span[hosts=1]", then the first job is bound to the first, second and third core of the first physical processor, the second job is bound to the fourth core of the first physical processor and the first, second core in the second physical processor.

BIND_JOB=USER

BIND_JOB=USER binds the job to the value of $LSB_USER_BIND_JOB as specified in the user submission environment. This allows the Administrator to delegate binding decisions to the actual user. This value must be one of Y, N, NONE, BALANCE, PACK, or ANY. Any other value is treated as ANY.

BIND_JOB=USER_CPU_LIST

BIND_JOB=USER_CPU_LIST binds the job to the explicit logic CPUs specified in environment variable $LSB_USER_BIND_CPU_LIST. LSF does not check that the value is valid for the execution host(s). It is the user's responsibility to correctly specify the CPU list for the hosts they select.

The correct format of $LSB_USER_BIND_CPU_LIST is a list which may contain multiple items, separated by comma, and ranges. For example, 0,5,7,9-11.

If the value's format is not correct or there is no such environment variable, jobs are not bound to any processor.

If the format is correct and it cannot be mapped to any logic CPU, the binding fails. But if it can be mapped to some CPUs, the job is bound to the mapped CPUs. For example, with a two-processor quad core host and the logic CPU ID is 0-7:

  1. If user1 specifies 9,10 into $LSB_USER_BIND_CPU_LIST, his job is not bound to any CPUs.
  2. If user2 specifies 1,2,9 into $LSB_USER_BIND_CPU_LIST, his job is bound to CPU 1 and 2.

If the value's format is not correct or it does not apply for the execution host, the related information is added to the email sent to users after job dispatch and job finish.

If user specifies a minimum and a maximum number of processors for a single-host parallel job, LSF may allocate processors between these two numbers for the job. In this case, LSF binds the job according to the CPU list specified by the user.

BIND_JOB=NONE

BIND_JOB=NONE is functionally equivalent to the former BIND_JOB=N where the processor binding is disabled.

Feature interactions
Enable processor binding for LSF job processes

LSF supports the following binding options for sequential jobs and parallel jobs that run on a single host:

Increasing the job ID limit

By default, LSF assigns job IDs up to 6 digits. This means that no more than 999999 jobs can be in the system at once. The job ID limit is the highest job ID that LSF will ever assign, and also the maximum number of jobs in the system.

LSF assigns job IDs in sequence. When the job ID limit is reached, the count rolls over, so the next job submitted gets job ID "1". If the original job 1 remains in the system, LSF skips that number and assigns job ID "2", or the next available job ID. If you have so many jobs in the system that the low job IDs are still in use when the maximum job ID is assigned, jobs with sequential numbers could have different submission times.

Increase the maximum job ID

You cannot lower the job ID limit, but you can raise it to 10 digits. This allows longer term job accounting and analysis, and means you can have more jobs in the system, and the job ID numbers will roll over less often.

Use MAX_JOBID in lsb.params to specify any integer from 999999 to 2147483646 (for practical purposes, you can use any 10-digit integer less than this value).

Increase the job ID display length

By default, bjobs and bhist display job IDs with a maximum length of 7 characters. Job IDs greater than 9999999 are truncated on the left.

Use LSB_JOBID_DISP_LENGTH in lsf.conf to increase the width of the JOBID column in bjobs and bhist display. When LSB_JOBID_DISP_LENGTH=10, the width of the JOBID column in bjobs and bhist increases to 10 characters.

Monitoring Performance Metrics in Real Time

Enable metric collection

Set SCHED_METRIC_ENABLE=Y in lsb.params to enable performance metric collection.

Start performance metric collection dynamically:

badmin perfmon start sample_period

Optionally, you can set a sampling period, in seconds. If no sample period is specified, the default sample period set in SCHED_METRIC_SAMPLE_PERIOD in lsb.params is used.

Stop sampling:

badmin perfmon stop

SCHED_METRIC_ENABLE and SCHED_METRIC_SAMPLE_PERIOD can be specified independently. That is, you can specify SCHED_METRIC_SAMPLE_PERIOD and not specify SCHED_METRIC_ENABLE. In this case, when you turn on the feature dynamically (using badmin perfmon start), the sampling period valued defined in SCHED_METRIC_SAMPLE_PERIOD will be used.

badmin perfmon start and badmin perfmon stop override the configuration setting in lsb.params. Even if SCHED_METRIC_ENABLE is set, if you run badmin perfmon start, performance metric collection is started. If you run badmin perfmon stop, performance metric collection is stopped.

Tune the metric sampling period

Set SCHED_METRIC_SAMPLE_PERIOD in lsb.params to specify an initial cluster-wide performance metric sampling period.

Set a new sampling period in seconds:

badmin perfmon setperiod sample_period

Collecting and recording performance metric data may affect the performance of LSF. Smaller sampling periods will result in the lsb.streams file growing faster.

Display current performance

Run badmin perfmon view to view real time performance metric information. The following metrics are collected and recorded in each sample period:

badmin perfmon view

Performance monitor start time:  Fri Jan 19 15:07:54 
End time of last sample period:  Fri Jan 19 15:25:55 
Sample period :                  60 Seconds 
------------------------------------------------------------------ 
Metrics                          Last    Max     Min     Avg     Total 
------------------------------------------------------------------ 
Total queries                      0      25       0       8      159 
Jobs information queries           0      13       0       2       46 
Hosts information queries          0       0       0       0        0 
Queue information queries          0       0       0       0        0 
Job submission requests            0      10       0       0       10 
Jobs submitted                     0     100       0       5      100 
Jobs dispatched                    0       0       0       0        0 
Jobs completed                     0      13       0       5      100 
Jobs sent to remote cluster        0      12       0       5      100 
Jobs accepted from remote cluster  0       0       0       0        0 
------------------------------------------------------------------ 
File Descriptor Metrics                Free     Used    Total 
------------------------------------------------------------------ 
MBD file descriptor usage               800      424     1024 

Performance metrics information is calculated at the end of each sampling period. Running badmin perfmon before the end of the sampling period displays metric data collected from the sampling start time to the end of last sample period.

If no metrics have been collected because the first sampling period has not yet ended, badmin perfmon view displays:

badmin perfmon view 
Performance monitor start time:  Thu Jan 25 22:11:12 
End time of last sample period:  Thu Jan 25 22:11:12 
Sample period :                  120 Seconds 
------------------------------------------------------------------ 
No performance metric data available. Please wait until first sample 
period ends. 
badmin perfmon output
Sample Period

Current sample period

Performance monitor start time

The start time of sampling

End time of last sample period

The end time of last sampling period

Metric

The name of metrics

Total

This is accumulated metric counter value for each metric. It is counted from Performance monitor start time to End time of last sample period.

Last Period

Last sampling value of metric. It is calculated per sampling period. It is represented as the metric value per period, and normalized by the following formula.

Max

Maximum sampling value of metric. It is re-evaluated in each sampling period by comparing Max and Last Period. It is represented as the metric value per period.

Min

Minimum sampling value of metric. It is re-evaluated in each sampling period by comparing Min and Last Period. It is represented as the metric value per period.

Avg

Average sampling value of metric. It is recalculated in each sampling period. It is represented as the metric value per period, and normalized by the following formula.

Reconfiguring your cluster with performance metric sampling enabled

badmin mbdrestart

If performance metric sampling is enabled dynamically with badmin perfmon start. You must enable it again after running badmin mbdrestart. If performance metric sampling is enabled by default, StartTime will be reset to the point mbatchd is restarted.

badmin reconfig

If SCHED_METRIC_ENABLE and SCHED_METRIC_SAMPLE_PERIOD parameters are changed, badmin reconfig is the same as badmin mbdrestart.

Performance metric logging in lsb.streams

By default, collected metrics must be written to lsb.streams. However, performance metric can still be turned on even if ENABLE_EVENT_STREAM=N is defined. In this case, no metric data will be logged.

Job arrays

Only one submission request is counted. Element jobs are counted for jobs submitted, jobs dispatched, and jobs completed.

Job rerun

Job rerun occurs when execution hosts become unavailable while a job is running, and the job will be put to its original queue first and later will be dispatched when a suitable host is available. So in this case, only one submission request, one job submitted, and n jobs dispatched, n jobs completed are counted (n represents the number of times the job reruns before it finishes successfully).

Job requeue

Requeued jobs may be dispatched, run, and exit due to some special errors again and again. The job data always exists in the memory, so LSF only counts one job submission request and one job submitted, and counts more than one job dispatched.

For jobs completed, if a job is requeued with brequeue, LSF counts two jobs completed, since requeuing a job first kills the job and later puts the job into pending list. If the job is automatically requeued, LSF counts one job completed when the job finishes successfully.

Job replay

When job replay is finished, submitted jobs are not counted in job submission and job submitted, but are counted in job dispatched and job finished.


Platform Computing Inc.
www.platform.com
Knowledge Center         Contents    Previous  Next    Index