Set SCHED_METRIC_ENABLE=Y in lsb.params to enable performance metric collection.
Start performance metric collection dynamically:
badmin perfmon start sample_period
Optionally, you can set a sampling period, in seconds. If no sample period is specified, the default sample period set in SCHED_METRIC_SAMPLE_PERIOD in lsb.params is used.
SCHED_METRIC_ENABLE and SCHED_METRIC_SAMPLE_PERIOD can be specified independently. That is, you can specify SCHED_METRIC_SAMPLE_PERIOD and not specify SCHED_METRIC_ENABLE. In this case, when you turn on the feature dynamically (using badmin perfmon start), the sampling period valued defined in SCHED_METRIC_SAMPLE_PERIOD will be used.
badmin perfmon start and badmin perfmon stop override the configuration setting in lsb.params. Even if SCHED_METRIC_ENABLE is set, if you run badmin perfmon start, performance metric collection is started. If you run badmin perfmon stop, performance metric collection is stopped.
Set SCHED_METRIC_SAMPLE_PERIOD in lsb.params to specify an initial cluster-wide performance metric sampling period.
Set a new sampling period in seconds:
badmin perfmon setperiod sample_period
Collecting and recording performance metric data may affect the performance of LSF. Smaller sampling periods will result in the lsb.streams file growing faster.
Performance monitor start time: Fri Jan 19 15:07:54End time of last sample period: Fri Jan 19 15:25:55Sample period : 60 Seconds------------------------------------------------------------------Metrics Last Max Min Avg Total------------------------------------------------------------------Total queries 0 25 0 8 159Jobs information queries 0 13 0 2 46Hosts information queries 0 0 0 0 0Queue information queries 0 0 0 0 0Job submission requests 0 10 0 0 10Jobs submitted 0 100 0 5 100Jobs dispatched 0 0 0 0 0Jobs completed 0 13 0 5 100Jobs sent to remote cluster 0 12 0 5 100Jobs accepted from remote cluster 0 0 0 0 0------------------------------------------------------------------File Descriptor Metrics Free Used Total------------------------------------------------------------------MBD file descriptor usage 800 424 1024
Performance metrics information is calculated at the end of each sampling period. Running badmin perfmon before the end of the sampling period displays metric data collected from the sampling start time to the end of last sample period.
If no metrics have been collected because the first sampling period has not yet ended, badmin perfmon view displays:
This is accumulated metric counter value for each metric. It is counted from Performance monitor start time to End time of last sample period.
Last sampling value of metric. It is calculated per sampling period. It is represented as the metric value per period, and normalized by the following formula.
Maximum sampling value of metric. It is re-evaluated in each sampling period by comparing Max and Last Period. It is represented as the metric value per period.
Minimum sampling value of metric. It is re-evaluated in each sampling period by comparing Min and Last Period. It is represented as the metric value per period.
Average sampling value of metric. It is recalculated in each sampling period. It is represented as the metric value per period, and normalized by the following formula.
Job rerun occurs when execution hosts become unavailable while a job is running, and the job will be put to its original queue first and later will be dispatched when a suitable host is available. So in this case, only one submission request, one job submitted, and n jobs dispatched, n jobs completed are counted (n represents the number of times the job reruns before it finishes successfully).
Requeued jobs may be dispatched, run, and exit due to some special errors again and again. The job data always exists in the memory, so LSF only counts one job submission request and one job submitted, and counts more than one job dispatched.
For jobs completed, if a job is requeued with brequeue, LSF counts two jobs completed, since requeuing a job first kills the job and later puts the job into pending list. If the job is automatically requeued, LSF counts one job completed when the job finishes successfully.