Knowledge Center         Contents    Previous  Next    Index  
Platform Computing Corp.

Working with Queues

Contents

Queue States

Queue states, displayed by bqueues, describe the ability of a queue to accept and start batch jobs using a combination of the following states:

Queue state can be changed by an LSF administrator or root.

Queues can also be activated and inactivated by run windows and dispatch windows (configured in lsb.queues, displayed by bqueues -l).

bqueues -l displays Inact_Adm when explicitly inactivated by an Administrator (badmin qinact), and Inact_Win when inactivated by a run or dispatch window.

Viewing Queue Information

The bqueues command displays information about queues. The bqueues -l option also gives current statistics about the jobs in a particular queue, such as the total number of jobs in the queue, the number of jobs running, suspended, and so on.

To view the...
Run...
Available queues
bqueues
Queue status
bqueues
Detailed queue information
bqueues -l
State change history of a queue
badmin qhist
Queue administrators
bqueues -l for queue

In addition to the procedures listed here, see the bqueues(1) man page for more details.

View available queues and queue status

  1. Run bqueues. You can view the current status of a particular queue or all queues. The bqueues command also displays available queues in the cluster.
  2. bqueues
    QUEUE_NAME   PRIO  STATUS        MAX JL/U JL/P JL/H NJOBS  PEND  RUN  SUSP
    interactive  400   Open:Active   -   -    -    -    2      0     2    0
    priority     43    Open:Active   -   -    -    -    16     4     11   1
    night        40    Open:Inactive -   -    -    -    4      4     0    0
    short        35    Open:Active   -   -    -    -    6      1     5    0
    license      33    Open:Active   -   -    -    -    0      0     0    0
    normal       30    Open:Active   -   -    -    -    0      0     0    0
    idle         20    Open:Active   -   -    -    -    6      3     1    2 
     

    A dash (-) in any entry means that the column does not apply to the row. In this example no queues have per-queue, per-user, per-processor, or per host job limits configured, so the MAX, JL/U, JL/P, and JL/H entries are shown as a dash.

Job slots required by parallel jobs
important:  
A parallel job with N components requires N job slots.

View detailed queue information

  1. To see the complete status and configuration for each queue, run bqueues -l.
  2. Specify queue names to select specific queues. The following example displays details for the queue normal.

    bqueues -l normal
    QUEUE: normal
      --For normal low priority jobs, running only if hosts are lightly loaded. This is 
    the default queue.
    PARAMETERS/STATISTICS
    PRIO NICE  STATUS      MAX JL/U JL/P NJOBS  PEND  RUN SSUSP USUSP
    40   20    Open:Active 100 50   11   1      1     0   0     0
    Migration threshold is 30 min.
    
    CPULIMIT           RUNLIMIT
    20 min of IBM350   342800 min of IBM350
    
    FILELIMIT  DATALIMIT  STACKLIMIT  CORELIMIT  MEMLIMIT  PROCLIMIT
    20000 K    20000 K    2048 K      20000 K    5000 K    3
    
    SCHEDULING PARAMETERS
               r15s  r1m  r15m  ut   pg   io   ls  it  tmp  swp  mem
    loadSched  -     0.7  1.0   0.2  4.0  50   -   -   -    -    -
    loadStop   -     1.5  2.5   -    8.0  240  -   -   -    -    - 
     		                cpuspeed    bandwidth 
     loadSched          -            - 
     loadStop           -            - 
    
    SCHEDULING POLICIES:  FAIRSHARE  PREEMPTIVE PREEMPTABLE EXCLUSIVE
    USER_SHARES:  [groupA, 70] [groupB, 15]  [default, 1]
    
    DEFAULT HOST SPECIFICATION : IBM350
    
    RUN_WINDOWS:  2:40-23:00 23:30-1:30
    DISPATCH_WINDOWS:  1:00-23:50
    
    USERS: groupA/ groupB/ user5
    HOSTS:  hostA, hostD, hostB
    ADMINISTRATORS:  user7
    PRE_EXEC: /tmp/apex_pre.x > /tmp/preexec.log 2>&1
    POST_EXEC:  /tmp/apex_post.x > /tmp/postexec.log 2>&1
    REQUEUE_EXIT_VALUES:  45 
    

View the state change history of a queue

  1. Run badmin qhist to display the times when queues are opened, closed, activated, and inactivated.
  2. badmin qhist
    Wed Mar 31 09:03:14: Queue <normal> closed by user or 
    administrator <root>. 
    Wed Mar 31 09:03:29: Queue <normal> opened by user or 
    administrator <root>. 
    

View queue administrators

  1. Run bqueues -l for the queue.

View exception status for queues (bqueues)

  1. Use bqueues to display the configured threshold for job exceptions and the current number of jobs in the queue in each exception state.
  2. For example, queue normal configures JOB_IDLE threshold of 0.10, JOB_OVERRUN threshold of 5 minutes, and JOB_UNDERRUN threshold of 2 minutes. The following bqueues command shows no overrun jobs, one job that finished in less than 2 minutes (underrun) and one job that triggered an idle exception (less than idle factor of 0.10):

    bqueues -l normal
    
    QUEUE: normal
      -- For normal low priority jobs, running only if hosts are lightly loaded.  This 
    is the default queue.
    
    PARAMETERS/STATISTICS
    PRIO NICE STATUS          MAX JL/U JL/P JL/H NJOBS  PEND   RUN SSUSP USUSP  RSV 
     30   20  Open:Active       -    -    -    -     0     0     0     0     0    0
    
     STACKLIMIT MEMLIMIT
       2048 K     5000 K
    
    SCHEDULING PARAMETERS
               r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem
     loadSched   -     -     -     -       -     -    -     -     -      -      -  
     loadStop    -     -     -     -       -     -    -     -     -      -      -   
     		                cpuspeed    bandwidth 
     loadSched          -            - 
     loadStop           -            - 
     
    JOB EXCEPTION PARAMETERS 
                 OVERRUN(min) UNDERRUN(min) IDLE(cputime/runtime)
     Threshold         5         2          0.10
          Jobs         0         1             1
    
    USERS:  all users
    HOSTS:  all allremote 
    CHUNK_JOB_SIZE: 3 
    

Control Queues

Queues are controlled by an LSF Administrator or root issuing a command or through configured dispatch and run windows.

Close a queue

  1. Run badmin qclose:
  2. badmin qclose normal
    Queue <normal> is closed 
     

    When a user tries to submit a job to a closed queue the following message is displayed:

    bsub -q normal ... normal: Queue has been closed

Open a queue

  1. Run badmin qopen:
  2. badmin qopen normal
    Queue <normal> is opened 
    

Inactivate a queue

  1. Run badmin qinact:
  2. badmin qinact normal
    Queue <normal> is inactivated 
    

Activate a queue

  1. Run badmin qact:
  2. badmin qact normal
    Queue <normal> is activated 
    

Log a comment when controlling a queue

  1. Use the -C option of badmin queue commands qclose, qopen, qact, and qinact to log an administrator comment in lsb.events.
  2. badmin qclose -C "change configuration" normal 
     

    The comment text change configuration is recorded in lsb.events.

    A new event record is recorded for each queue event. For example:

    badmin qclose -C "add user" normal

    followed by

    badmin qclose -C "add user user1" normal

    will generate records in lsb.events:

    "QUEUE_CTRL" "7.0 1050082373 1 "normal" 32185 "lsfadmin" "add user" "QUEUE_CTRL" "7.0 1050082380 1 "normal" 32185 "lsfadmin" "add user user1"
  3. Use badmin hist or badmin qhist to display administrator comments for closing and opening hosts.
  4. badmin qhist
    Fri Apr  4 10:50:36: Queue <normal> closed by administrator 
    <lsfadmin> change configuration. 
     

    bqueues -l also displays the comment text:

    bqueues -l normal QUEUE: normal -- For normal low priority jobs, running only if hosts are lightly loaded. Th is is the default queue. PARAMETERS/STATISTICS PRIO NICE STATUS MAX JL/U JL/P JL/H NJOBS PEND RUN SSUSP USUSP RSV 30 20 Closed:Active - - - - 0 0 0 0 0 0 Interval for a host to accept two jobs is 0 seconds THREADLIMIT 7 SCHEDULING PARAMETERS r15s r1m r15m ut pg io ls it tmp swp mem loadSched - - - - - - - - - - - loadStop - - - - - - - - - - - cpuspeed bandwidth loadSched - - loadStop - - JOB EXCEPTION PARAMETERS OVERRUN(min) UNDERRUN(min) IDLE(cputime/runtime) Threshold - 2 - Jobs - 0 - USERS: all users HOSTS: all RES_REQ: select[type==any] ADMIN ACTION COMMENT: "change configuration"

Configure Dispatch Windows

A dispatch window specifies one or more time periods during which batch jobs are dispatched to run on hosts. Jobs are not dispatched outside of configured windows. Dispatch windows do not affect job submission and running jobs (they are allowed to run until completion). By default, queues are always Active; you must explicitly configure dispatch windows in the queue to specify a time when the queue is Inactive.

To configure a dispatch window:

  1. Edit lsb.queues
  2. Create a DISPATCH_WINDOW keyword for the queue and specify one or more time windows.
  3. Begin Queue
    QUEUE_NAME   = queue1
    PRIORITY     = 45
    DISPATCH_WINDOW = 4:30-12:00
    End Queue 
    
  4. Reconfigure the cluster:
    1. Run lsadmin reconfig.
    2. Run badmin reconfig.
  5. Run bqueues -l to display the dispatch windows.

Configure Run Windows

A run window specifies one or more time periods during which jobs dispatched from a queue are allowed to run. When a run window closes, running jobs are suspended, and pending jobs remain pending. The suspended jobs are resumed when the window opens again. By default, queues are always Active and jobs can run until completion. You must explicitly configure run windows in the queue to specify a time when the queue is Inactive.

To configure a run window:

  1. Edit lsb.queues.
  2. Create a RUN_WINDOW keyword for the queue and specify one or more time windows.
  3. Begin Queue
    QUEUE_NAME   = queue1
    PRIORITY     = 45
    RUN_WINDOW = 4:30-12:00
    End Queue 
    
  4. Reconfigure the cluster:
    1. Run lsadmin reconfig.
    2. Run badmin reconfig.
  5. Run bqueues -l to display the run windows.

Add and Remove Queues

Add a queue

  1. Log in as the LSF administrator on any host in the cluster.
  2. Edit lsb.queues to add the new queue definition.
  3. You can copy another queue definition from this file as a starting point; remember to change the QUEUE_NAME of the copied queue.

  4. Save the changes to lsb.queues.
  5. Run badmin reconfig to reconfigure mbatchd.
  6. Adding a queue does not affect pending or running jobs.

Remove a queue

important:  
Before removing a queue, make sure there are no jobs in that queue.

If there are jobs in the queue, move pending and running jobs to another queue, then remove the queue. If you remove a queue that has jobs in it, the jobs are temporarily moved to a queue named lost_and_found. Jobs in the lost_and_found queue remain pending until the user or the LSF administrator uses the bswitch command to switch the jobs into an existing queue. Jobs in other queues are not affected.

  1. Log in as the LSF administrator on any host in the cluster.
  2. Close the queue to prevent any new jobs from being submitted.
  3. badmin qclose night
    Queue <night> is closed 
    
  4. Move all pending and running jobs into another queue.
  5. Below, the bswitch -q night argument chooses jobs from the night queue, and the job ID number 0 specifies that all jobs should be switched:

    bjobs -u all -q night
    JOBID USER  STAT  QUEUE FROM_HOST   EXEC_HOST   JOB_NAME   SUBM
    IT_TIME
    5308  user5  RUN   night    hostA     hostD         job5  Nov 2
    1 18:16
    5310  user5 PEND   night    hostA     hostC        job10  Nov 2
    1 18:17
    
    bswitch -q night idle 0
    Job <5308> is switched to queue <idle>
    Job <5310> is switched to queue <idle> 
    
  6. Edit lsb.queues and remove or comment out the definition for the queue being removed.
  7. Save the changes to lsb.queues.
  8. Run badmin reconfig to reconfigure mbatchd.

Manage Queues

Restrict host use by queues

You may want a host to be used only to run jobs submitted to specific queues. For example, if you just added a host for a specific department such as engineering, you may only want jobs submitted to the queues engineering1 and engineering2 to be able to run on the host.

  1. Log on as root or the LSF administrator on any host in the cluster.
  2. Edit lsb.queues, and add the host to the HOSTS parameter of specific queues.
  3. Begin Queue
    QUEUE_NAME = queue1
    ...
    HOSTS=mynewhost hostA hostB
    ...
    End Queue 
    
  4. Save the changes to lsb.queues.
  5. Use badmin ckconfig to check the new queue definition. If any errors are reported, fix the problem and check the configuration again.
  6. Run badmin reconfig to reconfigure mbatchd.
  7. If you add a host to a queue, the new host will not be recognized by jobs that were submitted before you reconfigured. If you want the new host to be recognized, you must use the command badmin mbdrestart.

Add queue administrators

Queue administrators are optionally configured after installation. They have limited privileges; they can perform administrative operations (open, close, activate, inactivate) on the specified queue, or on jobs running in the specified queue. Queue administrators cannot modify configuration files, or operate on LSF daemons or on queues they are not configured to administer.

To switch a job from one queue to another, you must have administrator privileges for both queues.

  1. In the lsb.queues file, between Begin Queue and End Queue for the appropriate queue, specify the ADMINISTRATORS parameter, followed by the list of administrators for that queue. Separate the administrator names with a space. You can specify user names and group names.
  2. Begin Queue
    ADMINISTRATORS = User1 GroupA
    End Queue 
    

Handling Job Exceptions in Queues

You can configure queues so that LSF detects exceptional conditions while jobs are running, and take appropriate action automatically. You can customize what exceptions are detected, and the corresponding actions. By default, LSF does not detect any exceptions.

Job exceptions LSF can detect

If you configure job exception handling in your queues, LSF detects the following job exceptions:

Configuring job exception handling (lsb.queues)

You can configure your queues to detect job exceptions. Use the following parameters:

JOB_IDLE

Specify a threshold for idle jobs. The value should be a number between 0.0 and 1.0 representing CPU time/runtime. If the job idle factor is less than the specified threshold, LSF invokes eadmin to trigger the action for a job idle exception.

JOB_OVERRUN

Specify a threshold for job overrun. If a job runs longer than the specified run time, LSF invokes eadmin to trigger the action for a job overrun exception.

JOB_UNDERRUN

Specify a threshold for job underrun. If a job exits before the specified number of minutes, LSF invokes eadmin to trigger the action for a job underrun exception.

Example

The following queue defines thresholds for all types job exceptions:

Begin Queue
...
JOB_UNDERRUN = 2
JOB_OVERRUN  = 5
JOB_IDLE     = 0.10
...
End Queue 

For this queue:

Configuring thresholds for job exception handling

By default, LSF checks for job exceptions every 1 minute. Use EADMIN_TRIGGER_DURATION in lsb.params to change how frequently LSF checks for overrun, underrun, and idle jobs.

Tuning
tip:  
Tune EADMIN_TRIGGER_DURATION carefully. Shorter values may raise false alarms, longer values may not trigger exceptions frequently enough.

Platform Computing Inc.
www.platform.com
Knowledge Center         Contents    Previous  Next    Index