Job forwarding model overview

In this model, the cluster that is starving for resources sends jobs over to the cluster that has resources to spare. Job status, pending reason, and resource usage are returned to the submission cluster. When the job is done, the exit code returns to the submission cluster.

Tracking

bhosts

By default, bhosts shows information about hosts and resources that are available to the local cluster and information about jobs that are scheduled by the local cluster.

bjobs

The bjobs command shows all jobs associated with hosts in the cluster, including MultiCluster jobs. Jobs from remote clusters can be identified by the FROM_HOST column, which shows the remote cluster name and the submission or consumer cluster job ID in the format host_name@remote_cluster_name:remote_job_ID.

If the MultiCluster job is running under the job forwarding model, the QUEUE column shows a local queue, but if the MultiCluster job is running under the resource leasing model, the name of the remote queue is shown in the format queue_name@remote_cluster_name.

Use -w or -l to prevent the MultiCluster information from being truncated.

bclusters

Displays remote resource provider and consumer information, resource flow information, and connection status between the local and remote cluster.

Use -app to view available application profiles in remote clusters.

Information related to the job forwarding model is displayed under the heading Job Forwarding Information.

  • LOCAL_QUEUE: Name of a local MultiCluster send-jobs or receive-jobs queue.

  • JOB_FLOW: Indicates direction of job flow.

    • send

      The local queue is a MultiCluster send-jobs queue (SNDJOBS_TO is defined in the local queue).

    • recv

      The local queue is a MultiCluster receive-jobs queue (RCVJOBS_FROM is defined in the local queue).

  • REMOTE: For send-jobs queues, shows the name of the receive-jobs queue in a remote cluster.

    For receive-jobs queues, always “-”.

  • CLUSTER: For send-jobs queues, shows the name of the remote cluster containing the receive-jobs queue.

    For receive-jobs queues, shows the name of the remote cluster that can send jobs to the local queue.

  • STATUS: Indicates the connection status between the local queue and remote queue.

    • ok

      The two clusters can exchange information and the system is properly configured.

    • disc

      Communication between the two clusters has not been established. This could occur because there are no jobs waiting to be dispatched, or because the remote master cannot be located.

    • reject

      The remote queue rejects jobs from the send-jobs queue. The local queue and remote queue are connected and the clusters communicate, but the queue-level configuration is not correct. For example, the send-jobs queue in the submission cluster points to a receive-jobs queue that does not exist in the remote cluster.

      If the job is rejected, it returns to the submission cluster.

For example, consider the following application profile configurations:

  • On the submission cluster (Cluster1) in the lsb.applications file:

    Begin Application
    NAME         = fluent
    DESCRIPTION  = FLUENT Version 6.2
    CPULIMIT     = 180/bp860-10      # 3 hours of host hostA
    FILELIMIT    = 20000
    DATALIMIT    = 20000          # jobs data segment limit
    CORELIMIT    = 20000
    PROCLIMIT    = 5              # job processor limit
    PRE_EXEC     = /usr/local/lsf/misc/testq_pre >> /tmp/pre.out
    POST_EXEC    = /usr/local/lsf/misc/testq_post |grep -v "Hi"
    REQUEUE_EXIT_VALUES = 55 34 78
    End Application
    Begin Application
    NAME         = catia
    DESCRIPTION  = CATIA V5
    CPULIMIT     = 24:0/bp860-10     # 24 hours of host hostA
    FILELIMIT    = 20000
    DATALIMIT    = 20000           # jobs data segment limit
    CORELIMIT    = 20000
    PROCLIMIT    = 5              # job processor limit
    PRE_EXEC     = /usr/local/lsf/misc/testq_pre >> /tmp/pre.out
    POST_EXEC    = /usr/local/lsf/misc/testq_post |grep -v "Hi"
    REQUEUE_EXIT_VALUES = 55 34 78
    End Application
    Begin Application
    NAME         = djob
    DESCRIPTION  = distributed jobs
    FILELIMIT    = 20000
    DATALIMIT    = 2000000          # jobs data segment limit
    RTASK_GONE_ACTION="KILLJOB_TASKEXIT IGNORE_TASKCRASH"
    DJOB_ENV_SCRIPT   = /lsf/djobs/proj_1/djob_env
    DJOB_RU_INTERVAL  = 300
    DJOB_HB_INTERVAL  = 30
    DJOB_COMMFAIL_ACTION="KILL_TASKS"
    End Application
  • On the execution cluster (Cluster2) in the lsb.applications file:

    Begin Application
    NAME         = dyna
    DESCRIPTION  = ANSYS LS-DYNA
    CPULIMIT     = 8:0/amd64dcore   # 8 hours of host model SunIPC
    FILELIMIT    = 20000
    DATALIMIT    = 20000          # jobs data segment limit
    CORELIMIT    = 20000
    PROCLIMIT    = 5              # job processor limit
    PRE_EXEC     = /usr/local/lsf/misc/testq_pre >> /tmp/pre.out
    POST_EXEC    = /usr/local/lsf/misc/testq_post |grep -v "Hi"
    REQUEUE_EXIT_VALUES = 55 255 78
    End Application
    Begin Application
    NAME         = default
    DESCRIPTION  = global defaults
    CORELIMIT    = 0              # No core files
    STACKLIMIT   = 200000         # Give large default
    RERUNNABLE   = Y              #
    RES_REQ      = order[mem:ut]  # change the default ordering method
    End Application

Verify that MultiCluster is enabled:

lsclusters
CLUSTER_NAME   STATUS   MASTER_HOST               ADMIN    HOSTS  SERVERS
cluster1       ok       master_c1                admin     1      1
cluster2       ok       master_c2                admin     2      2

View available applications on remote clusters from the submission cluster (Cluster1):

bclusters -app
REMOTE_CLUSTER  APP_NAME        DESCRIPTION
cluster2        dyna            ANSYS LS-DYNA
cluster2        default         global defaults

View available applications on remote clusters from the execution cluster (Cluster2):

bclusters -app
REMOTE_CLUSTER  APP_NAME        DESCRIPTION
cluster1        catia           CATIA V5
cluster1        fluent          FLUENT Version 6.2
cluster1        djob            distributed jobs