Scheduler Interaction

Scheduler Interaction Index

Introduction

Jobs are submitted to ARC resources through a job queuing system, or “scheduler”. Submission of jobs through a queueing system means that jobs may not run immediately, but will wait until the resources that it requires are available. The queuing system thus keeps the compute servers from being overloaded and allocates dedicated resources across running jobs. This will allow each job to run optimally once it leaves the queue.

Submission Script

Jobs are submitted with submission scripts that describe what resources the job requires and what the system should do once the job runs. Example submissions scripts are provided in the documentation for each system and can be used as a template for getting started. Note that jobs can also be started interactively, which can be very useful during testing and debugging.

The resource request includes:

  • Queue (denoted by #PBS -q). Indicates the queue to which the job should be submitted. Different queues are intended for different use cases (e.g., production, development, visualization) and therefore have different usage limits. The queue parameters are described in the documentation for each system.
  • Walltime (denoted by #PBS -l walltime). This is the time that you expect your job to run; so if you submit your job at 5:00pm on Wednesday and you expect it to finish at 5:00pm on Thursday, the walltime would be 24:00:00. Note that if your job exceeds the walltime estimated during submission, the scheduler will kill it. So it is important to be conservative (i.e., to err on the high side) with the walltime that you include in your submission script.
  • Hardware (denoted by #PBS -l nodes, #PBS -l proc, etc). This is the hardware that you want to reserve for your job. The types and quantity of available hardware, how to request them, and the limits for each user are described in the documentation for each system.
  • Account (denoted by #PBS -A). Indicates the allocation account to which you want to charge the job. (Only applies to some systems – see system documentation.)

The submission script should also specify what should happen when the job runs:

  • Software Modules. Use module commands to add the software modules that your job will need to run.
  • Run. Finally, you need to specify what commands you want to run to execute your computation. This can be execution of your own program or a call to a software package.

Job Management

Job submission is done by via a job launch script. Example submissions scripts are provided in the Examples section of the documentation for each system.
To submit your job to the queuing system, use the command qsub. For example, if your script is in “JobScript.sh”, the command would be:

qsub ./JobScript.sh

This will return your job name of the form

176618.master.cluster

Here 176618 is the job number. Once a job is submitted to a queue, it will wait until requested resources are available within that queue, and will then run if eligible. Eligibility to run is influenced by the resource policies in effect for the queue.

To check a job’s status, use the checkjob command:

checkjob -v 176618

To check resource usage on the nodes available to a running job, use:

jobload 176618

To check the status of more than one job or the queues in general, use showq. Examples include:

showq -r           #View all running jobs
showq -u username  #View only a given user's jobs

If your job has not started and you are unsure why, this FAQ provides some common explanations.

To remove a job from the queue, or stop a running job, use the command qdel. For job number 176618, the command would be:

qdel 176618

Output

When your job has finished running, any outputs to stdout or stderr will be placed in the files .o and .e. These two files will be in the directory that you submitted the job from. For example, for a job submitted from JobScript.sh and with job ID 176618, the output would be in:

JobScript.sh.o176618  #Output will be here
JobScript.sh.e176618  #Any errors will be here

Any files that the job writes to permanent storage locations will simply remain in those locations. Files written to locations only available during the life of the job (e.g. TMPFS or TMPDIR) will be removed once the job is completed, so those files must be moved to a permanent location at the end of the submission script.