SGI Queuing System
Jobs are submitted to the ARC SGI Altix servers through a job queueing
system. Submission of jobs through a queueing system means that jobs
will not run immediately but will wait for available CPU resources. The
queueing system thus keeps the compute servers from being overloaded and
makes CPU and memory utilization more optimal across running jobs. This
will allow each job to run optimally once it leaves the queue,
especially jobs that are CPU bound.
A queueing system, similar to the queueing system used for System X, has been installed on Cauldron and Inferno2.
This web page provides an introduction to the queuing system and outlines the steps required
to begin using this system.
A tutorial to assist you in learning how to use the queuing system on the VT-ARC SGI Systems
is available for download in two formats:
Two tutorial videos are also available:
- The Hello World video is a basic introduction to queuing for the SGI machines. It describes the queuing submission script and how to get information about job status.
- The Do Sums video shows more advanced queuing submission scripts.
Using the Queueing System
Submission of jobs through a queueing system means that jobs will not
run immediately but will wait for available CPU resources. However,
the queueing system will keep the compute servers from being
overloaded and therefore make CPU and memory utilization more optimal
across running jobs. This will allow each job to run faster once it
leaves the queue, especially jobs that are CPU bound.
To use the queueing system, you will be logging into one of two new
head nodes to access files, compile code, and submit jobs to a
queue. Direct logins to cauldron.arc.vt.edu and inferno2.arc.vt.edu
will no longer be allowed. The two head nodes that are available for
you to use are:
- charon1.arc.vt.edu
- charon2.arc.vt.edu
The head nodes are small systems with 2 CPUs and 4 GB of memory.
Therefore, the running of applications directly on the head nodes is
prohibited.
Direct logins to inferno.arc.vt.edu will still be allowed. It is to
be used for applications requiring interactive sessions (such as
GUIs) and for debug purposes. This is a relatively small server:
please do not run large jobs on inferno; try to limit debug jobs on
inferno to 2 - 4 processors.
In order to use the queuing system you need to have a job submission
script. There is an example submission script in /apps/doc. You can
also view the queuing tutorial videos for more information on queuing
scripts.
Selecting A Queue
There will be two queues available for job submission: the inferno2_q
for running jobs on inferno2, and the cauldron_q for running jobs on
cauldron. The inferno2_q will be configured to serve smaller jobs,
while the cauldron_q will be configured to serve larger jobs. The
number of CPUs and amount of memory you need for a job, as well as
your total number of jobs, will guide your selection of a queue.
Here are the characteristics of each queue that will help you select
a queue to run on:
The inferno2_q:
- Offers 4 GB of memory per CPU
- Has a soft limit* of 2 jobs per user
- Has a hard limit** of 8 jobs per user
- Has a MAXIMUM limit of 16 CPUs per USER
The cauldron_q:
- Ofers 5 GB of memory per CPU
- Has a soft limit* of 1 job per user
- Has a hard limit** of 1 job per user
- Requires a MINIMUM of 10 CPUS per JOB
* Soft limit sets the number of jobs submited by a user that will be
able to run concurrently if any other user below that limit has jobs
waiting in queue.
** Hard limit sets the maximum number of jobs submited by a user that
can run concurrently.
Note: We will monitor and tune the job and CPU limits as needed in
order to maximize utilization.
Once a job is submitted to a queue, it will wait until requested CPU
resources are available within that queue, and will then run if eligible
according to the job limits listed above. On the cauldron_q, the job
limits specify that a user cannot have more than 2 jobs running at once,
and a user will only be able to have a second job running if no other
user is waiting in the queue with no jobs running. On the inferno2_q,
the job limits specify that if user A has 2 4-cpu jobs running, before a
3rd 4-cpu job can run for user A, no other user below the job soft cap
can have an eligible job waiting to run. Furthermore, the inferno2_q
user cannot have a 4th 4-cpu job running concurrently with their
previous 3 jobs because they would then exceed the user CPU limit.
The total amount of time requested for a job also affects its
eligibility to run, as shorter jobs tend to get priority over longer
jobs. You will learn to tune your requested job times to be more
competitive as you gain experience with your code or application.
Submitting Jobs to the Queuing System
The queuing system is Torque/Moab, so if you are familiar with those, it will be very similar.
Job submissions are done by submitting a job launch script with the command qsub. You can find example submissions scripts in the /apps/doc directory.
To submit your job to the queuing system use the command qsub:
qsub ./JobScript.sh
This will return your job name of the form xxxxx.queue.tcf-int.vt.edu.
The number before the .queue.tcf-int.vt.edu is your job_number.
If you need to remove your job from the queue, use qdel:
qdel <job_number>.
To see status information about your job, you can use:
showstart <job_number> will tell you expected start and
finish times.
qstat -f <job_number> general information about the job.
When your job has finished running any outputs to stdout or stderr
will be placed in the files .o<job_number> and .e<job_number>. These 2 files will be in the directory that
you submitted the job from.
To find information about your queued or running jobs you can use the
command showq. This will show all of the running jobs over System X,
cauldron, and inferno2. If you wish to only view cauldron jobs, use
showq -p CAULDRON, or if you only want to see inferno2 jobs use showq
-p INFERNO2. If you would like detailed information on your job, use
qstat -f <job_number> or checkjob -v <job_number>.
If you have a job sitting in the queue that you think should be able to run, use the command checkjob -v <job_number> to see the reason the job is not running, as shown at the bottom of the output.
If you have any questions, concerns, comments please let us know at arc@vt.edu.
|