Virginia Tech
Advanced Research Computing
  • Home
  • About ARC@VT
  • Research
  • Services & Support
  • Systems & Resources
    • System X
    • SGI Systems
      •      Accounts
      •      New Accounts
      •      Usage Overview
      •      Help Requests
    • SGI Software
      •      Applications
      •      Queuing System
      •      Compilers
      •      Subroutine Libraries
      •      Debuggers
    • SGI Parallel Programming
      •      Auto-Parallel
      •      SCSL Subroutines
      •      OpenMP
      •      MPI Programming
    • Sun Systems
    • Visualization
  • Application Software
  • Web Site Map

SGI Queuing System

Jobs are submitted to the ARC SGI Altix servers through a job queueing system. Submission of jobs through a queueing system means that jobs will not run immediately but will wait for available CPU resources. The queueing system thus keeps the compute servers from being overloaded and makes CPU and memory utilization more optimal across running jobs. This will allow each job to run optimally once it leaves the queue, especially jobs that are CPU bound.

A queueing system, similar to the queueing system used for System X, has been installed on Cauldron and Inferno2. This web page provides an introduction to the queuing system and outlines the steps required to begin using this system.

A tutorial to assist you in learning how to use the queuing system on the VT-ARC SGI Systems is available for download in two formats:

  • PowerPoint
  • PDF

Two tutorial videos are also available:

  1. The Hello World video is a basic introduction to queuing for the SGI machines. It describes the queuing submission script and how to get information about job status.

  2. The Do Sums video shows more advanced queuing submission scripts.

Using the Queueing System

Submission of jobs through a queueing system means that jobs will not run immediately but will wait for available CPU resources. However, the queueing system will keep the compute servers from being overloaded and therefore make CPU and memory utilization more optimal across running jobs. This will allow each job to run faster once it leaves the queue, especially jobs that are CPU bound.

To use the queueing system, you will be logging into one of two new head nodes to access files, compile code, and submit jobs to a queue. Direct logins to cauldron.arc.vt.edu and inferno2.arc.vt.edu will no longer be allowed. The two head nodes that are available for you to use are:

  • charon1.arc.vt.edu
  • charon2.arc.vt.edu

The head nodes are small systems with 2 CPUs and 4 GB of memory. Therefore, the running of applications directly on the head nodes is prohibited.

Direct logins to inferno.arc.vt.edu will still be allowed. It is to be used for applications requiring interactive sessions (such as GUIs) and for debug purposes. This is a relatively small server: please do not run large jobs on inferno; try to limit debug jobs on inferno to 2 - 4 processors.

In order to use the queuing system you need to have a job submission script. There is an example submission script in /apps/doc. You can also view the queuing tutorial videos for more information on queuing scripts.

Selecting A Queue

There will be two queues available for job submission: the inferno2_q for running jobs on inferno2, and the cauldron_q for running jobs on cauldron. The inferno2_q will be configured to serve smaller jobs, while the cauldron_q will be configured to serve larger jobs. The number of CPUs and amount of memory you need for a job, as well as your total number of jobs, will guide your selection of a queue.

Here are the characteristics of each queue that will help you select a queue to run on:

The inferno2_q:

  • Offers 4 GB of memory per CPU
  • Has a soft limit* of 2 jobs per user
  • Has a hard limit** of 8 jobs per user
  • Has a MAXIMUM limit of 16 CPUs per USER

The cauldron_q:

  • Ofers 5 GB of memory per CPU
  • Has a soft limit* of 1 job per user
  • Has a hard limit** of 1 job per user
  • Requires a MINIMUM of 10 CPUS per JOB

* Soft limit sets the number of jobs submited by a user that will be able to run concurrently if any other user below that limit has jobs waiting in queue.

** Hard limit sets the maximum number of jobs submited by a user that can run concurrently.

Note:  We will monitor and tune the job and CPU limits as needed in order to maximize utilization.

Once a job is submitted to a queue, it will wait until requested CPU resources are available within that queue, and will then run if eligible according to the job limits listed above. On the cauldron_q, the job limits specify that a user cannot have more than 2 jobs running at once, and a user will only be able to have a second job running if no other user is waiting in the queue with no jobs running. On the inferno2_q, the job limits specify that if user A has 2 4-cpu jobs running, before a 3rd 4-cpu job can run for user A, no other user below the job soft cap can have an eligible job waiting to run. Furthermore, the inferno2_q user cannot have a 4th 4-cpu job running concurrently with their previous 3 jobs because they would then exceed the user CPU limit.

The total amount of time requested for a job also affects its eligibility to run, as shorter jobs tend to get priority over longer jobs. You will learn to tune your requested job times to be more competitive as you gain experience with your code or application.

Submitting Jobs to the Queuing System

The queuing system is Torque/Moab, so if you are familiar with those, it will be very similar.

Job submissions are done by submitting a job launch script with the command qsub. You can find example submissions scripts in the /apps/doc directory.

To submit your job to the queuing system use the command qsub:

     qsub ./JobScript.sh

This will return your job name of the form xxxxx.queue.tcf-int.vt.edu. The number before the .queue.tcf-int.vt.edu is your job_number.

If you need to remove your job from the queue, use qdel:

    qdel <job_number>.

To see status information about your job, you can use:

    showstart <job_number>  will tell you expected start and
                            finish times.
    qstat -f <job_number>   general information about the job.

When your job has finished running any outputs to stdout or stderr will be placed in the files .o<job_number> and .e<job_number>. These 2 files will be in the directory that you submitted the job from.

To find information about your queued or running jobs you can use the command showq. This will show all of the running jobs over System X, cauldron, and inferno2. If you wish to only view cauldron jobs, use showq -p CAULDRON, or if you only want to see inferno2 jobs use showq -p INFERNO2. If you would like detailed information on your job, use qstat -f <job_number> or checkjob -v <job_number>.

If you have a job sitting in the queue that you think should be able to run, use the command checkjob -v <job_number> to see the reason the job is not running, as shown at the bottom of the output.

If you have any questions, concerns, comments please let us know at arc@vt.edu.


VT-ARC Privacy Statement | Contact Us
VT-ARC is a Unit within the Office of the Vice President of Information Technology
© 2007-2008 Virginia Polytechnic Institute and State University
Principles of Community | Acceptable Use Policy | Accessibility | Equal Opportunity
Website Feedback   -   Page Last Updated:  July 11th, 2008