New SGI Accounts
Three SGI Systems are available as components of VT-ARC:
inferno.arc.vt.edu
inferno2.arc.vt.edu
caludron.arc.vt.edu
Inferno2 and Cauldron are accessed via a queuing system via the head nodes charon1.arc.vt.edu or charon2.arc.vt.edu, see details below.
Direct logins to an SGI Altix interactive node, inferno.arc.vt.edu, are available. It is to be used for applications requiring interactive sessions (such as GUIs) and for debug purposes. This is a relatively small server: please do not run large jobs on inferno; try to limit debug jobs on inferno to 2 - 4 processors.
An ssh-2 or later client running on your local system is required to logon to these systems. MAC OS-X, Linux,
and most other UNIX systems have a built in ssh client - simply enter the following commands to log onto
charon1, charon2, or inferno:
ssh charon1.arc.vt.edu
or
ssh charon2.arc.vt.edu
or
ssh inferno.arc.vt.edu
Notes:
- The VT ARC systems require use of an ssh-2 or later client; in some Unix implementations, you will be required to use "ssh2" instead of "ssh" above.
- If the id you are using on your local system is different from your id on the VT ARC systems, precede the VT ARC hostname with your VT ARC account name followed by the @ sign. For example, if your account name were my_acct, you could log onto inferno using the following command:
ssh my_acct@inferno.arc.vt.edu
- If you are logging on from a wireless or off campus location, you should do so using a VPN connection, see: http://www.computing.vt.edu/internet_and_web/internet_access/vpn.html
- If you are using an MS Windows (2000, XP, 2003, or Vista) Desktop, you can download the latest "SSHSecureShell" Client (currently SSHSecureShellClient-3.2.9.exe) from http://ftp.ssh.com/pub/ssh/ or the putty ssh client from http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html.
Queuing System
Jobs are submitted to the ARC SGI Altix servers through a job queueing system. Submission of jobs through a queueing system means that jobs will not run immediately but will wait for available CPU resources. The queueing system thus keeps the compute servers from being overloaded and makes CPU and memory utilization more optimal across running jobs. This will allow each job to run optimally once it leaves the queue, especially jobs that are CPU bound.
To use the queueing system, you will be logging into one of two SGI Altix head nodes to access files, compile code, and submit jobs to a queue. Direct logins to the SGI Altix compute servers, inferno2 and cauldron, are not allowed. The two head nodes that are available for you to use are:
charon1.arc.vt.edu
charon2.arc.vt.edu
The head nodes are small systems with 2 CPUs and 4 GB of memory. Therefore, the running of applications directly on the head nodes is prohibited.
Selecting A Queue
There will be two queues available for job submission: the inferno2_q
for running jobs on inferno2, and the cauldron_q for running jobs on
cauldron. The inferno2_q will be configured to serve smaller jobs,
while the cauldron_q will be configured to serve larger jobs. The
number of CPUs and amount of memory you need for a job, as well as
your total number of jobs, will guide your selection of a queue.
Here are the characteristics of each queue that will help you select
a queue to run on:
The inferno2_q:
- Offers 4 GB of memory per CPU
- Has a soft limit* of 2 jobs per user
- Has a hard limit** of 8 jobs per user
- Has a MAXIMUM limit of 16 CPUs per USER
The cauldron_q:
- Ofers 5 GB of memory per CPU
- Has a soft limit* of 1 job per user
- Has a hard limit** of 1 job per user
- Requires a MINIMUM of 10 CPUS per JOB
* Soft limit sets the number of jobs submited by a user that will be
able to run concurrently if any other user below that limit has jobs
waiting in queue.
** Hard limit sets the maximum number of jobs submited by a user that
can run concurrently.
Note: We will monitor and tune the job and CPU limits as needed in
order to maximize utilization.
Once a job is submitted to a queue, it will wait until requested CPU
resources are available within that queue, and will then run if eligible
according to the job limits listed above. On the cauldron_q, the job
limits specify that a user cannot have more than 2 jobs running at once,
and a user will only be able to have a second job running if no other
user is waiting in the queue with no jobs running. On the inferno2_q,
the job limits specify that if user A has 2 4-cpu jobs running, before a
3rd 4-cpu job can run for user A, no other user below the job soft cap
can have an eligible job waiting to run. Furthermore, the inferno2_q
user cannot have a 4th 4-cpu job running concurrently with their
previous 3 jobs because they would then exceed the user CPU limit.
The total amount of time requested for a job also affects its
eligibility to run, as shorter jobs tend to get priority over longer
jobs. You will learn to tune your requested job times to be more
competitive as you gain experience with your code or application.
Submitting Jobs to the Queuing System
The queuing system is Torque/Moab, so if you are familiar with those, it will be very similar.
Job submissions are done by submitting a job launch script with the command qsub. You can find example submissions scripts in the /apps/doc directory.
To submit your job to the queuing system use the command qsub:
qsub ./JobScript.sh
This will return your job name of the form xxxxx.queue.tcf-int.vt.edu.
The number before the .queue.tcf-int.vt.edu is your job_number.
If you need to remove your job from the queue, use qdel:
qdel <job_number>.
To see status information about your job, you can use:
showstart <job_number> will tell you expected start and
finish times.
qstat -f <job_number> general information about the job.
When your job has finished running any outputs to stdout or stderr
will be placed in the files .o<job_number> and .e<job_number>. These 2 files will be in the directory that
you submitted the job from.
To find information about your queued or running jobs you can use the
command showq. This will show all of the running jobs over System X,
cauldron, and inferno2. If you wish to only view cauldron jobs, use
showq -p CAULDRON, or if you only want to see inferno2 jobs use showq
-p INFERNO2. If you would like detailed information on your job, use
qstat -f <job_number> or checkjob -v <job_number>.
If you have a job sitting in the queue that you think should be able to run, use the command checkjob -v <job_number> to see the reason the job is not running, as shown at the bottom of the output.
For more information about the queueing system see:
http://www.arc.vt.edu/arc/sgi/queuing.php
There are instructional videos as well as presentation slides available at that page.
If you have questions, comments, or concerns, please let us know at arc@vt.edu.
Thank you.
|