System X Usage
System Maintenance
There is
a scheduled optional maintenance window of Thursday at noon until Friday
at noon. Email is sent to all System X users by noon on Wednesday detailing
whether or not the optional maintenance window will occur.
Application Development/Porting Consultants
Application related consulting is available from the Laboratory for Advanced
Scientific Computing and Applications (LASCA). LASCA Graduate Research
Assistants (GRAs) are available to work with System X users in the areas
of parallel algorithms, parallel applications development and porting,
compilation and runtime issues, and performance measurement and tuning.
LASCA faculty members are also available to discuss broader issues and
potential collaborations. Consultation is done on an individual basis
with no fee for Virginia Tech faculty, staff, and students. To contact
a LASCA consultant, please email lasca@cs.vt.edu.
System X Queuing System
The queueing policy is based on First-In First-Out
with backfill and job limits per user. Thus jobs are
scheduled as they appear in the queue, but if a job can
be squeezed in, it will be.
The soft job limit is 2 jobs with a hard job
limit of 20 jobs. After a user reaches the soft limit, the queuing
system skips their jobs in the first pass of scheduling, if there are no
other jobs that can be scheduled, the user's other jobs will then be
evaluated for scheduling.
Requirements for Jobs
- All jobs must
be submitted through the queueing system.
- Each job submitted
must include details about CPUs required, estimated runtime and accounting
information.
- The CPU time
consumed by each job must be allocated to the user before the job
is run.
To view your allocations
run: "mybalance"
A "Hat"
is similar to a bank account. Users may have multiple hats, but only
one hat at a time can be used for a job.
| Hat | CPU | Time Remaining |
|---|
| test 1001 | 300h | 45m 40s
| dept 2012 | 1200h | 31m 10s
The queueing system
assigns nodes rather than CPUs so even if a job has specified that it
is only going to use one CPU it will be assigned an entire node, and
count both CPUs in the node. Using one node for 2 hours would consume
4 CPU hours of an allocation.
jobhistory Command
This command will allow you to view jobs that have been completed and
how much time has been consumed. The jobhistory command also displays
queued jobs and how much time that job has encumbered. A total of all
time consumed/encumbered will be displayed at the bottom of the output
as well.
USAGE: jobhistory [-t]] [-h cap] [-a | -u user] [-s sdate] [-e edate]
-t, --totals Will only show the totals, and not each individual job
-h, --hat Specifies the hat from which you would like to check your job history
-s, --start Specifies a start date (in MM/DD/YY or MM-DD-YY format)
-e, --end Specifies an end date (in MM/DD/YY or MM-DD-YY format)
-a, --all If you are the Principle Investigator of the specified hat,
this will check the job history of every member of that hat
-u, --user Another Principle Investigator only flag, this will check user's job history
NOTE: The -a and -u flags do not work together
The various flags allow output to be streamlined. The -s <sdate> -e
<edate> allows you to choose to view jobs between sdate and edate, the
dates can be in YYYY-M-D or M/D/Y format (the - and / separators make
no difference so long as they are consistent inside the string). You
can use -s and -e, or just one, or neither. The -h <cap> option allows
those with mulitple hats to only view the specified cap. The -t option
will only show the totaled output and not display the individual jobs.
If you are a Principle Investigator of a hat, you have 2 more options
available to you. The -a flag will allow you to view all the members
of hats in which you are a Principle Investigator. The other is the -u
<user> flag. This allows you to specify a user in a hat in which you
are the Principle Investigator, and view their job history.
All flags will work together with the exception of -a and -u. If both
of those flags are used the command will fail and output usage
instructions.
mybalance Command
Mybalance allows you to see how many hours you have in your hat(s).
USAGE: mybalance [-h hrs] [-n num]
-h Specify hrs and see the amount of cpus that you have credit to run a job on for hrs hours.
-n Specify num and see the length of time you can run a job on num cpus
Specifying flags will give you information in addition to your hat
balance(s).
Running a Parallel Job
Use the following step to submit a parallel job for processing:
- Copy /nfs/docs/qsub-example.sh
to your directory
- Edit qsub-example.sh
and change walltime, number of nodes, hat name, and executable as
needed.
- Rename qsub-example.sh
to something related to your parallel job. Example: small-molecular-solution.sh
- If you are attempting to run a binary from your home directory
(or something out of $PATH), please remember to append a "./" to
the beginning of the command
- Submit the job
using qsub. Example: qsub small-molecular-solution.sh
- Check the state
of the job according to the scheduling system. The scheduling system
evaluates the state of the queue every 30 seconds to a minute. To
check the status of the scheduled jobs run "showq".
A brief list of commands, see
Torque and Moab documentation
for additional commands:
| qsub | Submit a job to the queue |
|---|
| qdel | Delete a job from the queue |
|---|
| showq | Show the status of submitted jobs |
|---|
| checkjob | Check the status of a particular job |
|---|
| showstart | Report on start and finish dates for a job |
|---|
If your job does not appear in the showq output wait 30 seconds and
try again. If it is still not listed it may have already been executed.
Check for job-output-JOBID and script-name.* files.
If your job is listed in the showq output as "Deferred" run checkjob
JOBID to determine why it was deferred. Common reasons for this include
not specifying the right hat and not having enough CPU hours in the hat.
If your job finishes and generates output that was unexpected or appears to indicate a system
level error please forward your job-output-JOBID and script-name.* files
to the System X Support listserv.
For additional information on running parallel jobs, see:
Using MPI on System X.
Debugging Cluster
Users do have access to an eight node, sixteen processor debug cluster. This cluster is NOT for performance testing and should only be used to debug code execution errors.
NOTE: The debug cluster reboots daily at 5:30 AM to purge any residule processes.
You should make sure to avoid running debug jobs that run during this
time. We offer no restrictions on the debug cluster, but hope that
users will not use it for production runs.
To launch a debug job you will need to:
dbgrun [-printhostname] [-verbose] -np N debug1 ... debug8 a.out [args]
where N is number of nodes
or
dbgrun [-printhostname] [-verbose] -np N -hostfile hf a.out [args]
where N is number of nodes and hf is a file with the nodes named (debug1-debug8)
And, of course
dbgrun --help
will give out pertinent information to the memory challenged.
|