Virginia Tech
Advanced Research Computing
  • Home
  • About ARC@VT
  • Research
  • Services & Support
  • Systems & Resources
    • System X
    •      History
    •      Hardware FAQ
    • System X - Usage
      •      User Accounts
      •      Software FAQ
      •      Help Requests
    • System X - Compilers
    • System X - Applications
    • System X - Using MPI
      •      MPI Overview
      •      MPI Tutorials
      •      MPI Code Examples
      •      MPI References
    • SGI Systems
    • Sun Systems
    • Visualization
  • Application Software
  • Web Site Map


System X Usage


Hardware    Compilers    Applications    Job Queuing    Parallel Jobs    Debugging

System Maintenance

There is a scheduled optional maintenance window of Thursday at noon until Friday at noon. Email is sent to all System X users by noon on Wednesday detailing whether or not the optional maintenance window will occur.

Application Development/Porting Consultants

Application related consulting is available from the Laboratory for Advanced Scientific Computing and Applications (LASCA). LASCA Graduate Research Assistants (GRAs) are available to work with System X users in the areas of parallel algorithms, parallel applications development and porting, compilation and runtime issues, and performance measurement and tuning. LASCA faculty members are also available to discuss broader issues and potential collaborations. Consultation is done on an individual basis with no fee for Virginia Tech faculty, staff, and students. To contact a LASCA consultant, please email lasca@cs.vt.edu.

System X Queuing System

The queueing policy is based on First-In First-Out with backfill and job limits per user. Thus jobs are scheduled as they appear in the queue, but if a job can be squeezed in, it will be. The soft job limit is 2 jobs with a hard job limit of 20 jobs. After a user reaches the soft limit, the queuing system skips their jobs in the first pass of scheduling, if there are no other jobs that can be scheduled, the user's other jobs will then be evaluated for scheduling.

Requirements for Jobs

  • All jobs must be submitted through the queueing system.
  • Each job submitted must include details about CPUs required, estimated runtime and accounting information.
  • The CPU time consumed by each job must be allocated to the user before the job is run.
To view your allocations run: "mybalance" A "Hat" is similar to a bank account. Users may have multiple hats, but only one hat at a time can be used for a job.

Hat CPUTime Remaining
test 1001 300h45m 40s
dept 20121200h31m 10s

The queueing system assigns nodes rather than CPUs so even if a job has specified that it is only going to use one CPU it will be assigned an entire node, and count both CPUs in the node. Using one node for 2 hours would consume 4 CPU hours of an allocation.

jobhistory Command

This command will allow you to view jobs that have been completed and how much time has been consumed. The jobhistory command also displays queued jobs and how much time that job has encumbered. A total of all time consumed/encumbered will be displayed at the bottom of the output as well.

USAGE:  jobhistory [-t]] [-h cap] [-a | -u user] [-s sdate] [-e edate]

-t, --totals Will only show the totals, and not each individual job
-h, --hat Specifies the hat from which you would like to check your job history
-s, --start Specifies a start date (in MM/DD/YY or MM-DD-YY format)
-e, --end Specifies an end date (in MM/DD/YY or MM-DD-YY format)
-a, --all If you are the Principle Investigator of the specified hat, this will check the job history of every member of that hat
-u, --user Another Principle Investigator only flag, this will check user's job history
NOTE: The -a and -u flags do not work together The various flags allow output to be streamlined. The -s <sdate> -e <edate> allows you to choose to view jobs between sdate and edate, the dates can be in YYYY-M-D or M/D/Y format (the - and / separators make no difference so long as they are consistent inside the string). You can use -s and -e, or just one, or neither. The -h <cap> option allows those with mulitple hats to only view the specified cap. The -t option will only show the totaled output and not display the individual jobs.

If you are a Principle Investigator of a hat, you have 2 more options available to you. The -a flag will allow you to view all the members of hats in which you are a Principle Investigator. The other is the -u <user> flag. This allows you to specify a user in a hat in which you are the Principle Investigator, and view their job history.

All flags will work together with the exception of -a and -u. If both of those flags are used the command will fail and output usage instructions.

mybalance Command

Mybalance allows you to see how many hours you have in your hat(s).

USAGE:  mybalance [-h hrs] [-n num]

-h Specify hrs and see the amount of cpus that you have credit to run a job on for hrs hours.
-n Specify num and see the length of time you can run a job on num cpus
Specifying flags will give you information in addition to your hat balance(s).

Running a Parallel Job

Use the following step to submit a parallel job for processing:

  • Copy /nfs/docs/qsub-example.sh to your directory
  • Edit qsub-example.sh and change walltime, number of nodes, hat name, and executable as needed.
  • Rename qsub-example.sh to something related to your parallel job. Example: small-molecular-solution.sh
  • If you are attempting to run a binary from your home directory (or something out of $PATH), please remember to append a "./" to the beginning of the command
  • Submit the job using qsub. Example: qsub small-molecular-solution.sh
  • Check the state of the job according to the scheduling system. The scheduling system evaluates the state of the queue every 30 seconds to a minute. To check the status of the scheduled jobs run "showq".

A brief list of commands, see Torque and Moab documentation for additional commands:

qsub Submit a job to the queue
qdel Delete a job from the queue
showq Show the status of submitted jobs
checkjob Check the status of a particular job
showstart   Report on start and finish dates for a job

If your job does not appear in the showq output wait 30 seconds and try again. If it is still not listed it may have already been executed. Check for job-output-JOBID and script-name.* files.

If your job is listed in the showq output as "Deferred" run checkjob JOBID to determine why it was deferred. Common reasons for this include not specifying the right hat and not having enough CPU hours in the hat.

If your job finishes and generates output that was unexpected or appears to indicate a system level error please forward your job-output-JOBID and script-name.* files to the System X Support listserv.

For additional information on running parallel jobs, see: Using MPI on System X.

Debugging Cluster

Users do have access to an eight node, sixteen processor debug cluster. This cluster is NOT for performance testing and should only be used to debug code execution errors.

NOTE: The debug cluster reboots daily at 5:30 AM to purge any residule processes. You should make sure to avoid running debug jobs that run during this time. We offer no restrictions on the debug cluster, but hope that users will not use it for production runs.

To launch a debug job you will need to:

dbgrun [-printhostname] [-verbose] -np N debug1 ... debug8 a.out [args]
where N is number of nodes

or
dbgrun [-printhostname] [-verbose] -np N -hostfile hf a.out [args]
where N is number of nodes and hf is a file with the nodes named (debug1-debug8)

And, of course
dbgrun --help
will give out pertinent information to the memory challenged.


Hardware    Compilers    Applications    Job Queuing    Parallel Jobs    Debugging



VT-ARC Privacy Statement | Contact Us
VT-ARC is a Unit within the Office of the Vice President of Information Technology
© 2007-2008 Virginia Polytechnic Institute and State University
Principles of Community | Acceptable Use Policy | Accessibility | Equal Opportunity
Website Feedback   -   Page Last Updated:  March 25th, 2008