Virginia Tech
Advanced Research Computing
  • Home
  • About ARC@VT
  • Research
  • Services & Support
  • Systems & Resources
    • System X
    •      History
    •      Hardware FAQ
    • System X - Usage
      •      User Accounts
      •      Software FAQ
      •      Help Requests
    • System X - Compilers
    • System X - Applications
    • System X - Using MPI
      •      MPI Overview
      •      MPI Tutorials
      •      MPI Code Examples
      •      MPI References
    • SGI Systems
    • Sun Systems
    • Visualization
  • Application Software
  • Web Site Map


System X User Support

For New Users
       Hardware Details
       System Maintenance
       Compiling
       Application Development/Porting
       Requirements for Jobs
       Running a Parallel Job
       Debugging Cluster
Tested Applications
Custom User Commands
Hello World Example
Allocation Requests
Obtaining an Account on Sysem X
User Support Requests


For New Users

Hardware Details

Compute Nodes: 1100 Apple Xserve G5 cluster nodes with the following specifications:

  • Dual 2.3 GHz PowerPC 970FX processors
  • 4 GB ECC DDR400 (PC3200) RAM
  • 80 GB S-ATA hard disk drive
  • One Mellanox Cougar InfiniBand 4x HCA*
* HCA added from third party and not a build-to-order option

Compile Nodes: 3 Apple Xserve G5 nodes with the following specifications:
  • Dual 2.3 GHz PowerPC 970FX processors
  • 4 GB ECC DDR400 (PC3200) RAM
  • 3x250 GB S-ATA hard disk drive

System Maintenance

There is a scheduled optional maintenance window of Thursday at noon until Friday at noon. Email is sent to all System X users by noon on Wednesday detailing whether or not the optional maintenance window will occur.

Compiling

  • MPICH-1.2.5 with modifications is the supported MPI library.
  • Compilers are available in /nfs/compilers/mpich-1.2.5/bin/
mpiCC-->g++
mpicc-->gcc
mpicxx-->g++
mpixlc-->xlc
mpif77-->xlf -qextname -L/usr/mellanox/lib
mpif90-->f90 -qextname -L/usr/mellanox/lib
There are known bugs in MPICH, please read the KnownBugs file in /nfs/storage1/docs/compilers/mpich-1.2.5/KnownBugs

Application Development/Porting Consultants

Application related consulting is available from the Laboratory for Advanced Scientific Computing and Applications (LASCA). LASCA Graduate Research Assistants (GRAs) are available to work with System X users in the areas of parallel algorithms, parallel applications development and porting, compilation and runtime issues, and performance measurement and tuning. LASCA faculty members are also available to discuss broader issues and potential collaborations. Consultation is done on an individual basis with no fee for Virginia Tech faculty, staff, and students. To contact a LASCA consultant, please email lasca@cs.vt.edu.

Requirements for Jobs

  • All jobs must be submitted through the queueing system.
  • Each job submitted must include details about CPUs required, estimated runtime and accounting information.
  • The CPU time consumed by each job must be allocated to the user before the job is run.
To view your allocations run: "mybalance" A "Hat" is similar to a bank account. Users may have multiple hats, but only one hat at a time can be used for a job.
             Hat  CPU  Time Remaining
 -----------------------
 test 1001 300h  45m 40s
 dept 2012 1200h 31m 10s
The queueing system assigns nodes rather than CPUs so even if a job has specified that it is only going to use one CPU it will be assigned an entire node, and count both CPUs in the node. Using one node for 2 hours would consume 4 CPU hours of an allocation.

System X Queuing Policy

The queueing policy is based on First-In First-Out with backfill and job limits per user. Thus jobs are scheduled as they appear in the queue, but if a job can be squeezed in, it will be. The soft job limit is 2 jobs with a hard job limit of 20 jobs. After a user reaches the soft limit, the queuing system skips their jobs in the first pass of scheduling, if there are no other jobs that can be scheduled, the user's other jobs will then be evaluated for scheduling.

Running a Parallel Job

  • Copy /nfs/docs/qsub-example.sh to your directory
  • Edit qsub-example.sh and change walltime, number of nodes, hat name, and executable as needed.
  • Rename qsub-example.sh to something related to your parallel job. Example: small-molecular-solution.sh
  • If you are attempting to run a binary from your home directory (or something out of $PATH), please remember to append a "./" to the beginning of the command
  • Submit the job using qsub. Example: qsub small-molecular-solution.sh
  • Check the state of the job according to the scheduling system. The scheduling system evaluates the state of the queue every 30 seconds to a minute. To check the status of the scheduled jobs run "showq".

A brief list of commands, see Torque and Moab documentation for additional commands:

           qsub      Submit a job to the queue
 qdel      Delete a job from the queue
 showq     Show the status of submitted jobs
 checkjob  Check the status of a particular job
 showstart Report on start and finish dates for a job

If your job does not appear in the showq output wait 30 seconds and try again. If it is still not listed it may have already been executed. Check for job-output-JOBID and script-name.* files.

If your job is listed in the showq output as "Deferred" run checkjob JOBID to determine why it was deferred. Common reasons for this include not specifying the right hat and not having enough CPU hours in the hat.

If your job finishes and generates output that was unexpected or appears to indicate a system level error please forward your job-output-JOBID and script-name.* files to the System X Support listserv.

Debugging Cluster

Users do have access to an eight node, sixteen processor debug cluster. This cluster is NOT for performance testing and should only be used to debug code execution errors.

NOTE: The debug cluster reboots daily at 5:30 AM to purge any residule processes. You should make sure to avoid running debug jobs that run during this time. We offer no restrictions on the debug cluster, but hope that users will not use it for production runs.

To launch a debug job you will need to:

dbgrun [-printhostname] [-verbose] -np N debug1 ... debug8 a.out [args]
where N is number of nodes

or
dbgrun [-printhostname] [-verbose] -np N -hostfile hf a.out [args]
where N is number of nodes and hf is a file with the nodes named (debug1-debug8)

And, of course
dbgrun --help
will give out pertinent information to the memory challenged.



Tested Applications

[return to top]


This is a list of community codes known to run on System X:

ApplicationDomain

AMBERmolecular dynamics
ARPREChigh-precision numerical methods
ARPSweather modeling
CHARMMmolecular dynamics
FASTESTfluid dynamics
GAMESSquantum chemistry
Global Arraysshared memory programming interface
LAMMPSmolecular dynamics
METIS/ParMETISsparse matrix suite
mpiBLAST mpiBLAST segments the BLAST database and distributes it across cluster nodes
NWChemmolecular dynamics
PETScpartial differential equation suite
ScaLAPACKdense and band matrix software
Unified Parallel C (UPC)C programming extensions
VASP *molecular dynamics
VecLib(BLAS, LAPACK, FFT, DSP)
WRFweather modeling

* currently being ported



VT-ARC Privacy Statement | Contact Us
VT-ARC is a Unit within the Office of the Vice President of Information Technology
© 2007-2008 Virginia Polytechnic Institute and State University
Principles of Community | Acceptable Use Policy | Accessibility | Equal Opportunity
Website Feedback   -   Page Last Updated:  July 5th, 2007