Virginia Tech
Advanced Research Computing
  • Home
  • About ARC@VT
  • Research
  • Services & Support
  • Systems & Resources
    • System X
    •      History
    •      Hardware FAQ
    • System X - Usage
      •      User Accounts
      •      Software FAQ
      •      Help Requests
    • System X - Compilers
    • System X - Applications
    • System X - Using MPI
      •      MPI Overview
      •      MPI Tutorials
      •      MPI Code Examples
      •      MPI References
    • SGI Systems
    • Sun Systems
    • Visualization
  • Application Software
  • Web Site Map

Using MPI on System X

Introduction

This document describes the "mechanics" of using an MPI program on Virginia Tech's System X. It assumes that an MPI program has already been written.

The implementation of MPI used on System X is known as MPICH. System X is currently using version 1.2.5 of MPICH. MPICH was developed at Argonne National Laboratories, which maintains the MPICH web site at http://www-unix.mcs.anl.gov/mpi/mpich1/. For a listing of available compilers, see: System X Compilers.

Refer to MPI Overview for a very brief overview of MPI.

This document will simply walk you through a typical series of steps that take an MPI program on your "home" machine, run it on System X, and bring the results back home.

At the end of this document are instructions on how to actually carry out these steps, using sample files available on the web.

Some Assumptions

For this simple introduction, we'll make a number of assumptions.

(One file, simple name): We'll assume you have a program, already written, which uses MPI, that the program consists of a single file, written in C, and that this file is called program.c.

(No input/Only standard output): We'll also assume for now that the program needs no input, and that the output of the program is entirely directed to the standard output device. In other words, the executing program does not read from or write to any auxilliary files.

(Source code on home machine): We'll assume this source code file is sitting in the source_code subdirectory on your home machine home_mac.

Our goal then is to transfer the file to System X, compile it, run it, and retrieve the output.


Transferring Source Code to System X

System X comprises 1100 nodes, each containing two processors. Most of these nodes are compute nodes, which are used exclusively for computation. But a few nodes, known as compile nodes, are set aside for interactive use, allowing users to create file directories, store files, compile them, submit jobs, and so on.

The System X compile nodes we are interested in have the following IP addresses:

  • sysx1.arc.vt.edu
  • sysx2.arc.vt.edu
  • sysx3.arc.vt.edu

To compile your program, you will need to transfer the source code of your MPI program to one of these nodes. This can be done with the secure FTP program sftp. Here is a typical session, which suggests how you might transfer the file. We are assuming here that you already set up a subdirectory on System X called work_directory.

home_mac: sftp sysx1.arc.vt.edu

sysx1: Password for user: xxxxx
sysx1: cd work_directory
sysx1: lcd source_code
sysx1: put program.c
sysx1: ls
sysx1: program.c
sysx1: quit

home_mac:

Note that the commands cd, pwd and ls are carried out on the remote machine (sysx1 in this case) while the corresponding commands lcd, lpwd and lls will be carried out on the local machine (home_mac in this example). The put command moves files from the local to the remote machine, while the get command brings files from the remote machine to the local one. If multiple files are to be transferred, the mget and mput commands can be used.


Compiling the Source Code into an Executable

Once the source code file has been transferred to one of the compile nodes, you can log in to the compile node and compile your file. Since there is a single file server shared by all the compile nodes, you can log in to any one of the compile nodes you like, and you will see the same set of files.

To log in interactively, we use the Secure Shell program, ssh.

home_mac: ssh sysx2.arc.vt.edu

sysx1: Password for user: xxxxx
sysx1: cd work_directory
sysx1: mpicc program.c
sysx1: mv a.out program

Note that you must use the mpicc compiler to compile a C program that invokes the MPI library. If the compile command fails because the mpicc command cannot be found, you may need to invoke it with the full path name:


/nfs/compilers/mpich-1.2.5/bin/mpicc program.c

If the compilation fails, you will need to revise your program. You can either edit the program on your home machine and transfer it again, or make the changes directly on the System X copy.

In our example, we assume the compilation was successful. We allowed the compiler to assign the default name of a.out to the executable program it created, and then we renamed it to program. We're now ready to submit the program to execution, so we're staying logged in.

Submitting a Script to Run the Executable

Once the executable program has been created, you need a shell script to run the program in parallel. This script specifies the number of processors to be used, the time limit, and so on. An example of such a shell script, with explanatory comments, is available in the System X file system as "/nfs/docs/qsub-example.sh" or you can refer to this qsub-example.sh.

Here is a simplified shell script for our example, which we will call program.sh.


#!/bin/bash
#
#PBS -lwalltime=00:00:30
#PBS -lnodes=2:ppn=2
#PBS -W group_list=???
#PBS -q production_q
#PBS -A $$$
#
NUM_NODES=`/bin/cat $PBS_NODEFILE | /usr/bin/wc -l | /usr/bin/sed "s/ //g"`
cd $PBS_O_WORKDIR
export PATH=/nfs/software/bin:$PATH
jmdrun -printhostname -np $NUM_NODES -hostfile $PBS_NODEFILE \
./program &> program_output.txt
exit;

Replace the "???" field in this file by your group information. To get your group, log into one of the System X compile nodes and type

     groups
Ignore the "staff" group in the output; use the other group that is listed as the value of the "???" filed in the shell script.

Also replace the "$$$" field by your "hat", that is, the account to which your computer work is to be billed. The hat value was assigned when your project was approved and set up for System X.

We'll assume that the shell script program.sh is stored in work_directory, the subdirectory which contains program.c and the executable program. To run the job, we must "submit" the shell script to the queuing system. To do this, we must move to the subdirectory containing the job script and the executable (which we're assuming is subdirectory work_directory), and issue the command:

     qsub program.sh

The qsub command asks the queing system to schedule your job to run. The immediate response from the queueing system is a message that assigns a job number. The job number can be used to check on the progress of your job, and it will also be used as part of the name of the log files created when your job is done.

For example, the response to your qsub command might be

     40316.queue.tcf-int.vt.edu
in which case your job number is 40316.

Although our example job is small (only 30 seconds on 4 processors) and should run quickly, it is always possible to check on the status of all the jobs you have in the queue, by issuing the command

     showq | grep YOUR_NAME
which might show you:
   JOBID    USERNAME   STATE  PROCS  REMAINING  STARTTIME
   -------  --------   ------ ------ ---------  -----------
   40316    your_name    Idle      4  00:00:30  Mon Oct 15 14:06:00

You can also use the convenient command

     qstat -u YOUR_NAME
whose output format is a little different:
                                                      Req'd  Req'd   Elap
Job ID          Username Queue Jobname SessID NDS TSK Memory Time  S Time
--------------- -------- ----- ------- ------ --- --- ------ ----- - ----
40316.queue.tcf YOUR_NAM prodt program   --     2   1   --   00:00 Q --
      
This command gives you information about the number of nodes requested, the amount of time and memory requested and so on. The "S" (for "status") field lists a value of "Q", which means the job has been queued, but has not started to run. (Note that the output under each heading is truncated if it is long).



Retrieving the Output File

When the shell script is processed, and it is time to run the executable, then we specified that the output of the executable was to go to the file: program_output.txt. When you see this file created in your directory, you know the program has begun to execute - however, you can't assume the program is done yet. The program is done executing (and all the commands in the shell script are completed) when you see the standard output and standard error log files appear in the directory.

For our example, these files would have the names

     program.sh.o40316
     program.sh.e40316
because they are the standard output "O" and standard error "E" associated with the run of program.sh which had been assigned the job number 40316. If you redirected the output of your executable program to a file (we did) then typically these log files won't contain anything of interest. However, if your job ran out of time, or had a run time error, for instance, this information would be stored in the standard error log file.

Assuming the job executed satisfactorily, you can examine the results or pull them back to your home machine using the sftp program:

     home_mac: sftp sysx3.arc.vt.edu

     sysx3: Password for user: xxxxx
     sysx3: cd work_directory
     sysx3: lcd source_code
     sysx3: get program_output.txt
     sysx3: lls
     sysx3: program.c  program_output.txt
     sysx3: quit

     home_mac:



Using Sample Files for Experimentation

Sample files are available, so that you can try out the procedures for file transfer, compilation, job submission, and output file recovery.

  • Copy the appropriate source code file (choose your favorite language) to your home machine.
    • program.c, C program;
    • program.cxx, C++ program; (an extension of ".C" or ".cpp" can also be used for C++ programs).
    • program.f, FORTRAN77 program;
    • program.f90, FORTRAN90 program;

  • Copy the shell script program.sh to your home machine.

  • Edit program.sh by inserting the necessary group and account fields at the beginning of the file.

  • Transfer the source code file and the shell script to your directory on System X.

  • Compile the source code using the appropriate compiler, and rename the executable to program.

  • Submit the shell script program.sh;

  • Retrieve the output file program_output.txt to your home machine and compare it to the sample results.



VT-ARC Privacy Statement | Contact Us
VT-ARC is a Unit within the Office of the Vice President of Information Technology
© 2007-2008 Virginia Polytechnic Institute and State University
Principles of Community | Acceptable Use Policy | Accessibility | Equal Opportunity
Website Feedback   -   Page Last Updated:  January 16th, 2008