Introduction

Interface

Snow

Rmpi

pbdR

Generating Figures

Adding Packages

Examples

Resources

R

Introduction

R refers to both the R language and the system environment that supports it. Like MATLAB and Java, R is an interpreted language, although like both of those languages R can be compiled or interact with compiled modules from the "big three" of compiled mathematical languages (C, C++, and FORTRAN descendents) to improve efficiency. The R language is semantically similar to its predecessor S, although in implementation it is similar to the (also interpreted) functional language Scheme.

Being designed primarily for statistical analysis, R contains many intrinsic functions for common operations in statistics, including (but not limited to) linear and generalized linear models, nonlinear regression models, time series analysis, classical parametric and nonparametric tests, clustering and smoothing. Additional add-on packages are available to expand this functionality. These add-on packages can be installed in a user's home directory and used with the centrally installed R or, if widely used, centrally installed by staff. R will automatically locally install packages to a user's home directory if these packages are not present when the user attempts to load them, downloading them from the CRAN website.

There are many packages in R for parallel computing. Three are provided in 's R installations: snow, Rmpi, and pbdR. Each package is described in sections below.

Interface

There are, broadly speaking, two distinct ways to use the R environment. The first is interactively, where commands are entered at the prompt and each command is executed upon entry. The second is to develop a script of several sequential commands and functions (usually using the .R extension) and submit this script for processing. The command submitted to execute an R script is R CMD BATCH <path-to-script>. To find the path to the current directory, use the pwd command. Note that you must use the sink("<path-to-output>") function to determine your output's destination, and sink() to close the output stream. User-defined function definitions can be written in separate .R files and loaded into the script via the source() function.

The most efficient way to develop code in R is to develop a script locally (testing correctness with simple cases utilizing the interactive environment) and to submit this script for processing on an system. A host of commercial and freely available graphical user interfaces also exist to aid the development of R. To run the interactive R environment on an system, simply run the following commands:

Once R is running, information about individual functions can be obtained using the help(function_name) command. To exit the R interactive environment, use the q() command. In order to avoid overloading the login nodes, it is desirable to download and install R to your local machine for interactive development. See the main R page for downloads for all major operating systems.

Snow

The snow (acronym for "Simple Network of Workstations", never capitalized) package is designed to be more accessible to the casual R programmer, and consists of a series of commands that can be run on a cluster. Basic parallel operations, such as applying a function to a vector, are parallelized with a short set of simple calls. After the creation of a cluster with the makeCluster(cl) command, functions can be called on vectors of arguments with arguments like clusterApply() and clusterApplyLB(). stopCluster(cl) will eliminate the cluster and clean up any resources tied up in maintaining it. A list of other common snow commands and how to use them, is available from Simon Fraser University.

On systems, snow is available through the base R installation as long as the OpenMPI module is loaded, as in the following line (where gcc can be replaced by intel):

    module load gcc openmpi R
  

One of the principal uses of snow over Rmpi is to conveniently execute many similar operations in parallel. For example, running Monte Carlo algorithms, which rely on pseudorandom number generation, is a common operation in statistics. In order to avoid the same seed for the random number generator being used across processors, and due to a lack of desire to invest in proper Rmpi programming for the operation, many researchers will simply submit a large batch of serial Monte Carlo programs by hand or through the use of a simple script. While the different start times of the program will cause different random seeds to be generated, the process is imprecise, messy, tedious, and difficult for both users and support staff to deal with properly. A proper implementation of snow to cause all of these programs to execute in parallel involves the insertion of just a few lines of code. For example, if "monte_carlo" is the function name for a Monte Carlo function to be executed in batch, the following code demonstrates how to do so. For ease of testing, a monte_carlo dice-rolling script has been provided, and it is assumed that the script is in the current directory when this script is run.

Monte-Carlo function: Dice rolling example.

      # Monte Carlo dice-rolling function
      # This simple function takes in an integer and rolls that many die, with
      #   numsides sides, outputting the result.
      # To use this with the snow script below, copy and paste this section into
      #   "monte_carlo.R", or download the script in the Examples section.
      monte_carlo <- function(x,  numsides=6){
        streamname <- .lec.GetStreams()
        dice <- .lec.uniform.int(streamname[1], n=1, a=1, b=numsides)
        outp <- sum(dice)
        return(outp)
      }

Snow script: Rolling the dice.

      # snow script:  Running a Monte Carlo script in parallel.
      # You may copy and paste this script directly, or download the script in the Examples section.
      # Load the snow and random number package.
      library(snow)
      # This example uses the already installed LEcuyers RNG library(rlecuyer)
      library(rlecuyer)
      # Set up our input/output
      source(./monte_carlo.R)
      sink(./monte_carlo_output.txt)
      # Create the cluster to execute the Monte Carlo function.
      # For the sake of this example, assume 2 processors have been acquired.  If you wish
      #   to run on a larger number of processors, edit the "2" below.
      cl <- makeCluster(2, type = MPI)
      # Generate a seed for the pseudorandom number generator, unique to each
      # processor in the cluster.
      clusterSetupRNG(cl, type = RNGstream)
      #Choose which of the following blocks best fit your own needs.
      # BLOCK 1
      # Set up the input to our Monte Carlo function.
      # Input is identical across the batch, only RNG seed has changed.
      # For this example, both clusters will roll one die.
      input <- 1
      output <- clusterCall(cl, monte_carlo, input)

      # Output will show the results of two rolls of a six-sided die, generated
      #  with a different random number seed.

      #BLOCK 2

      # Input is different for each processor
      # input <- array(1:2)  # Set up array of inputs, with each entry
      # input[1] <- 1        #   corresponding to one processor.
      # input[2] <- 3
      # parameters <- array(1:2)  # Set up inputs that will be used by each cluster.
      # parameters[1] <- 2        #   These will be passed to monte_carlo as its
      # parameters[2] <- 6        #   second argument.

      # output <- clusterApply(cl, input, monte_carlo, parameters)
      # Output should show the results of a coin flip and the roll of three six-sided die, each generated on a different processor.

      # Output the output.

      output

      # Clean up the cluster and release the relevant resources.
      stopCluster(cl)
      sink()
    

The full documentation for the snow package can be found here.

Rmpi

Rmpi provides an interface to MPI for R using a master-slave paradigm. It allows more advanced parallelism beyond embarrassingly parallel applications supported by the snow package.

On systems, Rmpi is available through the base R installation as long as the OpenMPI module is loaded, as in the following line (where gcc can be replaced by intel):

    module load gcc openmpi R
  

Once the module is loaded, most standard MPI functionality is available through Rmpi; see the MPI page for details on the underlying mechanics. However, the semantics of the calls used by Rmpi are unique to R. The following sample "Hello World" program (from Acadia University) demonstrates this difference:

      # Load the R MPI package if it is not already loaded.
      if (!is.loaded("mpi_initialize")) {
       library("Rmpi")
       }
       
      # Spawn as many slaves as possible
      mpi.spawn.Rslaves()
       
      # In case R exits unexpectedly, have it automatically clean up
      # resources taken up by Rmpi (slaves, memory, etc...)
      .Last <- function(){
      if (is.loaded("mpi_initialize")){
        if (mpi.comm.size(1) > 0){
          print("Please use mpi.close.Rslaves() to close slaves.")
          mpi.close.Rslaves()
          }
        print("Please use mpi.quit() to quit R")
        .Call("mpi_finalize")
        }
      }
       
      # Tell all slaves to return a message identifying themselves
      mpi.remote.exec(paste("I am",mpi.comm.rank(),"of",mpi.comm.size()))
       
      # Tell all slaves to close down, and exit the program
      mpi.close.Rslaves()
      mpi.quit()
       
      # Since mpi_finalize is called in mpi.quit(), no more mpi commands
      # should be invoked here.  Thus, detach the Rmpi library.
      detach("Rmpi")

A few simple differences between standard MPI and Rmpi stand out immediately. First, Rmpi uses an intrinsic master/slave paradigm in implementation. Secondly, the mpi_finalize function command is replaced with the mpi.quit function when disengaging normally from R, but not when R is closed unexpectedly (since Rmpi rests on the existing MPI implementation, this is required to clean up the environment upon an unexpected abort). Similarly, the mpi_initialize function is implicitly called when the Rmpi package is loaded (hence the format of the conditional when loading the Rmpi package).

More information, including an excellent Rmpi tutorial, can be obtained from Acadia University, and a deeper understanding of the underlying MPI mechanics can be obtained from 's MPI page.

pbdR

The Programming with Big Data in R (pbdR) package is probably the R package that is most targeted toward high-performance computing users. It was first released in Fall 2012 and consists of a number of components:

  1. MPI: pbdMPI (SPMD-style MPI interface) and pbdPROF (Parallel Profiler) packages
  2. Distributed Linear Algebra and Statistics: pbdSLAP, pbdBASE, and pbdDMAT packages
  3. Interface to NetCDF4 file formats: pbdNCDF4 package

On 's clusters, all of these components are available in a single pbdR module built against the gcc compiler and OpenMPI MPI stack. Because of the pbdNCDF4 package, the pbdR module also depends on the NetCDF and HDF5 modules. So the pbdR module can be loaded as follows:

      module load gcc openmpi R hdf5 netcdf pbdr
    

A pbdMPI program can then be run with an mpirun call of Rscript, such as:

      mpirun -np $PBS_NP Rscript mcpi_pbdr.r
    

Generating Figures

Starting with version 3, 's R installations enable plotting via the cairo interface. This allows a remote job to automatically generate figures and print them to a file. Using the interface requires three basic steps:

  1. Call a device such as png() or jpeg() with the flag type="cairo".
  2. Generate the figure as normal.
  3. Issue dev.off() to close the device and print the results of the plot to the file.

The following example generates a plot in PNG format:

    png('mh.draws.png',type="cairo")
    plot(mh.draws, main = "Sample by Iteration", xlab = "Iteration", ylab = "Samples", type="p", pch=".")
    dev.off()
cairo can also be set as the default for all devices (so that the type flag doesn't have to be used each time) using the command:
    options(bitmapType = "cairo")

Adding Packages to R on Systems

As noted above, one of the strengths of R is the huge number of third-party packages that are available to add or improve functionality. centrally installs parallel computing packages (described above), but users may also customize R by installing their own packages in their home directory. Fortunately, R makes this very straightforward to do. Here is an example using the scatterplot3d package:

  1. Start on the login node of an cluster. Create a folder where you want to install your R packages. For example, you might use the folder R/lib in your Home directory: mkdir -p $HOME/R/lib
  2. Open R. (You will need to load the R module using module load R and then type R.) You will get an R command prompt.
  3. Install the desired package using R's install.packages command. This will automatically download the required files and install them. Make sure that the lib option matches the folder that you created in Step 1:
          install.packages("scatterplot3d", lib="~/R/lib")
  4. R will print messages about its attempt to install the package. You should see a message at the end indicating that the package has been installed successfully (or, at least, the lack of an error message). If you exit R, you will see that files have been created in the folder that you created in Step 1 and indicated in the install.packages command in the previous step.
  5. To load the package in R and start using it, use the library command, again with lib option:
          library("scatterplot3d",lib.loc="~/R/lib")

An alternative approach to using R's install.packages command is to manually download the source file and install it from the shell (not R) command prompt using R CMD INSTALL, as follows:

    R CMD INSTALL scatterplot3d -l ~/R/lib scatterplot3d_0.3-35.tar.gz

Folders can be permanently added to R's library path (so that you need not specify the lib.loc option in the library command) by adding them to the R_LIBS environment variable. This can be done by adding the following line to the .Renviron file in your Home directory:

      R_LIBS="~/R/lib"

Examples

The following examples run different kinds of parallel R jobs.

Parallel R Type R Functions Submission Script (for Ithaca)
snow snow_example.R (Also requires monte_carlo.R) snow_qsub.sh
Rmpi rmpi_example.R rmpi_qsub.sh

Resources

General

Parallel Computing