R refers to both the R language and the system environment that supports it. Like MATLAB and Java, R is an interpreted language, although like both of those languages R can be compiled or interact with compiled modules from the “big three” of compiled mathematical languages (C, C++, and FORTRAN descendents) to improve efficiency. The R language is semantically similar to its predecessor S, although in implementation it is similar to the (also interpreted) functional language Scheme.
Being designed primarily for statistical analysis, R contains many intrinsic functions for common operations in statistics, including (but not limited to) linear and generalized linear models, nonlinear regression models, time series analysis, classical parametric and nonparametric tests, clustering and smoothing. Additional add-on packages are available to expand this functionality. These add-on packages can be installed in a user’s home directory and used with the centrally installed R or, if widely used, centrally installed by ARC staff. R will automatically locally install packages to a user’s home directory if these packages are not present when the user attempts to load them, downloading them from the CRAN website.
There are many packages in R for parallel computing. Three are provided in ARC’s R installations: snow, Rmpi, and pbdR. Each package is described in sections below.
There are, broadly speaking, two distinct ways to use the R environment. The first is interactively, where commands are entered at the prompt and each command is executed upon entry. The second is to develop a script of several sequential commands and functions (usually using the .R extension) and submit this script for processing. The command submitted to execute an R script is R CMD BATCH <path-to-script>. To find the path to the current directory, use the pwd command. Note that you must use the sink("<path-to-output>") function to determine your output’s destination, and sink() to close the output stream. User-defined function definitions can be written in separate .R files and loaded into the script via the source() function.
The most efficient way to develop code in R is to develop a script locally (testing correctness with simple cases utilizing the interactive environment) and to submit this script for processing on an ARC system. A host of commercial and freely available graphical user interfaces also exist to aid the development of R. To run the interactive R environment on an ARC system, simply run the following commands:
- module load R: Load the R module
- R: Start the R interactive environment
Once R is running, information about individual functions can be obtained using the help(function_name) command. To exit the R interactive environment, use the q() command. In order to avoid overloading the login nodes, it is desirable to download and install R to your local machine for interactive development. See the main R page for downloads for all major operating systems.
Many R operations use Basic Linear Algebra Subprograms (BLAS), such as dot product, matrix-vector multiplication, and matrix-matrix multiplication. To ensure that R runs as fast as possible, the R modules on ARC’s systems are built with BLAS that have been optimized to the hardware. Starting with version 3.2.2, R is built with MKL for Intel compilers and OpenBLAS for gcc compilers. The plot below compares the run-times (lower is better) of the R 2.5 Benchmark for the base R installation and ARC’s builds with optimized BLAS. These tests were run on NewRiver (Intel Haswell CPU) with R 3.2.2.
Note that BLAS operations will run in parallel by default, potentially dramatically speeding up large matrix operations. However, this may be less than ideal for some parallel applications (such as running Rmpi or pbdMPI with one MPI process on each core). In these cases, a user may want to change the number of threads used by R’s BLAS operations. This can be done using environment variables. (OpenBLAS was chosen over ATLAS, which was used for earlier R versions, because it allows the number of threads used for BLAS operations to be set at run-time.) For example, to make R use only one thread (core) for BLAS operations, the following command can be issued prior to opening R:
export OPENBLAS_NUM_THREADS=1 #Use for R built against gcc/openblas export MKL_NUM_THREADS=1 #Use for R built against intel/mkl
There are a number of packages that enable parallelism in R, each designed for different applications. These include:
- snow, snowfall: For “embarrassingly parallel” computations, such as to execute many similar operations in parallel as in Monte Carlo simulations.
- Rmpi: Rmpi provides an interface (wrapper) to MPI APIs
- pbdR: The Programming with Big Data in R (pbdR) package is probably the R package that is most targeted toward high-performance computing users. It was first released in Fall 2012 and consists of a number of components:
- MPI: pbdMPI (SPMD-style MPI interface)
- Distributed Linear Algebra and Statistics: pbdSLAP, pbdBASE, and pbdDMAT packages
- Interface to NetCDF4 file formats: pbdNCDF4 package
- Profiling: pbdPROF (Parallel Profiler) package
Beginning with R 3.2, ARC centrally installs a number of these parallel computing packages in an
R-parallel module built against OpenMPI. The version number of this R-parallel package is the same as the R version for which it is built (so R-parallel 3.2.0 is for R/3.2.0). For example, to load R 3.2.0 and its parallel packages, one might use:
module purge module load gcc openblas R/3.2.0 module load openmpi hdf5 netcdf R-parallel/3.2.0
Example scripts using some of these packages are provided in the Examples section.
Starting with version 3.2, ARC’s R installations enable plotting via two interfaces:
- cairo. This allows a remote job to automatically generate figures and print them to a file.
- X11. This allows a user to interactively create and view plots. This interface cannot be used on remote jobs or when a user is not logged in with X11 forwarding enabled.
X11 is the default interface set by R. Using the cairo interface requires three basic steps:
- Call a device such as png() or jpeg() with the flag type="cairo".
- Generate the figure as normal.
- Issue dev.off() to close the device and print the results of the plot to the file.
The following example generates a plot in PNG format:
png('mh.draws.png',type="cairo") plot(mh.draws, main = "Sample by Iteration", xlab = "Iteration", ylab = "Samples", type="p", pch=".") dev.off()
The Metropolis-Hastings snow code below provides an example of offline figure generation using cairo.
cairo can also be set as the default for all devices (so that the type flag doesn’t have to be used each time) using the command:
options(bitmapType = "cairo")
As noted above, one of the strengths of R is the huge number of third-party packages that are available to add or improve functionality. ARC centrally installs parallel computing packages (described above), but users may also customize R by installing their own packages in their home directory. Fortunately, R makes this very straightforward to do. Here is an example using the scatterplot3d package:
- Start on the login node of an ARC cluster. Create a folder where you want to install your R packages. For example, you might use the folder R/lib in your Home directory: mkdir -p $HOME/R/lib
- Open R. (You will need to load the R module using module load R and then type R.) You will get an R command prompt.
- Install the desired package using R’s install.packages command. This will automatically download the required files and install them. Make sure that thelib option matches the folder that you created in Step 1:
- R will print messages about its attempt to install the package. You should see a message at the end indicating that the package has been installed successfully (or, at least, the lack of an error message). If you exit R, you will see that files have been created in the folder that you created in Step 1 and indicated in the install.packages command in the previous step.
- To load the package in R and start using it, use the library command, again with lib option:
An alternative approach to using R’s install.packages command is to manually download the source file and install it from the shell (not R) command prompt using R CMD INSTALL, as follows:
R CMD INSTALL scatterplot3d -l ~/R/lib scatterplot3d_0.3-35.tar.gz
Folders can be permanently added to R’s library path (so that you need not specify the lib.loc option in the library command) by adding them to the R_LIBS environment variable. This can be done by adding the following line to the .Renviron file in your Home directory:
The following examples run different kinds of parallel R jobs.
|Parallel R Type||Description||R Functions||Sample Submission Script|
|Rmpi||Round Robin Message Passing||messages_rmpi.r||messages_rmpi.qsub|
- Download R for your local machine
- The Virginia Tech Statistics Department’s Laboratory for Interdisciplinary Statistical Analysis (LISA) offers some excellent short courses on R. Many of the courses are recorded and available online.
- The R Inferno provides an in-depth look at best practices for programming in R.
- The Art of R Programming: A Tour of Statistical Software Design is a good book-length reference on the R language.
- ARC offered classes on Parallel R in September-October 2015. Slides and sample codes used in these classes are available here:
- Rmpi Full Documentation
- Snow Full Documentation
- A quick list of some simple snow commands
- A more in-depth look at snow
- The website for the pbdR (Programming with Big Data in R) project includes downloads, tutorials, and demos.