Using the Intel MIC Cards on BlueRidge

Overview

BlueRidge is a 408-node Cray CS-300 cluster, of which 130 nodes (br001-br130) are equipped with two Intel Many Integrated Core (MIC) cards each. Intel MIC, also known as Xeon Phi, provide GPU-style acceleration for highly-parallel tasks but offer more integration with CPUs and compatibility with existing CPU programming paradigms (C/C++, Fortran, etc) than traditional GPUs. The MIC cards on BlueRidge have 60 1.05 GHz cores. This page describes how to use the Xeon MIC cards. The main BlueRidge page is here.

The table below compares the specifications for BlueRidge’s CPUs and MICs:

Specification CPU (2 per node) MIC (2 per node*)
Model Intel Xeon E5-2670 (Sandy Bridge) Intel Xeon Phi 5110P
Cores 8 60
Clock Speed 2.60 GHz 1.05 GHz
Memory 64 GB 8 GB
L1 Cache 32 KB (per core) 32 KB (per core)
L2 Cache 512 KB (per core) 512 KB (per core)
L3 Cache 20 MB (shared) N/A
Vector Unit 256 bit (4 DPFP) 512 bit (8 DPFP)
Theoretical Peak (DP) 166 Gflops/s 1,011 Gflops/s

* Two MICs are available on each of BlueRidge nodes 1-130.

Xeon MIC cards can be used in a number of different ways. They run a version of Linux, so users can log into them and run tasks directly on them; this is known as “native” execution. Users can also run jobs on the CPUs on a node (the “host”) and then use Intel’s Math Kernel Library (MKL) or compiler directives to push highly parallel portions of the job to the MIC cards. This is know as “offload”. MIC cards can also run a version of Intel MPI on them; however this capability is not enabled on the BlueRidge MICs. The sections below provide in-depth examples of each of the ways that the BlueRidge MIC cards can be used.

Native Jobs

This section describes how to log into a Xeon MIC card on a BlueRidge node and run a parallel program on it. The example code to be used is helloflops3, a C code taken from Intel Xeon Phi Coprocessor High Performance Programming by Jim Jeffers and James Reinders (copyright 2013, published by Morgan Kaufmann, ISBN 978-0-124-10414-3, Book Website, Amazon). This is a simple single-precision scalar multiplication/vector addition (“SAXPY”) program that uses OpenMP to obtain the parallelism necessary to maximize the MIC card’s computing throughput. To follow this example, you will need to download this program (or one similar to it) to one of your directories on BlueRidge.

Getting Started

First, we gain access to a MIC-enabled node by submitting an interactive job. Interactive jobs are the easiest way to test MIC code. To request an interactive job, enter a command like the following, where yourallocation is replaced by the name of your allocation account.

      [jkrometi@brlogin2 ~]$ qsub -I -l walltime=2:00:00 -l nodes=1:ppn=16:mic -q normal_q -W group_list=blueridge -A yourallocation

The job may take a few moments to start. When it does, you will receive a prompt on one of the MIC-enabled nodes (CPUs). (Note, though, that if BlueRidge is very busy and a MIC-enabled node is not available at the time you request one, you will have to wait for a node to become available. This may take a long time.)

      qsub: waiting for job 37122.master.cluster to start
      qsub: job 37122.master.cluster ready

      [jkrometi@br002 ~]$

Compiling for the MIC

Once the job starts, change to the directory where you saved helloflops3.c. Make sure that you have the Intel compiler module loaded using module load intel. Then compile the code with the icc command like the following:

      [jkrometi@br002 jeffers]$ icc -mmic -openmp -O3 helloflops3.c -o helloflops3

The -openmp flag tells the compiler that this is an OpenMP program. The -mmic flag tells the compiler to compile the program for the MIC card.

Running on the MIC

Once the program is finished compiling, we need to log into one of the MIC cards to run it. Each MIC-enabled BlueRidge node has access to two MIC cards, calledmic0 and mic1. To log into one of them, simply type:

      [jkrometi@br002 jeffers]$ ssh mic0

You may get a warning message, then you will see the command prompt on the MIC card ($). Now we can change directories and run the program that we compiled earlier:

      ~ $ cd mic/jeffers
      ~/mic/jeffers $ ./helloflops3
      Initializing
      Starting Compute
      Using 240 threads...
      Gflops =   6144.000, Secs =      3.066, GFlops per sec =   2003.945

The default number of threads is 240; this is because the MIC cards on BlueRidge have 60 cores and each core offers four hardware threads. In this example, we immediately see the computational power of the MIC: The program calculated 2 trillion single-precision floating point operations per second on a single MIC card. A comparison of program performance for other numbers of threads is provided below.

Finishing Up

To exit the MIC card, we simply type exit:

      ~/mic/jeffers $ exit
      Connection to mic0 closed.
      [jkrometi@br002 jeffers]$

We are now back on the host (CPU). To end our interactive job (before the walltime expires), type exit again:

      [jkrometi@br002 jeffers]$ exit
      logout

      qsub: job 37122.master.cluster completed

Performance Comparison

Here we will compare how performance changes when running with different numbers of threads and with the cores located on different cores on the MIC. With one thread, we obtain roughly 16 Gflops (billion floating point operations per second). This is roughly half of the theoretical peak performance of a single core on the MIC. This is because the MIC operating system is set up to run at least two threads on each core; when running a single thread it essentially does computations only on every other clock cycle. For two threads we obtain more or less the theoretical maximum performance (KMP_AFFINITY=compact forces the two threads onto a single core):

      ~/mic/jeffers $ OMP_NUM_THREADS=2 KMP_AFFINITY=compact ./helloflops3
      Initializing
      Starting Compute
      Using 2 threads...
      Gflops =     51.200, Secs =      1.520, GFlops per sec =     33.683

Each MIC core can handle up to four threads. However, for this application, performance does not improve when we put four threads onto a core:

      ~/mic/jeffers $ OMP_NUM_THREADS=4 KMP_AFFINITY=compact ./helloflops3
      Initializing
      Starting Compute
      Using 4 threads...
      Gflops =    102.400, Secs =      3.046, GFlops per sec =     33.614

However, once we get beyond four threads, the threads spread out onto more than one core and the performance increases accordingly:

      ~/mic/jeffers $ OMP_NUM_THREADS=8 KMP_AFFINITY=compact ./helloflops3
      Initializing
      Starting Compute
      Using 8 threads...
      Gflops =    204.800, Secs =      3.047, GFlops per sec =     67.203

Similarly, if we use KMP_AFFINITY=scatter to split four threads across four cores, we get roughly the single thread performance of 16 Gflops on each:

      ~/mic/jeffers $ OMP_NUM_THREADS=4 KMP_AFFINITY=scatter ./helloflops3
      Initializing
      Starting Compute
      Using 4 threads...
      Gflops =    102.400, Secs =      1.519, GFlops per sec =     67.392

In general, to maximize performance jobs on the MIC card should split threads across the cores and use multiple threads per core. For this reason,KMP_AFFINITY=scatter or KMP_AFFINITY=balanced (or something similar) will typically yield higher performance for MIC programs.

Offloading with Directives

This section describes how to use compiler directives (similar to OpenMP pragmas) to offload sections of a program running on the CPU (“host”) to its MIC cards. The example code used below is omp_hello_offload, a C OpenMP program that runs “Hello World” on a MIC card. The code is no different from a standard OpenMP Hello World program, except for the addition of a single line:

    #pragma offload target(mic)

This directive simply tells the compiler to move the region inside the following {} braces to the MIC card.

Getting Started

First, we gain access to a MIC-enabled node by submitting an interactive job. Interactive jobs are the easiest way to test MIC code. To request an interactive job, enter a command like the following, where yourallocation is replaced by the name of your allocation account.

      [jkrometi@brlogin2 ~]$ qsub -I -l walltime=2:00:00 -l nodes=1:ppn=16:mic -q normal_q -W group_list=blueridge -A yourallocation

The job may take a few moments to start. When it does, you will receive a prompt on one of the MIC-enabled nodes (CPUs). (Note, though, that if BlueRidge is very busy and a MIC-enabled node is not available at the time you request one, you will have to wait for a node to become available. This may take a long time.)

      qsub: waiting for job 37122.master.cluster to start
      qsub: job 37122.master.cluster ready

      [jkrometi@br002 ~]$

Compiling an Offload Job

Once the job starts, change to the directory where you saved omp_hello_offload.c. Make sure that you have the Intel compiler module loaded using module load intel. Then compile the code with the icc command like the following:

      [jkrometi@br002 omp]$ icc -openmp -O3 omp_hello_offload.c -o omphw.offload

The -openmp flag tells the compiler that this is an OpenMP program. The -O3 flag tells the compiler to use aggressive optimization. Note that we did not include the-mmic flag because this program will be running on the host, not on the MIC card. Note also that we did not have to explicitly tell the compiler to use offloading – this is the default feature when a #pragma offload is encountered. By contrast, we could have disabled offloading using the -no-offload flag.

Running an Offload Job

Once the program is finished compiling, we can simply run it on the host and the offload portions will automatically be sent to a MIC card. Before we do so, though, we have to do two things:

  1. Load the mkl and mic modules using the command: module load mkl mic
  2. Set environment variables to define how we want the offloading to be done. The offloading uses environment variables beginning with a prefix that is set in the variable MIC_ENV_PREFIX (the default is MIC_). For example, the OMP_NUM_THREADS variable controls how many OpenMP threads will be used by default on the host (CPU), so MIC_OMP_NUM_THREADS (i.e. OMP_NUM_THREADS + a MIC_ prefix) will control how many threads will be used by default on the MIC card.

So to run our omphw.offload program, we might enter commands like the following:

      [jkrometi@br002 omp]$ module load mkl mic
      [[jkrometi@br002 omp]$ module list
      Currently Loaded Modules:
       1) intel/13.1    2) mkl/11    3) mic/1.0
      [jkrometi@br002 omp]$ export MIC_ENV_PREFIX=MIC
      [jkrometi@br002 omp]$ export MIC_OMP_NUM_THREADS=2
      [jkrometi@br002 omp]$ ./omphw.offload
      Hello World from thread = 0
      Number of threads = 2
      Hello World from thread = 1
      [jkrometi@br002 omp]$ export MIC_OMP_NUM_THREADS=4
      [jkrometi@br002 omp]$ ./omphw.offload
      Hello World from thread = 0
      Number of threads = 4
      Hello World from thread = 1
      Hello World from thread = 2
      Hello World from thread = 3

Note that all of these commands are entered on the host (CPU) – we never need to log into the MIC cards themselves. In the first case, we setMIC_OMP_NUM_THREADS=2 and see that the program does indeed run with two threads. In the second, we set MIC_OMP_NUM_THREADS=4 and see that the program then runs with four threads.

Running an Offload Job in an Non-interactive Session

Putting all of the above together, it is fairly simple to run an offload as part of a standard submission script (i.e. as part of a batch, rather than interactive, job) by entering the following commands in the standard BlueRidge submission script:

      #Compile
      icc -openmp -O3 omp_hello_offload.c -o omphw.offload
      #Run
      module load mkl mic
      export MIC_ENV_PREFIX=MIC
      export MIC_OMP_NUM_THREADS=4
      ./omphw.offload

(Note that in many cases the program will already have been compiled and will not need to be compiled again as part of the submission script.)

Automatic Offloading with MKL

This section describes how Intel’s Math Kernel Library (MKL) can automatically offload certain kinds of tasks to a MIC card if it thinks doing so will be computationally advantageous. AO will automatically determine the fastest mode to use for executing the routine: only on the CPU, solely on the MIC card, or combined CPU and MIC execution. The example code used below is matmul_mkl, a C program that does matrix multiplication using MKL. The key line in the program is this call to the CBLAS double-precision matrix multiplication routine provided as part of MKL:

    	cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, size, size, size, 1.0, A, size, B, size, 0.0, C, size);

The program takes a -s <size> flag that says what size (number of rows/columns) matrices should be multiplied.

Compiling an Automatic Offload Job

Make sure that you have the Intel compiler module loaded using module load intel. Then compiling this program to use automatic offloading is no different from building it to run on a normal CPU:

    	[jkrometi@br002 mic]$ icc -std=c99 -O3 -mkl matmul_mkl.c -o mm_mkl

The -mkl flag simply says this is an MKL program, the -O3 flag is an optimization flag, and -std=c99 is needed because of the way the arrays are declared (unrelated to the MIC application).

Running an Automatic Offload Job

Once the program is compiled, the MKL_MIC_ENABLE environment variable controls whether or not it will be run with automatic offloading enabled. Before running, we also need to make sure that the mkl and mic modules are loaded. So to run this program with automatic offloading, we can use the commands:

    	[jkrometi@br002 mic]$ module load mkl mic
      [jkrometi@br002 mic]$ MKL_MIC_ENABLE=1 ./mm_mkl -s 8192

The -s <8192> flag says that the matrices each have 8,192 rows/columns. The program will provide some output about performance. Here it yielded more than 400 GFlops (billion floating point operations per second):

      ./mm_mkl
      #iters: 10
      size: 8192
      runtime:
      max: 3.22035
      min: 2.25827
      avg: 2.51784
      Gflops: 436.688

By contrast, to run it without automatic offloading, we can use:

    	[jkrometi@br002 mic]$ module load mkl mic
      [jkrometi@br002 mic]$ MKL_MIC_ENABLE=0 ./mm_mkl -s 8192
      ./mm_mkl
      #iters: 10
      size: 8192
      runtime:
      max: 4.13207
      min: 4.03368
      avg: 4.0651
      Gflops: 270.476

If offloading is enabled, we can also use the OFFLOAD_REPORT environment variable get a report of what is being offloaded:

      [jkrometi@br002 mic]$ MKL_MIC_ENABLE=1 OFFLOAD_REPORT=2 ./mm_mkl -s 8192
      ./mm_mkl
      [MKL] [MIC --] [AO Function]    DGEMM
      [MKL] [MIC --] [AO DGEMM Workdivision]  0.24 0.38 0.38
      [MKL] [MIC 00] [AO DGEMM CPU Time]      6.331041 seconds
      [MKL] [MIC 00] [AO DGEMM MIC Time]      1.119346 seconds
      [MKL] [MIC 00] [AO DGEMM CPU->MIC Data] 749207552 bytes
      [MKL] [MIC 00] [AO DGEMM MIC->CPU Data] 849346560 bytes
      [MKL] [MIC 01] [AO DGEMM CPU Time]      6.331041 seconds
      [MKL] [MIC 01] [AO DGEMM MIC Time]      1.114404 seconds
      [MKL] [MIC 01] [AO DGEMM CPU->MIC Data] 749207552 bytes
      [MKL] [MIC 01] [AO DGEMM MIC->CPU Data] 849346560 bytes
      (etc etc)

Note, though, that offload does not occur for smaller matrices, as MKL makes the determination that the performance benefit of running the computation on the MIC card is not worth the cost of transferring the data to the MIC card. For example, for matrices with 1,024 rows/columns nothing is reported by OFFLOAD_REPORT:

      [jkrometi@br002 mic]$ MKL_MIC_ENABLE=1 OFFLOAD_REPORT=2 ./mm_mkl -s 1024
      ./mm_mkl
      #iters: 10
      size: 1024
      runtime:
      max: 0.0099734
      min: 0.00894235
      avg: 0.00955424
      Gflops: 224.768

Note that the commands for compiling and running above can either be entered by hand as part of an interactive job or run as part of a batch job by entering them in the standard BlueRidge submission script.

Offloading to Multiple MICs

This example describes how to use more than one MIC at a time by running one MPI process per MIC and then offloading tasks from each process to the corresponding MIC. (For more on basic offloading, see Offload Jobs above.) The program structure is described in the diagram below – multiple nodes, two MPI processes per node (because there are two MICs on each node), each MPI process offloading onto its MIC and spawning multiple threads.

The example code used below is mpi_hello_offload, a C program that does a hybrid MPI/OpenMP “Hello World”. It will in essence do “Hello World” twice: once to report rank and host (node) for each MPI process and once to report thread ID and MIC name for each thread on each MIC.

The example of compiling and running below were done with a two-node interactive job (better for debugging) started with this command:

      [jkrometi@brlogin2 ~]$ qsub -I -l walltime=2:00:00 -l nodes=2:ppn=16:mic -q normal_q -W group_list=blueridge -A yourallocation

All of the commands below can also be executed in a batch job (better for production runs) as described below.

Compiling to Offload to Multiple MICs

This example will use the mvapich2 MPI stack. We start by making sure that we have both the intel and mvapich2 modules loaded. (Note also that we do not have either the mkl or mic modules loaded.)

      [jkrometi@br015 mpi]$ module purge
      [jkrometi@br015 mpi]$ module load intel mvapich2
      [jkrometi@br015 mpi]$ module list
      Currently Loaded Modules:
        1) intel/13.1    2) mvapich2/1.9a2

Then we issue the mpicc command to compile the program:

      [jkrometi@br015 mpi]$ mpicc mpi_hello_offload.c -o mpihw.offload

The executable is now in the file mpihw.offload.

Running to Offload to Multiple MICs

Before running, we need to make sure that the mkl and mic modules are loaded. We can also set the MIC_OMP_NUM_THREADS environment variable to control how many threads are used on each MIC. So to run this program with two threads on each MIC, we can use the commands:

    	[jkrometi@br015 mpi]$ module load mkl mic
      [jkrometi@br015 mpi]$ export MIC_OMP_NUM_THREADS=2

Then we can use mpiexec to run the program. Here we are trying to run on two nodes, so we use the -np 4 to indicate that we want four MPI processes (two per node) and the -ppn 2 flag to indicate that we want two processes on each node:

      [jkrometi@br015 mpi]$ mpiexec -np 4 -ppn 2 ./mpihw.offload
      Process rank 0 of 4 is on host br015. Rank on node is 0.
      Process rank 2 of 4 is on host br019. Rank on node is 0.
      Process rank 1 of 4 is on host br015. Rank on node is 1.
      Process rank 3 of 4 is on host br019. Rank on node is 1.
      Hello World from thread 0 on host br019-mic0
      Hello World from thread 1 on host br019-mic0
      Hello World from thread 0 on host br015-mic1
      Hello World from thread 1 on host br015-mic1
      Hello World from thread 0 on host br015-mic0
      Hello World from thread 1 on host br015-mic0
      Hello World from thread 0 on host br019-mic1
      Hello World from thread 1 on host br019-mic1

The first four lines of output, the program reports from each MPI process (two on host node br015 and two on br019). The other eight lines are reported from the MICs – we see that two threads are running each on MICs br015-mic0, br015-mic1, br019-mic0, and br019-mic1.

Running to Offload to Multiple MICs with a Batch Job

The above commands (run on node br015) were run inside of an interactive job on nodes br015 and br019. However, all of the above commands could also be run within a batch job.  An example qsub script is br_mpihw_offload.qsub. To submit it, we use the qsubcommand:

      [jkrometi@brlogin2 mpi]$ qsub mpihw_offload.qsub
      48251.master.cluster

In this case, our job number is 48251. We can then check the progress of our jobs. Here it is running (state R):

      [jkrometi@brlogin2 mpi]$ qstat -u jkrometi

      bradmin2.arc.vt.edu.mgt:
                                                                                        Req'd    Req'd       Elap
      Job ID                  Username    Queue    Jobname          SessID  NDS   TSK   Memory   Time    S   Time
      ----------------------- ----------- -------- ---------------- ------ ----- ------ ------ --------- - ---------
      48251.master.cluster    jkrometi    normal_q mpihw_offload.qs      0     2     32    --   00:10:00 R  00:00:01

Here it is complete (state C):

      [jkrometi@brlogin2 mpi]$ qstat -u jkrometi

      bradmin2.arc.vt.edu.mgt:
                                                                                        Req'd    Req'd       Elap
      Job ID                  Username    Queue    Jobname          SessID  NDS   TSK   Memory   Time    S   Time
      ----------------------- ----------- -------- ---------------- ------ ----- ------ ------ --------- - ---------
      48251.master.cluster    jkrometi    normal_q mpihw_offload.qs 130448     2     32    --   00:10:00 C       --

Then we can confirm that we got the expected output by looking at the output file (in this case, mpihw_offload.qsub.o48251):

      [jkrometi@brlogin2 mpi]$ cat mpihw_offload.qsub.o48251
      Process rank 0 of 4 is on host br022. Rank on node is 0.
      Process rank 2 of 4 is on host br020. Rank on node is 0.
      Process rank 1 of 4 is on host br022. Rank on node is 1.
      Process rank 3 of 4 is on host br020. Rank on node is 1.
      Hello World from thread 0 on host br022-mic0
      Hello World from thread 1 on host br022-mic0
      Hello World from thread 0 on host br020-mic1
      Hello World from thread 1 on host br020-mic1
      Hello World from thread 0 on host br020-mic0
      Hello World from thread 1 on host br020-mic0
      Hello World from thread 0 on host br022-mic1
      Hello World from thread 1 on host br022-mic1

MPI on the MIC and Symmetric Mode

This section describes how to run MPI programs on the MIC, across multiple MICs, and across both hosts (CPUs) and MICs (known as “symmetric mode”). This is currently only supported using Intel MPI. The example code used below is montecarlo.c, a C MPI program provided by Intel that uses a Monte Carlo method to calculate the value of pi.

Note: While many MPI programs are run with one process per core on a CPU, this is not recommended for programs running on a MIC. This is because MIC coprocessors have a much smaller amount of memory per core (8 GB for 60 cores for a BluRidge MIC vs. 64 GB for 16 cores for a BlueRidge CPU). Thus it becomes more critical to write hybrid (e.g. MPI-OpenMP) code, where OpenMP is used by the processes running on the MIC to leverage its full multicore potential.

Getting Started

First, we gain access to a MIC-enabled node by submitting an interactive job. Interactive jobs are the easiest way to test MIC code. To request an interactive job, enter a command like the following, where yourallocation is replaced by the name of your allocation account.

      [jkrometi@brlogin2 ~]$ qsub -I -l walltime=2:00:00 -l nodes=1:ppn=16:mic -q normal_q -W group_list=blueridge -A yourallocation

The job may take a few moments to start. When it does, you will receive a prompt on one of the MIC-enabled nodes (CPUs). (Note, though, that if BlueRidge is very busy and a MIC-enabled node is not available at the time you request one, you will have to wait for a node to become available. This may take a long time.)

      qsub: waiting for job 68470.master.cluster to start
      qsub: job 68470.master.cluster ready

      [jkrometi@br107 ~]$

Compiling

Once the job starts, change to the directory where you saved montecarlo.c. Make sure that you have the modules for the Intel compiler and Intel MPI stack loaded (and that these are the only modules loaded):

      [jkrometi@br107 mc]$ module purge; module load intel impi

Now we are ready to compile. The CPU and MIC are different architectures, so we need separate executables for each. First we compile the CPU version:

      [jkrometi@br107 mc]$ mpiicc montecarlo.c -o montecarlo

The CPU executable is now in the file montecarlo. The syntax above should be relatively familiar, though we are using the mpiicc wrapper instead of mpicc. Now we compile the MIC executable:

      [jkrometi@br107 mc]$ mpiicc -mmic montecarlo.c -o montecarlo.mic

Here we have used the same syntax, except that we added the -mmic flag since the executable is to run on the MIC (the same flag is used for compiling native programs above). We have also named the executable montecarlo.mic. This is important since the .mic extension matches that in the I_MPI_MIC_POSTFIXenvironment variable set by the impi module. This means that when we run in symmetric mode, MPI will automatically use montecarlo on the CPU processes andmontecarlo.mic on the MIC processes.

Running

Once the program has been compiled for both the CPU and MIC, it can be run on one or more CPUs, one or more MICs, or some combination thereof (“symmetric mode”). Some examples include:

  • Run on one or more CPUs (this is more or less the normal syntax):
          mpirun -np 16 ./montecarlo
  • Run using 4 processes on MIC 0:
          mpirun -n 4 -host mic0 ./montecarlo
  • Run using 4 processes on each MIC on a node:
          mpirun -n 4 -host mic0 ./montecarlo : -n 4 -host mic1 ./montecarlo
  • Run in symmetric mode using 16 processes on the host and 4 processes on MIC 0:
          mpirun -n 16 -host br107 ./montecarlo : -n 4 -host mic0 ./montecarlo

This syntax can quickly become complicated for larger runs. A cleaner way is to use a machinefile to specify each host and an associated number of processes. For a run on nodes br107 and br096 using 16 processes on each CPU and 4 processes on each MIC, this might look like the following:

    br107:16
    br107-mic0:4
    br107-mic1:4
    br096:16
    br096-mic0:4
    br096-mic1:4

Then to run using this machinefile, simply run:

    mpirun -machinefile hosts_file $( pwd )/montecarlo

Here we’ve assumed that the machinefile is called hosts_file. The $( pwd ) syntax is used to provide the full path to our executable. If the full path is not provided, then MPI fails to find the executable on the MIC. To automate the process of generating a machinefile, this script can be called from a running BlueRidge MIC job to generate this machinefile for the nodes assigned to that job and a given number of processes on each CPU and on each MIC. For example, the above machinefile could be generated by running:

    ./symhosts.sh --ppn 16 --ppmic 4 --file hosts_file

Running MPI Programs within a Batch Job

The above commands (run on node br107) were run inside of an interactive job on nodes br107 and br096. However, all of the above commands could also be run within a batch job.  An example qsub script is br_symmetric.qsub (also requires montecarlo  and symhosts.sh). To submit it, we use the qsub command:

      [jkrometi@brlogin2 mc]$ qsub br_symmetric.qsub 
      68750.master.cluster

In this case, our job number is 68750. We can then check the progress of our jobs. Here it is running (state R):

      [jkrometi@brlogin2 mpi]$ qstat -u jkrometi

      bradmin2.arc.vt.edu.mgt:
                                                                                        Req'd    Req'd       Elap
      Job ID                  Username    Queue    Jobname          SessID  NDS   TSK   Memory   Time    S   Time
      ----------------------- ----------- -------- ---------------- ------ ----- ------ ------ --------- - ---------
      68750.master.cluster    jkrometi    normal_q br_symmetric.qsu      0     2     32    --   00:30:00 R  00:00:35

Here it is complete (state C):

      [jkrometi@brlogin2 mpi]$ qstat -u jkrometi

      bradmin2.arc.vt.edu.mgt:
                                                                                        Req'd    Req'd       Elap
      Job ID                  Username    Queue    Jobname          SessID  NDS   TSK   Memory   Time    S   Time
      ----------------------- ----------- -------- ---------------- ------ ----- ------ ------ --------- - ---------
      68750.master.cluster    jkrometi    normal_q br_symmetric.qsu 109860     2     32    --   00:30:00 C       --

Then we can confirm that we got the expected output by looking at the output file (in this case, br_symmetric.qsub.o68750). The output file is quite long in this case (“Hello World” and results from each of the 48 processes – 16 on each of the two CPUs and four on each of the four MICs), so only portions of it are printed here:

      Hello world: rank 0 of 48 running on br012.
      Hello world: rank 1 of 48 running on br012.
      Hello world: rank 2 of 48 running on br012.
      Hello world: rank 3 of 48 running on br012.
      [output truncated]
      Hello world: rank 22 of 48 running on br012-mic1.
      Hello world: rank 23 of 48 running on br012-mic1.
      Hello world: rank 24 of 48 running on br006.
      Hello world: rank 25 of 48 running on br006.
      [output truncated]
      Elapsed time from rank 45:      49.41 (sec) 
      Elapsed time from rank 46:      49.38 (sec) 
      Elapsed time from rank 47:      53.31 (sec) 
      Out of 4294967295 points, there are 2248823941 points inside the sphere => pi=  3.141571044922

The full output is provided here.

Resources

The following references may be helpful for those hoping to learn more about writing and optimizing code for the MIC coprocessor:

  1. Intel’s Main MIC Developers Website
  2. Stampede User Guide at the Texas Advanced Computing Center (TACC)
  3. Intel Xeon Phi Coprocessor High Performance Programming by Jim Jeffers and James Reinders (copyright 2013, published by Morgan Kaufmann, ISBN 978-0-124-10414-3, Book Website, Amazon)