OpenMP is an Application Program Interface (API) for parallel programming in C, C++, and Fortran on shared-memory machines. It is maintained by the OpenMP Architecture Review Board (ARB) and supported on a wide array of system architectures and operating systems. It is generally easier to use but less versatile than MPI (though many programmers use a hybrid approach, where shared-memory parallelism is achieved via OpenMP and distributed memory communication is achieved through MPI).
OpenMP programs use “fork-join” parallelism, where the program executes sequentially until directed to create parallel (“slave”) threads. When a parallel section of the program is encountered, such as a large do or for loop, the iterations of the loop can be split up and executed in parallel. The parallel section can then be exited, after which sequential execution resumes until the next parallel section is encountered. The user can change the number of parallel threads for each parallel section. Note that this is fundamentally different from MPI, where all processes execute from the beginning of the program to the end.
OpenMP is implemented primarily through compiler directives, which can be easily added to an existing serial C or Fortran program. The directives look like comment statements, and the compiler can be told to ignore or activate them. This means that the conversion of a sequential code to one that takes full advantage of OpenMP can be done in stages, and at every step the user has a working code with improved performance. OpenMP also provides library functions and environment variables that can be used in a program.
OpenMP directives are comments in source code that specify parallelism. In C or C++, these are specified with the #pragma omp sentinel; in Fortran, they are specified with the !$OMP, C$OMP or *$OMP sentinels. There are a variety of directives, which are used for different thread behavior:
- Parallel: Create threads, any code is executed by all threads
- Do (Fortran), For (C/C++): Work sharing of iterations
- Sections: Work sharing by splitting, so that one thread does one section and another does another section
- Single: For a section to be executed by only one thread (but not a particular thread)
- Master: For sections to be executed by only the master thread.
- Atomic or Critical: Guarantees that a section of code will be executed by only one thread at a time. Atomic is to be used for a single command (e.g. incrementing a counter). Critical is to be used for longer serial sections.
Directive behavior can be adjusted through the use of clauses, such as (but not limited to) the following:
- Data Scoping (e.g., Private, Shared): Changes which variables are shared among threads (only have a single value) or have private copies for each thread. Most directives contain one or both of the Private and/or Shared clauses.
- Schedule (e.g., Static, Guided, Dynamic): Changes how work, such as the iterations in a parallel for loop, is divided among the threads to create better load balancing and minimize thread idle-time. Note that the options, such as Guided, that may produce better balancing also incur some overhead during execution, so they may not always provide better performance depending on the application.
- Reduction: Provides a thread-safe way to combine private copies of a variable into a single result at the end of a parallel section. For an example, see the vector norm program in the Examples section.
- Nowait: Eliminates the synchronization that occurs by default at the end of a parallel section
OpenMP provides a few simple functions that can be called in a program to guide execution:
|omp_get_num_threads()||Returns the number of threads in team|
|omp_get_thread_num()||Returns the ID (0 to n-1) for the thread calling it|
|omp_get_num_procs()||Returns number of machine CPUs|
|omp_in_parallel()||True if in parallel region and multiple threads are executing|
|omp_set_num_threads(n)||Set number of threads for a parallel region to n|
Shared memory can be difficult to manage when multiple threads need to edit the same shared variable. For example, assume that a program uses two threads (labeled T1 and T2) and each is supposed to increment (add 1 to) the variable x. Then each thread needs to read the current value of x, calculate its new value, and write that value to x. However, the order in which these steps occur – which cannot be controlled by the program – affects the resulting value of x:
|Start: x=01. T1 reads x=02. T1 calculates x=0+1=13. T1 writes x=1
4. T2 reads x=1
5. T2 calculates x=1+1=2
6. T2 writes x=2
|Start: x=01. T1 reads x=02. T2 reads x=03. T1 calculates x=0+1=1
4. T2 calculates x=0+1=1
5. T1 writes x=1
6. T2 writes x=1
This is an example of a “race condition” – where two threads are “racing” to complete and task and the order in which they finish affects the result of the program. OpenMP has some mechanisms to avoid these situations, as described below. Note, however, that these mechanisms have the effect of serializing portions of the program, eliminating the performance benefit of multithreading, so their use should be minimized when possible.
- The Atomic directive can be used to ensure that a single line of code will be executed by by only one thread at a time. So adding an Atomic directive before the line x=x+1 would eliminate the race condition in the example above by ensuring that a thread would not start executing that line while another thread was in the process of executing it.
- The Critical directive is similar to Atomic, except that it can be used for more than one instruction. So if no more than one thread should be executing a section of code at a time, the programmer should bracket that code in a Critical directive.
- Locks provide a more flexible means of ensuring mutual exclusion between threads by allowing the programmer to set up a variable (called a lock) that controls access to a given section of code. Locks are manipulated through library functions:
- OMP_INIT_LOCK(lck): Initialize a lock variable
- OMP_SET_LOCK(lck): “Take” the lock so that no other threads can access the section of code. To be used as a thread begins executing the section of code.
- OMP_UNSET_LOCK(lck): “Release” the lock so that another thread can access the section of code. To be used when a thread is done executing the section of code.
- OMP_DESTROY_LOCK(lck): Destroy a lock variable
To compile an OpenMP program, you need to add flags to your compiler command:
- The OpenMP flag for the Intel compiler is -openmp. So to compile the C program myopenmp.c, the command would be: icc myopenmp.c -openmp -o icc_omp.out
- The OpenMP flag for the GNU compiler is -fopenmp. So to compile the C program myopenmp.c, the command would be: gcc myopenmp.c -fopenmp -o gcc_omp.out. Note that GCC support for OpenMP began with GCC version 4.2, so OpenMP will not work with previous versions of the compiler. GCC versions 4.2.x and 4.3.x implement OpenMP spec v2.5; GCC 4.4.x onwards implement OpenMP spec v3.0.
The OpenMP Architecture Review Board (ARB) provides instructions for a wide array of other compilers.
To compile these examples, see the Compiling section, above. Note that you will have to load the compiler module in order for the compiler command to work; for instructions, see the Software Modules documentation or the documentation for the system that you want to compile on.
- The programs omp_mm.c and omp_mm.f demonstrates matrix multiplication using OpenMP.
- The program omp_scope demonstrates the impact of various data scoping clauses in OpenMP.
- The C program openmp_vnorm calculates the norm of a vector using the Reduction() clause in OpenMP.
- Lawrence Livermore National Laboratory (LLNL) provides a handful of C and Fortran examples.