CUDA

Overview

GPU stands for Graphics Processing Unit. Similar to a CPU, or Central Processing Unit, a GPU is a single-chip processor. However, GPUs are specialized primarily for real time rendering and computing 3D functions, such as video animations, which perform inefficiently on the CPU. The GPU accelerates applications running on the CPU by offloading some of the compute-intensive of the code. The rest of the application still runs on the CPU. The application runs faster because it is using the massively parallel processing power of the GPU. To put this into perspective, An NVIDIA graphics card has the computational ability to calculate the color of different pixels in a 20 inch monitor many times a second. Due to the massive parallelism exposed by the GPUs and radically different architecture when compared to the CPUs, it is difficult to leverage (i.e., program) the GPUs for general purpose computation.

Fortunately, the NVIDIA GPUs are easily programmable using the CUDA programming extensions. CUDA stands for Compute Unified Device Architecture, and is an extension of the C programming language. CUDA allows the programmer to take advantage of the massive parallel computing power of an NVIDIA graphics card in order to do general purpose computation.

Programming Model

The CUDA programming model has three abstractions:

  1. Kernels (GPU functions)
  2. Hierarchical thread groups
  3. Memory hierarchy

Kernels

CUDA C extends C by allowing the programmer to define C functions, called kernels, that, when called, are executed N times in parallel by N different CUDA threads, as opposed to only once like regular C functions. We will look at how to go about writing a kernel later in the documentation.

Threads are organized to form a one-dimensional, two-dimensional or three-dimensional group of thread blocks (referred to as just blocks). Blocks are, again, organized into a one-dimensional, two-dimensional or three-dimensional grid of thread blocks. This provides a natural way to invoke computation across the elements in a domain such as a vector, matrix, or volume. The figure below illustrates the thread hierarchy in the CUDA programming model. Each block in a grid and thread within a block is identified using a built-in variable. We will take a closer look at how to identify the threads and blocks in a kernel later.

Memory Hierarchy

CUDA threads may access data from multiple memory spaces during their execution as illustrated by the figure below. Each thread has private local memory. Each thread block has shared memory visible to all threads of the block and with the same lifetime as the block. All threads have access to the same global memory. There are also two additional read-only memory spaces accessible by all threads: the constant and texture memory spaces. The global, constant, and texture memory spaces are optimized for different memory usages. The global, constant, and texture memory spaces are persistent across kernel launches by the same application.

These abstractions are exposed to the programmer through a set of language extensions via the CUDA programming environment. CUDA program typically consists of one or more modules that are executed on either the host (CPU) or a device (GPU). The program supplies a single set of source file(s) containing both host and device code. The host code is C code and is compiled with the host’s standard C compiler. The device code is written using C extended with keywords for labeling data-parallel functions (kernels) and their data structures. At compile time nvcc separates host and device code offloading the compilation of the host code to the native compiler and compiles the device code to a format suitable for execution on the device (GPU).

NVIDIA provides two interfaces to write CUDA programs: the CUDA runtime API and the CUDA driver API. Developers must choose which one they are going to use for a particular application because their usage is mutually exclusive. We will only cover the usage of CUDA runtime API in this documentation. Irrespective of the API choice, the following steps needs to performed to execute a program on the GPU at a basic level:

  1. Initialize/acquire the device (GPU)
  2. Allocate memory on device (GPU)
  3. Copy data (input) from host (CPU) to device (GPU)
  4. Execute the kernel on device (GPU)
  5. Copy data (result) from device (GPU) to host (CPU)
  6. Deallocate memory on device (GPU)
  7. Release device (GPU)

CUDA Runtime API

CUDA is an extension to the C Programming Language. It adds function type qualifiers to specify execution on the host or device and variable type qualifiers to specify the memory location on the device.

Function Type Qualifiers

Qualifier Description
__device__ Executed on the device and callable only from the device.
__global__ Executed on the device and callable only from the host.
__host__ Executed on the host and callable only from the host.

Variable Type Qualifiers

Qualifier Resides Accessibility Lifetime
__device__ Global memory space From all threads within grid and host Application lifetime
__constant__ (optionally used with __device__) Constant memory space From all threads within grid and host Application lifetime
__shared__ (optionally used with __device__) Shared memory space of a thread block From all the threads within the block Block lifetime

Built-in Variables

Variable Description Type
gridDim Dimensions of the grid dim3
blockIdx Block index within a grid uint3
blockDim Dimension of the block dim3
threadIdx Thread index within a block uint3
warpSize Warp size in number of threads int

Execution Configuration

Any call to a __global__ function must specify the execution configuration for that call. The execution configuration defines the dimension of the grid and blocks that will be used to execute the function on the device. The execution configuration is specified by inserting an expression of the form <<< Dg, Db, Ns, S >>>between the function name and the parenthesized argument list, where:

  • Dg specifies the dimension and size of the grid, such that Dg.x * Dg.y * Dg.z equals the number of blocks being launched
  • Db specifies the dimension and size of each block, such that Db.x * Db.y * Db.z equals the number of threads per block
  • Ns specifies the number of bytes in shared memory that is dynamically allocated per block for this call in addition to the statically allocated memory. It defaults to 0.
  • S specifies the associated stream. S is an optional argument which defaults to 0.

As an example, a function declared as:

    __global__ void Func(float* parameter);

must be called like:

    Func<<< Dg, Db, Ns >>>(parameter);

Some example programs are provided in the Examples section, below.

Handling Multiple GPUs

On many systems, such as HokieSpeed, a single node houses more than one GPU. In this section, we will see simple examples of how to handle multiple devices.

Device Enumeration

A host system can have multiple devices. The number of GPUs attached to the host and their properties can be queried using cudaGetDeviceCount() andcudaGetDeviceProperties() respectively. The following code sample shows how to enumerate these devices, query their properties, and determine the number of CUDA-enabled devices.

    int deviceCount;

    cudaGetDeviceCount(&deviceCount);
   
    int device;
    for (device = 0; device < deviceCount; ++device)
    {
      cudaDeviceProp deviceProp;
      cudaGetDeviceProperties(&deviceProp, device);
      printf("Device %d has compute capability %d.%d.\n", device, deviceProp.major, deviceProp.minor);
    }

Device Selection

A host thread can set the device it operates on at any time by calling cudaSetDevice(). Device memory allocations and kernel launches are made on the currently set device. If no call to cudaSetDevice() is made, the current device is device 0 (i.e., the default device is 0). The following code sample illustrates how setting the current device affects memory allocation and kernel execution.

    size_t size = 1024 * sizeof(float);
   
    // Set device 0 as current
    cudaSetDevice(0);
   
    // Allocate memory on device 0
    float* p0;
    cudaMalloc(&p0, size);
   
    // Launch kernel on device 0
    MyKernel<<<1000, 128>>>(p0);
   
    // Set device 1 as current
    cudaSetDevice(1);
   
    // Allocate memory on device 1
    float* p1;
    cudaMalloc(&p1, size);
   
    // Launch kernel on device 1
    MyKernel<<<1000, 128>>>(p1);

A sample CUDA/OpenMP program that uses multiple GPUs is provided in the Examples section, below.

Examples

Example 1: Basic CUDA

We will put together a simple example of vector addition (the complete program is here). The first step is to allocate memory on the GPU. This done by usingcudaMalloc().

    float *devPtrA; 			// Vector A
    float *devPtrB; 			// Vector B
    float *devPtrC; 			// Result
    int memsize= SIZE * sizeof(float); 

    //Allocate Memory on the device	
    cudaMalloc((void**)&devPtrA, memsize); 
    cudaMalloc((void**)&devPtrB, memsize); 
    cudaMalloc((void**)&devPtrC, memsize);

The next step is to copy the data (input) from host to device. This is accomplished by using cudaMemcpy(). In our case we have to copy the devPtrA and devPtrB arrays to the device.

    // Copy devPtrA host to device
    cudaMemcpy(devPtrA, A, memsize,  cudaMemcpyHostToDevice); 
    // Copy devPtrB host to device
    cudaMemcpy(devPtrB, B, memsize,  cudaMemcpyHostToDevice);

Perform the computation on the GPU. Performing the computation GPU is done by launching a kernel respectively. We will discuss how to write GPU kernels later in the documentation.

    //Launch the kernel 
    vecAdd<<<1, N>>>(devPtrA,  devPtrB, devPtrC);

The next step is to copy the data (result) from the device to host. In our case, we copy the dev_c array from the device to the host.

    // Copy devPtrC host to device
    cudaMemcpy(C, devPtrC, memsize, cudaMemcpyDeviceToHost); //Copy devPtrC device to host

The final step is to deallocate the memory on the GPU. It is accomplished by using cudaFree() on the GPU.

    //Deallocate the memory on GPU
    cudaFree(devPtrA);
    cudaFree(devPtrB);
    cudaFree(devPtrC);

Now that we have seen how to run programs on GPUs. We will take a closer look at the kernels used for the computation. The code below shows a simple kernel:

    // Kernel definition  
    __global__  void vecAdd(float* A, float* B, float* C) 
    { 
          // threadIdx.x is a built-in variable  provided by CUDA at runtime 
          int i = threadIdx.x; 
          C[i] = A[i] + B[i]; 
    }

As you can see above the kernel is written in C with a few additional extensions. First, there is a CUDA specific keyword __global__ in front of the declaration ofvecAdd( ). This keyword indicates that the function is a kernel. When this function is called from the host code it will generate a grid of threads on the device. The second extension to C are the keywords threadIdx.x, threadIdx.y, blockDim.x, and blockDim.y that refer to the indices of a thread and the block respectively inside the running kernel.

Example 2: CUDA + OpenMP

This example uses OpenMP and CUDA to perform computation on multiple GPU devices. See the Handling Multiple GPUs section for more information on working with multiple devices. We will also see how to compile the program later in this section. The following code snippet shows the device enumeration part of our program.

    // Variable which holds number of GPUs
    int num_gpus = 0;   

    // Determine the number of CUDA capable GPUs
    cudaGetDeviceCount(&num_gpus);
    if(num_gpus < 1)
    {
    printf("No CUDA Capable GPU(s) Detected \n");
                    return 1;
    }

    // Display the CPU and GPU processor specification         
    printf("number of host CPUs:\t%d\n", omp_get_num_procs());
    printf("number of CUDA devices:\t%d\n", num_gpus);
    for(int i = 0; i < num_gpus; i++)
    {
              cudaDeviceProp dprop;
              cudaGetDeviceProperties(&dprop, i);
                    printf("\t Device %d is a %s\n", i, dprop.name);
    }

The following code snippet shows how we assign GPUs to each of the threads. In our example, we will only use one thread per GPU.

    // Set the number of threads to the number of GPUs on the system
    omp_set_num_threads(num_gpus);

    #pragma omp parallel
    {
        unsigned int cpu_thread_id = omp_get_thread_num();
        unsigned int num_cpu_threads = omp_get_num_threads();

        // Assign and check the GPU device for each thread
        int gpu_id = -1;
        cudaSetDevice(cpu_thread_id % num_gpus);        
        cudaGetDevice(&gpu_id);

        // Variable on the device associated with this CPU thread
        int *d_a = 0; 

        // Variable for the CPU
        int *sub_a = a + cpu_thread_id * n / num_cpu_threads;

        unsigned int nbytes_per_kernel = nbytes / num_cpu_threads;
        dim3 gpu_threads(128);  // 128 threads per block
        dim3 gpu_blocks(n / (gpu_threads.x * num_cpu_threads));

        //Allocate memory on the device
        cudaMalloc((void**)&d_a, nbytes_per_kernel);

        //Initialize the array on the device with zeros
        cudaMemset(d_a, 0, nbytes_per_kernel);

        //Copy data from host to device
        cudaMemcpy(d_a, sub_a, nbytes_per_kernel, cudaMemcpyHostToDevice);
      
        //Launch the kernel
        kernelAddConstant<<<gpu_blocks, gpu_threads>>>(d_a, b);

        //Copy the result  from the device to the host
        cudaMemcpy(sub_a, d_a, nbytes_per_kernel, cudaMemcpyDeviceToHost);
              
        //Deallocate the memory on the device
        cudaFree(d_a);
    }

We can use nvcc to compile OpenMP programs. A makefile similar to the one used in the basic example is sufficient. Notice the addition of the -Xcompiler and -fopenmp options.

    # NVCC is path to nvcc. Here it is assumed that /usr/local/cuda is on one's PATH.
    NVCC = nvcc

    NVCCFLAGS = -Xcompiler -fopenmp
    NVCCINC = -I$(CUDA_INC)
    LFLAGS = -L$(CUDA_LIB64) -lcuda -lcudart -lgomp

    #nvcc -Xcompiler -fopenmp -lgomp -o omp# NVCC is path to nvcc. Here it is assumed that /usr/local/cuda is on one's PATH.
    NVCC = nvcc

    NVCCFLAGS = -Xcompiler -fopenmp
    NVCCINC = -I$(CUDA_INC)
    LFLAGS = -L$(CUDA_LIB64) -lcuda -lcudart -lgomp

    #nvcc -Xcompiler -fopenmp -lgomp -o ompCuda ompCuda.cu
    ompCuda:
      $(NVCC) $(NVCCFLAGS) $(NVCCINC) $(LFLAGS) -o ompCuda ompCuda.cu

    clean:
      rm -f ompCudaCuda ompCuda.cu
    ompCuda:
      $(NVCC) $(NVCCFLAGS) $(NVCCINC) $(LFLAGS) -o ompCuda ompCuda.cu

    clean:
      rm -f ompCuda

The program is compiled by using the make command and executed like any other executable:

    make
    ./ompCuda

Other Examples

  • CUDA MPI program:
    1. Requires modules for GCC, CUDA, and an MPI implementation (e.g. OpenMPI): module load gcc openmpi cuda. Note that you may need to purge the modules before this (module purge); alternatively, if the Intel compiler is loaded in place of GCC, you can replace Intel with GCC using themodule swap command (module swap intel gcc).
    2. To compile use this makefile or this command line: nvcc -arch sm_13 -I$VT_MPI_INC -L$VT_MPI_LIB -lmpi -lcuda -lcudart -o run-cuda-mpi cuda-mpi.cu. ($VT_MPI_INC and $VT_MPI_LIB are environment variables that point to directories associated with the MPI module loaded.)
    3. In the qsub script, run with the following command: mpiexec -npernode 1 ./run-cuda-mpi (This would run with 1 process per node, which is ideal for CUDA code so that you don’t have more than one process trying to access the same GPU.)
  • CUDA OpenMP program:
    1. Requires modules for GCC and CUDA: module load gcc cuda. Note that you may need to purge the modules before this (module purge); alternatively, if the Intel compiler is loaded in place of GCC, you can replace Intel with GCC using the module swap command (module swap intel gcc).
    2. To compile use this makefile or this command line: nvcc -Xcompiler -fopenmp -lcuda -lcudart -lgomp -o run-cuda-omp cuda-omp.cu
    3. In the qsub script, run with the following command: ./run-cuda-omp
  • CUDA Matrix Multiplication program:
    1. Requires modules for GCC and CUDA: module load gcc cuda. Note that you may need to purge the modules before this (module purge); alternatively, if the Intel compiler is loaded in place of GCC, you can replace Intel with GCC using the module swap command (module swap intel gcc).
    2. To compile use this command line: nvcc -lcuda -lcudart -o run-cuda-matmul MatMul.cu
    3. In the qsub script, run with the following command: ./run-cuda-matmul