Huckleberry User Guide

Huckleberry is a high performance computing system targeted at deep learning applications.  Huckleberry consists of two login nodes and Fourteen IBM “Minksy” S822LC compute nodes. Each of the compute nodes is equipped with:

  • Two IBM Power8 CPU (3.26 GHz) with 256 GB of memory
  • Four NVIDIA P100 GPU with 16 GB of memory each
  • NVLink interfaces connecting CPU and GPU memory spaces
  • Mellanox EDR Infiniband (100 GB/s) interconnect
  • Ubuntu 16.04 OS

Understanding non-uniform memory access (NUMA) patterns important to get the full benefit of the S822LC compute nodes on huckleberry.  The memory bandwidth associated with data movement within each compute node is summarized in Figure 1.  Note that each Power8 CPU is coupled to two P100 GPU through NVLink, which supports bi-directional data transfer rates of 80 GB/s.  The theoretical  maximum memory bandwidth for each Power8 CPU is 115 GB/s.  The theoretical maximum memory bandwidth for each NVIDIA P100 GPU is 720 GB/s

S822LCforHPCDiagram
Figure 1. Theoretical memory bandwidth for data transfers within the IBM S822LC Compute node (image source: NVIDIA).

To access Huckleberry, users should log in with:

ssh huckleberry1.arc.vt.edu

Slurm has been installed on Huckleberry and supports the scheduling of both batch job and interactive jobs.

Basic Job Submission and Monitoring

The current configuration is currently very basic, but allows users to run jobs either through the batch scheduler or interactively. The following is a basic “hello world” job submission script requesting 500 GB memory and all four Pascal P100 GPU on a compute node.

NOTE: asking for -N 1 without specifying how many cores per node will default to only 1 core (equivalent to -n 1). If you would like to get the full node exclusively, you should ask for all the cores on the node using the flag -n, or, you could use the --exclusive flag


#!/bin/bash
#SBATCH -J hello-world
#SBATCH -p normal_q
#SBATCH -N 1  # this will not assign the node exclusively. See the note above for details
#SBATCH -t 10:00
#SBATCH --mem=500G
#SBATCH --gres=gpu:pascal:4
echo "hello world"

To submit a job to the batch queue, slurm provides the sbatch command (which is the analog of torque’s qsub). Assuming that the above is copied into a file “hello.sh,” a job can be submitted to the scheduler using


mcclurej@hulogin1:~/Slurm$ sbatch hello.sh
Submitted batch job 5

To check on the status of jobs, use the squeue command,


mcclurej@hulogin1:~/Slurm$ squeue -u mcclurej
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
5 debug hello-wo mcclurej R INVALID 1 hu001

Output from the job will be written to the file slurm-5.out

To cancel a job, provide the jobid to the scancel command.

Slurm provides the srun command to launch parallel jobs. Typically this would replace mpirun for an MPI job.

For more comprehensive information, SchedMD has a handy Slurm command cheat sheet

Interactive Jobs

To run a job interactively, a two-step process is required. First, one should request a reservation using salloc (e.g. one compute node for 10 minutes). If you would like to get exclusive access to the node you should specificy the number of cores or use the exclusive flag as noted in the previous section.

salloc -N1 -t 10:00

To get an interactive job on a session, provide the --pty /bin/bash flag to srun,

srun --pty /bin/bash

This command won’t work if you don’t first request an allocation to reserve nodes for this purpose.

Requesting Individual GPU 

In many cases jobs will require fewer than the four GPU available on each huckleberry compute node.  GPU can be requested as a generic resource (GRES) through Slurm by requesting a specific number of processor cores and GPU.  To request one processor core and one GPU in an interactive session with 8 GB of memory per processor core,

salloc -n1 -t 10:00 --mem-per-cpu=8G --gres=gpu:pascal:1

The example batch submission script shown below request the equivalent resource for a batch job


#!/bin/bash
#SBATCH -J gpu-alloc
#SBATCH -p normal_q
#SBATCH -n 1
#SBATCH -t 10:00
#SBATCH --mem-per-cpu=8G
#SBATCH --gres=gpu:pascal:1
echo "Allocated GPU with ID $CUDA_VISIBLE_DEVICES"

 

Slurm will set the $CUDA_VISIBLE_DEVICES environment variable automatically based on your request.  Multiple processor cores and/or GPU can be requested in the same manner.  For example, to request two GPU and 80 CPU cores,

salloc -n80 -t 10:00 --mem-per-cpu=4G --gres=gpu:pascal:2

The Power8 CPU are viewed by Slurm as 160 processor cores.

Software

Software modules are available on huckleberry and function in the same manner as other ARC systems, e.g. the following syntax will load the module for cuda


module load cuda

Additionally, IBM’s PowerAI deep learning software are installed under /opt/DL. For brief tutorials, please click on any of the package name below.

  • caffe: /opt/DL/caffe-ibm
  • tensorflow: /opt/DL/tensorflow
  • torch: /opt/DL/torch
  • theano: /opt/DL/theano
  • openblas: /opt/DL/openblas
  • digits: /opt/DL/digits
  • jupyter notebooks
  • bazel: /opt/DL/bazel
  • chainer:/opt/DL/chainer

For additional information, please refer to the PowerAI User Guide.

Python

For users that would like to customize their Python environment, we provide online documentation for best practices to manage Python on ARC systems.

As an example, to locally install keras with TensorFlow backend on huckleberry, the following approach can be used


mkdir $HOME/huckleberry/python
export PYTHONUSERBASE=$HOME/huckleberry/python


module load cuda
source /opt/DL/tensorflow/bin/tensorflow-activate


pip install --user --no-deps keras

This basic procedure should work for most Python packages that can be installed from python package manager.