DragonsTooth

Overview

DragonsTooth is a 48-node system designed to support general batch HPC jobs. The table below lists the technical details of each DragonsTooth node. Nodes are connected to each other and to storage via 10 gigabit ethernet (10GbE), a communication channel with high bandwidth but higher latency than InfiniBand (IB). As a result, DragonsTooth is better suited to jobs that require less  internode communication and/or less I/O intearction with non-local storage than NewRiver, which has similar nodes but a low-latency IB interconnect. To allow I/O-intensive jobs, DragonsTooth nodes are each outfitted with nearly 2 TB of solid state local disk. DragonsTooth was released to the Virginia Tech research community in August 2016.

In November of 2018, DragonsTooth was reprovisioned with Slurm as its scheduler as a replacement for Moab/Torque.

Technical Specifications

CPU 2 x Intel Xeon E5-2680v3 (Haswell) 2.5 GHz 12-core
Memory 256 GB 2133 MHz DDR4
Local Storage 4 x 480 GB SSD Drives
Theoretical Peak (DP) 806 GFlops/s

Policies

Note: DragonsTooth is governed by an allocation manager, meaning that in order to run most jobs on it, you must be an authorized user of an allocation that has been submitted and approved. The open_q partition (queue) is available to jobs that are not charged to an allocation, but it has tight usage restrictions (see below for details) and so is best used for initial testing in preparing allocation requests. For more on allocations, click here.

As described above, communications between nodes and between a node and storage will have higher latency on DragonsTooth than on other ARC clusters. For this reason the queue structure is designed to allow more jobs and longer-running jobs than on other ARC clusters.

DragonsTooth has three partitions (queues) :

  • normal_q for production (research) runs.
  • dev_q for short testing, debugging, and interactive sessions. dev_q provides slightly elevated job priority to facilitate code development and job testing prior to production runs.
  • open_q provides access for small jobs and evaluating system features. open_q does not require an allocation; it can be used by new users or researchers evaluating system performance for an allocation request.

The settings for the partitions are:

QUEUE NORMAL_Q DEV_Q OPEN_Q
Access to dt003-dt048 dt003-dt048 dt003-dt048
Max Jobs 288 per user
432 per allocation
1 per user 1 per user
Max Nodes 12 per user
18 per allocation
12 per user 4 per user
Max Core-Hours* 34,560 per user
51,840 per allocation
96 per user 192 per user
Max Walltime 30 days 2 hr 4 hr

Other notes:

  • Shared node access: more than one job can run on a node.

* A user cannot, at any one time, have more than this many core-hours allocated across all of their running jobs. So you can run long jobs or large/many jobs, but not both. For illustration, the following table describes how many nodes a user can allocate for a given amount of time:

Walltime Max Nodes (per user) Max Nodes (per allocation)
72 hr (3 days) 12 18
144 hr (6 days) 10 15
360 hr (15 days) 4 6
720 hr (30 days) 2 3

Software

For list of software available on DragonsTooth, as well as a comparison of software available on all ARC systems, click here.

Note that a user will have to load the appropriate module(s) in order to use a given software package on the cluster. The module avail and module spider commands can also be used to find software packages available on a given system.

Usage

The cluster is accessed via ssh to one of the login nodes below. Log in using your username (usually Virginia Tech PID) and password. You will need an SSH Client to log in; see here for information on how to obtain and use an SSH Client.

  • dragonstooth1.arc.vt.edu
  • dragonstooth2.arc.vt.edu

Job Submission

Access to all compute engines  is controlled via the Slurm job scheduler. See the Slurm Job Submission page here. The basic flags are:

#SBATCH -p normal_q (or other partition, see Policies)
#SBATCH -A <yourAllocation> (see Policies)
#SBATCH -t dd-hh:mm:ss

The DragonsTooth cluster formerly ran using a different scheduler which would take #PBS style directives. ARC implemented configurations during the transition to Slurm so that most of these directives and commands will continue to work without any modifications.

Shared Node

DragonsTooth compute nodes can be shared by multiple jobs. Resources can be requested by specifying the number of nodes, processes per node (ppn), cores, memory, etc. See example resource requests below:

#Request exclusive access to all resources on 2 nodes 
#SBATCH --nodes=2 
#SBATCH --exclusive

#Request 4 cores (on any number of nodes)
#SBATCH --ntasks=4

#Request 2 nodes with 12 tasks running on each
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=12
#Request 12 tasks with 20GB memory per core
#SBATCH --ntasks=12 
#SBATCH --mem-per-cpu=20G

#Request 5 nodes and spread 50 tasks evenly across them
#SBATCH --nodes=5
#SBATCH --ntasks=50
#SBATCH --spread-job

Finding Information

Check status of a job after submission:

squeue

Get detailed information about a running  job

scontrol show job <job-number>

Check status of the cluster’s nodes and partitions:

sinfo

Examples

This shell script provides a template for submission of jobs on DragonsTooth. The comments in the script include notes about how to request resources, load modules, submit MPI jobs, etc.

To utilize this script template, create your own copy and edit as described here.