Blueridge Decommission

The Blueridge cluster has served as the largest-scale HPC cluster at Virginia Tech since its release in 2013. The university is now moving forward with a new cluster purchase to replace Blueridge as it is being decommissioned. This page and is intended to provide both a timeline for the decommissioning process and also technical resources for migrating data and workflows off the cluster and possibly onto other HPC resources.

Timeline for decommission

Blueridge has been a highly productive resource for researchers and students. The Fall 2019 semester is the final semester during which the cluster will be available. If you still have data or computational workflows which need to run on Blueridge, use this semester to complete all work and move data to other locations. You can find recommendations on how to proceed below.

Here is the scheduled timeline:

  • August 1 – End of updates.
  • September 1 – No new allocations or renewals.
  • December 31 – End of compute. All jobs must end on/before this date.
  • January 2020 – Decommission /lustre storage
  • Spring 2020 – new (yet to be named) large scale compute cluster arrives

Migrating workloads

Prior to the availability of the pending large-scale compute cluster in Spring 2020, the Newriver, Cascades, and Dragonstooth clusters present viable options for moderate scale compute workloads. Newriver is a PBS managed cluster and provides similar interactivity to Blueridge. Cascades and Dragonstooth are newer Slurm clusters. This webpage provides some information about migrating from PBS to Slurm on ARC clusters.

Migrating data off of Blueridge (/lustre only)

Filesystem                        Size  Used Avail Use% MountedOn
     ...@o2ib0:/lustre            785T  565T  181T  76% /lustre
clproto-ha....:/gpfs/work         2.1P  1.3P  799T  62% /work
clproto...    :/gpfs/home        1002T  591T  411T  59% /groups
vt-archiv.... :/gpfs/archive/arc  420T  350T   71T  84%/vtarchive 
qumulo.arc.internal:/home         281T  177T  104T  63% /home

The output above shows the five notable user-facing filesystems available on Blueridge. The Lustre filesystem is the only one of concern for data with regards to the decommission of Blueridge because it is the only storage location unique to Blueridge. The other four (/work, /groups, /vtarchive, and /home) are all universally available on other ARC clusters and data stored on them will not be affected.

In short, if you have data stored on /lustre which you need to keep around, then you need to migrate it somewhere else. Please be selective about the files you decide migrate and avoid keeping duplicate copies of data.

Environment variables are set on Blueridge as convenient shortcuts for storage paths, but may cause some confusion. In particular $WORK is set differently on each cluster. Use these tools to make sure full paths are what you expect:

  • env (prints all environment variables which are currently set)
  • pwd (print current working directory) to validate the full paths to your data.
[mypid@brlogin2 ~]$ env | grep work

[mypid@brlogin2 ~]$ cd $WORK
[mypid@brlogin2 brownm12]$ pwd

On other ARC clusters, $WORK points to the GPFS location mounted at /work, but on Blueridge refers to the /lustre location. It is always safest to use explicit, full paths.

Data migration tips

Assess what data you need to move. Data in /lustre is the primary concern. Small data sets (less than 1000 files and less than 10GB total size) should be quick and easy to move, but larger data sets may take some planning (tar, compress, run in background). You usage information is presented at login when you connect to the system. For example:

data usage:
USER       FILESYS/SET                         DATA (GiB)   QUOTA (GiB) FILES      QUOTA      NOTE
mypid   /home                               467.7        640         -          -
mypid   /lustre                             8.0          14336       28         3000000

You may want to move all the files you want to keep into directory for exporting. You can use du -sh /lustre/work/mypid/exportdir to compute the size of a particular directory and all its contents.

Reduce consumption by archiving and/or purging data which is no longer needed before initiating a transfer. This can make a migration easier and faster.

If you have a large dataset to move, it will be helpful to package it into one or more tarballs and to compress them. Where possible, separate datasets into chunks which compress to manageable sizes between 1GB and 1TB. Compression rates vary widely with the types of data. Image/video data is often already stored in a well compressed format and may not compress further, while text files often compress by 90% or more.

  • !! DO tar and DO compress data destined for /vtarchive

Example of creating a compressed tarball and some optional flags:

tar --create --xz --file=exportdir.tar.xz /lustre/work/mypid/exportdir

More options:
--bzip2 or --gzip may also be used for compression
   handle sparse files efficiently 
   remove files after adding them to the archive

If you’re running processing several datasets or chunks at once, please perform the actions in the context of a job running on a compute node so that login node resources remain available for others.

Choose the destination(s). Make sure that the data you want to move will not exceed the space available (df -h) or any quota restrictions on the destination filesystem. Consider timing (how long might it take; is now a good time to start) and where to execute the transfer from (push vs. pull, login node vs. compute node) and then initiate it when you’re ready.

Some typical destination options:

  1. Download to local/personal computer. Tools – interact, screen, scp (Linux/Mac), WinSCP, FileZilla (Windows), “&”
  2. Migrate to a different ARC filesystem. Keep quotas and data lifecycle in mind. Tools (command line) – interact, screen, run in background “&”, tar, cp, mv, rsync
  3. Archive data to /vtarchive. Tools – interact, screen, “&”, tar, cp, mv, rsync. Files sent to the /vtarchive tape system should be large (1GB-1TB) and already compressed for best results. Large quantities of small (<100MB) files are inappropriate for the tape library and can severely hamper its performance and reliability.

Notes regarding these tools:

  • interact – this command can be used to start a job on Blueridge and provide you with an interactive shell on a compute node. This may be useful for running tar, compression, and even file transfers without risking overloading the login nodes. The default duration of an interact job is one hour, but this can be increased by passing some options as shown below. One caveat is that /vtarchive is not mounted on the compute nodes.
interact -l walltime=6:00:00 -A MyAlloc -q normal_q
  • scp (secure copy) – any filesystem on any host you can ssh to is a valid source/destination for scp. The standard format is scp sourcefile destination. When specifying a remote host, use a colon and then specify the full path of the file to copy. Example:
Personal-Mac:~ mypid$ scp ./
damdir.tar                                                                                                                                                                        100%   21MB  30.1MB/s   00:00
Personal-Mac:~ brownm12$
  • screen – this command can be used to essentially disconnect long-running tasks from the current shell so that you can log off from the cluster without forcing the task to quit.
    • Use “screen” to start a screen session.
    • Use “Ctrl+a,d” to disconnect from a running screen session (it will continue running on its own)
    • Use “screen -r” to resume a screen session
    • Use “exit” to terminate a running screen session.
#Start a screen session
[mypid@brlogin2 ~]$ screen

#Start a task which will take a while to complete
[mypid@brlogin2 ~]$ for ii in {1..10}; do echo $ii; sleep 10; done

# entered Ctrl+A,D to detach

[mypid@brlogin2 ~]$ 

#Check back in on the running screen session a while later
[mypid@brlogin2 ~]$ screen -r

[mypid@brlogin2 ~]$ for ii in {1..10}; do echo $ii; sleep 10; done
#Still not done, so use Ctrl+A,D to detach again
[mypid@brlogin2 ~]$

#Log off from the system
[mypid@brlogin2 ~]$ exit
Connection to closed.

#Log back in a while later
Personal-Mac:~ brownm12$ ssh

#Resume the screen session which is still running
[mypid@brlogin2 ~]$ screen -r

[mypid@brlogin2 ~]$ for ii in {1..10}; do echo $ii; sleep 10; done
[mypid@brlogin2 ~]$ 
#Long-running process has completed, so exit to terminate the screen session
[mypid@brlogin2 ~]$ exit

[screen is terminating]
[mypid@brlogin2 ~]$