ARC TA – Hailey Larose/David Haak

Welcome message

Welcome!

Often times in bioinformatics, we need to utilize software and run programs that our normal computers or laptops are unable to handle. When this situation arises, the ARC super-computing environment helps alleviate some of this burden.

This FAQ will cover the programs related to bioinformatics that have been pre-installed and are ready for use on the ARC servers. These programs exist in the form of modules; these modules can be loaded into the environment through a submission script without having to install and compile the software yourself. This saves you time, and saves hard drive space by prevents copies of the same softwares being installed across multiple accounts.

The goal of making these FAQs was to provide new users with a quick and easy to understand reference tutorial for using common bioinformatics softwares on the ARC environment. These tutorials are broken into four parts: 1. general information about the program, 2. required input files, 3. commands, 4. output files and an example submission script. As a new user beginning bioinformatics, reading a manual can be overwhelming and critical information can be missed. It is my hope that these short tutorials can offer some key information and serve as a reference, however they are not a replacement for reading the software manuals. When using a particular software, please always consult the manual for full details of all parameters and descriptions of algorithms used.

Some basics:

1. Before proceeding it is important that you have an understanding of the different types of servers ARC offers. Please read the following links to familiarize yourself with the four ARC servers we will cover in this FAQ:

1. {HokieOne}

2. {HokieSpeed}

3. {BlueRidge}

4. {NewRiver}

2. You should also familiarize yourself with the concept of {software modules}. This system is utilized across all ARC servers and we will be references this throughout this FAQ.

3. And if you haven’t already, please use the following {link} to create an account, requesting access for HokieOne and HokieSpeed. BlueRidge and NewRiver are strongly recommended, but will require an {allocation approval} before access is granted. You will need your PID and password.

4. To log into each server you will use the shh command followed by your username and the server address, as such:

ssh username@hokieone.arc.vt.edu

ssh username@hokiespeed1.arc.vt.edu

ssh username@blueridge1.arc.vt.edu

ssh username@newriver1.arc.vt.edu

The programs that you will use for your analysis will depend highly on the type of data that you are working with. This FAQ is organized based on data-type to better explain common data analysis pipelines that ARC can assist you through. Below are pipelines used for genomic and transcriptome data. By clicking on a particular cell of the diagram, you will be redirected to the FAQ for that software. Alternatively, you can consult the list at the end of this page for direct links to all software covered in this FAQ.

Transcriptome data processing 

Transcriptomes offer a snap shot of a species’ gene expression profile under a given condition or time. RNA-sequencing has uses across many fields and the ability to answer numerous questions, and thanks to modern advances, is relatively cheap to obtain and straight forward to process.

When you first receive your transcriptome data, it will most likely be in a raw, unprocessed format from the sequencing facility, potentially containing poor quality bases, reads and sequencing adaptors. This data is in .fastq format and will need to be processed, or trimmed. To assess the initial quality of your raw transcriptome data, a program to assess quality needs to be run, and ARC has installed {FastQC} to serve this purpose. Using FastQC to get an initial idea of your data’s quality, a trimming program is then run to correct any issues. Many programs exist for this purpose, and ARC has installed {Trimmomatic} as a module for user’s convenience. After running your trimming program, re-check the quality to make sure the initial identified issues are resolved.

Once your data has been pre-processed, you can move on to assembly. The type of assembly you’ll perform and the software you’ll use to do it, depends highly on the available resources for the species you are working with as well as the type of sequencing performed. ARC offers resources for both de-novo assembly, assembly in the absence of a genome, and reference-guided assembly, or assembly using a genome for a reference. These softwares include {Trinity} for de-novo or reference-guided, and {Cufflinks} for reference-guided assembly. Both of these softwares offer their own specific downstream pipelines for transcriptome analysis.

After you have an assembled transcriptome, you will want to annotate it against a database useful to your species. This could be against an annotated-reference genome, a compilation of related-species or taxa genomes, or publicly available comprehensive databases such as nr or Uniprot. ARC has installed the module {ncbi-blast+} for this purpose.

Once you transcriptomes assembled and annotated, you can begin to explore your project specific questions. ARC offers various modules to assist in some of the commonly explored post-assembly analyses such as {Cufflinks suite}, {Hmmer}, and R which can load Bioconductor.

Each of these programs relies on various dependencies to execute them. Many of these dependencies are pre-loaded on ARC as modules and include aligners such as {Bowtie}, {Bowtie2}, {BWA}, {Diamond}, {MySQL}, {Picard}, {Bamtools}, {BCFtools}, {Samtools}, and {Bedtools}. The required dependencies for each software will be listed in the individual software FAQs.

Genomic data processing

All softwares covered in this FAQ:

——— transcriptomics and aligners

Trinity

Bowtie

Bowtie2

BWA 

Diamond 

FastQC

Trimmomatic

TopHat

Cufflinks

ncbi-blast+

——— genomics:

AbySS

GATK

Hmmer

 

ABySS (Assembly by short sequences)

AbySS is a de novo, parallel, genome assembler that is designed for short paired-end reads. The single-processor version is useful for assembling genomes up to 100 Mbases in size. The parallel version is implemented using MPI and is capable of assembling larger genomes.

Where to run:

ABySS is loaded on NewRiver, BlueRidge and HokieSpeed. This software can perform parallel processing with or without MPI configuration. This means you can run it on multiple cores on a single machine, or use multiple machines (or nodes, in this case) for assembly. To run ABySS on the cluster with multiple nodes the module openmpi is used.

Input files:

ABySS runs on paired-end reads files with to without unpaired single end read files. The input file must be in one of the following formats: FASTA, FASTQ, qseq, export, SRA, SAM, BAM and my be compressed with gz, bz2 or xz and can be tarred.

The input reads need to have the suffixes /1 and /2 (name1 name2 or name_1 name_2) to identify the forward and reverse reads. Reads without mates are specified with the -se parameter. Reads from multiple libraries can be run at once to assemble one de novo genome.

Commands:

To run ABySS, the following additional ARC modules will need to be loaded in your environment or script:

1. Python module load python/2.7.10

2. Boost  module load boost/1.58.0

3. Open MPI module load openmpi/1.8.5

4. Sparsehash module load sparsehash/2.0.2

Basic usage of ABySS to generate a de novo assembly of a single paired-end library is:

abyss-pe name=Arabidopsis k=64 in=‘forward1.fa reverse2.fa’

The required arguments include:

abyss-pe : envokes abyss paired-end assembly Makefile script.

name : the name of the assembly, the results are stored in $name-scaffolds.fa

k : size of the k-mer (consult ABySS manual on how tips to derive this)

in : input files when assembling data from a single paired-end library.

To see a more detailed list of all parameters and options for running ABySS, consult the developer’s online {manual – https://github.com/bcgsc/abyss#abyss}, or run the following command from any directory:

less /opt/apps/gcc5_2/openmpi1_8/abyss/1.5.2/share/man/man1/abyss-pe.1

Output files:

The assembled contigs will be stored in the $name-contigs.fa file generated in the submission directory. $name is what is specified on the command line with the <name> parameter. Contigs are in fasta format.

Cufflinks:

Cufflinks is both the name of a suite of tools and a program within that suite. Cufflinks the program assembles transcriptomes from RNA-seq data using a reference genome, and quantifies expression values for differential expression analysis. Cufflinks the suite of tools contains additional programs that can be used for different analyses for RNA-Seq experiments. This tutorial will cover Cufflinks the program and how to use it for genome-guided RNA-Seq assembly.

Where to run: 

Cufflinks is installed on NewRiver, BlueRidge and HokieSpeed. Depending on how quickly you want your files to run, the server load, your input file sizes and whether or not you are running the entire Cufflinks suite or just parts, will determine the best location to run this software. An example script will be provided for running on NewRiver.

Input files: 

Cufflinks requires a file of RNA-Seq read alignments in SAM or BAM format containing appropriate custom tags for individual alignments. The software TopHat generates this output file and is recommended for use with Cufflinks, as it generates potential alternative splicing alignments. Cufflinks will accept SAM alignments generated by any read mapper, however. The SAM alignment file must be sorted by reference position. The TopHat output file will already have this step completed, but the module {samtools} can be used to do this on alignment files generated by other mappers.

Commands: 

To run Cufflinks, the following additional ARC modules will need to be loaded in your environment or script:

1. python/2.7.10

2. boost/1.58.0

3. samtools/1.2

4. bowtie2/2.2.5 (default, however it can use bowtie)

Basic usage of Cufflinks is:

cufflinks [options] <aligned_reads.(sam/bam)>

The only required input file is an aligned_reads.(sam/bam) generated from TopHat or an another mapping program. Cufflinks can run with or without a reference annotation gtf file. To see all options available for running Cufflinks, type:

module load  python/2.7.10 boost/1.58.0 samtools/1.2 cufflinks/2.2.1

cufflinks – -help

Output files:

Cufflinks produces three output files:

1. transcripts.gtf : a GTF file containing the Cufflinks’ assembled isoforms.

2. isoforms.fpkm_tracking : file containing the isoform expression values in FPKM Tracking Format.

3. genes.fpkm_tracking : file containing the gene expression values in FPKM Tracking Form

The isoforms.fpkm_tracking and genes.fpkm_tracking can be used for differential expression analysis either directly through the {cufflinks’ suite} cuffdiff, or modified for other differential expression programs such as DESeq.

FastQC:

Before beginning to process raw data, it is important to perform a quality check (QC) of the reads to assure that the raw data looks alright and there are no major problems or biases in your data. FastQC provides a QC report that will report problems that may be generated from either the sequencer or in the starting library material in a visual format.

There are two ways to run FastQC: a non-interactive command line or a stand alone user interface application. We will discuss how to use FastQC on the ARC servers using the command line mode. Having the option to run this program in command line is beneficial for integrating into a larger analysis pipeline, allowing for the high throughput processing of a large number of files.

Where to run:

FastQC is available on NewRiver, BlueRidge and HokieSpeed. Depending on how quickly you want your files to run, the server load, your input file sizes and whether or not you are running this as part of a larger analysis pipeline, will determine the best location to run this software. Generally, when running as a stand-alone software, all three servers will be acceptable as this program requires limited memory resources. An example script will be provided for running on NewRiver.

Input files:

FastQC supports files in the following formats:

  • FastQ (all quality encoding variants)
  • Casava FastQ files*
  • Colorspace FastQ
  • GZip compressed FastQ
  • SAM
  • BAM
  • SAM/BAM Mapped only (normally used for colorspace data)

Commands:

To load FastQC into the environment first type:

module load fastqc/0.11.3

To run non-interactively you simply have to specify a list of files to process on the command line:

fastqc [commands] file1.txt file2.txt … fileN.txt

You can specify as many files to process in a single run as you like. There are a few extra options you can specify when running non-interactively.  Full details of these can be found by running:

module load fastqc/0.11.3

fastqc –help

Output files:

If you want to save your reports in a folder other than the folder which contained your original FastQ files then you can specify an alternative location by setting the output directory:

-o=/some/other/dir/

The output report will be in two formats: a .html file and a .zip file. The .html file will give a visual output of the results opening in a web browser interface. The .zip file contains the results in a text format as well as retaining the graphical format. The .zip file can be inflated into a directory, and files containing within that directory can be integrated into a larger analysis pipeline (example: examining the summary report to evaluate whether the data set passes or fails certain quality checked before the analysis pipeline will proceed).

Trimmomatic:

Trimmomatic performs a variety of useful trimming tasks for illumina paired-end and single ended data.The selection of trimming steps and their associated parameters are supplied on the command line. Based on your results from FastQC (or another quality control program), raw sequence data can be processed, or trimmed, to remove bias, adaptors, and poor quality reads/bases.

Where to run:

Trimmomatic is available on NewRiver, BlueRidge and HokieSpeed. Depending on how quickly you want your files to run, the server load, your input file sizes and whether or not you are running this as part of a larger analysis pipeline, will determine the best location to run this software. Generally, when running as a stand-alone software, all three servers will be acceptable as this program requires limited memory resources. An example script will be provided for running on NewRiver.

Input files:

Trimmomatic works with FastQ files (using phred + 33 or phred + 64 quality scores, depending on the Illumina pipeline used), either in uncompressed or gzip’ed format (.gz extension). The program will determine automatically which format is provided. Trimmomatic excepts both single and paired-end read files. This is specificied on the command line.

Commands:

Trimmomatic is a Java-based program, so the program is invoked using java jar trimmomatic-0.33.jar.

There are many parameters that can be set on the command line to customize the trimming parameters:

  • ILLUMINACLIP: Cut adapter and other illumina-specific sequences from the read.
  • SLIDINGWINDOW: Perform a sliding window trimming, cutting once the average quality within the window falls below a threshold.
  • LEADING: Cut bases off the start of a read, if below a threshold quality
  • TRAILING: Cut bases off the end of a read, if below a threshold quality
  • CROP: Cut the read to a specified length
  • HEADCROP: Cut the specified number of bases from the start of the read
  • MINLEN: Drop the read if it is below a specified length
  • TOPHRED33: Convert quality scores to Phred-33
  • TOPHRED64: Convert quality scores to Phred-64

Output files:

For single-ended data, on the command line you will specify one input and one output file, in addition to the desired processing steps. For single-end data, you will get one output file sample containing all reads that survived the processing.

For paired-end data, two input files are specified and 4 output files are provided on the command line. You will get four output files per sample. Two output files will be for the ‘paired’ output where both reads survived the processing, and two for corresponding ‘unpaired’ output where a read survived, but the partner read did not.

These outputs will remain in FASTQC format.

Trinity:

Trinity assembles transcript sequences from Illumina RNA-seq data and can utilize both single and paired-end read data. Trinity can assemble data with or without a reference genome (de-novo or reference guided), and contains a comprehensive downstream pipeline for transcriptome annotation and differential expression analysis. In the latest release, Trinity can also assemble PacBio reads.

For more information on Trinity:

https://github.com/trinityrnaseq/trinityrnaseq/wiki

Where to run:

Trinity is currently installed on all the ARC servers, however it is usually not appropriate to run Trinity on HokieSpeed or BlueRidge.

Trinity developers recommend using ~1GB RAM per ~1 million ~76 base Illumina paired end raw reads. For example, this means that for a 100 million raw reads, 100GB of RAM should be requested.

Lets consider the memory allocations of each server. BlueRidge has 64GB of RAM per node and HokieSpeed has 24GB of RAM per node. Neither server is set up to utilize shared-memory, so you are limited by the power and memory that is available on a single node. If the amount of raw reads to assemble requires less memory than the memory on a a HokieSpeed or BlueRidge node, then it can be run on those servers. However two of the ARC servers, HokieOne and NewRiver, are both set up as shared-memory machines. This means that a job can request more than one node, and the memory and CPUs on each node can be shared amongst one job. HokieOne and NewRiver are set up slightly different.

HokieOne

HokieOne can be thought of as ‘one big node’ (hence the name). In HokieOne, memory is allocated on a socket-by-socket basis, which can be requested to meet your needs. HokieOne has a max of 144 hour wall-time limit per job with each user restricted to 6,912 core-hours. Scheduling jobs to utilize the resources to the fullest potenial requires a bit of math, as a user cannot, at any one time, have more than this many core-hours allocated across all of their running jobs. In addition to this core-hour restriction, a user cannot have more than 12 jobs running at one time. Please consult the HokieOne FAQ for further description of how to most efficiently request resources to keep in compliance with the per-user core-hours allocation.

NewRiver

While able to run OpenMP and MPI parallization programs, NewRiver is not a true shared-memory machine like in the example of HokieOne above. NewRiver is equipped with a variety of node types that have different CPU and memory allocations to handle a wide array of jobs depending on your needs. This is covered in the NewRiver ARC FAQ. For Trinity, when using NewRiver, we need to request nodes that are high in memory to handle the very memory intensive steps that accompanies de novo assembly.

We will discuss how to run Trinity on each of these servers specifically, providing example submission scripts.

Input files:

Trinity can run either single or paired end reads.

Commands:

Trinity requires multiple dependencies before it will run correctly. These are all present on the servers as modules, and they need to be loaded in the submission script in addition to Trinity. These include:

A compiler (gcc or intel – check system for specific versions)

Java (jdk/1.7.0)

Bowtie (bowtie/0.12.7)

There are three required paramaters to set in the Trinity de novo command line:

Required:

–seqType <string> : type of reads: ( fa, or fq )

–max_memory <string> : suggested max memory to use by Trinity where limiting can be enabled. (jellyfish, sorting, etc)   provied in Gb of RAM, ie.  ‘–max_memory 10G’

If paired reads:

–left  <string> : left reads, one or more file names (separated by commas, not spaces)

–right <string> : right reads, one or more file names (separated by commas, not spaces)

Or, if unpaired reads:

–single <string> : single reads, one or more file names, comma-delimited (note, if single file contains pairs, can use flag: –run_as_paired)

Additional parameters can also be set. To see what additional parameters are available, run:

module load gcc/5.2.0 jdk/1.7.0 trinityrnaseq/2.0.6

Trinity –help

A list of all ‘Misc.’ commands should print to the terminal screen.

** Note: HokieOne, BlueRidge and HokieSpeed have a default compiler loaded when you log in. An appropriate compiler is required to be loaded into the environment before certain modules are to be able to be loaded. In NewRiver, there is no compiler loaded when you log in. Be sure to run module load gcc/5.2.0 before loading Trinity. **

Alternatively, you can type Trinity –show_full_usage_info to get detailed information about every command, and additional settings that can be altered for each of the three main programs in Trinity.

The quality trimming of raw reads can be performed by Trimmomatic within Trinity by the simple addition of the –trimmomatic parameter:

–trimmomatic : run Trimmomatic to quality trim reads  see ‘–quality_trimming_params’ under full usage info for tailored

settings.

–quality_trimming_params <string> defaults to : “ILLUMINACLIP:TruSeq3-PE.fa:2:30:10  SLIDINGWINDOW:4:5 LEADING:5                                     TRAILING:5 MINLEN:25”

Trinity can also perform genome-guided assembly. To run Trinity in this mode, the –genome_guided_bam and –genome_guided_max_intron <int> parameters are also required.

–genome_guided_bam <string> : genome guided mode, provide path to coordinate-sorted bam file. (see genome-guided  param section under  –show_full_usage_info)

–genome_guided_max_intron <int> : maximum allowed intron length (also maximum

fragment span on genome)

Output files:

Example Scripts:

Depending on which server you choose to run Trinity will influence how you request the resources needed to run the job. Below are example submission scripts for how to run Trinity for de novo transcriptome assembly on both HokieOne and NewRiver, using the sample data provided in the modules.

** Note: examples of genome-guided transcriptome assembly command line are provided on the Trinity website. Using these template scripts, you can modify the command line to utilize this function.

Example submission script for Hokieone: trinity_HO.sh

Example submission script for NewRiver: trinity_NR.sh

trinity_NR.sh

!/bin/bash

For a data set with ~500 million reads, we’d need to call approximately 500GB of RAM. We’d need to request enough resources that we are sent to a ‘Big Data’ node.

PBS -l nodes=1:ppn=24,pmem=20gb

to request a ‘Big Data’ node with 512GB of RAM; this is saying I need 24 processors per node, each with 20GB of RAM.

PBS -l walltime=72:00:00

PBS -q normal_q

PBS -A allocation

PBS -W group_list=newriver

PBS -M username@vt.edu

PBS -m bea

modules to load required by Trinity

module load gcc/5.2.0

module load jdk/1.7.0

module load trinityrnaseq/2.0.6

module load bowtie/1.1.2

module load samtools/1.2

Trinity –seqType fq –left reads.left.fq –right reads.right.fq –max_memory 500G –output trinity_output

NCBI-Blast+

This stand-alone blasting (basic local alignment search tool) program is developed for searching sequence similarity against a database or reference of choice. It works with both nucleotide and protein queries and databases. The blast+ package offers search tools as well as database tools for formatting and manipulating databases. This program is frequently used to annotate transcriptomes and genomes.

Where to run:

Blasting is a true bottleneck for bioinformatics research. NCBI-Blast+ is currently installed on NewRiver. The size of the database and the query will play a huge role into how long this program needs to be run and how many resources need to be requested. If working with something greater than 2000 sequences it is advisable to parse your query into smaller pieces and blast individually against the database. The example script will show how to do this in a throughput manner on NewRiver.

Input files:

To run any of the blasting programs, you must have a query and a database file. The query and the database must be in the form of uncompressed fasta sequences. The database must be formatted using the commands below.

Commands:

Prior to running Blast+, you must have a properly formatted database. Blast+ has database application tools to accomplish this:

makeblastdb : create a custom database from a multi-fasts file of nucleotide or protein sequences.

Basic usage of makeblastdb is:

makeblastdb -in inputdb.fasta -dbtype nucl/prot -parse_seqids

The required arguments include:

-in : name of the input database in fasta format

-dbtype : the type of data stored within the database; can be nucl for nucleotide or prot for protein.

makeblastdb has many other parameters and options, and these are covered comprehensively in the ncbi-blast+ manual, or by running:

module load ncbi-blast+/2.2.30

makeblastdb -help

Blast+ has many different search commands that can be used to fit your query and desired database. The most common are:

blastn : compares a nucleotide query to a nucleotide database

blastp : compares a protein query to a protein database.

blastx : compares a nucleotide query translated into six frames against a protein database.

tblastx : compares a nucleotide query translated into six frames against a nucleotide database translated in all six frames.

tblastn : compares a protein query against a nucleotide database translated into all six frames.

Basic usage of these commands include three parameters: the search method, the database name, and the query name. An example is given below for performing a blastn search:

blastn -db nt_db -query nt.fasta -out results.out

There are many different methods, parameters and options for performing a blast search. These are covered comprehensively in the ncbi-blast+ manual, and what is listed here is but a small piece of what this software is capable of, and just covers the most common commands. To learn more about any specific command you can also run:

module load ncbi-blast+/2.2.30

[command] -help

Output files:

The output file is a file containing the top hits for your query sequences that match the specified criteria. The output can be written in one of 13 different formats that contain different alignment view options, and this is specified with the -outfmt option. These include:

0 = pairwise,

1 = query-anchored showing identities,

2 = query-anchored no identities,

3 = flat query-anchored, show identities,

4 = flat query-anchored, no identities,

5 = XML Blast output,

6 = tabular,

7 = tabular with comment lines,

8 = Text ASN.1,

9 = Binary ASN.1,

10 = Comma-separated values,

11 = BLAST archive format (ASN.1)

12 = JSON Seqalign output

Please consult the NCBI-Blast+ manual to fully understand default parameters and the extensive ways to customize your blasting search.

Bowtie

Bowtie is an ultrafast short read aligner for aligning large sets of short DNA or RNA-Seq reads to large genomes (or transcriptomes). It aligns 35-base pair reads to the human genome at a rate of 25 million reads per hour. Bowtie works best with shorter reads (25-50bps) but can align reads as long as 1024 bases at a slower rate. Bowtie takes an indexed reference and a set of reads as input and outputs a list of alignment. Bowtie does not report gapped alignments.

Bowtie is used in many other bioinformatics tools such as TopHat, Cufflinks and Trinity. Bowtie can also be used as a stand-alone software for read mapping.

Where to run:

Bowtie is designed to be a memory efficient program requiring no more memory than a typical workstation to run. It is installed as an ARC module on all four servers. The server you chose to run this on will depend on whether you are running Bowtie as a stand-alone software or as a dependency to another program with different memory requirements.

Input files:

Bowtie takes an indexed reference and a set of reads, single or paired-end, as input. The index is generated by running the bowtie command bowtie-build on a reference genome or transcriptome, and can be multiple files. Reads can be either single or paired-end need to be in either FASTA or FASTQ files. Sequences may be of mixed length.

Command line:

To run bowtie, the following additional ARC module will need to be loaded in your environment or script:

module load bowtie/1.1.2 *please check specific version on the server you are using

Before running bowtie for alignment, the reference to which you are aligning to must be indexed. This is done using the bowtie-build command and outputs a set of six files which constitute the index. The basic usage of bowtie-build is:

bowtie-build [options]* <reference_in> <ebwt_base>

The required arguments include:

<reference_in> : a comma-separated list of FASTA files containing the reference sequences to be aligned to.

<ebwt_base> : the desired basename of the index files that will be written as output.

Additional parameters and options for bowtie-build can be retrieved in the software manual or by running:

module load bowtie/0.12.7

bowtie-build -h

After the index is build, bowtie can be ran to align reads to the reference. The basic usage of bowtie is:

bowtie [options]* <ebwt> {-1 <m1> -2 <m2> | –12 <r> | <s>} [<hit>]

The required arguments include:

<ebwt> : the base same of the index to be searched. This is the name assigned when running bowtie-build.

<m1> : a comma-separated list of the files containing the forward, or left reads.

<m2> : a comma-separated list of the files containing the reverse, or right reads. The sequence files must correspond with those specificed for <m1> file for file.

<hit> : the file to write the alignments to.

Additional parameters and options for bowtie can be retrieved in the software manual or by running:

module load bowtie/0.12.7

bowtie -h

Output files:

The bowtie-build command outputs a set of six files with the suffices .1.ebwt, …, .4.ebwt, .rev.1.ebwt, and rev.2.ebwt. These files constitute the index and are what is required to align the reads to the reference. The original reference file is no longer needed.

Bowtie writes all alignments, one per line, to the specified output file in the command line. Bowtie can also write the output in SAM format by specifying the -S/- – sam option in the command line.

In your standard error file, bowtie will write the mapping efficiencies for the run.

Bowtie2

Bowtie is an ultrafast short read aligner for aligning large sets of short DNA or RNA-Seq reads to large genomes (or transcriptomes). Bowtie2 works best with reads around 50 up to 100s or 1000s of bps to relatively large genomes. Bowtie2 takes an indexed reference and a set of reads as input and outputs a list of alignments. Bowtie2  supports  gapped, local and paired-end alignment modes.

Bowtie2 is used in many other bioinformatics tools such as TopHat, Cufflinks and Trinity. Bowtie2 can also be used as a stand-alone software for read mapping.

Bowtie2 was developed after Bowtie to handle the increasingly longer reads produced by sequencing advances. When working with reads greater than 50bps, Bowtie2 is generally faster, more sensitive, and more memory efficient than Bowtie. Bowtie2 supports gapped and local alignments that allow alignments to overlap ambiguous characters, while Bowtie just finds ungapped alignments where the read might align entirely with no ambiguity. Bowtie2 has no upper read length limit, unlike Bowtie, which maxes out around 1000bps.

Please note that Bowtie and Bowtie2 each have their niche. The command line arguments for Bowtie2 are different from Bowtie.

Where to run:

Bowtie2 is designed to be a memory efficient program requiring no more memory than a typical workstation to run. For example, the developers have reported it takes approximately 3.2GB of RAM to index the human genome. Bowtie2 is installed as an ARC module on all four servers. The server you chose to run this on will depend on whether you are running Bowtie as a stand-alone software or as a dependency to another program with different memory requirements. Multiple processors can be used can be used to achieve greater alignment speed.

Input files:

Bowtie2 takes an indexed reference and a set of reads, single or paired-end, as input. The index is generated by running the bowtie command bowtie2-build on a reference genome or transcriptome, and can be multiple files. Reads can be either single or paired-end need to be in either FASTA or FASTQ files. Sequences may be of mixed length.

Command line:

To run bowtie2, the following additional ARC module will need to be loaded in your environment or script:

bowtie/0.12.7

Before running bowtie for alignment, the reference to which you are aligning to must be indexed. This is done using the bowtie-build command and outputs a set of six files which constitute the index. The basic usage of bowtie-build is:

bowtie2-build [options]* <reference_in> <bt2_base>

The required arguments include:

<reference_in> : a comma-separated list of FASTA files containing the reference sequences to be aligned to.

<bt2_base> : the desired basename of the index files that will be written as output.

Additional parameters and options for bowtie-build can be retrieved in the software manual or by running:

module load bowtie2/2.2.5

bowtie2-build -h

After the index is build, bowtie can be ran to align reads to the reference. The basic usage of bowtie is:

bowtie2 [options]* -x <bt2-idx> {-1 <m1> -2 <m2> | -U <r>} -S [<hit>]

The required arguments include:

-x <bt2-idx> : the base same of the index to be searched. This is the name assigned when running bowtie2-build.

-1 <m1> : a comma-separated list of the files containing the forward, or left reads.

-2 <m2> : a comma-separated list of the files containing the reverse, or right reads. The sequence files must correspond with those specificed for <m1> file for file.

-U <r> : a comma-separated list of the files containing unpaired reads to be aligned. **This is optional.

-S <hit> : the file to write the alignments to.

Additional parameters and options for bowtie can be retrieved in the software manual or by running:

module load bowtie2/2.2.5

bowtie2 -h

Output files:

The bowtie-build command outputs a set of six files with the suffices .1.bt2, …, .4.bt2, .rev.1.bt2, and rev.2.bt2. These files constitute the index and are what is required to align the reads to the reference. The original reference file is no longer needed.

Bowtie writes all alignments into a SAM formatted file. In your standard error file, bowtie will write the mapping efficiencies for the run.

BWA (Burrows-Wheeler Aligner)

BWA is a mapping program for mapping low-divergent sequences against a reference genome. It contains three different algorithms that are designed to handle different sequence lengths. BWA-Backtrack for Illumina sequence reads up to 100bps, and BWA-SW and BWA-MEM for 70bps-1Mbp. Each of these algorithms is suited for a particular data type.

Where to run:

BWA is installed on NewRiver, BlueRidge, HokieSpeed and HokieOne. According to the developers it takes approximately 5GB of RAM to index the complete human genome, and for short reads, the aln command uses around 3GB of RAM and the sampe (paired-end alignment) requires around 5.5GB of RAM. These memory requirements can be achieved on all four systems. The exact memory required for running your alignment will depend heavily on the database you are aligning to as well as the size of the input files. Generally, the more resources to run the job is better and will finish more quickly.

Input Files:

BWA requires an FM-indexed database file.

Command Line:

All commands in BWA are invoked using different sub-commands proceeding ‘bwa’.

For each algorithm, BWA must first construct an index for the database using the index command. There are two different options for indexing the database, the default algorithm ‘is’ for smaller databases less than 2GB and the ‘bwtsw’ algorithm for databases larger than 2GB. This is invoked using the index command:

bwa index [-p prefix] [-a algoType] <in.db.fasta>

The required arguments include:

-p : the desired prefix, or name, of the output database.

-a : the algorithm used to construct the index file. The available options are ‘is’ and ‘bwtsw’.

After the database is created, three different alignment algorithms can be invoked depending on your needs: BWA-Backtrack, BWA-SW and BWA-MEM.

BWA-Backtrack is designed for Illumina sequence reads up to 100bps and is invoked with bwa aln/samse/sampe depending on the input data type and desired output. Please consult the BWA manual or a detailed description of these algorithms.

Basic usage of BWA aln is:

bwa aln <in.db.fasta> <in.query.fq> > <out.sai>

Generates the SA coordinates of the input reads.

Basic usage of BWA samse is:

bwa samse <in.db.fasta> <in.sai> <in.fq> > <out.sam>

Generates alignments in SAM format from single-end reads.

Basic usage of BWA sampe is:

bwa sampe <in.db.fasta> <in1.sai> <in2.sai> <in1.fq> <in2.fq> > <out.sam>

Generates alignments in SAM format from paired-end reads.

BWA-SW is for Illumina sequence reads ranging from 70bp-1Mpbs.

Basic usage of BWA bwasw is:

bwa bwasw <in.db.fasta> <reads.fq> [mate.fq]

Generates alignments of the query sequences from the in.fq file. This can take both paired-end and single-read data. If mate.fq is specified, it will run in paired-end mode.  

BWA-mem is for Illumina sequence reads ranging from 70bps-1Mbps. It is similar to BWA-SW but is preferable for high-quality reads as it performs faster and more accurately.

Basic usage of BWA mem is:

bwa mem db.prefix reads.fq [mates.fq]

Generates alignments of the query sequences to the database file. This can take both paired-end and single-read data. If mate.fq is specified, it will run in paired-end mode.

 

Output Files:

Output files will vary depending on the command run. For most alignment commands a .SAM file is produced.

Diamond (Double index alignment of next-generation sequencing data)

Diamond is a high-throughput, ultrafast BLAST compatible program for aligning short DNA reads against a reference protein database such as NR. According to the developers, this program aligns 100-150bp Illumina reads at 20,000 times the speed of BLASTX with comparable sensitivity in default mode.

Where to run:

Blasting and aligning are often bottlenecks to bioinformatics research. They are often computationally heavy and require longer amounts of computational hours to complete. Diamond is installed on NewRiver, BlueRidge and HokieSpeed. Diamond is designed with an ‘adaptable memory footprint’ that does not depend on the size of the input, unlike many other programs. Diamond is programed to break down the input query and the reference database into chunks of sequence letters that will be compared against each other at a time. This means that a minimum amount of RAM is not required for a specific input data size, but more memory will lead to the program running exponentially faster. For this type program, using maximum cores and available RAM is recommended for faster processing. The memory allocation can be controlled by the user by setting a block size value. The default setting uses approximately 12GB of RAM. Diamond can be run on all three servers successfully.

Input files:

Diamond requires two input files:

1. a read file containing DNA reads 100-150bp long in .fna (fasta nucleic acid) format.

2. a protein database in fasta format. This will need to be formatted using Diamond before alignment.

Command line:

Before running Diamond, a reference database will need to be formatted using the following command:

diamond makedb – -in <database> -d <basename>

The required arguments are:

– -in : the protein database in fasta format.

-d : the desired name of the formatted database.

This will create a binary Diamond database file with the specified base name.

After the formatted database is created, the alignment command can be invoked using the following command:

diamond blastx -d <basename> -q <.fna> -a <output file> -t <temporary directory>

OR

diamond blastp -d <basename> -q <.fna> -a <output file> -t <temporary directory>

The required arguments include:

blastx : align translated DNA query sequences against a protein reference database.

blastp : align protein query sequences against a protein reference database.

-d : basename of the formatted database.

-a : desired name of output file.

-t : desired name of temporary directory where intermediate files will be held.

Diamond has other parameters and options, these are covered comprehensively in the Diamond manual, or by running:

module load diamond/0.6.13

diamond -h

Output files:

By default, output is written in .daa (diamond alignment archive) format. The .daa file can be converted to tab (BLAST tabular format) or SAM format, using the ‘view’ command.

HMMER:

HMMER is a commonly used software for predicting homologous protein or nucleotide sequences against a sequence database. It is  utilized by many bioinformatic softwares such as Pfam, InterPro and UGENE. This software uses profile hidden Markov models (HMM) to best predict homologs.

Where to run:

Input files:

HMMER handles both aligned and unaligned data formats. Acceptable formats include: Stockholm, FASTA, Clustal, NCBI PSY-Blast, PHYLIP, Selex, UCSC SAM, EMBL, and Genbank. HMMER will auto-detect format.

Commands:

HMMER has many applications. Some of the main applications are:

1. It can be used in place of BLASTp for searching protein databases with two commands phmmer and jackhmmer.

2. Scan a protein database to identify genes with strongly conserved amino acid and spacing patterns of interest.

3. The automated annotation of the domain structures of proteins.

Output files:

TopHat

TopHat is a program that aligns RNA-Seq reads to a reference genome to identify splice junctions in the exon regions using the short-read aligner {Bowtie}. This program can be run with or without a reference annotation file or with a known insertion/deletion file.

TopHat works by mapping RNA-Seq reads to a reference genome, identifying candidate exons, and builds a database of possible splice junctions. It first identifies regions where the reads map end to end, terming these “islands” in the reference genome where reads piled up. Many of these islands will be exons. TopHat then runs an internal program to find splice junctions in the unmapped reads.  TopHat also generates an alignment file that can be used in the Cufflinks pipeline.

Where to run:

TopHat, as well as all required dependencies, is installed on NewRiver, BlueRidge and HokieSpeed. Depending on how quickly you want your files to run, the server load, your input file sizes, and reference-genome size, will determine the best location to run this software. Usually it is appropriate to use a machine with higher memory, such as BlueRidge or NewRiver, especially when working with transcriptome coverage greater than 10X. An example script will be provided for running on NewRiver.

Input files:

TopHat accepts reads in FASTA or FASTQ format, with FASTQ being the preferred format. TopHat does not support the mixing of these formats in the same input file. TopHat is designed to work with reads produced from Illumina sequencing and is optimized for reads 75bp or longer. TopHat will accept .gz files.

TopHat can run on either single-end or paired-end reads, and can handle both in the same run.

Commands:

To run Cufflinks, the following additional ARC modules will need to be loaded in your environment or script:

1. python/2.7.10

2. boost/1.58.0

3. samtools/1.2

4. bowtie2/2.2.5 (default, however it can use bowtie)

Basic usage of TopHat is:

Usage: tophat [options]* <genome_index_base> <reads1_1[,…,readsN_1]> [reads1_2,…readsN_2]

The required arguments include:

<genome_index_base> : The base name of the genome index to be searched. This is created using    the bowtie-build command.

<reads1_1[,…,readsN_1]> : A comma-separated list of files in appropriate format, with no spaces.   When working with paired-end reads, *_1 is equivalent to the forward, or          left, reads.

<reads1_2[,…,readsN_2]> : A comma-separated list of files in appropriate format, with no spaces.   When working with paired-end reads, *_2 is equivalent to the reverse, or          right, reads and must appear in the same order as the *_1 files.

TopHat has numerous parameters and options, these are covered comprehensively in the TopHat manual, or by running:

module load  python/2.7.10 boost/1.58.0 tophat/2.1

tophat – – help

A list of all additional commands and descriptions will print to the terminal screen.

The default values are optimized for mammalian RNA-Seq reads. The developers suggest that if running TopHat in a different class of organism, to set the parameters more strict. Please refer to the TopHat manual for more detailed suggestions.

Output Files: 

Numerous files will be produced in the output directory, however most of these will be intermediate files of little importance to you. The output files that are of most relevance are:

1. accepted_hits.bam : this is a list of the alignments in SAM format. This output file can be the input      file for Cufflinks if proceeding with that pipeline.

2. juctions.bed : a bed- track of junctions reported by TopHat.

3. insertions.bed and deletions.bed : files reporting the insertions and deletions found by TopHat.