Welcome to the variations analysis tutorial page.

This is a pipeline for SNP discovery.

One reference genome (fasta format), one or two reads (fastq format) in case of single-end or paired-end are required.

0. Import Data

Create a new history, and import the following data:

I. Pre-Processing

Step 0: Use 'fastqc' to see the reads (fastq format) quality.

The main functions of FastQC are:

  • Import of data from BAM, SAM or FastQ files (any variant)
  • Providing a quick overview to tell you in which areas there may be problems
  • Summary graphs and tables to quickly assess your data
  • Export of results to an HTML based permanent report
  • Offline operation to allow automated generation of reports without running the interactive application

The following links show the results after using 'fastqc' :

Step 1: Pre-processing: clean the reads (fastq format), using 'fqCleaner'.

fqCleaner v.0.4.1

 USAGE :
    fqCleaner.sh [options]
  where 'options' are :
   -f   FASTQ formatted input file name (mandatory option)               
   -r   when using  paired-ends data,  this option  allows inputing the  
                second file (i.e. reverse reads)                                 
   -q      quality score  threshold (default: 20);  all bases with quality  
                score below this threshold are considered as non-confident       
   -l      minimum required length for a read (default: 30)                 
   -p      minimum percent of bases (default: 80) that must have a quality  
                score higher than the fixed threshold (set by option -q)         
   -s   a sequence of tasks to be iteratively performed, each being set  
                one of the following letters:                                    
                  Q: each read is trimmed off to  remove non-confident bases at  
                     5' and 3' ends (quality score threshold is set with option  
                     -q); all short reads are discarded (minimum read length is  
                     set with option -l)                                         
                  F: each read containing  too few confident  bases is filtered  
                     out (the minimum percentage of confident bases is set with  
                     option -p)                                                  
                  A: each artefactual read is filtered out                       
                  C: contaminant oligonucleotide sequences are trimmed off when  
                     occuring in either  5' or 3' ends of each read;  all short  
                     reads  are  discarded  (minimum  read  length is  set with  
                     option -l);  user contaminant  sequences can  be set  with  
                     option -c                                                   
                  D: all duplicated single- or paired-ends reads are removed     
                  d: (only for  paired-ends data)  all duplicated  reads within  
                     each input file are removed                                 
                  N: the number of (remaining) reads is displayed                
                (default sequence of tasks: NQNCNFNANDN)                         
   -c   sequence file containing contaminant nucleotide sequences  (one  
                per line) to be trimmed off during step 'C'  (This option dosen't work yet on Galaxy)              
   -x     single-end: output file name;  paired-ends: forward read output  
                file name (default: ..fq )                 
   -y     paired-ends data only: reverse read output file name  (default:  
                ..fq)                                      
   -z     paired-ends data only:  single read output file name  (default:  
                ..sgl.fq)  

In this example, we keep all default parameters.

The following links show the results after using 'fqCleaner :

Step 2: Re-use 'fastqc' to see the reads after pre-processing.

The following links show the results after using 'fqCleaner :

II. MAPPING

Step 3: Map the reads to the genome reference, using 'BWA'.

Program: bwa (alignment via Burrows-Wheeler transformation)
Version: 0.5.9-r16
Contact: Heng Li

Usage:   bwa [options]

Command: index         index sequences in the FASTA format
         aln           gapped/ungapped alignment
         samse         generate alignment (single ended)
         sampe         generate alignment (paired ended)
         bwasw         BWA-SW for long queries

         fa2pac        convert FASTA to PAC format
         pac2bwt       generate BWT from PAC
         pac2bwtgen    alternative algorithm for generating BWT
         bwtupdate     update .bwt to the new format
         pac_rev       generate reverse PAC
         bwt2sa        generate SA from BWT and Occ
         pac2cspac     convert PAC to color-space PAC
         stdsw         standard SW/NW alignment

--- In case of paired-end: bwa sampe ---

Usage:   bwa sampe [options]

Options: -a INT   maximum insert size [500]
         -o INT   maximum occurrences for one end [100000]
         -n INT   maximum hits to output for paired reads [3]
         -N INT   maximum hits to output for discordant pairs [10]
         -c FLOAT prior of chimeric rate (lower bound) [1.0e-05]
         -f FILE  sam file to output results to [stdout]
         -r STR   read group header line such as `@RG\tID:foo\tSM:bar' [null]
         -P       preload index into memory (for base-space reads only)
         -s       disable Smith-Waterman for the unmapped mate
         -A       disable insert size estimate (force -s)

Notes: 1. For SOLiD reads, corresponds R3 reads and to F3.
       2. For reads shorter than 30bp, applying a smaller -o is recommended to
          to get a sensible speed at the cost of pairing accuracy.

The following link shows the result after using 'BWA' :

Step 4: Convert the out Sam format mapping file to Bam format, using 'SAM-to-BAM'.

This tool uses the SAMTools toolkit to produce an indexed BAM file based on a sorted input SAM file.

The following link shows the result after using 'SAM-to-BAM':

III. SAMTOOLS SNP CALLING

Step 5: Use 'mpileup' for SNP calling.

'mpileup' is one of SAMTOOLS tools, which collects summary information in the input BAMs, computes the likelihood of data given each possible genotype and stores the likelihoods in the BCF format (via link).

Usage: samtools mpileup [options] in1.bam [in2.bam [...]]

Input options:

       -6           assume the quality is in the Illumina-1.3+ encoding
       -A           count anomalous read pairs
       -B           disable BAQ computation
       -b FILE      list of input BAM files [null]
       -C INT       parameter for adjusting mapQ; 0 to disable [0]
       -d INT       max per-BAM depth to avoid excessive memory usage [250]
       -E           extended BAQ for higher sensitivity but lower specificity
       -f FILE      faidx indexed reference sequence file [null]
       -G FILE      exclude read groups listed in FILE [null]
       -l FILE      list of positions (chr pos) or regions (BED) [null]
       -M INT       cap mapping quality at INT [60]
       -r STR       region in which pileup is generated [null]
       -R           ignore RG tags
       -q INT       skip alignments with mapQ smaller than INT [0]
       -Q INT       skip bases with baseQ/BAQ smaller than INT [13]

Output options:

       -D           output per-sample DP in BCF (require -g/-u)
       -g           generate BCF output (genotype likelihoods)
       -O           output base positions on reads (disabled by -g/-u)
       -s           output mapping quality (disabled by -g/-u)
       -S           output per-sample strand bias P-value in BCF (require -g/-u)
       -u           generate uncompress BCF output

SNP/INDEL genotype likelihoods options (effective with `-g' or `-u'):

       -e INT       Phred-scaled gap extension seq error probability [20]
       -F FLOAT     minimum fraction of gapped reads for candidates [0.002]
       -h INT       coefficient for homopolymer errors [100]
       -I           do not perform indel calling
       -L INT       max per-sample depth for INDEL calling [250]
       -m INT       minimum gapped reads for indel candidates [1]
       -o INT       Phred-scaled gap open sequencing error probability [40]
       -P STR       comma separated list of platforms for indels [all]

Notes: Assuming diploid individuals.

The following link shows the result after using 'mpileup':

Step 6: Convert the out pileup format file to VCF format, using 'Pileup to VCF'.

This step converts the pileup format file to VCF format (via link) for displaying the SNPs information more readablely.

The following link shows the result after using 'Pileup to VCF':

IV. GATK2 SNP CALLING

Notice: The GATK2 SNP calling step is independent to the SAMTOOLS SNP calling step.  

Step 5': Use 'Add or Replace Groups to  regroup the reads.

Many downstream analysis tools (such as GATK, for example) require BAM datasets to contain read groups. Even if you are not going to use GATK, setting read groups correctly from the start will simplify your life greatly.

The following link shows the result after using 'Add or Replace Groups':

Step 6': Create the INDEL intervals by 'Realigner Target Creator'.

Notice: This step depends on Step 5'.

A GATK tool to emit intervals for the Local Indel Realigner to target for cleaning. Ignores 454 reads, MQ0 reads, and reads with consecutive indel operators in the CIGAR string.

More information about this tool, please see the documentation (via link).

The following link shows the result after using 'Realigner Target Creator'.

Step 7': Re-align the reads with INDEL invervals to the genome reference with 'indel realigner'.

Notice: This step depends on Step 6'.

A GATK tool to perform local realignment of reads based on misalignments due to the presence of indels. Unlike most mappers, this walker uses the full alignment context to determine whether an appropriate alternate reference (i.e. indel) exists and updates SAM Records accordingly.

More information about this tool, please see the documentation (via link).

The following link shows the result after using 'indel realigner'.

Step 8': Use 'Unified genotyper' for SNP calling.

Notice: This step depends on Step 5'.

A GATK tool as a variant caller to unifiy the approaches of several disparate callers. Works for single-sample and multi-sample data. The user can choose from several different incorporated calculation models.

More information about this tool, please see the documentation (via link).

The following links show the results after using 'Unified genotyper'.

Step 9: Compaire the obtained SNPs called by mpileup with thoses called by GATK.

V. Assembly de novo

Step 10: Assembly de novo with 'clc_assembler'

Notice: This step is independent to the MAPPING, MPILEUP and GATK steps. 

This step dosen't need the genome reference (fasta file), only cleaned reads are required. 

This tool assembles the cleaned reads without genome reference and outputs contig sequences in fasta format. It can also output scaffolding annotation in a GFF output format file.

The following links show the results after using 'clc_assembler'.

VI. Pipeline -> Workflow

Step 11: You can make a workflow from your history.

History --> Setting --> Extract Workflow --> choose the steps that you want to have in your workflow --> create workflow

Go to Workflow section, you can checkout your pipeline in the list.

The following link shos the created workflow: