Welcome to the variations analysis tutorial page.

This is a pipeline for SNP discovery.

One reference genome (fasta format), one or two reads (fastq format) in case of single-end or paired-end are required.

0. Import Data

Create a new history, and import the following data:

Galaxy Dataset | NC_002929_Bpertussis.fasta

Galaxy Dataset | rSim_1M50P80_sec.fastq

Galaxy Dataset | rSim_1M50P80_fir.fastq

I. Pre-Processing

**Step 0: Use 'fastqc' to see the reads (fastq format) quality**.

The main functions of FastQC are:

Import of data from BAM, SAM or FastQ files (any variant)
Providing a quick overview to tell you in which areas there may be problems
Summary graphs and tables to quickly assess your data
Export of results to an HTML based permanent report
Offline operation to allow automated generation of reports without running the interactive application

The following links show the results after using 'fastqc' :

Galaxy Dataset | FastQC_rSim_1M50P80_sec.fastq.html

Galaxy Dataset | FastQC_rSim_1M50P80_fir.fastq.html

Step 1: Pre-processing: clean the reads (fastq format), using 'fqCleaner'.

fqCleaner v.0.4.1

USAGE :
    fqCleaner.sh [options]
where 'options' are :
   -f FASTQ formatted input file name (mandatory option)
   -r when using paired-ends data, this option allows inputing the
                second file (i.e. reverse reads)
   -q      quality score threshold (default: 20); all bases with quality
                score below this threshold are considered as non-confident
   -l      minimum required length for a read (default: 30)
   -p      minimum percent of bases (default: 80) that must have a quality
                score higher than the fixed threshold (set by option -q)
   -s a sequence of tasks to be iteratively performed, each being set
                one of the following letters:
                  Q: each read is trimmed off to remove non-confident bases at
                     5' and 3' ends (quality score threshold is set with option
                     -q); all short reads are discarded (minimum read length is
                     set with option -l)
                  F: each read containing too few confident bases is filtered
                     out (the minimum percentage of confident bases is set with
                     option -p)
                  A: each artefactual read is filtered out
                  C: contaminant oligonucleotide sequences are trimmed off when
                     occuring in either 5' or 3' ends of each read; all short
                     reads are discarded (minimum read length is set with
                     option -l); user contaminant sequences can be set with
                     option -c
                  D: all duplicated single- or paired-ends reads are removed
                  d: (only for paired-ends data) all duplicated reads within
                     each input file are removed
                  N: the number of (remaining) reads is displayed
                (default sequence of tasks: NQNCNFNANDN)
   -c sequence file containing contaminant nucleotide sequences (one
                per line) to be trimmed off during step 'C' (This option dosen't work yet on Galaxy)
   -x     single-end: output file name; paired-ends: forward read output
                file name (default: ..fq )
   -y     paired-ends data only: reverse read output file name (default:
                ..fq)
   -z     paired-ends data only: single read output file name (default:
                ..sgl.fq)

In this example, we keep all default parameters.

The following links show the results after using 'fqCleaner :

Galaxy Dataset | Cleaned read2

Galaxy Dataset | Cleaned read1

Step 2: Re-use 'fastqc' to see the reads after pre-processing.

The following links show the results after using 'fqCleaner :

Galaxy Dataset | FastQC_Cleaned read1.html

Galaxy Dataset | FastQC_Cleaned read2.html

II. MAPPING

**Step 3: Map the reads to the genome reference, using 'BWA'.**

Program: bwa (alignment via Burrows-Wheeler transformation)
Version: 0.5.9-r16
Contact: Heng Li

Usage:   bwa [options]

Command: index         index sequences in the FASTA format
         aln           gapped/ungapped alignment
         samse         generate alignment (single ended)
         sampe         generate alignment (paired ended)
         bwasw         BWA-SW for long queries

         fa2pac        convert FASTA to PAC format
         pac2bwt       generate BWT from PAC
         pac2bwtgen    alternative algorithm for generating BWT
         bwtupdate     update .bwt to the new format
         pac_rev       generate reverse PAC
         bwt2sa        generate SA from BWT and Occ
         pac2cspac     convert PAC to color-space PAC
         stdsw         standard SW/NW alignment

--- In case of paired-end: bwa sampe ---

Usage:   bwa sampe [options]

Options: -a INT   maximum insert size [500]
         -o INT   maximum occurrences for one end [100000]
         -n INT   maximum hits to output for paired reads [3]
         -N INT   maximum hits to output for discordant pairs [10]
         -c FLOAT prior of chimeric rate (lower bound) [1.0e-05]
         -f FILE sam file to output results to [stdout]
         -r STR   read group header line such as `@RG\tID:foo\tSM:bar' [null]
         -P       preload index into memory (for base-space reads only)
         -s       disable Smith-Waterman for the unmapped mate
         -A       disable insert size estimate (force -s)

Notes: 1. For SOLiD reads, corresponds R3 reads and to F3.
       2. For reads shorter than 30bp, applying a smaller -o is recommended to
          to get a sensible speed at the cost of pairing accuracy.

The following link shows the result after using 'BWA' :

Galaxy Dataset | Map with BWA for Illumina on data 7, data 6, and data 3: mapped reads

Step 4: Convert the out Sam format mapping file to Bam format, using 'SAM-to-BAM'.

This tool uses the SAMTools toolkit to produce an indexed BAM file based on a sorted input SAM file.

The following link shows the result after using 'SAM-to-BAM':

Galaxy Dataset | SAM-to-BAM on data 3 and data 11: converted BAM

III. SAMTOOLS SNP CALLING

Step 5: Use 'mpileup' for SNP calling.

'mpileup' is one of SAMTOOLS tools, which collects summary information in the input BAMs, computes the likelihood of data given each possible genotype and stores the likelihoods in the BCF format (via link).

Usage: samtools mpileup [options] in1.bam [in2.bam [...]]

Input options:

       -6           assume the quality is in the Illumina-1.3+ encoding
       -A           count anomalous read pairs
       -B           disable BAQ computation
       -b FILE      list of input BAM files [null]
       -C INT       parameter for adjusting mapQ; 0 to disable [0]
       -d INT       max per-BAM depth to avoid excessive memory usage [250]
       -E           extended BAQ for higher sensitivity but lower specificity
       -f FILE      faidx indexed reference sequence file [null]
       -G FILE      exclude read groups listed in FILE [null]
       -l FILE      list of positions (chr pos) or regions (BED) [null]
       -M INT       cap mapping quality at INT [60]
       -r STR       region in which pileup is generated [null]
       -R           ignore RG tags
       -q INT       skip alignments with mapQ smaller than INT [0]
       -Q INT       skip bases with baseQ/BAQ smaller than INT [13]

Output options:

       -D           output per-sample DP in BCF (require -g/-u)
       -g           generate BCF output (genotype likelihoods)
       -O           output base positions on reads (disabled by -g/-u)
       -s           output mapping quality (disabled by -g/-u)
       -S           output per-sample strand bias P-value in BCF (require -g/-u)
       -u           generate uncompress BCF output

SNP/INDEL genotype likelihoods options (effective with `-g' or `-u'):

       -e INT       Phred-scaled gap extension seq error probability [20]
       -F FLOAT     minimum fraction of gapped reads for candidates [0.002]
       -h INT       coefficient for homopolymer errors [100]
       -I           do not perform indel calling
       -L INT       max per-sample depth for INDEL calling [250]
       -m INT       minimum gapped reads for indel candidates [1]
       -o INT       Phred-scaled gap open sequencing error probability [40]
       -P STR       comma separated list of platforms for indels [all]

Notes: Assuming diploid individuals.

The following link shows the result after using 'mpileup':

Galaxy Dataset | MPileup on data 3 and data 12

Step 6: Convert the out pileup format file to VCF format, using 'Pileup to VCF'.

This step converts the pileup format file to VCF format (via link) for displaying the SNPs information more readablely.

The following link shows the result after using 'Pileup to VCF':

Galaxy Dataset | Pileup to VCF on data 13

IV. GATK2 SNP CALLING

Notice: The GATK2 SNP calling step is independent to the SAMTOOLS SNP calling step.

**Step 5': Use 'Add or Replace Groups'** to regroup the reads.

Many downstream analysis tools (such as GATK, for example) require BAM datasets to contain read groups. Even if you are not going to use GATK, setting read groups correctly from the start will simplify your life greatly.

The following link shows the result after using 'Add or Replace Groups':

Galaxy Dataset | Add or Replace Groups on data 42: bam with read groups replaced

Step 6': Create the INDEL intervals by 'Realigner Target Creator'.

Notice: This step depends on Step 5'.

A GATK tool to emit intervals for the Local Indel Realigner to target for cleaning. Ignores 454 reads, MQ0 reads, and reads with consecutive indel operators in the CIGAR string.

More information about this tool, please see the documentation (via link).

The following link shows the result after using 'Realigner Target Creator'.

Galaxy Dataset | Realigner Target Creator on data 3 and data 16 (GATK intervals)

Step 7': Re-align the reads with INDEL invervals to the genome reference with 'indel realigner'.

Notice: This step depends on Step 6'.

A GATK tool to perform local realignment of reads based on misalignments due to the presence of indels. Unlike most mappers, this walker uses the full alignment context to determine whether an appropriate alternate reference (i.e. indel) exists and updates SAM Records accordingly.

More information about this tool, please see the documentation (via link).

The following link shows the result after using 'indel realigner'.

Galaxy Dataset | Indel Realigner on data 3, data 17, and data 16 (BAM)

Step 8': Use 'Unified genotyper' for SNP calling.

Notice: This step depends on Step 5'.

A GATK tool as a variant caller to unifiy the approaches of several disparate callers. Works for single-sample and multi-sample data. The user can choose from several different incorporated calculation models.

More information about this tool, please see the documentation (via link).

The following links show the results after using 'Unified genotyper'.

Galaxy Dataset | Unified Genotyper on data 3 and data 19 (log)

Galaxy Dataset | Unified Genotyper on data 3 and data 19 (metrics)

Galaxy Dataset | Unified Genotyper on data 3 and data 19 (VCF)

Step 9: Compaire the obtained SNPs called by mpileup with thoses called by GATK.

V. Assembly de novo

Step 10: Assembly de novo with 'clc_assembler'

Notice: This step is independent to the MAPPING, MPILEUP and GATK steps.

This step dosen't need the genome reference (fasta file), only cleaned reads are required.

This tool assembles the cleaned reads without genome reference and outputs contig sequences in fasta format. It can also output scaffolding annotation in a GFF output format file.

The following links show the results after using 'clc_assembler'.

Galaxy Dataset | Scaffolding (clc_assembler on data 6 and data 7)

Galaxy Dataset | De novo contigs (fasta) from clc_assembler on data 6 and data 7

VI. Pipeline -> Workflow

Step 11: You can make a workflow from your history.

History --> Setting --> Extract Workflow --> choose the steps that you want to have in your workflow --> create workflow

Go to Workflow section, you can checkout your pipeline in the list.

The following link shos the created workflow:

Galaxy Workflow | Workflow constructed from history 'Formation'

About this Page

Author

yluo

All published pages
Published pages by yluo

Rating

Community
(0 ratings, 0.0 average)

Welcome to the variations analysis tutorial page.

This is a pipeline for SNP discovery.

One reference genome (fasta format), one or two reads (fastq format) in case of single-end or paired-end are required.

0. Import Data

I. Pre-Processing

Step 0: Use 'fastqc' to see the reads (fastq format) quality.

Step 1: Pre-processing: clean the reads (fastq format), using 'fqCleaner'.

Step 2: Re-use 'fastqc' to see the reads after pre-processing.

II. MAPPING

Step 3: Map the reads to the genome reference, using 'BWA'.

Step 4: Convert the out Sam format mapping file to Bam format, using 'SAM-to-BAM'.

III. SAMTOOLS SNP CALLING

Step 5: Use 'mpileup' for SNP calling.

Step 6: Convert the out pileup format file to VCF format, using 'Pileup to VCF'.

IV. GATK2 SNP CALLING

Step 5': Use 'Add or Replace Groups' to regroup the reads.

Step 6': Create the INDEL intervals by 'Realigner Target Creator'.

Step 7': Re-align the reads with INDEL invervals to the genome reference with 'indel realigner'.

Step 8': Use 'Unified genotyper' for SNP calling.

Step 9: Compaire the obtained SNPs called by mpileup with thoses called by GATK.

V. Assembly de novo

Step 10: Assembly de novo with 'clc_assembler'

VI. Pipeline -> Workflow

Step 11: You can make a workflow from your history.

Author

Related Pages

Rating

Tags

**Step 0: Use 'fastqc' to see the reads (fastq format) quality**.

**Step 3: Map the reads to the genome reference, using 'BWA'.**

**Step 5': Use 'Add or Replace Groups'** to regroup the reads.