Introduction to Galaxy

Instructor Stephanie Le Gras
Duration 3.5 hours
Content Description of the key features of Galaxy (Lecture)
Practical session on basic features of Galaxy (Hands-on)
Prerequisites None

1 Log in to GalaxEast

Answer

2 History

2.1 Create a new history

2.2 Change the name of the history to “DNA-seq data analysis"

Answer

3 Import files from your computer to Galaxy

  1. Download the file “sample.bed.gz” following this link and upload it to Galaxy.
  • The genome is: Mouse (mm9)
  • The format is: bed

Answer

4 Remove a dataset

  1. Remove the dataset sample.bed from your history by clicking on the button
  2. You are told that your history is empty. Look at the size of your history
    1. Click on “deleted” in the top of the history panel (below the history name). Remove definitely the file from the disk by clicking on “Supprimer définitivement du disque”.
    2. Click on “hide deleted”

5 Running a tool

  1. Download the two files CRN-107_11-R1.fastq.gz and CRN-107_11-R2.fastq.gz following this link.
  2. Import them to your history called “DNA-seq data analysis”
    • The genome is: Human (hg19)
    • The format: <auto detect>

Answer

  1. Use the tool “FastQC Read Quality reports” to compute quality analysis on the datasets “CRN-107_11-R1.fastq” and “CRN-107_11-R2.fastq
    1. Use default parameters.

Answer

What is the quality encoding of the two fastq files?

Answer

6 Running tools without a workflow

Analyze CRN-107 data from reads to variant annotation.

Run the following tools:

  1. BWA mem to align reads to the reference genome
  2. Picard markduplicates to identify duplicated reads
  3. Freebayes to detect variants
  4. snpEff to annotate variants

To run the tools you will need the following files:

  • CRN-107_11-R1.fastq
  • CRN-107_11-R2.fastq
  • CaptureDesign_chr4.bed (download it from here)

Import missing files from the data library “DNA-seq test datasets

Here are the parameters to use for each of the tools. All parameters not mentioned are to be used with default values.

  1. Map with BWA-MEM - map medium and long reads (> 100 bp) against reference genome
    1. Using reference genome: hg19
    2. Single or Paired-end reads: Paired
    3. Select first set of reads: CRN-107_11-R1.fastq
    4. Select second set of reads: CRN-107_11-R2.fastq.
    5. Set read groups information? Set read groups (Picard style)
      1. Read group identifier (ID): Auto-assign Yes
      2. Read group sample name (SM): Auto-assign Yes
      3. Library name (LB): Auto-assign Yes
      4. Platform/technology used to produce the reads (PL): ILLUMINA
      5. Platform unit (PU): HS026.2
      6. Sequencing center that produced the read (CN): Genomeast
      7. Description (DS): CRN-107
      8. Predicted median insert size (PI): 250
      9. Date that run was produced (DT): 2017-12-13
  2. MarkDuplicates examine aligned records in BAM datasets to locate duplicate molecules.
    1. Select SAM/BAM dataset or dataset collection: output of BWA mem
    2. Select validation stringency: Silent
  3. FreeBayes bayesian genetic variant detector
    1. BAM or CRAM dataset: output (bam) of markduplicates
    2. Using reference genome: hg19
    3. Limit analysis to regions in this BED dataset: CaptureDesign_chr4.bed
  4. SnpEff Variant effect and annotation
    1. Sequence changes (SNPs, MNPs, InDels): output of FreeBayes (VCF)
    2. Input format: VCF
    3. Output format: VCF (only if input is VCF)
    4. Genome source: Downloaded on demand
      1. Snpff Genome Version Name (e.g. GRCh38.86): hg19
  5. VCFtoTab-delimited: Convert VCF data into TAB-delimited format
    1. Select VCF dataset to convert: output of SnpEff
  1. How many variants are called?

7 Create a workflow out of an existing history

One can create a workflow from an existing history going to the history button and selecting “Extract Workflow”.

7.1 Extract a workflow out of the history called "DNA-seq data analysis"

7.2 Rename the workflow "DNA-seq data analysis"

8 Edit a workflow with the workflow editor

8.1 Open the workflow editor with the workflow "DNA-seq data analysis"

Answer

8.2 Add steps to the workflow

Your workflow should look like this before editing:

Add the following tools:

  1. Samtools flagstat to compute mapping statistics (after BWA mem)
  2. Filter SAM or BAM, output SAM or BAM to select aligned reads with a mapping quality >= 20 (after MarkDuplicates)
  3. Samtools flagstat to compute mapping statistics after removing reads with low mapping qualities (after Filter)

Here are the parameters to use for each of the tools:

  1. Flagstat tabulate descriptive stats for BAM dataset
    1. BAM File to Convert: output of BWA mem
  2. Filter SAM or BAM, output SAM or BAM files on FLAG MAPQ RG LN or by region
    1. SAM or BAM file to filter: output of Picard MarkDuplicates
    2. Minimum MAPQ quality score: 20
  3. Flagstat tabulate descriptive stats for BAM dataset
    1. BAM File to Convert: output of Filter

The final workflow should look like this (new tools are in black boxes):

Save the workflow once you are done editing it:

9 Run a workflow

9.1 Import files

Import the following files from the data library “DNA-seq test datasets” to a new history:

  • CRN-107_11-R1.fastq
  • CRN-107_11-R2.fastq
  • CaptureDesign_chr4.bed

9.2 Run the workflow DNA-seq data analysis

  1. Choose the right files.
  2. Check the parameters.
  1. How many reads are discarded due to the low mapping quality?

Answer