Quality control of raw sequencing data, data cleaning, FASTQ file handling

Instructor Stephanie Le Gras
Duration 3.5 hours
Content Quality control of raw sequencing data, data cleaning, FASTQ file handling (lecture)
Practical session on quality controls of raw sequencing data (Hands-on)
Prerequisites None

1 Prepare the analysis environment to use to perform the analysis

1.1 Create a working directory

cd /work/c-shd/[your login]
mkdir coursQC_Mapping

1.2 Move to the directory you've just created

cd coursQC_Mapping

1.3 Copy the datasets into your working directory

cp /user2/c-shd/sll13384/DU/Data/Sample_CRN-107_11/CRN-107_11-R*.fastq.gz . 

1.4 Uncompress the files

gunzip CRN-107_11-R1.fastq.gz
gunzip CRN-107_11-R2.fastq.gz

1.5 Create a virtual working environment

mkdir Software
cd Software
wget https://repo.continuum.io/miniconda/Miniconda2-latest-Linux-x86_64.sh
bash Miniconda2-latest-Linux-x86_64.sh
# Press enter
# Press space until the end of the license
# Enter Yes
# Enter /work/c-shd/[your login]/Software/miniconda2
# Enter yes
conda config --add channels conda-forge
conda config --add channels defaults
conda config --add channels r
conda config --add channels bioconda
conda create --name dnaseq bedtools samtools bwa fastqc cutadapt picard gatk
# enter y
cp /user2/c-shd/sll13384/DU/Software/GenomeAnalysisTK-3.6.tar.bz2 .
source activate dnaseq
gatk-register $(pwd)/GenomeAnalysisTK-3.6.tar.bz2

1.6 Source working environment

source activate dnaseq
source /user2/c-shd/sll13384/DU/Scripts/envTD.sh

2 Compute the number of sequenced reads

File:

  • CRN-107_11-R1.fastq
  • CRN-107_11-R2.fastq

Tool:

  • Use the command line 'wc' ('wc' for word count)

Answer

3 Assess the quality of the raw data

Files:

  • CRN-107_11-R1.fastq
  • CRN-107_11-R2.fastq

Tool:

  • fastqc

Tips

  • Put the results onto a directory called Fastqc

3.1 Create an output directory named Fastqc

Answer

3.2 Run the command fastqc on the two fastq files

By default, FastQC groups bases for reads >50bp. Disable this functionality.

Answer

3.3 Use firefox to look at the resulting html file (fastqc_report.html)

Answer

4 Trim the last nucleotide from sequences

Input files:

  • CRN-107_11-R1.fastq
  • CRN-107_11-R2.fastq

Output files:

  • CRN-107_11-R1_shorter.fastq
  • CRN-107_11-R2_shorter.fastq

Tool:

  • fastx_trimmer

Tips:

  • Be careful the encoding of the quality is Phred33 so tell it to fastx toolkit using the tag -Q 33.

Answer

4.1 Check the sequences length after trimming

  • Use a combination of 'head', 'tail' and 'wc'
  • Or open the file in a text editor with the column number displayed

Answer

5 Cut Illumina's adapters using Cutadapt

Input files:

  • CRN-107_11-R1_shorter.fastq
  • CRN-107_11-R2_shorter.fastq

Output files:

  • CRN-107_11-R1_trimmed.fastq
  • CRN-107_11-R2_trimmed.fastq

Tips:

  • The adaptater sequence is: AGATCGGAAGAGC
  • Sequence length after trimming should be higher that 30
  • Check out Cutadapt website for help on trimming adapters in paired end sequences.

5.1 Compute reverse complement of the adapter sequence using fastx_reverse_complement

Answer

5.2 Run cutadapt

Answer

6 Remove bad quality sequences

Files:

  • CRN-107_11-R1_trimmed.fastq
  • CRN-107_11-R2_trimmed.fastq

Tool:

  • DynamicTrim.pl (SolexaQA)

Tips:

  • Threshold for trimming: Phred score >= 10

6.1 Create an output directory for the data named SolexaQA

Answer

6.2 Run the tool DynamicTrim.pl

Answer

Rename resulting files

mv SolexaQA/CRN-107_11-R1_trimmed.fastq.trimmed CRN-107_R1.fastq
mv SolexaQA/CRN-107_11-R2_trimmed.fastq.trimmed CRN-107_R2.fastq

7 Compress all output files

gzip CRN-107_11-R1.fastq CRN-107_11-R2.fastq
gzip CRN-107_11-R1_shorter.fastq CRN-107_11-R2_shorter.fastq
gzip CRN-107_11-R1_trimmed.fastq CRN-107_11-R2_trimmed.fastq
gzip CRN-107_R1.fastq CRN-107_R2.fastq

8 Put all temporary files into the directory named intermedFastqFiles

# Make output directory
mkdir intermedFastqFiles

# Move compressed files 
mv CRN-107_11-R1_shorter.fastq.gz CRN-107_11-R2_shorter.fastq.gz intermedFastqFiles 
mv CRN-107_11-R1_trimmed.fastq.gz CRN-107_11-R2_trimmed.fastq.gz intermedFastqFiles

9 Run FastQC on the final file

9.1 Create an output directory Fastqc_final

mkdir Fastqc_final

9.2 Run FastQC on the final fastq files

fastqc --nogroup CRN-107_R1.fastq.gz --outdir Fastqc_final