Quality control of raw sequencing data, data cleaning, FASTQ file handling
| Instructor | Stephanie Le Gras |
|---|---|
| Duration | 3.5 hours |
| Content | Quality control of raw sequencing data, data cleaning, FASTQ file handling (lecture) |
| Practical session on quality controls of raw sequencing data (Hands-on) | |
| Prerequisites | None |
1 Prepare the analysis environment to use to perform the analysis
1.1 Create a working directory
cd /work/c-shd/[your login] mkdir coursQC_Mapping
1.2 Move to the directory you've just created
cd coursQC_Mapping
1.3 Copy the datasets into your working directory
cp /user2/c-shd/sll13384/DU/Data/Sample_CRN-107_11/CRN-107_11-R*.fastq.gz .
1.4 Uncompress the files
gunzip CRN-107_11-R1.fastq.gz gunzip CRN-107_11-R2.fastq.gz
1.5 Create a virtual working environment
mkdir Software cd Software wget https://repo.continuum.io/miniconda/Miniconda2-latest-Linux-x86_64.sh bash Miniconda2-latest-Linux-x86_64.sh # Press enter # Press space until the end of the license # Enter Yes # Enter /work/c-shd/[your login]/Software/miniconda2 # Enter yes conda config --add channels conda-forge conda config --add channels defaults conda config --add channels r conda config --add channels bioconda conda create --name dnaseq bedtools samtools bwa fastqc cutadapt picard gatk # enter y cp /user2/c-shd/sll13384/DU/Software/GenomeAnalysisTK-3.6.tar.bz2 . source activate dnaseq gatk-register $(pwd)/GenomeAnalysisTK-3.6.tar.bz2
1.6 Source working environment
source activate dnaseq source /user2/c-shd/sll13384/DU/Scripts/envTD.sh
2 Compute the number of sequenced reads
File:
- CRN-107_11-R1.fastq
- CRN-107_11-R2.fastq
Tool:
- Use the command line 'wc' ('wc' for word count)
3 Assess the quality of the raw data
Files:
- CRN-107_11-R1.fastq
- CRN-107_11-R2.fastq
Tool:
- fastqc
Tips
- Put the results onto a directory called Fastqc
3.1 Create an output directory named Fastqc
3.2 Run the command fastqc on the two fastq files
By default, FastQC groups bases for reads >50bp. Disable this functionality.
3.3 Use firefox to look at the resulting html file (fastqc_report.html)
4 Trim the last nucleotide from sequences
Input files:
- CRN-107_11-R1.fastq
- CRN-107_11-R2.fastq
Output files:
- CRN-107_11-R1_shorter.fastq
- CRN-107_11-R2_shorter.fastq
Tool:
- fastx_trimmer
Tips:
- Be careful the encoding of the quality is Phred33 so tell it to fastx toolkit using the tag -Q 33.
4.1 Check the sequences length after trimming
- Use a combination of 'head', 'tail' and 'wc'
- Or open the file in a text editor with the column number displayed
5 Cut Illumina's adapters using Cutadapt
Input files:
- CRN-107_11-R1_shorter.fastq
- CRN-107_11-R2_shorter.fastq
Output files:
- CRN-107_11-R1_trimmed.fastq
- CRN-107_11-R2_trimmed.fastq
Tips:
- The adaptater sequence is: AGATCGGAAGAGC
- Sequence length after trimming should be higher that 30
- Check out Cutadapt website for help on trimming adapters in paired end sequences.
5.1 Compute reverse complement of the adapter sequence using fastx_reverse_complement
5.2 Run cutadapt
6 Remove bad quality sequences
Files:
- CRN-107_11-R1_trimmed.fastq
- CRN-107_11-R2_trimmed.fastq
Tool:
- DynamicTrim.pl (SolexaQA)
Tips:
- Threshold for trimming: Phred score >= 10
6.1 Create an output directory for the data named SolexaQA
6.2 Run the tool DynamicTrim.pl
Rename resulting files
mv SolexaQA/CRN-107_11-R1_trimmed.fastq.trimmed CRN-107_R1.fastq mv SolexaQA/CRN-107_11-R2_trimmed.fastq.trimmed CRN-107_R2.fastq
7 Compress all output files
gzip CRN-107_11-R1.fastq CRN-107_11-R2.fastq gzip CRN-107_11-R1_shorter.fastq CRN-107_11-R2_shorter.fastq gzip CRN-107_11-R1_trimmed.fastq CRN-107_11-R2_trimmed.fastq gzip CRN-107_R1.fastq CRN-107_R2.fastq
8 Put all temporary files into the directory named intermedFastqFiles
# Make output directory mkdir intermedFastqFiles # Move compressed files mv CRN-107_11-R1_shorter.fastq.gz CRN-107_11-R2_shorter.fastq.gz intermedFastqFiles mv CRN-107_11-R1_trimmed.fastq.gz CRN-107_11-R2_trimmed.fastq.gz intermedFastqFiles
9 Run FastQC on the final file
9.1 Create an output directory Fastqc_final
mkdir Fastqc_final
9.2 Run FastQC on the final fastq files
fastqc --nogroup CRN-107_R1.fastq.gz --outdir Fastqc_final