Introduction to chIP-seq data analysis
Instructor | Stephanie Le Gras |
---|---|
Duration | 4 hours |
Content | Analyzing chIP-seq data (Lecture) |
Hands-on are being run using Galaxy (Practical) | |
link to UCSC | |
Prerequisites | Introduction to Galaxy training |
Basic knowledge on chIP-seq experiments |
Description of the training dataset
We are using chIPseq data from :
Strub, T., Giuliano, S., Ye, T., Bonet, C., Keime, C., Kobi, D., Le Gras, S., Cormont, M., Ballotti, R., and Bertolotto, C. (2011). Essential role of microphthalmia transcription factor for DNA replication, mitosis and genomic stability in melanoma. Oncogene 30, 2319–2332.
In this study, they did a chIPseq on the transcription factor MITF in melanoma cell lines (501Mel).
We have 2 datasets:
- MITF
- Control
Practical session
During the pratical session, we are going to use the GalaxEast platform.
1 Question
1.1 Visualize the WIG files for mitf and the control into UCSC
Go to UCSC web site.
Files are :
- wigs_for_ctrl_2.wig.gz (control)
- wigs_for_mitf_2.wig.gz (mitf)
(to download, click on Télécharger)
To download personal tracks to UCSC, go to My Data (top menu)/Custom Tracks.
Select the right genome assembly before uploading your data. Data were aligned to the Human genome hg19.
1.2 Go to chromosome 2 in the Genome Browser
1.3 Change display mode for each track from dense to full
To change display mode of tracks you can :
- right click on the track you want to change display mode and select the required display mode
- scroll down and go below the plot in the section “Custom tracks” to change the display mode of the two uploaded tracks.
Go to check the genes:
- ANKRD30BL
- CFAP221
- DBI
Do you see peaks at this locations?
2 Question
2.1 Go to GalaxEast
2.2 Log in to GalaxEast
2.3 Create a new history called « ChIP-seq data analysis »
2.4 Import datasets from the "Shared Data" top menu. Data are in « Chip seq test dataset (chr2) ».
Go to Shared Data/Data Librairies/Chip seq test dataset (chr2). Import the two datasets.
3 Question
3.1 Run MACS 1.4.2 on the data using MITF (2) and control datasets as inputs.
Use default parameters except for:
- tag size (54)
- Effective genome size (75% of the size of chr2: 182400000)
3.2 Take a look at the result files
What is the fragment length estimated by MACS? How many peaks are called?
3.3 Have a look at the different MACS result files
4 Question
4.1 Re-run MACS using changed parameters
To rerun a tool with the same parameters, click on the button with the two rounded arrows in one of the datasets generated by MACS.
In the tool form, change only the parameters as such:
- Do not build the shifting model
- Arbitrary shift size: 100
How many peaks are found now?
5 Question
5.1 Annotate the peaks with Homer annotatePeaks
Now, that we have a list of regions bound by the protein, we would like to know what are the genes nearby MITF peaks. This is done using Homer annotatePeaks.
Tips
- Use the (peaks: bed) dataset generated by Macs as input of this step.
- the only parameter to change is the genome version which as to be set to hg19
6 De novo motif discovery
We are going to run the de novo motif discovery in regions +/- 40 nucleotides around the summits of peaks detected by MACS. The tool (MEME) we are going to use to run the analysis needs the nucleotide sequences as input. So far, we have the genomic coordinates of the peak summits (1 nucleotide long). To get the right input to MEME, we need:
- To compute the genomic coordinates of the peak summits +/-40
- To extract the nucleotide sequences from previous coordinates
6.1 Upload the datasets with the positions of peak summits generated by Macs to Galaxy.
Use the dataset generated by Macs with peak summits (html dataset).
Export it out of Galaxy and upload it to Galaxy.
Tips:
- Data type: bed
- Genome: hg19
6.2 Compute the coordinates of the peak summits +-40
Tips:
- Use the utility called “SlopBed”
- Use the chromosome length file hg19.len from the data library “Chromosome length”
6.3 Extract the fasta sequences
Tips:
- Use the utility called “Extract Genomic DNA using coordinates from assembled/unassembled genomes”
- Use the dataset with coordinates of peak summits +/-40 as input
6.4 Run MEME
Use MEME 4.8.0!
Tips:
- Use default parameters except for:
- Search for 2 motifs
- Width of motifs should be between 6 and 12
- E-value to stop looking for motifs : 1
- I certify that I am not using this tool for commercial purposes.: Yes
- To display additional parameters select Advanced in the “Options Configuration” drop down list.