ggplot2
The best way to follow this lesson is to try all examples given by yourself, as soon as you read them.
R commands will be presented in gray boxes:
print('Hello world!')
## [1] "Hello world!"
White box following a gray box represents the R’s output. If you test commands by yourself you will get the same output. Blue boxes contain exercises to see if you have well understood notions seen in this lesson:
EXERCISE
This box will contain some exercises that you must do to be sure that you have well understand all notions.
Solutions to exercises will be given in green boxes (they will be only available after the session).
SOLUTION
This box will contain answers to the previous exercises.
Purple boxes contain some additional information (for advanced users), so it is not necessary to read or understand them in the first reading. You will probably understand them after you get some experiences with R.
INFO
This box will contain additional informations that are not necessary to understand in your first reading.
Orange boxes contain warning to care about in order to well use R and have good practices.
WARNING
This box will contain warnings or good practices, that you must take about.
R offers many ways to generate graphics, it also includes a library graphics
that is loaded by default at the beginning of each new session. Nevertheless, we will present here another approach to generate graphics that is included into a package named ggplot2
which is based on The Grammar of Graphics book (Wilkinson 2005).
To use ggplot2, we must to install its library. To do this, we will use the install.packages()
function, seen in the previous chapter
install.packages("ggplot2")
As ggplot2
is a part of Tidyverse, a collection of packages that share a common philosophy and are useful to manipulate scientific data, you can either install the whole set of packages (that include ggplot2
) with
install.packages("tidyverse")
However, we will only present here the ggplot2
package. In all cases, you can load ggplot2
functions with
library(ggplot2)
or, only with the second installation, also with
library(tidyverse)
Data are obviously the main component of a plot. To draw a graphic with ggplot2
, your data must be in the tidy format (Wickham 2014). This format
is a two-dimensional format (stored in a data.frame
or data.frame
-like structure) where each column corresponds to a variable, each row to an observation and each cell to a value (Figure 2.1).
Figure 2.1: Visualization of the tidy format where each column correponds to a variable, each row to an observation and each cell to a value. Picture taken from here.
If you have installed and loaded the tidyverse
package, you have access to the gather()
or spread()
functions (from the tidyr
package, part of tidyverse
) to transform the data as you needed. You can also use the melt()
function of the reshape2
package.
To introduce specificities of ggplot2
we will use the mpg
dataset (which is automatically loaded in the same time as ggplot2
package)
mpg
To learn more about this dataset you can use the help()
function on mpg
or the shortcut command ?mpg
.
ggplot2
philosophyThe philosophy behind ggplot2
is difficult to describe succinctly. You will always start with the ggplot()
function to initialize the plot, and after use the +
operator in order to add layers (like geom_point()
), scales (like scale_colour_brewer()
), faceting specifications (like facet_wrap()
) and coordinate systems (like coord_flip()
). You will also need to specify the data you will used, and the aesthetic mapping aes()
(which variable of the data must match with which aesthetic component). These two information can either be given to the main function ggplot()
, in this case all the following layers will inherit these information (except otherwise specified in the specific layer), or in the following layer functions to specify particular data and aesthetic mapping to be used.
ggplot(mpg, aes(x=cty, y=hwy)) + geom_point()
EXERCISE
What will you get if you indicate only the ggplot()
function?
SOLUTION
You will get an empty plot…
ggplot()
Layers are responsible for creating the objects that are perceive on the plot. A layer is composed of five parts:
INFO
To create a new layer, ggplot2
offers the generic layer()
function. Nevertheless, this function is rarely used and predefined ones prefixed with geom_
, that indicate a specific geometric object, or stat_
, that indicate a specific statistical transformation, are most often used.
Each layer needs a tidy dataset from which the objects will be drawn and an aesthetic mapping (defined with the aes()
function) which determines which variables of the tidy dataset should map to which geometric object’s aesthetic. Of course, each geometric object have its own set of aesthetics and whose some are required. For instance, drawing a scatter plot with geom_point()
needs at least x
and y
aesthetics. If no dataset or/and mapping are specified, the layer will inherit these ones that have been specified in the ggplot()
function. You could know all the aesthetics available and required for a given layer with the help()
function on that layer.
INFO
If you want to force an aesthetic do be independant of the data (eg use a huger or smaller point), you can set to the coresponding aesthetic a default value inside or outside the aes()
function.
ggplot(mpg, aes(x=cty, y=hwy)) + geom_point(size=10)
You can also transform the data before ploting them.
ggplot(mpg, aes(x=log10(1+cty), y=log10(1+hwy))) + geom_point()
Geometric object (geom) indicates which type of plot will be drawn. For example, in our last graphic we have used a point geom (geom_point()
) that have created a scatter plot, whereas using a line geom (geom_line()
) will create a line plot.
ggplot(mpg, aes(x=cty, y=hwy)) + geom_line()
We can classify the geometric objects by their dimensionality (the minimum number of variables that they depends). Table 4.1 presents some of the available geoms.
Function | Dimension | Description |
---|---|---|
geom_histogram() |
1 | To draw histogram. |
geom_point() |
2 | To draw scatter plot. |
geom_boxplot() |
2 | To draw whiskers plot. |
geom_violin() |
2 | To draw violin plot. |
geom_tile() |
3 | To draw heatmap. |
EXERCISE
For each function defined in the table 4.1, draw a plot using the mpg
datset.
SOLUTION
ggplot(mpg, aes(x=cty)) + geom_histogram()
ggplot(mpg, aes(x=manufacturer, y=hwy)) + geom_boxplot()
ggplot(mpg, aes(x=manufacturer, y=hwy)) + geom_violin()
ggplot(mpg, aes(x=manufacturer, y=year, fill=cty)) + geom_tile()
Statistical transformations can be applied to the data before they will be used in order to transform them or to get new aesthetics to map. To be consistent a statistical transformation \(f\) must obey to the following rules: \(f(x+a)=f(x)+a\) and \(f(b\times x)=b\times f(x)\). These two rules ensure that the statistical transformation is invariant under translation and scaling.
Somme statistical transformation are present in the ggplot2
package. They all begin with stat_
, for instance:
ggplot(mpg, aes(x=manufacturer, y=cty)) + stat_boxplot()
You will get the same result as using geom_boxplot()
function. In fact geom_boxplot()
has the stat="boxplot"
option and stat_boxplot()
the
geom="boxplot"
option in default, therefore these two functions are equivalent with their default options. If you don’t apply the stat="boxplot"
transformation you will not be able to draw boxplot because the geom="boxplot"
requires also lower
, upper
, middle
, ymin
and ymax
aesthetics that you don’t have give to map in the aes()
function (in fact they are computed with the statistical transformation boxplot
). For instance, if we want to use geom="boxplot"
without any statistical transformation (we must first compute quantiles ourselves, and set the stat="identity"
option):
df <- aggregate(mpg$cty, by=list(mpg$manufacturer), quantile)
df <- cbind(df[1], df[[2]])
colnames(df) <- c("manufacturer", "ymin", "lower", "middle", "upper", "ymax")
ggplot(df, aes(x=manufacturer, ymin=ymin, lower=lower, middle=middle, upper=upper, ymax=ymax)) +
geom_boxplot(stat="identity")
Position adjustment is responsible for adjusting the position of overlapping geoms. As statistical transformation, they are all prefixed with position_
. You can defined them in the geom_
or the stat_
function with the position
option. For instance,
ggplot(mpg) + geom_bar(aes(x = manufacturer, fill = class), position = "dodge")
ggplot(mpg) + geom_bar(aes(x = manufacturer, fill = class), position = "stack")
ggplot(mpg) + geom_bar(aes(x = manufacturer, fill = class), position = "fill")
EXERCISE
Can you try with the mpg
dataset to get the following graphics?
You can use the jitter
position to get a randomization of the point.
SOLUTION
ggplot(mpg, aes(x=manufacturer, y=cty)) +
geom_boxplot(outlier.shape=NA) +
geom_point(pch = 21, position="jitter")
Scales control how the data are mapped to the aesthetic, so it exists one scale for each aesthetic. Scale functions have the following pattern scale_[aesthetic]_[scale]
, for instance scale_x_continuous()
which defines how continuous variable is mapped to the x aesthetic (x-axis) or scale_fill_gradient()
which defines how continuous variable is mapped to a two-color gradient scale. For instance:
ggplot(mpg, aes(x=cty, y=hwy, color=cyl)) +
geom_point() +
scale_color_gradient(low ="skyblue", high= "navy")
Facets divide a plot into subplots based on the values of one or more discrete variables. This is used to investigate whether patterns are the same or different across request variables. For example:
ggplot(mpg) +
geom_point(aes(x=cty, y=hwy)) +
facet_wrap(~manufacturer)
To use facets you can use either the facet_wrap()
or facet_grid()
functions. facet_wrap()
is often used when you have one variable to facet and facet_grid()
, two variables. The default used facet function is facet_null()
, so adding it or removing it will not change the plot, however it is useful when we need to reset facets from a plot.
A coordinate system associates each value in the data to an unique position on the plot. The most often used one is the cartesian coordinate system (represented with the coord_cartesian()
function) which is used to map an object with two dimensions, but it also exists other representations like polar system coordinate (represented with the coord_polar()
function) or less frequently used various map projections (eg see coord_map()
function).
As coordinate system transforms the shape of object, polar system coordinate can be used to generate pie charts from bar plot. For instance:
ggplot(mpg) +
geom_bar(aes(x=class))
ggplot(mpg) +
geom_bar(aes(x="", fill=class), color="white") +
coord_polar("y") +
theme_void()
All functions associated to a coordinate system lokks like coord_*
. Another useful function is the coord_filp()
function that flips coordinates, so that x-axis become vertical and y-axis horizonthal.
ggplot(mpg) +
geom_bar(aes(x=class)) +
coord_flip()
The default theme of graphics generated by ggplot2
library is a gray background with white lines grid representing the axis ticks inside the graphic. You can change this theme using complete themes:
ggplot(mpg, aes(x=cty, y=hwy)) + geom_point() + theme_dark()
ggplot(mpg, aes(x=cty, y=hwy)) + geom_point() + theme_classic()
Complete themes that are included in the default package are
theme_grey()
,
theme_gray()
,
theme_bw()
,
theme_linedraw()
,
theme_light()
,
theme_dark()
,
theme_minimal()
,
theme_classic()
,
theme_void()
, and
theme_test()
.
You have also the possibility to change one or many components of the theme with the theme()
function. For instance, to set a red background:
ggplot(mpg, aes(x=cty, y=hwy)) + geom_point() + theme(panel.background=element_rect(fill="red"))
You can even change the default theme during all the R session using the theme_set()
, theme_update()
and theme_replace()
functions, so any subsequent graphics will inherit your choice. The theme_get()
function returns you the actual theme.
You can easily download new themes on internet, for instance the package ggthemes
provides you new theme.
You can export the graphics in a variety of formats using either pdf()
, jpeg()
, png()
, bmp()
, tiff()
… depending of the output format you wish.
To do that, you must draw your graphics after the calling of the function and finish by calling the dev.off()
function (otherwise all subsequent graphics will also be printed into the same output). For instance:
pdf("test.pdf")
ggplot(mpg, aes(x=manufacturer, y=cty)) + stat_boxplot()
dev.off()
Remember that the path given to the pdf()
or equivalent function can be absolute or relative. If it is relative it is from your working directory (look at getwd()
function).
You can visit many websites or read books to inspire you, to learn more about ggplot2
or to see many extension associated to this package:
You have also a cheat sheet that is very useful when drawing graphics with ggplot2 because it synthetize all functions:
EXERCISE
For this purpose, we will use the data that we have generated and exported in the previous homework.
D5plus
and D5minus
conditions:
data.frame
containing two columns D5plus
and D5minus
and where each count correspond to the mean of the normalized gene counts of samples belonging to the D5plus
and D5minus
conditions.data.frame
, draw a scatterplot of the log10 transformed value (don’t forget to add a pseudo-count of 1) of column D5plus
in x-axis and column D5minus
in y-axis.geom_abline()
function to do that.reshape2
package. If it is not installed, install it before.melt()
function in order to render these data into a tidy format. You can convert it into a matrix before, so you will get the rownames as a variable. As melt()
function convert character
into factor
, convert Var1
and Var2
columns into character
vector.Var3
that contains the experimental condition name (D5plus, D5minus, D5shX3 and D5shX5) associated to each sample and filter the resulting table on the Var1
to keep only requested Ensembl gene ID.ggplot()
and geom_jitter()
functions, in order to plot variable Var3
in x-axis, variable value
in y-axis (log10 transformed and a pseudo-count of 1) and use facet_wrap()
on the Var1
variable. For a better visualisation, you can set the option width
of the geom_jitter()
function to 0.25.stat_summary()
function. To use line graph with a discrete scale, we must indicate to ggplot2
to which data point must be grouped together. In this case, all points should be connected, so we can add group=1
option. Neverthless, when more variables are used and multiple lines are drawn, the grouping for lines is usually done by variable.theme()
function to do that but with what option and what setting?)SOLUTION
I recall you that there is no an unique answer! The goal is to get the corresponding results.
Question 1
data <- read.table("../01_IntroductionToR/GSE153921_dge.tsv",
sep="\t", header=TRUE, stringsAsFactors=FALSE, check.names=FALSE, dec=",")
# data <- read.table("/Users/jung/Desktop/210316_IntroductionToR_PhDProgram/01_IntroductionToR/GSE153921_dge.tsv", sep="\t", header=TRUE, stringsAsFactors=FALSE, check.names=FALSE, dec=",")
head(data)
Question 2.1
df_D5minus <- data[startsWith(colnames(data), "D5minus")&endsWith(colnames(data), "(normalized)")]
df_D5plus <- data[startsWith(colnames(data), "D5plus")&endsWith(colnames(data), "(normalized)")]
df <- data.frame("D5minus"=apply(df_D5minus, 1, mean),
"D5plus"=apply(df_D5plus, 1, mean))
head(df)
Question 2.2
ggplot(df, aes(x=log10(1+D5plus), y=log10(1+D5minus))) +
geom_point()
Question 2.3
ggplot(df, aes(x=log10(1+D5plus), y=log10(1+D5minus))) +
geom_point() +
geom_abline(slope=1, color="red")
Question 3.1
install.packages("reshape2")
library(reshape2)
Question 3.2
filer_column <- endsWith(colnames(data), "(normalized)")
dm <- melt(as.matrix(data[filer_column]))
dm$Var1 <- as.character(dm$Var1)
dm$Var2 <- as.character(dm$Var2)
head(dm)
Question 3.3
dm$Var3 <- substr(dm$Var2, 1, nchar(dm$Var2)-nchar("X (normalized)"))
filter_gene <- c("ENSG00000151388", "ENSG00000067082", "ENSG00000105290")
dm <- dm[dm$Var1%in%filter_gene, ]
head(dm)
Question 3.4
ggplot(dm,
mapping=aes(x=Var3, y=log10(1+value))) +
geom_jitter(width=0.25) +
facet_wrap(~Var1)
Question 3.5
ggplot(dm,
mapping=aes(x=Var3, y=log10(1+value))) +
geom_jitter(width=0.25) +
facet_wrap(~Var1)+
stat_summary(group=1, geom="line",
color="red", fun="mean")
Question 3.6
ggplot(dm,
mapping=aes(x=Var3, y=log10(1+value))) +
geom_jitter(width=0.25) +
facet_wrap(~Var1)+
stat_summary(group=1, geom="line",
color="red", fun="mean") +
theme(axis.text.x=element_text(angle=90, vjust=0.5)) +
labs(title="Gene variation",
x="",
y="Count (log10 transformed plus a pseudo-count of 1)")
Wickham, Hadley. 2014. “Tidy Data.” Journal of Statistical Software 59 (10). https://doi.org/10.18637/jss.v059.i10.
Wilkinson, Leland. 2005. The Grammar of Graphics. Second Edition. New York: Springer-Verlag.