How to use this document?

The best way to follow this lesson is to try all examples given by yourself, as soon as you read them.

R commands will be presented in gray boxes:

print('Hello world!')
## [1] "Hello world!"

White box following a gray box represents the R’s output. If you test commands by yourself you will get the same output. Blue boxes contain exercises to see if you have well understood notions seen in this lesson:

EXERCISE

This box will contain some exercises that you must do to be sure that you have well understand all notions.

Solutions to exercises will be given in green boxes (they will be only available after the session).

SOLUTION

This box will contain answers to the previous exercises.

Purple boxes contain some additional information (for advanced users), so it is not necessary to read or understand them in the first reading. You will probably understand them after you get some experiences with R.

INFO

This box will contain additional informations that are not necessary to understand in your first reading.

Orange boxes contain warning to care about in order to well use R and have good practices.

WARNING

This box will contain warnings or good practices, that you must take about.

1 Introduction

R offers many ways to generate graphics, it also includes a library graphics that is loaded by default at the beginning of each new session. Nevertheless, we will present here another approach to generate graphics that is included into a package named ggplot2 which is based on The Grammar of Graphics book (Wilkinson 2005).

To use ggplot2, we must to install its library. To do this, we will use the install.packages() function, seen in the previous chapter

install.packages("ggplot2")

As ggplot2 is a part of Tidyverse, a collection of packages that share a common philosophy and are useful to manipulate scientific data, you can either install the whole set of packages (that include ggplot2) with

install.packages("tidyverse")

However, we will only present here the ggplot2 package. In all cases, you can load ggplot2 functions with

library(ggplot2)

or, only with the second installation, also with

library(tidyverse)

2 Tidy data

Data are obviously the main component of a plot. To draw a graphic with ggplot2, your data must be in the tidy format (Wickham 2014). This format is a two-dimensional format (stored in a data.frame or data.frame-like structure) where each column corresponds to a variable, each row to an observation and each cell to a value (Figure 2.1).

Visualization of the tidy format where each column correponds to a variable, each row to an observation and each cell to a value. Picture taken from [here](https://cfss.uchicago.edu/notes/tidy-data/).

Figure 2.1: Visualization of the tidy format where each column correponds to a variable, each row to an observation and each cell to a value. Picture taken from here.

If you have installed and loaded the tidyverse package, you have access to the gather() or spread() functions (from the tidyr package, part of tidyverse) to transform the data as you needed. You can also use the melt() function of the reshape2 package.

To introduce specificities of ggplot2 we will use the mpg dataset (which is automatically loaded in the same time as ggplot2 package)

mpg

To learn more about this dataset you can use the help() function on mpg or the shortcut command ?mpg.

3 ggplot2 philosophy

The philosophy behind ggplot2 is difficult to describe succinctly. You will always start with the ggplot() function to initialize the plot, and after use the + operator in order to add layers (like geom_point()), scales (like scale_colour_brewer()), faceting specifications (like facet_wrap()) and coordinate systems (like coord_flip()). You will also need to specify the data you will used, and the aesthetic mapping aes() (which variable of the data must match with which aesthetic component). These two information can either be given to the main function ggplot(), in this case all the following layers will inherit these information (except otherwise specified in the specific layer), or in the following layer functions to specify particular data and aesthetic mapping to be used.

ggplot(mpg, aes(x=cty, y=hwy)) + geom_point()

EXERCISE

What will you get if you indicate only the ggplot() function?

SOLUTION

You will get an empty plot…

ggplot()

4 Layers

Layers are responsible for creating the objects that are perceive on the plot. A layer is composed of five parts:

  • a geometric object (geom) used to display the data;
  • a statistical transformation (stat) to use on the data;
  • the data (in the tidy format) from which the object will be construct;
  • an aesthetic mapping that describe how the data variables should map with the aesthetic attributes of the geometric object;
  • a position adjustment.

INFO

To create a new layer, ggplot2 offers the generic layer() function. Nevertheless, this function is rarely used and predefined ones prefixed with geom_, that indicate a specific geometric object, or stat_, that indicate a specific statistical transformation, are most often used.

4.1 Data and mapping

Each layer needs a tidy dataset from which the objects will be drawn and an aesthetic mapping (defined with the aes() function) which determines which variables of the tidy dataset should map to which geometric object’s aesthetic. Of course, each geometric object have its own set of aesthetics and whose some are required. For instance, drawing a scatter plot with geom_point() needs at least x and y aesthetics. If no dataset or/and mapping are specified, the layer will inherit these ones that have been specified in the ggplot() function. You could know all the aesthetics available and required for a given layer with the help() function on that layer.

INFO

If you want to force an aesthetic do be independant of the data (eg use a huger or smaller point), you can set to the coresponding aesthetic a default value inside or outside the aes() function.

ggplot(mpg, aes(x=cty, y=hwy)) + geom_point(size=10)

You can also transform the data before ploting them.

ggplot(mpg, aes(x=log10(1+cty), y=log10(1+hwy))) + geom_point()

4.2 Geometric object

Geometric object (geom) indicates which type of plot will be drawn. For example, in our last graphic we have used a point geom (geom_point()) that have created a scatter plot, whereas using a line geom (geom_line()) will create a line plot.

ggplot(mpg, aes(x=cty, y=hwy)) + geom_line()

We can classify the geometric objects by their dimensionality (the minimum number of variables that they depends). Table 4.1 presents some of the available geoms.

Table 4.1: Some geometric object that can be used with ggplot2.
Function Dimension Description
geom_histogram() 1 To draw histogram.
geom_point() 2 To draw scatter plot.
geom_boxplot() 2 To draw whiskers plot.
geom_violin() 2 To draw violin plot.
geom_tile() 3 To draw heatmap.

EXERCISE

For each function defined in the table 4.1, draw a plot using the mpg datset.

SOLUTION

ggplot(mpg, aes(x=cty)) + geom_histogram()

ggplot(mpg, aes(x=manufacturer, y=hwy)) + geom_boxplot()

ggplot(mpg, aes(x=manufacturer, y=hwy)) + geom_violin()

ggplot(mpg, aes(x=manufacturer, y=year, fill=cty)) + geom_tile()

4.3 Statistical transformation

Statistical transformations can be applied to the data before they will be used in order to transform them or to get new aesthetics to map. To be consistent a statistical transformation \(f\) must obey to the following rules: \(f(x+a)=f(x)+a\) and \(f(b\times x)=b\times f(x)\). These two rules ensure that the statistical transformation is invariant under translation and scaling.

Somme statistical transformation are present in the ggplot2 package. They all begin with stat_, for instance:

ggplot(mpg, aes(x=manufacturer, y=cty)) + stat_boxplot()

You will get the same result as using geom_boxplot() function. In fact geom_boxplot() has the stat="boxplot" option and stat_boxplot() the geom="boxplot" option in default, therefore these two functions are equivalent with their default options. If you don’t apply the stat="boxplot" transformation you will not be able to draw boxplot because the geom="boxplot" requires also lower, upper, middle, ymin and ymax aesthetics that you don’t have give to map in the aes() function (in fact they are computed with the statistical transformation boxplot). For instance, if we want to use geom="boxplot" without any statistical transformation (we must first compute quantiles ourselves, and set the stat="identity" option):

df <- aggregate(mpg$cty, by=list(mpg$manufacturer), quantile)
df <- cbind(df[1], df[[2]])
colnames(df) <- c("manufacturer", "ymin", "lower", "middle", "upper", "ymax")
ggplot(df, aes(x=manufacturer, ymin=ymin, lower=lower, middle=middle, upper=upper, ymax=ymax)) + 
  geom_boxplot(stat="identity")

4.4 Position adjustment

Position adjustment is responsible for adjusting the position of overlapping geoms. As statistical transformation, they are all prefixed with position_. You can defined them in the geom_ or the stat_ function with the position option. For instance,

ggplot(mpg) + geom_bar(aes(x = manufacturer, fill = class), position = "dodge")

ggplot(mpg) + geom_bar(aes(x = manufacturer, fill = class), position = "stack")

ggplot(mpg) + geom_bar(aes(x = manufacturer, fill = class), position = "fill")

EXERCISE

Can you try with the mpg dataset to get the following graphics?

You can use the jitter position to get a randomization of the point.

SOLUTION

ggplot(mpg, aes(x=manufacturer, y=cty)) + 
  geom_boxplot(outlier.shape=NA) + 
  geom_point(pch = 21, position="jitter")

5 Scales

Scales control how the data are mapped to the aesthetic, so it exists one scale for each aesthetic. Scale functions have the following pattern scale_[aesthetic]_[scale], for instance scale_x_continuous() which defines how continuous variable is mapped to the x aesthetic (x-axis) or scale_fill_gradient() which defines how continuous variable is mapped to a two-color gradient scale. For instance:

ggplot(mpg, aes(x=cty, y=hwy, color=cyl)) +
  geom_point() +
  scale_color_gradient(low ="skyblue", high= "navy")

6 Facet

Facets divide a plot into subplots based on the values of one or more discrete variables. This is used to investigate whether patterns are the same or different across request variables. For example:

ggplot(mpg) + 
  geom_point(aes(x=cty, y=hwy)) + 
  facet_wrap(~manufacturer)

To use facets you can use either the facet_wrap() or facet_grid() functions. facet_wrap() is often used when you have one variable to facet and facet_grid(), two variables. The default used facet function is facet_null(), so adding it or removing it will not change the plot, however it is useful when we need to reset facets from a plot.

7 Coordinate system

A coordinate system associates each value in the data to an unique position on the plot. The most often used one is the cartesian coordinate system (represented with the coord_cartesian() function) which is used to map an object with two dimensions, but it also exists other representations like polar system coordinate (represented with the coord_polar() function) or less frequently used various map projections (eg see coord_map() function).

As coordinate system transforms the shape of object, polar system coordinate can be used to generate pie charts from bar plot. For instance:

ggplot(mpg) + 
  geom_bar(aes(x=class))

ggplot(mpg) + 
  geom_bar(aes(x="", fill=class), color="white") + 
  coord_polar("y") +
  theme_void()

All functions associated to a coordinate system lokks like coord_*. Another useful function is the coord_filp() function that flips coordinates, so that x-axis become vertical and y-axis horizonthal.

ggplot(mpg) + 
  geom_bar(aes(x=class)) +
  coord_flip()

8 Theme

The default theme of graphics generated by ggplot2 library is a gray background with white lines grid representing the axis ticks inside the graphic. You can change this theme using complete themes:

ggplot(mpg, aes(x=cty, y=hwy)) + geom_point() + theme_dark()

ggplot(mpg, aes(x=cty, y=hwy)) + geom_point() + theme_classic()

Complete themes that are included in the default package are theme_grey(), theme_gray(), theme_bw(), theme_linedraw(), theme_light(), theme_dark(), theme_minimal(), theme_classic(), theme_void(), and theme_test(). You have also the possibility to change one or many components of the theme with the theme() function. For instance, to set a red background:

ggplot(mpg, aes(x=cty, y=hwy)) + geom_point() + theme(panel.background=element_rect(fill="red"))

You can even change the default theme during all the R session using the theme_set(), theme_update() and theme_replace() functions, so any subsequent graphics will inherit your choice. The theme_get() function returns you the actual theme.

You can easily download new themes on internet, for instance the package ggthemes provides you new theme.

9 Exporting

You can export the graphics in a variety of formats using either pdf(), jpeg(), png(), bmp(), tiff()… depending of the output format you wish. To do that, you must draw your graphics after the calling of the function and finish by calling the dev.off() function (otherwise all subsequent graphics will also be printed into the same output). For instance:

pdf("test.pdf")
ggplot(mpg, aes(x=manufacturer, y=cty)) + stat_boxplot()
dev.off()

Remember that the path given to the pdf() or equivalent function can be absolute or relative. If it is relative it is from your working directory (look at getwd() function).

10 To go further

You can visit many websites or read books to inspire you, to learn more about ggplot2 or to see many extension associated to this package:

You have also a cheat sheet that is very useful when drawing graphics with ggplot2 because it synthetize all functions:

11 Homework

EXERCISE

For this purpose, we will use the data that we have generated and exported in the previous homework.

  1. Load the data into a variable
  2. Now we want to see if there is a bias into sample from the D5plus and D5minus conditions:
    1. Generate a new data.frame containing two columns D5plus and D5minus and where each count correspond to the mean of the normalized gene counts of samples belonging to the D5plus and D5minus conditions.
    2. With the previous data.frame, draw a scatterplot of the log10 transformed value (don’t forget to add a pseudo-count of 1) of column D5plus in x-axis and column D5minus in y-axis.
    3. Draw also the identity function (\(f(x)=x\)) in red to better visualize if a bias exists; you can use geom_abline() function to do that.
  3. We will now distinguish the expression of three genes, having the following Ensembl ID: ENSG00000151388, ENSG00000067082 and ENSG00000105290, regarding the expression in each experimental condition:
    1. Load the reshape2 package. If it is not installed, install it before.
    2. Filter the column of the input data in order to keep only normalized gene counts and use the melt() function in order to render these data into a tidy format. You can convert it into a matrix before, so you will get the rownames as a variable. As melt() function convert character into factor, convert Var1 and Var2 columns into character vector.
    3. Add a column named Var3 that contains the experimental condition name (D5plus, D5minus, D5shX3 and D5shX5) associated to each sample and filter the resulting table on the Var1 to keep only requested Ensembl gene ID.
    4. Draw a graphic using ggplot() and geom_jitter() functions, in order to plot variable Var3 in x-axis, variable value in y-axis (log10 transformed and a pseudo-count of 1) and use facet_wrap() on the Var1 variable. For a better visualisation, you can set the option width of the geom_jitter() function to 0.25.
    5. Now we want to better visualize how the expressed values move between conditions. To do that, we wish to draw a red line joining the mean value of each experimental condition. To do that, add a layer on the previous grpahics using stat_summary() function. To use line graph with a discrete scale, we must indicate to ggplot2 to which data point must be grouped together. In this case, all points should be connected, so we can add group=1 option. Neverthless, when more variables are used and multiple lines are drawn, the grouping for lines is usually done by variable.
    6. To finish, remove from the graphics the x-axis title, rename the y-axis title with “Count (log10 transformed plus a pseudo-count of 1)” and add the title “Gene variation” to the graphic. Can you try to rotate of 90° the x-axis text elements? (Use the theme() function to do that but with what option and what setting?)

SOLUTION

I recall you that there is no an unique answer! The goal is to get the corresponding results.

Question 1

data <- read.table("../01_IntroductionToR/GSE153921_dge.tsv", 
                   sep="\t", header=TRUE, stringsAsFactors=FALSE, check.names=FALSE, dec=",")
# data <- read.table("/Users/jung/Desktop/210316_IntroductionToR_PhDProgram/01_IntroductionToR/GSE153921_dge.tsv",  sep="\t", header=TRUE, stringsAsFactors=FALSE, check.names=FALSE, dec=",")
head(data)

Question 2.1

df_D5minus <- data[startsWith(colnames(data), "D5minus")&endsWith(colnames(data), "(normalized)")]
df_D5plus <- data[startsWith(colnames(data), "D5plus")&endsWith(colnames(data), "(normalized)")]
df <- data.frame("D5minus"=apply(df_D5minus, 1, mean),
                 "D5plus"=apply(df_D5plus, 1, mean))
head(df)

Question 2.2

ggplot(df, aes(x=log10(1+D5plus), y=log10(1+D5minus))) + 
  geom_point()

Question 2.3

ggplot(df, aes(x=log10(1+D5plus), y=log10(1+D5minus))) + 
  geom_point() +
  geom_abline(slope=1, color="red")

Question 3.1

install.packages("reshape2")
library(reshape2)

Question 3.2

filer_column <- endsWith(colnames(data), "(normalized)")
dm <- melt(as.matrix(data[filer_column]))
dm$Var1 <- as.character(dm$Var1)
dm$Var2 <- as.character(dm$Var2)
head(dm)

Question 3.3

dm$Var3 <- substr(dm$Var2, 1, nchar(dm$Var2)-nchar("X (normalized)"))
filter_gene <- c("ENSG00000151388", "ENSG00000067082", "ENSG00000105290")
dm <- dm[dm$Var1%in%filter_gene, ]
head(dm)

Question 3.4

ggplot(dm,
       mapping=aes(x=Var3, y=log10(1+value))) + 
  geom_jitter(width=0.25) +
  facet_wrap(~Var1)

Question 3.5

  ggplot(dm,
       mapping=aes(x=Var3, y=log10(1+value))) + 
  geom_jitter(width=0.25) +
  facet_wrap(~Var1)+
  stat_summary(group=1, geom="line",
               color="red", fun="mean")

Question 3.6

  ggplot(dm,
       mapping=aes(x=Var3, y=log10(1+value))) + 
  geom_jitter(width=0.25) +
  facet_wrap(~Var1)+
  stat_summary(group=1, geom="line",
               color="red", fun="mean")   +
  theme(axis.text.x=element_text(angle=90, vjust=0.5)) +
  labs(title="Gene variation", 
       x="", 
       y="Count (log10 transformed plus a pseudo-count of 1)")

References

Wickham, Hadley. 2014. “Tidy Data.” Journal of Statistical Software 59 (10). https://doi.org/10.18637/jss.v059.i10.

Wilkinson, Leland. 2005. The Grammar of Graphics. Second Edition. New York: Springer-Verlag.