How to use this document?

R commands will be presented in a gray box:

print('Hello world!')
## [1] "Hello world!"

White box following a gray box represents the R’s output. Blue box contains exercises to see if you have well understood notions:

EXERCISE

This box will contain some exercises that you must do to be sure that you have well understand all notions.

Solutions to exercises will be given in a green box (they will be only available after the session).

SOLUTION

This box will contain answers to the previous exercises.

Purple box contains some additional information (for advanced users), so it is not necessary to read or understand them in the first reading. You will probably understand them after you get some experiences with R.

INFO

This box will contain additional informations that are not necessary to understand in your first reading.

Orange box contains warning to care about in order to well use R and have good practices.

WARNING

This box will contain warnings or good practices, that you must take about.

1 Introduction

1.1 What is R?

R is a programming language initially designed to statistical computation and graphics, created by Ross Ihaka and Robert Gentleman (Ihaka and Gentleman 1996).

R is an interpreted language, meaning that all scripts written in R need a software to be run. This software is called an interpreter. By extension, the interpreter that is able to read R scripts is also called R and is freely available under the GNU General Public License on a wide variety of UNIX platforms, Windows and MacOS (www.r-project.org).

1.2 How can I program in R?

An R script is only a text file (not a Word document), so any text editor can be used to write a script in R language. Nevertheless, some softwares are better than others as they provide functionalities like syntax highlighting, autocompletion, help, debugger, and other tools that help the programers in writting R scripts. Such softwares are called IDE (integrated development environment) and are designed to increase the developer production.

A widely used IDE for editing R scripts is RStudio (rstudio.com). A good thing with Rstudio is that it comes with an integreted R interpreter. So after Rstudio installation, you are ready to write R scripts or to execute R commands.

Figure 1.1 shows what this software looks like.

Screenshot of RStudio software.

Figure 1.1: Screenshot of RStudio software.

Let’s start with our first exercise:

EXERCISE

On your compture, download and install RStudio Desktop software and use it throught to resolve exercises. If it is not your personal compture and Rstudio is not available, you can go to https://rstudio.cloud and create a free account in order to have an emulation of RStudio that you can use for the purpose of this lesson.

2 Basic concept

2.1 Script vs real-time mode

You can work with R in two different ways, either by writing a script and calling R to interpret it or in an interactive mode by writing your commands in real time. Both approaches have their advantages and inconveniences depending on what you wish to do.

I think that the interactive mode is a good way to perform some simple tasks, to explore your data, to determine what you wish to do with your data, or to learn and explore R. The script approach is better if you want to perform the same code but with different input data, like a pipeline.

WARNING

Whatever the way you will use R, I recommend you to always save commands that you have used to get a result or a graphic.

In script or in interactive mode, a command is ended with the newline character “\n” that you get when you press the return button on your keyboard. You can also write many commands on the same line, each separated by a semicolon character “;” All things after a hashtag character “#” will be ignored by the interpreter until the end of the line, this symbol is used to add comments in your code. Hence, if your comment is on many lines, you must start each new line by a hashtag character “#.”

1+1 # This is a comment and will not be interpreted
## [1] 2
1*3 ; 1-5
## [1] 3
## [1] -4

2.2 Object assignation

In R you will manipulate different kinds of object, like integers, characters, tables or functions. Each object can be stored in memory and therefore must be associated to a variable name in order to use it afterward. To assign an object to a variable name, just use the <- operator:

my_variable <- 1

This operator acts from right to left indicating at R to assign the value indicated in the right part of the arrow to the variable in the left part. In the interactive mode, writing the name of a variable will print its content to the screen:

my_variable
## [1] 1

INFO

R provides two other operators to achieve this: -> and =. The operator -> acts in the same way as the <- operator but from left to right. The operator = acts like <- at the top level but has a different meaning in other kind of expressions. For instance:

matrix(1, ncol = 2)
##      [,1] [,2]
## [1,]    1    1

Will create a two-column matrix, and now:

matrix(1, ncol <- 2)
##      [,1]
## [1,]    1
## [2,]    1
ncol
## [1] 2

Will create a two-row matrix and a variable called ncol which evaluates to 2! What happened is that in matrix(1, ncol <- 2) R didn’t assign the function argument ncol to 2, but assigns 2 to an variable called ncol and sends its value (ie 2) to the second argument of the matrix() function, which is the nrow argument…

To resume, take the habit to use the <- operator to do an assignation and use = operator only in function to assign a value to the function’s argument.

A variable name can only contain alphabetical symbols, numeric symbols, or the special symbols _ or . (so, no space). It can not start with a numeric symbol or underscore _ symbol, and when the first symbol is a dot ., the next symbol cannot be numeric. R is case sensitive, so a variable named data is different from Data or DATA. Any alphabetical symbols can be used (even Chinese symbols!), however it is not recommended to use other symbols than the common ones (English alphabet) because this will make harder to other people to read your code (especially if they are not Chinese readers…)!

福禄寿 <- 3.14
福禄寿
## [1] 3.14

INFO

You can use any variable name you want as they are defined into backquotes `:

`!%():` <- 'This is a bad name for a variable'
`!%():`
## [1] "This is a bad name for a variable"

WARNING

R contains some keywords that you cannot use to name a variable. Use ?reserved or help(reserved) to get the complete list. Briefly: if, else, repeat, while, function, for, in, next, break, TRUE, FALSE, NULL, Inf, NaN, NA, NA_integer_, NA_real_, NA_complex_, NA_character_ are reserved keywords.

2.3 Function call

A function is an object which stores a piece of codes and which aims to perform a given task or a given computation. Moreover, a function can depend of input parameters, have options and can return an object.

INFO

A function can also update an existing object, but this is rarely the case in R where its philosophy is to pass a copy of the input object to the function.

R recognizes a function by its variable name and to run it, you must add parenthesis after the variable name. If the function needs parameters, they must be added between the parenthesis.

print("Hello world")
## [1] "Hello world"

As you have understand, the print() function will print the string given in parameter on the screen.

INFO

If you indicate to R only the name of the function (without the prentheses) then R will print some information about it. Sometimes, the code of the function itself.

Another useful function is the help() function which displays the help text describing the function. A shortcut command can be used with the operator ? before a function name to get the same result.

INFO

To get help of an operator, just indicate it between two backquote characters `:

help(`?`)

The quit(), or its alias q(), function will terminate the current R session.

WARNING

When exiting, R ask you if you want to save the current session. To avoid this, you can use q(FALSE) to indicate to not save the current session and avoiding this question or q(TRUE) to save it. The R session is saved into the file .Rdata in the current directory and will be automatically loaded if you run R again in the same directory (that containing the .Rdata file). This file contains an image of all objects that you have generated during your work, avoiding regenereted them when you re-run R the next time. Neverthless, I recommand you to never save the R session, nor to take the habit to do that, but rather save your commands into a script in order to regenerate the object again and to backup the way you have generated them. For more explanation, see 5.

The functions getwd() and setwd() will respectively return the absolute path to your working directory and set a new working directory by indicating a new absolute path or a relative path starting from your working directory.

The function is.function() returns TRUE if a variable contains a function.

Table 2.1 lists functions seen in this section.

Table 2.1: List of useful R functions.
Function Description
print() Print into the screen the characters given in argument.
help() Show the help of the function name given in parameter
quit() or q() Exit from R
getwd() Get the working directory
setwd() Set the working directory given in parametr
is.function() Return TRUE when the variable given in parameter is a function.

2.4 Data type

Each programming language has its own set of atomic data and data structures. Atomic data is the lowest level of detail to store data and data structures are collections of atomic data organized in some way. Each code consists in manipulating, transforming, combining, associating these atomic data in order to obtain the desired algorithm. We will present here only the most often used ones.

2.4.1 Atomic data

In R, the atomic data are the following:

  • logical,
  • integer,
  • double,
  • complex,
  • character,
  • and raw.

They will be discussed below, except for the raw type that is rarely used and represents binary data.

R provides a function to test is a variable contains an object that is atomic. This function is is.atomic() and lists all available atomic type in R in its help page.

is.atomic(1) ; is.atomic(help)
## [1] TRUE
## [1] FALSE

2.4.1.1 Numeric data

R provides three kinds of numeric type: integer, double and complex that respectively correspond to the set of relative numbers (\(\mathbb{Z}\)), real numbers (\(\mathbb{R}\)) and complex numbers (\(\mathbb{C}\)). The most used one is the double type, so, by default, each writing number (eg 1) is considered as a double. You can also used the scientific notation to declare a double, eg 1e6. If you wish to specify that the number is an integer you need to add L at the end of the number (eg 1L), and if you wish to specify a complex you need to add its imaginary part indicated by i (eg 1+0i). Negative numbers are preceded by the minus sign - (eg -1). The double type has three more specific numbers: Inf, -Inf and NaN (for not a number).

is.double(1) ; is.integer(1L) ; is.complex(1+0i)
## [1] TRUE
## [1] TRUE
## [1] TRUE

Table 2.2 lists all available arithmetic operators.

Table 2.2: List of R arithmetic operators.
Operator Description
+ Addition.
- Subtraction.
* Multiplication.
/ Division.
^ or ** Exponent.
%% Modulus (remainder from division). Do not work with complex number type.
%/% Integer division. Do not work with complex number type.

Using one of these operators between two numbers of the same type would return a number of the same type, except with operators /, ^ and ** used with two integers which would return a double number type. Otherwise, using these operators with two different number types would always returning the most defined one, ie an operation with a complex number type would always return a complex number type, operations between a double number type and an integer number type would always return a double number type.

When an expression combines many operators, R uses priorities to resolve ambiguities. For instance, in the expression 1+3*5 the user wishes first to perform the addition or the multiplication? With the priorities used by R, it will first compute the multiplication, so, this expression is equivalent to 1+(3*5). If the user had whished performing the addition in first, he would have needed to indicate the addition into parentheses, eg (1+3)*5 which indicates to R to compute in first the addition and then the multiplication of the result by 5. Table 2.3 gives the priorities used by R for arithmetic operators. When many operators of the same priority are indicated in an expression, R resolves the expression from left to right.

Table 2.3: Priority levels, from the highest to the lowest priority, used by R to resolve ambiguities with arithmetic operators.
Operator Priority level
^ or ** ****
%% ***
*, /, %/% **
+, - *

Some mathematical functions are available in R, Table 2.4 lists some of them.

Table 2.4: Some mathematical functions available in R.
Function Description
abs(x) Absolute value of \(x\).
sqrt(x) Square root of \(x\).
log(x), log2(x), log10(x) Natural, binary and common logarithm of \(x\) (base \(e\), \(2\) and \(10\)).
log(x, b) Logarithm of \(x\) in base \(b\).
exp(x) Exponential of \(x\), ie \(e^x\).
ceiling(x) Get the least integer greater than or equal to \(x\), ie \(\lceil x\rceil\).
floor(x) Get the greatest integer less than or equal to \(x\), ie \(\lfloor x\rfloor\).
trunc(x) Get the integer part of \(x\), ie \([x]\).
round(x, digits=n) Round the number at \(n\) number of decimal places. By default, \(n=0\).
cos(x), sin(x), tan(x) Cosine, Sine and tangent of \(x\) expressed in radius.
acos(x), asin(x), atan(x) Arc-cosine, arc-sine and arc-tangent of \(x\) expressed in radius.

INFO

An operator is just a shortcut command to call a specific (often used) function. To access to this function just used the operator with two backquotes:

`+`(5,2)
## [1] 7

We can also define our own operators but they must be wrapped with two % characters:

`%,%` <- function(a, b) { (a+1)*(b+1) }
2 %,% 1
## [1] 6

2.4.1.2 Charatcer type

Character type represents textual data composed of letters, numbers, and symbols. The character string must be expressed between both single quotes ' or both double quotes ". Choosing one or other quotation mark is not very important. However it is important to keep the same quotation mark at the opening and the ending declaration of the character string. The utility to have two possibilities arises when the character string contains a quotation mark, so choosing the other one to declare the character string would be a good idea:

"I'm learning R"
## [1] "I'm learning R"

Nevertheless, you could use a backslash \ before a quotation mark to indicate to R do not use the next character as the final quotation mark:

'I\'m learning R'
## [1] "I'm learning R"

R does not provide operators to use with character data type, but there are plenty of useful functions. Table 2.5 lists some of them.

Table 2.5: Some functions that can be used with character data type.
Function Description
paste0(a, b, ...) Get a character corresponding to the concatenation of a and b and so on.
paste(a, b, ..., sep=x) Get a character corresponding to the concatenation of a and b and so on, each separated by x. By default x is a blank space (" ").
print(a) Print the content of a.
nchar(a) Get the number of symbols in the character.
tolower(a), toupper(a) Get a character with all letters in lower or upper case respectively.
chartr(x, y, a) Get a character where every occurence of x found in a is replaced by y.
substr(a, x, y) Get the substring of a starting at position x and ending at position y (the first symbol is in position 1).
startsWith(x, a), endsWith(x, a) Return TRUE or FALSE if x starts (resp. ends) with a.

2.4.1.3 Logical type

The logical type represents boolean data type, ie true or false. In R, these values are in upper case: TRUE or FALSE.

WARNING

In R there exists also global variables T and F that are respectively initialized to TRUE and FALSE, the latest are reserved keywords. This mean that TRUE and FALSE will always denote the true and fasle boolean values, instead of T and F that can be overwritten with another value, including its opposite!

So, to prevent errors, YOU MUST NEVER USED T AND F TO REPRESENT BOOLEAN VALUES, because some dumy people who will reused your code can reasign these variables to other values AND YOU MUST NEVER USED T and F AS VARIABLES because some dumy developers have used them in their code that you will inevitably use to perform specific computation.

With this kind of data type, you can perform boolean algebra. Table 2.6 lists all available logicial operators.

Table 2.6: R boolean operators from the highest to the lowest priority.
Operator Priority Description
! *** Unary operator not, eg !TRUE is equivalent to FALSE.
&, && ** Binary operator and, return FALSE when at least one of the two variables is FALSE.
|, || * Binary operator or, return TRUE when at least one of the two variables is TRUE.

The operators | (resp. &) and || (resp. &&) act not exactly on the same way, we will see later an example of their differences.

Table 2.7 presents operators which are used to compare a value of either type to another one. All these operators have the same priority and return a logical value depending of the success of the requested comparison. A comparison is performed after any arithmetic operators and before any boolean operators.

Table 2.7: R comparison operators. Each operator have the same priority and return a logical value depending of the success of the test.
Operator Description
< Lower than.
<= Lower than or equal to.
> Greater than.
>= Greater than or equal to.
== Exactely equal to.
!= Not exactely equal to.

WARNING

Due to floating precision, it is greatly recommanded to never used equalty operators (== and !=) to compare double or complex values. To compare them, you must do relatively. For instance:

(0.1 + 0.05) == 0.15 ; abs((0.1 + 0.05) - 0.15) < 1e-6
## [1] FALSE
## [1] TRUE

Table 2.8 lists some of functions used with logical values

Table 2.8: Some functions that can be used with logical data type.
Function Description
all(a) Return TRUE if all values of vector a is TRUE.
any(a) Return TRUE if a has at least one value at TRUE.

2.4.1.4 Conversion between types

R provides a lot of functions to convert between atomic data type into another one, if conversion can be performed. Table 2.9 lists functions useful to transform a type into another one and presents also different functions to test if a value belong to a specific type.

Table 2.9: R functions to convert or test for a specified data type.
Function Description
as.integer(x), as.double(x), as.complex(x) Convert x into an integer, double or complex data type.
as.character(x) Convert x into a character data type.
as.logical(x) Convert x into a logical data type.
is.integer(x), is.double(x), is.complex(x) Return TRUE if x is an integer, double or complex type, FALSE otherwise.
is.character(x) Return TRUE if x is a character value type, FALSE otherwise.
is.logical(x), isFALSE(x), isTRUE(x) Return TRUE if x is a logicial, TRUEor FALSE value type, FALSE otherwise.

2.4.2 Data structures

Data structures are collection of data type of even data strucutres. We present here only the most often used ones.

2.4.2.1 Vector

A vector is the data structure that represents a collection of data of the same type. It is the primary data structure in R, so even each single value is viewed by R as a vector. For instance:

is.vector(1)
## [1] TRUE

You can use the function c() to declare a vector containing multiple elements. Here is a vector of logical data:

c(TRUE, FALSE, FALSE, FALSE, TRUE)
## [1]  TRUE FALSE FALSE FALSE  TRUE

As a vector can only contain the same data type, this function will also try and coerce elements to the same type, if they are different. Coercion is from lower to higher types from logical to integer to double to complex to character data type:

c(TRUE, 2L, 3 + 1i, 'k')
## [1] "TRUE" "2"    "3+1i" "k"

The function length() returns the number of elements of a given vector:

length(c("A", "B", "C", "D", "E"))
## [1] 5
Combining vectors

The function c() means combine, so it can also be used to combine many vectors into a single one:

fruits <- c("Apple", "oranges", "banana")
vegetables <- c("cabbage", "spinach", "tomatoes")
all_basket_items <- c(fruits, vegetables)
all_basket_items
## [1] "Apple"    "oranges"  "banana"   "cabbage"  "spinach"  "tomatoes"
Sequential vectors

The function seq(from, to, by) returns sequential vector starting at from ending at to and with elements seprated by an interval set at by.

seq(2, 100, 2)
##  [1]   2   4   6   8  10  12  14  16  18  20  22  24  26  28  30  32  34  36  38
## [20]  40  42  44  46  48  50  52  54  56  58  60  62  64  66  68  70  72  74  76
## [39]  78  80  82  84  86  88  90  92  94  96  98 100

The by argument can be omitted and then is set to 1:

seq(1, 50)
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50

A shortcut command using the : operator gives the same result:

1:50
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50

With the operator : the step can only be 1.

INFO

The value indicated into brackets before displaying values on a new line indicates the position (first position is 1) of the next value that will be printed at the left.

1:100
##   [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
##  [19]  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
##  [37]  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
##  [55]  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
##  [73]  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
##  [91]  91  92  93  94  95  96  97  98  99 100

The seq() function has more options, however its main usage has been presented. See the help of this function for further explanation.

Repeating vectors

The function rep() is used to repeat each element of a vector and/or the vector itself as many times as indicated:

rep(1:10, times=10)
##   [1]  1  2  3  4  5  6  7  8  9 10  1  2  3  4  5  6  7  8  9 10  1  2  3  4  5
##  [26]  6  7  8  9 10  1  2  3  4  5  6  7  8  9 10  1  2  3  4  5  6  7  8  9 10
##  [51]  1  2  3  4  5  6  7  8  9 10  1  2  3  4  5  6  7  8  9 10  1  2  3  4  5
##  [76]  6  7  8  9 10  1  2  3  4  5  6  7  8  9 10  1  2  3  4  5  6  7  8  9 10
rep(1:10, each=2)
##  [1]  1  1  2  2  3  3  4  4  5  5  6  6  7  7  8  8  9  9 10 10
rep(1:10, each=2, times=5)
##   [1]  1  1  2  2  3  3  4  4  5  5  6  6  7  7  8  8  9  9 10 10  1  1  2  2  3
##  [26]  3  4  4  5  5  6  6  7  7  8  8  9  9 10 10  1  1  2  2  3  3  4  4  5  5
##  [51]  6  6  7  7  8  8  9  9 10 10  1  1  2  2  3  3  4  4  5  5  6  6  7  7  8
##  [76]  8  9  9 10 10  1  1  2  2  3  3  4  4  5  5  6  6  7  7  8  8  9  9 10 10
Named vectors

We can associate to each element of a vector a name using the names() function:

vect <- 1:10
names(vect) <- paste0('elt:', 1:10)
vect
##  elt:1  elt:2  elt:3  elt:4  elt:5  elt:6  elt:7  elt:8  elt:9 elt:10 
##      1      2      3      4      5      6      7      8      9     10

The same function is used to retrieve the names of its element as a vector:

names(vect)
##  [1] "elt:1"  "elt:2"  "elt:3"  "elt:4"  "elt:5"  "elt:6"  "elt:7"  "elt:8" 
##  [9] "elt:9"  "elt:10"
Filtering in a vector

You can access to an element of a vector by indicating its position (first element is at position 1) into brackets, eg vect[10] after the vector name. By extension, if you give a vector of positions, you will get all the pointed positions:

vect[c(2,5)]
## elt:2 elt:5 
##     2     5

If the vector is named, you could also access to its elements by using names:

vect[c("elt:5", "elt:10")]
##  elt:5 elt:10 
##      5     10

Leaving it empty to get a copy of the vector:

vect[]
##  elt:1  elt:2  elt:3  elt:4  elt:5  elt:6  elt:7  elt:8  elt:9 elt:10 
##      1      2      3      4      5      6      7      8      9     10

When an element is not found, NA is returned:

vect["elt:50"]
## <NA> 
##   NA
Operating and function on a vector

The previously seen operators can be apply on a vector as such as the functions:

c(1,2) + c(2,3)
## [1] 3 5
chartr("i", "I", c("R", "is", "incredible"))
## [1] "R"          "Is"         "IncredIble"

If two vectors are not of the same length, R autocomplete the shortest one by contacting it with itself until obtaining the same size as the larger one. This is useful if you wish to do an operation on each of its element.

c(1,2,3) + c(2,3)
## [1] 3 5 5
c(1,2,3) + 1
## [1] 2 3 4

WARNING

Do you remember that we have two distinct operators to perform boolean operations and (& and &&) and or (| and ||)? When the vector has one element, these operators are exactly the same. The difference arises when the vector has more than one element. In this case, the & and | operators act like + or * operators, they are performed on each element of the vector returning a vector of the same length of the larger one, whereas the && and || will only act on the first element of the vectors returning an 1-length vector:

c(TRUE, TRUE, FALSE, FALSE) | c(TRUE, FALSE, TRUE, FALSE)
## [1]  TRUE  TRUE  TRUE FALSE
c(TRUE, TRUE, FALSE, FALSE) || c(TRUE, FALSE, TRUE, FALSE)
## [1] TRUE

2.4.2.2 Matrix and Array

The array data structures stands for vectors with many dimension and a matrix corresponds to a bidimensional array. As matrix and array are like vectors, they can only store data of the same type.

matrix(1:100, nrow=10)
##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
##  [1,]    1   11   21   31   41   51   61   71   81    91
##  [2,]    2   12   22   32   42   52   62   72   82    92
##  [3,]    3   13   23   33   43   53   63   73   83    93
##  [4,]    4   14   24   34   44   54   64   74   84    94
##  [5,]    5   15   25   35   45   55   65   75   85    95
##  [6,]    6   16   26   36   46   56   66   76   86    96
##  [7,]    7   17   27   37   47   57   67   77   87    97
##  [8,]    8   18   28   38   48   58   68   78   88    98
##  [9,]    9   19   29   39   49   59   69   79   89    99
## [10,]   10   20   30   40   50   60   70   80   90   100
mat <- matrix(1:100, ncol=10, byrow=TRUE)
mat
##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
##  [1,]    1    2    3    4    5    6    7    8    9    10
##  [2,]   11   12   13   14   15   16   17   18   19    20
##  [3,]   21   22   23   24   25   26   27   28   29    30
##  [4,]   31   32   33   34   35   36   37   38   39    40
##  [5,]   41   42   43   44   45   46   47   48   49    50
##  [6,]   51   52   53   54   55   56   57   58   59    60
##  [7,]   61   62   63   64   65   66   67   68   69    70
##  [8,]   71   72   73   74   75   76   77   78   79    80
##  [9,]   81   82   83   84   85   86   87   88   89    90
## [10,]   91   92   93   94   95   96   97   98   99   100
arr <- array(1:100, dim=c(10,5,2))
arr
## , , 1
## 
##       [,1] [,2] [,3] [,4] [,5]
##  [1,]    1   11   21   31   41
##  [2,]    2   12   22   32   42
##  [3,]    3   13   23   33   43
##  [4,]    4   14   24   34   44
##  [5,]    5   15   25   35   45
##  [6,]    6   16   26   36   46
##  [7,]    7   17   27   37   47
##  [8,]    8   18   28   38   48
##  [9,]    9   19   29   39   49
## [10,]   10   20   30   40   50
## 
## , , 2
## 
##       [,1] [,2] [,3] [,4] [,5]
##  [1,]   51   61   71   81   91
##  [2,]   52   62   72   82   92
##  [3,]   53   63   73   83   93
##  [4,]   54   64   74   84   94
##  [5,]   55   65   75   85   95
##  [6,]   56   66   76   86   96
##  [7,]   57   67   77   87   97
##  [8,]   58   68   78   88   98
##  [9,]   59   69   79   89   99
## [10,]   60   70   80   90  100

We can also use the [ ] operator to access to an element, and names() to named elements:

mat[c(3,2), c(7,3)]
##      [,1] [,2]
## [1,]   27   23
## [2,]   17   13
arr[c(3,4), c(3,2), c(1,2)]
## , , 1
## 
##      [,1] [,2]
## [1,]   23   13
## [2,]   24   14
## 
## , , 2
## 
##      [,1] [,2]
## [1,]   73   63
## [2,]   74   64
names(mat) <- paste0("elt:", 1:100)
mat
##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
##  [1,]    1    2    3    4    5    6    7    8    9    10
##  [2,]   11   12   13   14   15   16   17   18   19    20
##  [3,]   21   22   23   24   25   26   27   28   29    30
##  [4,]   31   32   33   34   35   36   37   38   39    40
##  [5,]   41   42   43   44   45   46   47   48   49    50
##  [6,]   51   52   53   54   55   56   57   58   59    60
##  [7,]   61   62   63   64   65   66   67   68   69    70
##  [8,]   71   72   73   74   75   76   77   78   79    80
##  [9,]   81   82   83   84   85   86   87   88   89    90
## [10,]   91   92   93   94   95   96   97   98   99   100
## attr(,"names")
##   [1] "elt:1"   "elt:2"   "elt:3"   "elt:4"   "elt:5"   "elt:6"   "elt:7"  
##   [8] "elt:8"   "elt:9"   "elt:10"  "elt:11"  "elt:12"  "elt:13"  "elt:14" 
##  [15] "elt:15"  "elt:16"  "elt:17"  "elt:18"  "elt:19"  "elt:20"  "elt:21" 
##  [22] "elt:22"  "elt:23"  "elt:24"  "elt:25"  "elt:26"  "elt:27"  "elt:28" 
##  [29] "elt:29"  "elt:30"  "elt:31"  "elt:32"  "elt:33"  "elt:34"  "elt:35" 
##  [36] "elt:36"  "elt:37"  "elt:38"  "elt:39"  "elt:40"  "elt:41"  "elt:42" 
##  [43] "elt:43"  "elt:44"  "elt:45"  "elt:46"  "elt:47"  "elt:48"  "elt:49" 
##  [50] "elt:50"  "elt:51"  "elt:52"  "elt:53"  "elt:54"  "elt:55"  "elt:56" 
##  [57] "elt:57"  "elt:58"  "elt:59"  "elt:60"  "elt:61"  "elt:62"  "elt:63" 
##  [64] "elt:64"  "elt:65"  "elt:66"  "elt:67"  "elt:68"  "elt:69"  "elt:70" 
##  [71] "elt:71"  "elt:72"  "elt:73"  "elt:74"  "elt:75"  "elt:76"  "elt:77" 
##  [78] "elt:78"  "elt:79"  "elt:80"  "elt:81"  "elt:82"  "elt:83"  "elt:84" 
##  [85] "elt:85"  "elt:86"  "elt:87"  "elt:88"  "elt:89"  "elt:90"  "elt:91" 
##  [92] "elt:92"  "elt:93"  "elt:94"  "elt:95"  "elt:96"  "elt:97"  "elt:98" 
##  [99] "elt:99"  "elt:100"

WARNING

When filtering a matrix or array data structre, R will always return a data structre in the smallest dimension as possible. This could lead to erros because sometimes we used variables to filter a matrix or an array but we don’t know in advance the content of the variable, and in subsequent code we supposed that we get a similar data strucutre and in fact not.

is.matrix(mat[ 3, c(2,4)])
## [1] FALSE

To prevent such errors, use drop=FALSE option of the [ ] operator:

is.matrix(mat[ 3, c(2,4), drop=FALSE])
## [1] TRUE

The dim() function returns the dimension of the data structure:

dim(mat)
## [1] 10 10
dim(arr)
## [1] 10  5  2

You can also use the ncol() and nrow() functions on a matrix-like object.

INFO

An array or a matrix is a vector with a dimension attribut. Hence:

vect <- 1:100
is.vector(vect) ; is.matrix(vect) ; is.array(vect)
## [1] TRUE
## [1] FALSE
## [1] FALSE
dim(vect) <- c(10, 10)
is.vector(vect) ; is.matrix(vect) ; is.array(vect)
## [1] FALSE
## [1] TRUE
## [1] TRUE
dim(vect) <- NULL
is.vector(vect) ; is.matrix(vect) ; is.array(vect)
## [1] TRUE
## [1] FALSE
## [1] FALSE

The colnames(), rownames() and dimnames() functions are used to respectively set the name of columns, rows and dimensions for an array-like structure.

names(mat) <- NULL
colnames(mat) <- paste0("col:", 1:10)
mat
##       col:1 col:2 col:3 col:4 col:5 col:6 col:7 col:8 col:9 col:10
##  [1,]     1     2     3     4     5     6     7     8     9     10
##  [2,]    11    12    13    14    15    16    17    18    19     20
##  [3,]    21    22    23    24    25    26    27    28    29     30
##  [4,]    31    32    33    34    35    36    37    38    39     40
##  [5,]    41    42    43    44    45    46    47    48    49     50
##  [6,]    51    52    53    54    55    56    57    58    59     60
##  [7,]    61    62    63    64    65    66    67    68    69     70
##  [8,]    71    72    73    74    75    76    77    78    79     80
##  [9,]    81    82    83    84    85    86    87    88    89     90
## [10,]    91    92    93    94    95    96    97    98    99    100
dimnames(mat)
## [[1]]
## NULL
## 
## [[2]]
##  [1] "col:1"  "col:2"  "col:3"  "col:4"  "col:5"  "col:6"  "col:7"  "col:8" 
##  [9] "col:9"  "col:10"
dimnames(arr) <- list(paste0("row:",1:10), paste0("col:",1:5), paste0("pro:",1:2))
arr
## , , pro:1
## 
##        col:1 col:2 col:3 col:4 col:5
## row:1      1    11    21    31    41
## row:2      2    12    22    32    42
## row:3      3    13    23    33    43
## row:4      4    14    24    34    44
## row:5      5    15    25    35    45
## row:6      6    16    26    36    46
## row:7      7    17    27    37    47
## row:8      8    18    28    38    48
## row:9      9    19    29    39    49
## row:10    10    20    30    40    50
## 
## , , pro:2
## 
##        col:1 col:2 col:3 col:4 col:5
## row:1     51    61    71    81    91
## row:2     52    62    72    82    92
## row:3     53    63    73    83    93
## row:4     54    64    74    84    94
## row:5     55    65    75    85    95
## row:6     56    66    76    86    96
## row:7     57    67    77    87    97
## row:8     58    68    78    88    98
## row:9     59    69    79    89    99
## row:10    60    70    80    90   100

2.4.2.3 List

A list is a collection of different data structures:

list(c("A","B"), matrix(1:10, nrow=2))
## [[1]]
## [1] "A" "B"
## 
## [[2]]
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    3    5    7    9
## [2,]    2    4    6    8   10
list("Elt1"=c("A","B"), "Elt2"=c(TRUE,FALSE))
## $Elt1
## [1] "A" "B"
## 
## $Elt2
## [1]  TRUE FALSE

You can access, define or redefine the name of its elements with the names() function. The [ ] operator allows you to access of one or more element(s). Nevertheless, the [ ] operator applied on a list will always return a list, even if you ask only one element:

my_list <- list("A"=c("A","B"), "B"=matrix(1:10, nrow=2))
is.list(my_list[1])
## [1] TRUE

To have access to only one element you must used the [[ ]] operator or the $ operator if the elements have a name:

is.list(my_list[[1]])
## [1] FALSE
is.list(my_list$A)
## [1] FALSE

2.4.2.4 Data frame

A data.frame is a special case of a list in the sense that each of these elements must be a vector of the same size. This allows you to create a matrix whose columns are no longer constrained to be in the same type.

d <- data.frame("Number"=c(1,2,3), 
                "Character"=c("A","B","C"), 
                "Logical"=c(TRUE, TRUE, FALSE))
d

As they are like matrices, you can define to each data.frame row and column names. Row names can be accessed and set with the rownames() or row.names() functions and column names with the colnames(), col.names() or names() functions. You can after use these names to filter the data.frame like in a matrix.

rownames(d) <- paste0('row', 1:3)
colnames(d)
## [1] "Number"    "Character" "Logical"
d

WARNING

With data.frame, the [ ] operator works either as applied on a matrix or on a list. If this operator is used with two arguments, it works like the matrix operator, instead of one argument where it works like the list operator.

Works like a matrix operator

d[3,2] ; d[ , 'Logical', drop=TRUE] ; d[, 'Logical', drop=FALSE]
## [1] "C"
## [1]  TRUE  TRUE FALSE

Works like a list operator

d['Logical'] ; d[['Logical']]
## [1]  TRUE  TRUE FALSE

3 Manipulating objects

In the previous section we have seen, how can data be represented in different kind of data type and presented data structures in order to gather data together. In this section we will see how can I manipulate these data or structures in order to do more complex tasks.

3.1 Importing data

R can read a variety of input file formats, however a lot of them are available in specific packages. As the main advantages of R is to manipulate tabular data, we will present here only functions that are included in the base package and which are specific to read tabular data.

3.1.1 What kind of tabular format should I use?

There are also many tabular file format that exists. The most common one is the Excel format. Nevertheless, Excel is not the best way to share or to backup your data, see (Ziemann, Eren, and El-Osta 2016).

We present here two commonly used text formats suitable to represent tabular data: CSV (Comma-separated values) and TSV (Tabulation-separated values). As there are text format, any text editor can open, read and edit them (even Excel!). Each line represents a row of your table and each column is separated either by a comma , (french-like format separator is a semicolon ; in order to avoid conflict with decimal point) or a tabular character \t. By convention, the extension of CSV file is .csv and the extension of TSV file is .tsv. However you can use any extension you want (as they are text format, .txt extension is also often used). So before importing any file in R, I recommend you to open it first or to get a look at their first lines in order to see what kind of separator or decimal point are used, or any other particularities like if there is a header or not, and so on.

Example of CSV format:

## Column1,Column2
## 1,2
## 3,4

Example of TSV format:

## Column1  Column2
## 1    2
## 3    4

These two file formats (and their particularities, eg with the decimal point) can be easily imported and exported with R.

3.1.2 Reading tabular data

The generic function to import data in tabular text format is the read.table() function. You will also find wrappers to this function to import CSV format like read.csv() and read.csv2() which respectively import US CSV format (separator is a comma , and decimal points a dot .) and french-like CSV format (separator is a semicolon ; and decimal point a comma ,). The read.delim() and read.delim2() are suitable to read TSV format. The first one is for US numbers (decimal point is a dot .), the second one for french-like numbers (decimal point is a comma ,). Don’t hesitate to use the help() function on these functions to see more particularities (with header, no header, etc.).

WARNING

When importing data with read.table() function (or any wrapper function), R will automatically convert character columns to factor data type. This data type is a particular type of integer vector that looks like character vector and which was designed to store a finite set of data (like categories). Neverthless, manipulating this kind of data can lead you to do some mistakes.

So, I recommend you to always set the option stringsAsFactors to FALSE (except if you know what you are doing) and, if needed, convert your character vector in factor at the last moment.

Warning, in R versions 4 or above, this option is set to FALSE by default and TRUE with lower versions. So take the habit to specify this option each time you will used one of these functions.

3.2 Conditional statement

Conditional statement indicates to R to run a block of codes or another one following the result of an expression;it is the well known if_then_else. Below is the syntax of this statement in R:

if (EXPRESSION) {
  CODE1
} else {
  CODE2
}

The result of EXPRESSION should be a logicial value or could be converted into a logicial value. If the result of EXPRESSION is TRUE then the commands of CODE1 is run, otherwise it is commands in CODE2. The second block is optional, so if the result of EXPRESSION is FALSE, so no code is run. If the code block contains only one instruction you could omit the brackets.

if (257%%2==0) {
  print('Even number')
} else {
  print('Odd number')
}
## [1] "Odd number"

3.3 Iterative statement

Iterative statement is used when a block of codes should be run many times, sometimes depending of a value. R provides three approaches to perform a loop: repeat, while and for. There is no specific difference between the three structures, it depends of the developer choice or its preferences to use one instead the other; although is easier to loop over vector’s values with the for statement.

break and next instructions

These two instructions are used inside the body of a loop in order to indicate to exit a loop (break) or to go to the next iteration (next), avoiding run of unwanted code for specific iterations.

repeat loops

The general syntax of the repeat loop is the following:

repeat {
  CODE
}

When R enters in this kind of loop, it will loop indefinitely until it reaches a break expression indicated inside the block of codes. So be sure, before running your script that this expression is inside the repeat loop and can be reached.

n <- 17L
repeat {
  print(n)
  if (n==1L)
    break
  else
    n <- if (n%%2L==0L) n%/%2L else 3L*n+1L
}
## [1] 17
## [1] 52
## [1] 26
## [1] 13
## [1] 40
## [1] 20
## [1] 10
## [1] 5
## [1] 16
## [1] 8
## [1] 4
## [1] 2
## [1] 1

while loops

The general syntax of the while loop is the following:

while (EXPRESSION) {
  CODE
}

First, EXPRESSION is evaluated. If it is TRUE the block CODE is run, then EXPRESSION is reevaluated. If it is TRUE again, then CODE is executed and so on until EXPRESSION is evaluated to FALSE. If EXPRESSION is evaluated to FALSE in the first time, so commands inside CODE are never run. Example below is equivalent to the same code in the repeat section:

n <- 17L
print(n)
## [1] 17
while (n!=1L) {
  n <- if (n%%2L==0L) n%/%2L else 3L*n+1L
  print(n)
}
## [1] 52
## [1] 26
## [1] 13
## [1] 40
## [1] 20
## [1] 10
## [1] 5
## [1] 16
## [1] 8
## [1] 4
## [1] 2
## [1] 1

for loops

The general syntax of the for loop is the following:

for (VAR in VECT) {
  CODE
}

This loop indicates to run CODE for each value of VECT. CODE can access to the current value with the VAR variable name. For instance:

for (i in 1:10) {
  print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10

Loop versus vectorized computing

As R is powerful to compute over vectors, it is sometimes faster to use vectorized computing instead of loops because R operates on all elements in a vector simultaneously, rather than sequentially for each element in a loop. To perform vectorized computing, R provides tapply(), lapply(), sapply(), vapply() and apply() functions. These functions offer ways around loops. apply() is ideally suited for array, including matrix. tapply() applies a function to each cell of a ragged array (where the dimensions of each element in a matrix vary). The functions sapply() and vapply() are used with vectors and lapply() with lists.

For instances:

apply(matrix(1:100, ncol=10), 1, sum)
##  [1] 460 470 480 490 500 510 520 530 540 550
sapply(1:10, function(x) x**2)
##  [1]   1   4   9  16  25  36  49  64  81 100

3.4 Functions

Functions encapsulate a piece of codes in order to perform a certain task, eg computing a mean. A function can take variables in input and can also returns or not a result; eg to compute a mean you need input values (on what should you compute the mean?) and returning a value (the mean).

In R, a function is declared with the keyword function and, to be used later, it must be assigned to a variable name:

my_mean <- function(a, b) {
  return((a + b) / 2)
}
my_mean(2, 5)
## [1] 3.5

In R, the value of the last expression is automatically returned, so if there is no ambiguity you can avoid using the retrun() function to specify what the function needs to return.

my_mean <- function(a, b) {
  (a + b) / 2
}
my_mean(2, 5)
## [1] 3.5

You can also give a default value to an argument:

my_mean <- function(a, b, p=1) {
  (a*p + b*p) / 2
}
my_mean(2, 5)
## [1] 3.5
my_mean(c(5, 5), c(5, 5), p=c(2, 4))
## [1] 10 20

3.5 Exporting data

To save your data in a tabular text format, R provides you the generic write.table() function. As the read.table() function, you will find wrappers around this function to write CSV file write.csv() or write.csv2(). The first one is for US CSV format and the second one for French CSV format. There is no wrapper function to backup your data in TSV format, but you can easily save in TSV by setting the sep option to a tabular character "\t".

You can also save the object in a binary format that you can after load them into R again. You can use the saveRDS() and loadRDS() functions to backup only one R object into a file and load it in your next R session. By convention, the “.rds” extension file name is the most often used.

You can also save many R objects into a single file using the save() and load() functions. In this case, the most often used extension file name is “.RData.”

4 Packages

A lot of other useful functions or data structures are available in separate packages. You can go to The Comprehensive R Archive Network (CRAN) webpage https://cran.r-project.org to see the majority of available R packages. The install.packages() function is designed to install a new package in the current R that are listed in this webpage. For instance to install tidyverse package:

install.packages('tidyverse')

and follow instructions on your screen. To use functions or data structures of a given package use the library() function to load them:

library(tidyverse)

Remark that the name of the package is into quotation marks with the install.packages() function while they are not needed in the library() function.

Bioconductor https://bioconductor.org is an open source project that lists and provides a widy variety of R packages suitable for analysis of high-throughput genomic data (Huber et al. 2015). To install packages listed in the bioconductor website, just follow explication provided on the webpage of the requested tool.

5 Best practices

I recall you here some information given through this lesson and some others when you want to develop in R:

  • If you have a doubt about a function, don’t hesitate to use the help() function to see how it works.
  • If you want to do something but you don’t know how, ask Google it will help you.
  • Always save your code that generates a result or a graphic in a script that can be reusable in raw.
  • Always use sessionInfo() function to get the list of all loaded packages and their versions at the end of your script. This is recommended because sometimes results can differ between R or package version. So to be reproducible, results are given following package and R versions.
  • Never save your R session instead you know what your are doing.
  • For more useful recommendations you can read this post: https://waterdata.usgs.gov/blog/intro-best-practices/.

INFO

You can download a cheat sheet summurazing all basic concepts in R from here: https://github.com/rstudio/cheatsheets/blob/main/base-r.pdf.

6 Practice

EXERCISE

We will illustrate what we have learned on a published bulk RNA-seq data.

  1. We will use data from GSE153921, a Gene Expression Omnibus (GEO) database ID. Dowload the CSV file that is in the supplementary section (bottom of the webpage) and save it into your computer. This file contains raw read counts associated to each gene of each sample. It looks as follow:

    ## ,D5minusT,D5minusA,D5minusC,D5plusT,D5plusA,D5plusB,D5shX3T,D5shX3A,D5shX3B,D5shX5T,D5shX5A,D5shX5B
    ## ENSG00000223972,0,0,0,0,0,0,0,0,0,0,0,0
    ## ENSG00000227232,3,4,6,3,4,13,4,3,11,4,7,8
    ## ENSG00000243485,0,1,0,0,0,0,0,0,0,0,0,0
    ## ENSG00000221311,0,0,0,0,0,0,0,0,0,0,0,0
    ## ENSG00000237613,0,0,0,0,0,0,0,0,0,0,0,0
    ## ENSG00000268020,0,0,0,0,0,0,0,0,0,0,0,0
    ## ENSG00000240361,0,0,0,0,0,0,0,0,0,0,0,0
    ## ENSG00000186092,0,0,0,0,0,0,0,0,0,0,0,0
    ## ENSG00000238009,5,6,6,3,5,11,2,4,3,3,2,3
  2. What is your current working directory? If needed, change this directory to the one where you have saved the file.

  3. Import these data into a variable named data with the read.table() function or an analogous function. Use the first column, containing the Ensembl gene ID, as row names.

  4. What is the type of the data object? If it is not a matrix object, convert it into a matrix object.

  5. How many genes contains this dataset?

  6. Has data NA values? If yes, replace them with 0 count value.

  7. Replace all column names of the data matrix using the following convention: use a letter to indicate an eperimental condition and a number to indicate a replicate, eg A1 represents the sample of the experimental condition A and it is the first replicate of this experimental condition. We see here that we have 4 conditions and 3 replicates for each one.

  8. What are the counts of each sample for the gene ENSG00000131591?

  9. What is the total count of the B3 sample? (You can use the sum() function, use help() if needed).

  10. What is the total count of each sample? (Maybe the help page of the sum() function suggests you another useful functions…)

  11. Remove genes from the data matrix that have no read count (ie 0) in each sample.

    1. Write a function called size_factor that takes in input a numeric vector and returns a numeric value that corresponds to the sum of all values in the input vector and divided it by 1 million.

    2. Create a new matrix called data_norm where each count corresponds to the count of the data matrix divided by the size_factor value obtained for the given sample.

  12. Compute a “virtual” sample that contains the mean expression of each gene for all samples across a given experimental condition from the normalized counts and for each experimental condition and store them into a matrix called data_mean. You can call these “virtual” samples with its experimental condition name. (Use loops or iterative functions)

  13. The differential expression of a given gene between two samples \(A\) and \(B\) can be measured with the fold change. A fold change is given by the ratio of the expression of the gene in \(A\) with that in \(B\); and we apply a logarithm transformation in base 2 to get the log2 fold change. As 0 count can be problematic (a division by 0 or a log of 0 is not defined), we will add a pseudo-count of 1 to all counts before computing the log2 fold change. Compute all log2 fold change for any possible paired combinaison and store the results into a matrix called data_fc. (To help you in generating all the possible paired combinaisons, you can use the combn() function.)

  14. As we are only interessed into differentially expressed genes, filter the data_fc table in order to consider only differentially expressed genes at least in one comparison. A gene is differentially expressed if its log2 fold change is greater than 1 (over-expressed) or lower than -1 (under-expressed).

  15. Generate a final data.frame called final that contains the following columns: gene ID, raw counts, normalized counts, mean counts of our “virtual” samples, and the log2 fold change expression of each comparison. To distinguish between the raw count columns and the normalized one, you can add the (raw) and (normalized) character after each sample name. (To do that you can use the merge() function.)

  16. Export the last genereted table into a TSV (Tabulation-separated values) file format. For french people, you can export numbers with a comma , for decimal point (to easily open it with Excel).

SOLUTION

Question 1

Easy…

Question 2

getwd()
## [1] "/Users/jung/Desktop/01_IntroductionToR"
setwd('/Users/jung/Download/')

Question 3

data <- read.csv('GSE153921_Andrew_XPO7_merged_gene_counts.csv.gz',
                 stringsAsFactors=FALSE, row.names=1) 
head(data)

Question 4

class(data)
## [1] "data.frame"
data <- as.matrix(data)

Question 5

nrow(data) ; dim(data)[1]
## [1] 62161
## [1] 62161

Question 6

any(is.na(data)) ; anyNA(data)
## [1] FALSE
## [1] FALSE
# If there are NA values
is.na(data) <- 0

Question 7

colnames(data) <- paste0(rep(c("A", "B", "C", "D"), each=3), 
                         rep(1:3, times=4))
colnames(data)
##  [1] "A1" "A2" "A3" "B1" "B2" "B3" "C1" "C2" "C3" "D1" "D2" "D3"

Question 8

data['ENSG00000131591', ]
##  A1  A2  A3  B1  B2  B3  C1  C2  C3  D1  D2  D3 
## 329 445 292 318 243 478 375 287 507 319 338 460

Question 9

sum(data[, 'B3'])
## [1] 49844320

Question 10

colSums(data)
##       A1       A2       A3       B1       B2       B3       C1       C2 
## 31467250 39359869 33485923 37152895 28534141 49844320 37188199 33303397 
##       C3       D1       D2       D3 
## 60397769 30623686 33215566 37472948

Question 11

data <- data[apply(data, 1, sum)>0, ]
nrow(data)
## [1] 31672

Question 12.a

size_factor <- function(x) { sum(x)/1e6 }

Question 12.b

data_norm <- apply(data, 2, function(x) x/size_factor(x) )

Question 13

exp_cond <- c('A', 'B', 'C', 'D')
data_mean <- data.frame(row.names=rownames(data_norm))
for(e in exp_cond) {
  filter <- startsWith(colnames(data_norm), e)
  data_mean[[e]] <- apply(data_norm[,filter], 1, mean)
}
data_mean <- as.matrix(data_mean)
head(data_mean)
##                           A           B          C          D
## ENSG00000227232 0.125381114 0.160580807 0.12658928 0.18494992
## ENSG00000243485 0.008468863 0.000000000 0.00000000 0.00000000
## ENSG00000238009 0.163504887 0.158887742 0.07451969 0.07941129
## ENSG00000233750 0.041733504 0.006687489 0.01000899 0.00000000
## ENSG00000237683 0.091297278 0.064922138 0.07690194 0.14045040
## ENSG00000239906 0.018423295 0.000000000 0.00000000 0.00000000

Question 14

log2_FC <- function(a, b) log2((1+b)/(1+a))
combinaison <- combn(colnames(data_mean), 2)
data_fc <- apply(combinaison, 
                 2,
                 function(x) log2_FC(data_mean[,x[1]], data_mean[,x[2]]))
colnames(data_fc) <- apply(combinaison, 2, paste, collapse=" vs ")
rownames(data_fc) <- rownames(data_mean)
head(data_fc)
##                       A vs B      A vs C      A vs D       B vs C      B vs D
## ENSG00000227232  0.044433319  0.00154799  0.07441243 -0.042885328  0.02997911
## ENSG00000243485 -0.012166541 -0.01216654 -0.01216654  0.000000000  0.00000000
## ENSG00000238009 -0.005736448 -0.11478535 -0.10823259 -0.109048900 -0.10249614
## ENSG00000233750 -0.049370365 -0.04461812 -0.05898625  0.004752243 -0.00961589
## ENSG00000237683 -0.035296205 -0.01915726  0.06355955  0.016138942  0.09885575
## ENSG00000239906 -0.026337324 -0.02633732 -0.02633732  0.000000000  0.00000000
##                      C vs D
## ENSG00000227232  0.07286444
## ENSG00000243485  0.00000000
## ENSG00000238009  0.00655276
## ENSG00000233750 -0.01436813
## ENSG00000237683  0.08271681
## ENSG00000239906  0.00000000

Question 15

data_fc <- data_fc[apply(abs(data_fc)>1, 1, any), ]
head(data_fc)
##                     A vs B     A vs C     A vs D       B vs C      B vs D
## ENSG00000217801  0.6672408  0.7931069  1.0181703  0.125866105  0.35092951
## ENSG00000162576 -1.1093824 -1.1370220 -0.8119073 -0.027639624  0.29747506
## ENSG00000162591 -2.5640394 -2.6170423 -2.4602345 -0.053002935  0.10380484
## ENSG00000049246 -1.6904630 -1.6224116 -1.7347604  0.068051423 -0.04429738
## ENSG00000116285  1.8872724  1.8459151  1.7619417 -0.041357261 -0.12533072
## ENSG00000171621  1.6634546  1.6664540  1.8179546  0.002999404  0.15450002
##                      C vs D
## ENSG00000217801  0.22506340
## ENSG00000162576  0.32511468
## ENSG00000162591  0.15680778
## ENSG00000049246 -0.11234880
## ENSG00000116285 -0.08397346
## ENSG00000171621  0.15150062

Question 16

final <- merge(data, data_norm, 
               by="row.names", suffixes=c(' (raw)', ' (normalized)'))
rownames(final) <- final[[1]]
final <- final[-1]
final <- merge(final, data_mean, 
               by="row.names")
rownames(final) <- final[[1]]
colnames(final)[1] <- 'GeneID'
final <- merge(final, data_fc, 
               by="row.names", all.x=TRUE)
final <- final[-1]
head(final)

Question 17

write.table(final, "GSE153921_dge.tsv", sep="\t")
write.table(final, "GSE153921_dge.tsv", sep="\t", dec=",") # for french people

References

Huber, W., V. J. Carey, R. Gentleman, S. Anders, M. Carlson, B. S. Carvalho, H. C. Bravo, et al. 2015. Orchestrating High-Throughput Genomic Analysis with Bioconductor.” Nature Methods 12 (2): 115–21. http://www.nature.com/nmeth/journal/v12/n2/full/nmeth.3252.html.
Ihaka, Ross, and Robert Gentleman. 1996. “R: A Language for Data Analysis and Graphics.” Journal of Computational and Graphical Statistics 5 (3): 299–314. http://www.jstor.org/stable/1390807.
Ziemann, Mark, Yotam Eren, and Assam El-Osta. 2016. “Gene Name Errors Are Widespread in the Scientific Literature.” Genome Biology 17: 177. https://doi.org/10.1186/s13059-016-1044-7.