R commands will be presented in a gray box:
print('Hello world!')
## [1] "Hello world!"
White box following a gray box represents the R’s output. Blue box contains exercises to see if you have well understood notions:
EXERCISE
This box will contain some exercises that you must do to be sure that you have well understand all notions.
Solutions to exercises will be given in a green box (they will be only available after the session).
SOLUTION
This box will contain answers to the previous exercises.
Purple box contains some additional information (for advanced users), so it is not necessary to read or understand them in the first reading. You will probably understand them after you get some experiences with R.
INFO
This box will contain additional informations that are not necessary to understand in your first reading.
Orange box contains warning to care about in order to well use R and have good practices.
WARNING
This box will contain warnings or good practices, that you must take about.
R is a programming language initially designed to statistical computation and graphics, created by Ross Ihaka and Robert Gentleman (Ihaka and Gentleman 1996).
R is an interpreted language, meaning that all scripts written in R need a software to be run. This software is called an interpreter. By extension, the interpreter that is able to read R scripts is also called R and is freely available under the GNU General Public License on a wide variety of UNIX platforms, Windows and MacOS (www.r-project.org).
An R script is only a text file (not a Word document), so any text editor can be used to write a script in R language. Nevertheless, some softwares are better than others as they provide functionalities like syntax highlighting, autocompletion, help, debugger, and other tools that help the programers in writting R scripts. Such softwares are called IDE (integrated development environment) and are designed to increase the developer production.
A widely used IDE for editing R scripts is RStudio (rstudio.com). A good thing with Rstudio is that it comes with an integreted R interpreter. So after Rstudio installation, you are ready to write R scripts or to execute R commands.
Figure 1.1 shows what this software looks like.
Figure 1.1: Screenshot of RStudio software.
Let’s start with our first exercise:
EXERCISE
On your compture, download and install RStudio Desktop software and use it throught to resolve exercises. If it is not your personal compture and Rstudio is not available, you can go to https://rstudio.cloud and create a free account in order to have an emulation of RStudio that you can use for the purpose of this lesson.
You can work with R in two different ways, either by writing a script and calling R to interpret it or in an interactive mode by writing your commands in real time. Both approaches have their advantages and inconveniences depending on what you wish to do.
I think that the interactive mode is a good way to perform some simple tasks, to explore your data, to determine what you wish to do with your data, or to learn and explore R. The script approach is better if you want to perform the same code but with different input data, like a pipeline.
WARNING
Whatever the way you will use R, I recommend you to always save commands that you have used to get a result or a graphic.
In script or in interactive mode, a command is ended with the newline character
“\n
” that you get when you press the return button on your keyboard. You can
also write many commands on the same line, each separated by a semicolon
character “;
” All things after a hashtag character “#
” will be ignored by
the interpreter until the end of the line, this symbol is used to add comments
in your code. Hence, if your comment is on many lines, you must start each new
line by a hashtag character “#
.”
1+1 # This is a comment and will not be interpreted
## [1] 2
1*3 ; 1-5
## [1] 3
## [1] -4
In R you will manipulate different kinds of object, like integers, characters,
tables or functions. Each object can be stored in memory and therefore must be
associated to a variable name in order to use it afterward.
To assign an object to a variable name, just use the <-
operator:
my_variable <- 1
This operator acts from right to left indicating at R to assign the value indicated in the right part of the arrow to the variable in the left part. In the interactive mode, writing the name of a variable will print its content to the screen:
my_variable
## [1] 1
INFO
R provides two other operators to achieve this: ->
and =
. The operator ->
acts in the same way as the <-
operator but from left to right. The operator
=
acts like <-
at the top level but has a different meaning in other kind
of expressions. For instance:
matrix(1, ncol = 2)
## [,1] [,2]
## [1,] 1 1
Will create a two-column matrix, and now:
matrix(1, ncol <- 2)
## [,1]
## [1,] 1
## [2,] 1
ncol
## [1] 2
Will create a two-row matrix and a variable called ncol which evaluates to 2!
What happened is that in matrix(1, ncol <- 2)
R didn’t assign the function
argument ncol to 2, but assigns 2 to an variable called ncol
and sends its
value (ie 2) to the second argument of the matrix()
function, which is the
nrow argument…
To resume, take the habit to use the <-
operator to do an assignation and
use =
operator only in function to assign a value to the function’s argument.
A variable name can only contain alphabetical symbols, numeric symbols, or the
special symbols _
or .
(so, no space). It can not start with a numeric
symbol or underscore _
symbol, and when the first symbol is a dot .
, the
next symbol cannot be numeric. R is case sensitive, so a variable named data
is different from Data
or DATA
. Any alphabetical symbols can be used (even
Chinese symbols!), however it is not recommended to use other symbols than the
common ones (English alphabet) because this will make harder to other people to
read your code (especially if they are not Chinese readers…)!
福禄寿 <- 3.14
福禄寿
## [1] 3.14
INFO
You can use any variable name you want as they are defined into backquotes
`
:
`!%():` <- 'This is a bad name for a variable'
`!%():`
## [1] "This is a bad name for a variable"
WARNING
R contains some keywords that you cannot use to name a variable. Use
?reserved
or help(reserved)
to get the complete list. Briefly: if
, else
,
repeat
, while
, function
, for
, in
, next
, break
, TRUE
, FALSE
,
NULL
, Inf
, NaN
, NA
, NA_integer_
, NA_real_
, NA_complex_
,
NA_character_
are reserved keywords.
A function is an object which stores a piece of codes and which aims to perform a given task or a given computation. Moreover, a function can depend of input parameters, have options and can return an object.
INFO
A function can also update an existing object, but this is rarely the case in R where its philosophy is to pass a copy of the input object to the function.
R recognizes a function by its variable name and to run it, you must add parenthesis after the variable name. If the function needs parameters, they must be added between the parenthesis.
print("Hello world")
## [1] "Hello world"
As you have understand, the print()
function will print the string given in
parameter on the screen.
INFO
If you indicate to R only the name of the function (without the prentheses) then R will print some information about it. Sometimes, the code of the function itself.
Another useful function is the help()
function which displays the help text
describing the function. A shortcut command can be used with the operator ?
before a function name to get the same result.
INFO
To get help of an operator, just indicate it between two backquote characters
`
:
help(`?`)
The quit()
, or its alias q()
, function will terminate the current R session.
WARNING
When exiting, R ask you if you want to save the current session. To avoid this,
you can use q(FALSE)
to indicate to not save the current session and avoiding
this question or q(TRUE)
to save it. The R session is saved into the file
.Rdata
in the current directory and will be automatically loaded if you run R
again in the same directory (that containing the .Rdata
file). This file
contains an image of all objects that you have generated during your work,
avoiding regenereted them when you re-run R the next time. Neverthless, I
recommand you to never save the R session, nor to take the habit to do that, but
rather save your commands into a script in order to regenerate the object again
and to backup the way you have generated them. For more explanation, see
5.
The functions getwd()
and setwd()
will respectively return the absolute path
to your working directory and set a new working directory by indicating a new
absolute path or a relative path starting from your working directory.
The function is.function()
returns TRUE
if a variable contains a function.
Table 2.1 lists functions seen in this section.
Function | Description |
---|---|
print() | Print into the screen the characters given in argument. |
help() | Show the help of the function name given in parameter |
quit() or q() | Exit from R |
getwd() | Get the working directory |
setwd() | Set the working directory given in parametr |
is.function() | Return TRUE when the variable given in parameter is a function. |
Each programming language has its own set of atomic data and data structures. Atomic data is the lowest level of detail to store data and data structures are collections of atomic data organized in some way. Each code consists in manipulating, transforming, combining, associating these atomic data in order to obtain the desired algorithm. We will present here only the most often used ones.
In R, the atomic data are the following:
logical
,integer
,double
,complex
,character
,raw
.They will be discussed below, except for the raw
type that is rarely used and represents binary data.
R provides a function to test is a variable contains an object that is atomic. This function is is.atomic()
and lists all available atomic type in R in its help page.
is.atomic(1) ; is.atomic(help)
## [1] TRUE
## [1] FALSE
R provides three kinds of numeric type: integer
, double
and complex
that respectively correspond to the set of relative numbers (\(\mathbb{Z}\)), real numbers (\(\mathbb{R}\)) and complex numbers (\(\mathbb{C}\)). The most used one is the double
type, so, by default, each writing number (eg 1
) is considered as a double
. You can also used the scientific notation to declare a double
, eg 1e6
. If you wish to specify that the number is an integer
you need to add L
at the end of the number (eg 1L
), and if you wish to specify a complex
you need to add its imaginary part indicated by i
(eg 1+0i
). Negative numbers are preceded by the minus sign -
(eg -1
). The double
type has three more specific numbers: Inf
, -Inf
and NaN
(for not a number).
is.double(1) ; is.integer(1L) ; is.complex(1+0i)
## [1] TRUE
## [1] TRUE
## [1] TRUE
Table 2.2 lists all available arithmetic operators.
Operator | Description |
---|---|
+
|
Addition. |
-
|
Subtraction. |
*
|
Multiplication. |
/
|
Division. |
^ or **
|
Exponent. |
%%
|
Modulus (remainder from division). Do not work with complex number type.
|
%/%
|
Integer division. Do not work with complex number type.
|
Using one of these operators between two numbers of the same type would return a number of the same type, except with operators /
, ^
and **
used with two integer
s which would return a double
number type. Otherwise, using these operators with two different number types would always returning the most defined one, ie an operation with a complex
number type would always return a complex
number type, operations between a double
number type and an integer
number type would always return a double
number type.
When an expression combines many operators, R uses priorities to resolve ambiguities. For instance, in the expression 1+3*5
the user wishes first to perform the addition or the multiplication? With the priorities used by R, it will first compute the multiplication, so, this expression is equivalent to 1+(3*5)
. If the user had whished performing the addition in first, he would have needed to indicate the addition into parentheses, eg (1+3)*5
which indicates to R to compute in first the addition and then the multiplication of the result by 5. Table 2.3 gives the priorities used by R for arithmetic operators. When many operators of the same priority are indicated in an expression, R resolves the expression from left to right.
Operator | Priority level |
---|---|
^ or **
|
**** |
%%
|
*** |
* , / , %/%
|
** |
+ , -
|
* |
Some mathematical functions are available in R, Table 2.4 lists some of them.
Function | Description |
---|---|
abs(x)
|
Absolute value of \(x\). |
sqrt(x)
|
Square root of \(x\). |
log(x) , log2(x) , log10(x)
|
Natural, binary and common logarithm of \(x\) (base \(e\), \(2\) and \(10\)). |
log(x, b)
|
Logarithm of \(x\) in base \(b\). |
exp(x)
|
Exponential of \(x\), ie \(e^x\). |
ceiling(x)
|
Get the least integer greater than or equal to \(x\), ie \(\lceil x\rceil\). |
floor(x)
|
Get the greatest integer less than or equal to \(x\), ie \(\lfloor x\rfloor\). |
trunc(x)
|
Get the integer part of \(x\), ie \([x]\). |
round(x, digits=n)
|
Round the number at \(n\) number of decimal places. By default, \(n=0\). |
cos(x) , sin(x) , tan(x)
|
Cosine, Sine and tangent of \(x\) expressed in radius. |
acos(x) , asin(x) , atan(x)
|
Arc-cosine, arc-sine and arc-tangent of \(x\) expressed in radius. |
INFO
An operator is just a shortcut command to call a specific (often used) function. To access to this function just used the operator with two backquotes:
`+`(5,2)
## [1] 7
We can also define our own operators but they must be wrapped with two %
characters:
`%,%` <- function(a, b) { (a+1)*(b+1) }
2 %,% 1
## [1] 6
Character type represents textual data composed of letters, numbers, and symbols. The character string must be expressed between both single quotes '
or both double quotes "
. Choosing one or other quotation mark is not very important. However it is important to keep the same quotation mark at the opening and the ending declaration of the character string. The utility to have two possibilities arises when the character string contains a quotation mark, so choosing the other one to declare the character string would be a good idea:
"I'm learning R"
## [1] "I'm learning R"
Nevertheless, you could use a backslash \
before a quotation mark to indicate to R do not use the next character as the final quotation mark:
'I\'m learning R'
## [1] "I'm learning R"
R does not provide operators to use with character data type, but there are plenty of useful functions. Table 2.5 lists some of them.
Function | Description |
---|---|
paste0(a, b, ...)
|
Get a character corresponding to the concatenation of a and b and so on.
|
paste(a, b, ..., sep=x)
|
Get a character corresponding to the concatenation of a and b and so on, each separated by x . By default x is a blank space (" " ).
|
print(a)
|
Print the content of a .
|
nchar(a)
|
Get the number of symbols in the character. |
tolower(a) , toupper(a)
|
Get a character with all letters in lower or upper case respectively.
|
chartr(x, y, a)
|
Get a character where every occurence of x found in a is replaced by y .
|
substr(a, x, y)
|
Get the substring of a starting at position x and ending at position y (the first symbol is in position 1).
|
startsWith(x, a) , endsWith(x, a)
|
Return TRUE or FALSE if x starts (resp. ends) with a .
|
The logical
type represents boolean data type, ie true or false. In R, these values are in upper case: TRUE
or FALSE
.
WARNING
In R there exists also global variables T
and F
that are respectively initialized to TRUE
and FALSE
, the latest are reserved keywords. This mean that TRUE
and FALSE
will always denote the true and fasle boolean values, instead of T
and F
that can be overwritten with another value, including its opposite!
So, to prevent errors, YOU MUST NEVER USED T
AND F
TO REPRESENT BOOLEAN VALUES, because some dumy people who will reused your code can reasign these variables to other values AND YOU MUST NEVER USED T
and F
AS VARIABLES because some dumy developers have used them in their code that you will inevitably use to perform specific computation.
With this kind of data type, you can perform boolean algebra. Table 2.6 lists all available logicial
operators.
Operator | Priority | Description |
---|---|---|
!
|
*** |
Unary operator not, eg !TRUE is equivalent to FALSE .
|
& , &&
|
** |
Binary operator and, return FALSE when at least one of the two variables is FALSE .
|
| , ||
|
* |
Binary operator or, return TRUE when at least one of the two variables is TRUE .
|
The operators |
(resp. &
) and ||
(resp. &&
) act not exactly on the same way, we will see later an example of their differences.
Table 2.7 presents operators which are used to compare a value of either type to another one. All these operators have the same priority and return a logical
value depending of the success of the requested comparison. A comparison is performed after any arithmetic operators and before any boolean operators.
Operator | Description |
---|---|
<
|
Lower than. |
<=
|
Lower than or equal to. |
>
|
Greater than. |
>=
|
Greater than or equal to. |
==
|
Exactely equal to. |
!=
|
Not exactely equal to. |
WARNING
Due to floating precision, it is greatly recommanded to never used equalty operators (==
and !=
) to compare double
or complex
values. To compare them, you must do relatively. For instance:
(0.1 + 0.05) == 0.15 ; abs((0.1 + 0.05) - 0.15) < 1e-6
## [1] FALSE
## [1] TRUE
Table 2.8 lists some of functions used with logical
values
Function | Description |
---|---|
all(a)
|
Return TRUE if all values of vector a is TRUE .
|
any(a)
|
Return TRUE if a has at least one value at TRUE .
|
R provides a lot of functions to convert between atomic data type into another one, if conversion can be performed. Table 2.9 lists functions useful to transform a type into another one and presents also different functions to test if a value belong to a specific type.
Function | Description |
---|---|
as.integer(x) , as.double(x) , as.complex(x)
|
Convert x into an integer , double or complex data type.
|
as.character(x)
|
Convert x into a character data type.
|
as.logical(x)
|
Convert x into a logical data type.
|
is.integer(x) , is.double(x) , is.complex(x)
|
Return TRUE if x is an integer , double or complex type, FALSE otherwise.
|
is.character(x)
|
Return TRUE if x is a character value type, FALSE otherwise.
|
is.logical(x) , isFALSE(x) , isTRUE(x)
|
Return TRUE if x is a logicial , TRUE or FALSE value type, FALSE otherwise.
|
Data structures are collection of data type of even data strucutres. We present here only the most often used ones.
A vector is the data structure that represents a collection of data of the same type. It is the primary data structure in R, so even each single value is viewed by R as a vector. For instance:
is.vector(1)
## [1] TRUE
You can use the function c()
to declare a vector containing multiple elements. Here is a vector of logical
data:
c(TRUE, FALSE, FALSE, FALSE, TRUE)
## [1] TRUE FALSE FALSE FALSE TRUE
As a vector can only contain the same data type, this function will also try and coerce elements to the same type, if they are different. Coercion is from lower to higher types from logical
to integer
to double
to complex
to character
data type:
c(TRUE, 2L, 3 + 1i, 'k')
## [1] "TRUE" "2" "3+1i" "k"
The function length()
returns the number of elements of a given vector:
length(c("A", "B", "C", "D", "E"))
## [1] 5
The function c()
means combine, so it can also be used to combine many vectors into a single one:
fruits <- c("Apple", "oranges", "banana")
vegetables <- c("cabbage", "spinach", "tomatoes")
all_basket_items <- c(fruits, vegetables)
all_basket_items
## [1] "Apple" "oranges" "banana" "cabbage" "spinach" "tomatoes"
The function seq(from, to, by)
returns sequential vector starting at from
ending at to
and with elements seprated by an interval set at by
.
seq(2, 100, 2)
## [1] 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38
## [20] 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76
## [39] 78 80 82 84 86 88 90 92 94 96 98 100
The by
argument can be omitted and then is set to 1:
seq(1, 50)
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
A shortcut command using the :
operator gives the same result:
1:50
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
With the operator :
the step can only be 1.
INFO
The value indicated into brackets before displaying values on a new line indicates the position (first position is 1) of the next value that will be printed at the left.
1:100
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
## [19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
## [37] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
## [55] 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
## [73] 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
## [91] 91 92 93 94 95 96 97 98 99 100
The seq()
function has more options, however its main usage has been presented. See the help of this function for further explanation.
The function rep()
is used to repeat each element of a vector and/or the vector itself as many times as indicated:
rep(1:10, times=10)
## [1] 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5
## [26] 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
## [51] 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5
## [76] 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
rep(1:10, each=2)
## [1] 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10
rep(1:10, each=2, times=5)
## [1] 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 1 1 2 2 3
## [26] 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 1 1 2 2 3 3 4 4 5 5
## [51] 6 6 7 7 8 8 9 9 10 10 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8
## [76] 8 9 9 10 10 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10
We can associate to each element of a vector a name using the names()
function:
vect <- 1:10
names(vect) <- paste0('elt:', 1:10)
vect
## elt:1 elt:2 elt:3 elt:4 elt:5 elt:6 elt:7 elt:8 elt:9 elt:10
## 1 2 3 4 5 6 7 8 9 10
The same function is used to retrieve the names of its element as a vector:
names(vect)
## [1] "elt:1" "elt:2" "elt:3" "elt:4" "elt:5" "elt:6" "elt:7" "elt:8"
## [9] "elt:9" "elt:10"
You can access to an element of a vector by indicating its position (first element is at position 1) into brackets, eg vect[10]
after the vector name. By extension, if you give a vector of positions, you will get all the pointed positions:
vect[c(2,5)]
## elt:2 elt:5
## 2 5
If the vector is named, you could also access to its elements by using names:
vect[c("elt:5", "elt:10")]
## elt:5 elt:10
## 5 10
Leaving it empty to get a copy of the vector:
vect[]
## elt:1 elt:2 elt:3 elt:4 elt:5 elt:6 elt:7 elt:8 elt:9 elt:10
## 1 2 3 4 5 6 7 8 9 10
When an element is not found, NA
is returned:
vect["elt:50"]
## <NA>
## NA
The previously seen operators can be apply on a vector as such as the functions:
c(1,2) + c(2,3)
## [1] 3 5
chartr("i", "I", c("R", "is", "incredible"))
## [1] "R" "Is" "IncredIble"
If two vectors are not of the same length, R autocomplete the shortest one by contacting it with itself until obtaining the same size as the larger one. This is useful if you wish to do an operation on each of its element.
c(1,2,3) + c(2,3)
## [1] 3 5 5
c(1,2,3) + 1
## [1] 2 3 4
WARNING
Do you remember that we have two distinct operators to perform boolean operations and (&
and &&
) and or (|
and ||
)? When the vector has one element, these operators are exactly the same. The difference arises when the vector has more than one element. In this case, the &
and |
operators act like +
or *
operators, they are performed on each element of the vector returning a vector of the same length of the larger one, whereas the &&
and ||
will only act on the first element of the vectors returning an 1-length vector:
c(TRUE, TRUE, FALSE, FALSE) | c(TRUE, FALSE, TRUE, FALSE)
## [1] TRUE TRUE TRUE FALSE
c(TRUE, TRUE, FALSE, FALSE) || c(TRUE, FALSE, TRUE, FALSE)
## [1] TRUE
The array
data structures stands for vectors with many dimension and a matrix
corresponds to a bidimensional array
. As matrix
and array
are like vectors, they can only store data of the same type.
matrix(1:100, nrow=10)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 1 11 21 31 41 51 61 71 81 91
## [2,] 2 12 22 32 42 52 62 72 82 92
## [3,] 3 13 23 33 43 53 63 73 83 93
## [4,] 4 14 24 34 44 54 64 74 84 94
## [5,] 5 15 25 35 45 55 65 75 85 95
## [6,] 6 16 26 36 46 56 66 76 86 96
## [7,] 7 17 27 37 47 57 67 77 87 97
## [8,] 8 18 28 38 48 58 68 78 88 98
## [9,] 9 19 29 39 49 59 69 79 89 99
## [10,] 10 20 30 40 50 60 70 80 90 100
mat <- matrix(1:100, ncol=10, byrow=TRUE)
mat
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 1 2 3 4 5 6 7 8 9 10
## [2,] 11 12 13 14 15 16 17 18 19 20
## [3,] 21 22 23 24 25 26 27 28 29 30
## [4,] 31 32 33 34 35 36 37 38 39 40
## [5,] 41 42 43 44 45 46 47 48 49 50
## [6,] 51 52 53 54 55 56 57 58 59 60
## [7,] 61 62 63 64 65 66 67 68 69 70
## [8,] 71 72 73 74 75 76 77 78 79 80
## [9,] 81 82 83 84 85 86 87 88 89 90
## [10,] 91 92 93 94 95 96 97 98 99 100
arr <- array(1:100, dim=c(10,5,2))
arr
## , , 1
##
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 11 21 31 41
## [2,] 2 12 22 32 42
## [3,] 3 13 23 33 43
## [4,] 4 14 24 34 44
## [5,] 5 15 25 35 45
## [6,] 6 16 26 36 46
## [7,] 7 17 27 37 47
## [8,] 8 18 28 38 48
## [9,] 9 19 29 39 49
## [10,] 10 20 30 40 50
##
## , , 2
##
## [,1] [,2] [,3] [,4] [,5]
## [1,] 51 61 71 81 91
## [2,] 52 62 72 82 92
## [3,] 53 63 73 83 93
## [4,] 54 64 74 84 94
## [5,] 55 65 75 85 95
## [6,] 56 66 76 86 96
## [7,] 57 67 77 87 97
## [8,] 58 68 78 88 98
## [9,] 59 69 79 89 99
## [10,] 60 70 80 90 100
We can also use the [ ]
operator to access to an element, and names()
to named elements:
mat[c(3,2), c(7,3)]
## [,1] [,2]
## [1,] 27 23
## [2,] 17 13
arr[c(3,4), c(3,2), c(1,2)]
## , , 1
##
## [,1] [,2]
## [1,] 23 13
## [2,] 24 14
##
## , , 2
##
## [,1] [,2]
## [1,] 73 63
## [2,] 74 64
names(mat) <- paste0("elt:", 1:100)
mat
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 1 2 3 4 5 6 7 8 9 10
## [2,] 11 12 13 14 15 16 17 18 19 20
## [3,] 21 22 23 24 25 26 27 28 29 30
## [4,] 31 32 33 34 35 36 37 38 39 40
## [5,] 41 42 43 44 45 46 47 48 49 50
## [6,] 51 52 53 54 55 56 57 58 59 60
## [7,] 61 62 63 64 65 66 67 68 69 70
## [8,] 71 72 73 74 75 76 77 78 79 80
## [9,] 81 82 83 84 85 86 87 88 89 90
## [10,] 91 92 93 94 95 96 97 98 99 100
## attr(,"names")
## [1] "elt:1" "elt:2" "elt:3" "elt:4" "elt:5" "elt:6" "elt:7"
## [8] "elt:8" "elt:9" "elt:10" "elt:11" "elt:12" "elt:13" "elt:14"
## [15] "elt:15" "elt:16" "elt:17" "elt:18" "elt:19" "elt:20" "elt:21"
## [22] "elt:22" "elt:23" "elt:24" "elt:25" "elt:26" "elt:27" "elt:28"
## [29] "elt:29" "elt:30" "elt:31" "elt:32" "elt:33" "elt:34" "elt:35"
## [36] "elt:36" "elt:37" "elt:38" "elt:39" "elt:40" "elt:41" "elt:42"
## [43] "elt:43" "elt:44" "elt:45" "elt:46" "elt:47" "elt:48" "elt:49"
## [50] "elt:50" "elt:51" "elt:52" "elt:53" "elt:54" "elt:55" "elt:56"
## [57] "elt:57" "elt:58" "elt:59" "elt:60" "elt:61" "elt:62" "elt:63"
## [64] "elt:64" "elt:65" "elt:66" "elt:67" "elt:68" "elt:69" "elt:70"
## [71] "elt:71" "elt:72" "elt:73" "elt:74" "elt:75" "elt:76" "elt:77"
## [78] "elt:78" "elt:79" "elt:80" "elt:81" "elt:82" "elt:83" "elt:84"
## [85] "elt:85" "elt:86" "elt:87" "elt:88" "elt:89" "elt:90" "elt:91"
## [92] "elt:92" "elt:93" "elt:94" "elt:95" "elt:96" "elt:97" "elt:98"
## [99] "elt:99" "elt:100"
WARNING
When filtering a matrix
or array
data structre, R will always return a data structre in the smallest dimension as possible. This could lead to erros because sometimes we used variables to filter a matrix or an array but we don’t know in advance the content of the variable, and in subsequent code we supposed that we get a similar data strucutre and in fact not.
is.matrix(mat[ 3, c(2,4)])
## [1] FALSE
To prevent such errors, use drop=FALSE
option of the [ ]
operator:
is.matrix(mat[ 3, c(2,4), drop=FALSE])
## [1] TRUE
The dim()
function returns the dimension of the data structure:
dim(mat)
## [1] 10 10
dim(arr)
## [1] 10 5 2
You can also use the ncol()
and nrow()
functions on a matrix-like object.
INFO
An array
or a matrix
is a vector
with a dimension attribut. Hence:
vect <- 1:100
is.vector(vect) ; is.matrix(vect) ; is.array(vect)
## [1] TRUE
## [1] FALSE
## [1] FALSE
dim(vect) <- c(10, 10)
is.vector(vect) ; is.matrix(vect) ; is.array(vect)
## [1] FALSE
## [1] TRUE
## [1] TRUE
dim(vect) <- NULL
is.vector(vect) ; is.matrix(vect) ; is.array(vect)
## [1] TRUE
## [1] FALSE
## [1] FALSE
The colnames()
, rownames()
and dimnames()
functions are used to respectively set the name of columns, rows and dimensions for an array-like structure.
names(mat) <- NULL
colnames(mat) <- paste0("col:", 1:10)
mat
## col:1 col:2 col:3 col:4 col:5 col:6 col:7 col:8 col:9 col:10
## [1,] 1 2 3 4 5 6 7 8 9 10
## [2,] 11 12 13 14 15 16 17 18 19 20
## [3,] 21 22 23 24 25 26 27 28 29 30
## [4,] 31 32 33 34 35 36 37 38 39 40
## [5,] 41 42 43 44 45 46 47 48 49 50
## [6,] 51 52 53 54 55 56 57 58 59 60
## [7,] 61 62 63 64 65 66 67 68 69 70
## [8,] 71 72 73 74 75 76 77 78 79 80
## [9,] 81 82 83 84 85 86 87 88 89 90
## [10,] 91 92 93 94 95 96 97 98 99 100
dimnames(mat)
## [[1]]
## NULL
##
## [[2]]
## [1] "col:1" "col:2" "col:3" "col:4" "col:5" "col:6" "col:7" "col:8"
## [9] "col:9" "col:10"
dimnames(arr) <- list(paste0("row:",1:10), paste0("col:",1:5), paste0("pro:",1:2))
arr
## , , pro:1
##
## col:1 col:2 col:3 col:4 col:5
## row:1 1 11 21 31 41
## row:2 2 12 22 32 42
## row:3 3 13 23 33 43
## row:4 4 14 24 34 44
## row:5 5 15 25 35 45
## row:6 6 16 26 36 46
## row:7 7 17 27 37 47
## row:8 8 18 28 38 48
## row:9 9 19 29 39 49
## row:10 10 20 30 40 50
##
## , , pro:2
##
## col:1 col:2 col:3 col:4 col:5
## row:1 51 61 71 81 91
## row:2 52 62 72 82 92
## row:3 53 63 73 83 93
## row:4 54 64 74 84 94
## row:5 55 65 75 85 95
## row:6 56 66 76 86 96
## row:7 57 67 77 87 97
## row:8 58 68 78 88 98
## row:9 59 69 79 89 99
## row:10 60 70 80 90 100
A list
is a collection of different data structures:
list(c("A","B"), matrix(1:10, nrow=2))
## [[1]]
## [1] "A" "B"
##
## [[2]]
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 3 5 7 9
## [2,] 2 4 6 8 10
list("Elt1"=c("A","B"), "Elt2"=c(TRUE,FALSE))
## $Elt1
## [1] "A" "B"
##
## $Elt2
## [1] TRUE FALSE
You can access, define or redefine the name of its elements with the names()
function. The [ ]
operator allows you to access of one or more element(s). Nevertheless, the [ ]
operator
applied on a list will always return a list, even if you ask only one element:
my_list <- list("A"=c("A","B"), "B"=matrix(1:10, nrow=2))
is.list(my_list[1])
## [1] TRUE
To have access to only one element you must used the [[ ]]
operator or the $
operator if the elements have a name:
is.list(my_list[[1]])
## [1] FALSE
is.list(my_list$A)
## [1] FALSE
A data.frame
is a special case of a list in the sense that each of these elements must be a vector of the same size. This allows you to create a matrix whose columns are no longer constrained to be in the same type.
d <- data.frame("Number"=c(1,2,3),
"Character"=c("A","B","C"),
"Logical"=c(TRUE, TRUE, FALSE))
d
As they are like matrices, you can define to each data.frame
row and column names. Row names can be accessed and set with the rownames()
or row.names()
functions and column names with the colnames()
, col.names()
or names()
functions. You can after use these names to filter the data.frame
like in a matrix
.
rownames(d) <- paste0('row', 1:3)
colnames(d)
## [1] "Number" "Character" "Logical"
d
WARNING
With data.frame
, the [ ]
operator works either as applied on a matrix
or on a list
. If this operator is used with two arguments, it works like the matrix
operator, instead of one argument where it works like the list
operator.
Works like a matrix operator
d[3,2] ; d[ , 'Logical', drop=TRUE] ; d[, 'Logical', drop=FALSE]
## [1] "C"
## [1] TRUE TRUE FALSE
Works like a list operator
d['Logical'] ; d[['Logical']]
## [1] TRUE TRUE FALSE
In the previous section we have seen, how can data be represented in different kind of data type and presented data structures in order to gather data together. In this section we will see how can I manipulate these data or structures in order to do more complex tasks.
R can read a variety of input file formats, however a lot of them are available in specific packages. As the main advantages of R is to manipulate tabular data, we will present here only functions that are included in the base package and which are specific to read tabular data.
There are also many tabular file format that exists. The most common one is the Excel format. Nevertheless, Excel is not the best way to share or to backup your data, see (Ziemann, Eren, and El-Osta 2016).
We present here two commonly used text formats suitable to represent tabular data: CSV (Comma-separated values) and TSV (Tabulation-separated values). As there are text format, any text editor can open, read and edit them (even Excel!). Each line represents a row of your table and each column is separated either by a comma ,
(french-like format separator is a semicolon ;
in order to avoid conflict with decimal point) or a tabular character \t
. By convention, the extension of CSV file is .csv
and the extension of TSV file is .tsv
. However you can use any extension you want (as they are text format, .txt
extension is also often used). So before importing any file in R, I recommend you to open it first or to get a look at their first lines in order to see what kind of separator or decimal point are used, or any other particularities like if there is a header or not, and so on.
Example of CSV format:
## Column1,Column2
## 1,2
## 3,4
Example of TSV format:
## Column1 Column2
## 1 2
## 3 4
These two file formats (and their particularities, eg with the decimal point) can be easily imported and exported with R.
The generic function to import data in tabular text format is the read.table()
function. You will also find wrappers to this function to import CSV format like read.csv()
and read.csv2()
which respectively import US CSV format (separator is a comma ,
and decimal points a dot .
) and french-like CSV format (separator is a semicolon ;
and decimal point a comma ,
). The read.delim()
and read.delim2()
are suitable to read TSV format. The first one is for US numbers (decimal point is a dot .
), the second one for french-like numbers (decimal point is a comma ,
). Don’t hesitate to use the help()
function on these functions to see more particularities (with header, no header, etc.).
WARNING
When importing data with read.table()
function (or any wrapper function), R will automatically convert character
columns to factor
data type. This data type is a particular type of integer
vector that looks like character
vector and which was designed to store a finite set of data (like categories). Neverthless, manipulating this kind of data can lead you to do some mistakes.
So, I recommend you to always set the option stringsAsFactors
to FALSE
(except if you know what you are doing) and, if needed, convert your character
vector in factor
at the last moment.
Warning, in R versions 4 or above, this option is set to FALSE by default and TRUE with lower versions. So take the habit to specify this option each time you will used one of these functions.
Conditional statement indicates to R to run a block of codes or another one following the result of an expression;it is the well known if_then_else. Below is the syntax of this statement in R:
if (EXPRESSION) {
CODE1
} else {
CODE2
}
The result of EXPRESSION
should be a logicial
value or could be converted into a logicial
value. If the result of EXPRESSION
is TRUE
then the commands of CODE1
is run, otherwise it is commands in CODE2
. The second block is optional, so if the result of EXPRESSION
is FALSE
, so no code is run. If the code block contains only one instruction you could omit the brackets.
if (257%%2==0) {
print('Even number')
} else {
print('Odd number')
}
## [1] "Odd number"
Iterative statement is used when a block of codes should be run many times, sometimes depending of a value. R provides three approaches to perform a loop: repeat
, while
and for
. There is no specific difference between the three structures, it depends of the developer choice or its preferences to use one instead the other; although is easier to loop over vector’s values with the for
statement.
break
and next
instructionsThese two instructions are used inside the body of a loop in order to indicate to exit a loop (break
) or to go to the next iteration (next
), avoiding run of unwanted code for specific iterations.
repeat
loopsThe general syntax of the repeat
loop is the following:
repeat {
CODE
}
When R enters in this kind of loop, it will loop indefinitely until it reaches a break
expression indicated inside the block of codes. So be sure, before running your script that this expression is inside the repeat loop and can be reached.
n <- 17L
repeat {
print(n)
if (n==1L)
break
else
n <- if (n%%2L==0L) n%/%2L else 3L*n+1L
}
## [1] 17
## [1] 52
## [1] 26
## [1] 13
## [1] 40
## [1] 20
## [1] 10
## [1] 5
## [1] 16
## [1] 8
## [1] 4
## [1] 2
## [1] 1
while
loopsThe general syntax of the while
loop is the following:
while (EXPRESSION) {
CODE
}
First, EXPRESSION
is evaluated. If it is TRUE
the block CODE
is run, then EXPRESSION
is reevaluated. If it is TRUE
again, then CODE
is executed and so on until EXPRESSION
is evaluated to FALSE
. If EXPRESSION
is evaluated to FALSE
in the first time, so commands inside CODE
are never run. Example below is equivalent to the same code in the repeat
section:
n <- 17L
print(n)
## [1] 17
while (n!=1L) {
n <- if (n%%2L==0L) n%/%2L else 3L*n+1L
print(n)
}
## [1] 52
## [1] 26
## [1] 13
## [1] 40
## [1] 20
## [1] 10
## [1] 5
## [1] 16
## [1] 8
## [1] 4
## [1] 2
## [1] 1
for
loopsThe general syntax of the for
loop is the following:
for (VAR in VECT) {
CODE
}
This loop indicates to run CODE
for each value of VECT
. CODE
can access to the current value with the VAR
variable name. For instance:
for (i in 1:10) {
print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
As R is powerful to compute over vectors, it is sometimes faster to use vectorized computing instead of loops because R operates on all elements in a vector simultaneously, rather than sequentially for each element in a loop. To perform vectorized computing, R provides tapply()
, lapply()
, sapply()
, vapply()
and apply()
functions. These functions offer ways around loops. apply()
is ideally suited for array
, including matrix
. tapply()
applies a function to each cell of a ragged array (where the dimensions of each element in a matrix vary). The functions sapply()
and vapply()
are used with vectors and lapply()
with lists.
For instances:
apply(matrix(1:100, ncol=10), 1, sum)
## [1] 460 470 480 490 500 510 520 530 540 550
sapply(1:10, function(x) x**2)
## [1] 1 4 9 16 25 36 49 64 81 100
Functions encapsulate a piece of codes in order to perform a certain task, eg computing a mean. A function can take variables in input and can also returns or not a result; eg to compute a mean you need input values (on what should you compute the mean?) and returning a value (the mean).
In R, a function is declared with the keyword function
and, to be used later, it must be assigned to a variable name:
my_mean <- function(a, b) {
return((a + b) / 2)
}
my_mean(2, 5)
## [1] 3.5
In R, the value of the last expression is automatically returned, so if there is no ambiguity you can avoid using the retrun()
function to specify what the function needs to return.
my_mean <- function(a, b) {
(a + b) / 2
}
my_mean(2, 5)
## [1] 3.5
You can also give a default value to an argument:
my_mean <- function(a, b, p=1) {
(a*p + b*p) / 2
}
my_mean(2, 5)
## [1] 3.5
my_mean(c(5, 5), c(5, 5), p=c(2, 4))
## [1] 10 20
To save your data in a tabular text format, R provides you the generic write.table()
function. As the read.table()
function, you will find wrappers around this function to write CSV file write.csv()
or write.csv2()
. The first one is for US CSV format and the second one for French CSV format. There is no wrapper function to backup your data in TSV format, but you can easily save in TSV by setting the sep
option to a tabular character "\t"
.
You can also save the object in a binary format that you can after load them into R again. You can use the saveRDS()
and loadRDS()
functions to backup only one R object into a file and load it in your next R session. By convention, the “.rds” extension file name is the most often used.
You can also save many R objects into a single file using the save()
and load()
functions. In this case, the most often used extension file name is “.RData.”
A lot of other useful functions or data structures are available in separate packages. You
can go to The Comprehensive R Archive Network (CRAN) webpage https://cran.r-project.org to see the majority of available R packages. The install.packages()
function is designed to install a new package in the current R that are listed in this webpage. For instance to install tidyverse package:
install.packages('tidyverse')
and follow instructions on your screen. To use functions or data structures of a given package use the library()
function to load them:
library(tidyverse)
Remark that the name of the package is into quotation marks with the install.packages()
function while they are not needed in the library()
function.
Bioconductor https://bioconductor.org is an open source project that lists and provides a widy variety of R packages suitable for analysis of high-throughput genomic data (Huber et al. 2015). To install packages listed in the bioconductor website, just follow explication provided on the webpage of the requested tool.
I recall you here some information given through this lesson and some others when you want to develop in R:
help()
function to see how it works.sessionInfo()
function to get the list of all loaded packages and their versions at the end of your script. This is recommended because sometimes results can differ between R or package version. So to be reproducible, results are given following package and R versions.INFO
You can download a cheat sheet summurazing all basic concepts in R from here: https://github.com/rstudio/cheatsheets/blob/main/base-r.pdf.
EXERCISE
We will illustrate what we have learned on a published bulk RNA-seq data.
We will use data from GSE153921, a Gene Expression Omnibus (GEO) database ID. Dowload the CSV file that is in the supplementary section (bottom of the webpage) and save it into your computer. This file contains raw read counts associated to each gene of each sample. It looks as follow:
## ,D5minusT,D5minusA,D5minusC,D5plusT,D5plusA,D5plusB,D5shX3T,D5shX3A,D5shX3B,D5shX5T,D5shX5A,D5shX5B
## ENSG00000223972,0,0,0,0,0,0,0,0,0,0,0,0
## ENSG00000227232,3,4,6,3,4,13,4,3,11,4,7,8
## ENSG00000243485,0,1,0,0,0,0,0,0,0,0,0,0
## ENSG00000221311,0,0,0,0,0,0,0,0,0,0,0,0
## ENSG00000237613,0,0,0,0,0,0,0,0,0,0,0,0
## ENSG00000268020,0,0,0,0,0,0,0,0,0,0,0,0
## ENSG00000240361,0,0,0,0,0,0,0,0,0,0,0,0
## ENSG00000186092,0,0,0,0,0,0,0,0,0,0,0,0
## ENSG00000238009,5,6,6,3,5,11,2,4,3,3,2,3
What is your current working directory? If needed, change this directory to the one where you have saved the file.
Import these data into a variable named data
with the read.table()
function or an analogous function. Use the first column, containing the Ensembl gene ID, as row names.
What is the type of the data
object? If it is not a matrix
object, convert it into a matrix
object.
How many genes contains this dataset?
Has data
NA
values? If yes, replace them with 0
count value.
Replace all column names of the data
matrix using the following convention: use a letter to indicate an eperimental condition and a number to indicate a replicate, eg A1 represents the sample of the experimental condition A and it is the first replicate of this experimental condition. We see here that we have 4 conditions and 3 replicates for each one.
What are the counts of each sample for the gene ENSG00000131591?
What is the total count of the B3 sample? (You can use the sum()
function, use help()
if needed).
What is the total count of each sample? (Maybe the help page of the sum()
function suggests you another useful functions…)
Remove genes from the data
matrix that have no read count (ie 0) in each sample.
Write a function called size_factor
that takes in input a numeric vector and returns a numeric value that corresponds to the sum of all values in the input vector and divided it by 1 million.
Create a new matrix called data_norm
where each count corresponds to the count of the data
matrix divided by the size_factor
value obtained for the given sample.
Compute a “virtual” sample that contains the mean expression of each gene for all samples across a given experimental condition from the normalized counts and for each experimental condition and store them into a matrix called data_mean
. You can call these “virtual” samples with its experimental condition name. (Use loops or iterative functions)
The differential expression of a given gene between two samples \(A\) and \(B\) can be measured with the fold change. A fold change is given by the ratio of the expression of the gene in \(A\) with that in \(B\); and we apply a logarithm transformation in base 2 to get the log2 fold change. As 0 count can be problematic (a division by 0 or a log of 0 is not defined), we will add a pseudo-count of 1 to all counts before computing the log2 fold change. Compute all log2 fold change for any possible paired combinaison and store the results into a matrix called data_fc
. (To help you in generating all the possible paired combinaisons, you can use the combn()
function.)
As we are only interessed into differentially expressed genes, filter the data_fc
table in order to consider only differentially expressed genes at least in one comparison. A gene is differentially expressed if its log2 fold change is greater than 1 (over-expressed) or lower than -1 (under-expressed).
Generate a final data.frame called final
that contains the following columns: gene ID, raw counts, normalized counts, mean counts of our “virtual” samples, and the log2 fold change expression of each comparison. To distinguish between the raw count columns and the normalized one, you can add the (raw)
and (normalized)
character after each sample name. (To do that you can use the merge()
function.)
Export the last genereted table into a TSV (Tabulation-separated values) file format. For french people, you can export numbers with a comma ,
for decimal point (to easily open it with Excel).