Chapter 1 Introduction

1.1 What is R?

R is a software for data analysis, manipulation and visualization and a well developed and powerful programming language. It is a highly extensible, open-source and free software which compiles and runs on a wide variety of UNIX platforms, Windows and MacOS. The R project was started by Robert Gentleman and Ross Ihaka at the University of Auckland in 1995 and is maintained by the R Core Team (2024), an international group of volunteer developers. The website of the R project is http://www.r-project.org.

Although R works fine by itself, we will use it in combination with RStudio, a so called Integrated Development Environment (IDE) which provides a comfortable graphical user interface and some additional functionalities.

1.2 Installing R

To begin, you should install R and RStudio on your computer. If you are working on a computer that has these programs already installed, you can skip this part. To install R, go to the Comprehensive R Archive Network (CRAN), for example here: https://ftp.fau.de/cran/ and install the Version of R that is suitable for your operating system. After you have installed R, visit https://posit.co/download/rstudio-desktop/ and download and install RStudio Desktop. Check whether you can open RStudio by clicking on the RStudio icon on your desktop or by searching it in your taskbar.

1.3 Getting to know RStudio

RStudio is generally divided into four subwindows. The upper and lower right windows in RStudio will be explained when they become relevant in the following chapters. If you see only three subwindows upon opening RStudio, click File > New File > R Script. The upper left window is the R Script were you write your code. The lower left window is the console. There the code that you have written in the script gets excecuted and the results are displayed.

To try this out, type 1+1 into your script. Then mark this piece of code and either click on the Run symbol in the upper right corner or press Ctrl + Enter on your keyboard.

As you can see, the code gets reprinted in the console behind the >, which is called the prompt. The result [1] 2 is displayed directly below. The process of sending code from the script to the console is called running or executing your code. It is possible to write code directly into the console next to the prompt and executing it by hitting Enter. However, we strongly advise typing all of your code out in the script before executing it, since this makes rerunning and changing your code way easier. To make your code more humanly readable, you can comment it in your script. Any line in your script that begins with a # will not be evaluated when send to the console but will be merely printed out:

#R does not calculate 1+1 if it is written like this:

#1+1

Anything you write in R that is not a comment is case sensitive, which means for example A and a are not the same thing to R.

1.4 R as a calculator

As you have seen in the example above, R can be used as an ordinary calculator. You can use + and - for addition and subtraction, * and / for multiplication and division and ^ for exponentiation.

Try out different calculations like the following by typing them into your script and running them. You can either run them line by line or mark several lines at once for execution.

5+3

[1] 8

7*3/2

[1] 10.5

2^3

[1] 8

(2-5)*8^2

[1] -192

The [1] in front of the output will appear in front of every vector in the console. It is an index telling you the position of the first element in the row which is useful when the vector is so long it produces line breaks in you console. You will learn what a vector is in just a moment.

1.5 Assignments

One of the most important concepts in R is the assignment of names to objects. So far the objects we have encountered are simple numbers. To assign a name to a number, you use the assignment operator <- (no space between < and -) like this:

x <- 3
some.complicated.name <- 7

We call x and some.complicated.name variables. Notice how they appear in the top right window of RStudio under the tab Environment once you have run those two lines. In the environment you will see all R objects that you have defined so far. R will list their names for you if you use ls():

ls()

[1] "some.complicated.name" "x"

The collection of named R objects showing up in the environment window is called workspace and can be saved and reloaded as you will learn later on. For a variable name, you can use any string that does not have blank spaces or special characters in it and that does not begin with a number. You can look up the value that is stored in a variable by simply typing out its name and running it:

[1] 3

Now you use these variables for computation:

x + some.complicated.name

[1] 10

You can overwrite the value stored in your variables at any point by simply rerunning the original assignment with different values.

For example you can assign the values 2 and 80 to x and some.complicated.name and compute their product.

x <- 2
some.complicated.name <- 80
x*some.complicated.name

[1] 160

If you want to remove a variable, you can use the rm() command like this:

rm(x)

If you want to remove all variables in your workspace, you can use a combination of rm()and ls():

rm(list=ls())

As you can see, the variables now disappear from the environment window in the top right corner. The advantage of using variables instead of numbers in your code is that your code becomes reusable. Imagine having typed out a long computation that you want to perform repeatedly with different numbers. If your computation uses variable names, you only write it down once and are able to rerun it with as many different values as you like by just assigning those values to the variable one by one.

1.6 Basic data structures

R has a small number of basic data structures from which all other kinds of objects can be built. We will go through the most important ones one by one.

1.6.1 Vectors

So far you have worked with single numbers. These are actually a special case of the most important data structure in R, the so called vector. A vector in R is a sequence of elements of the same data type. For example 8 2 4 6 is a numeric vector (i.e. with elements that are numbers) with four elements, namely the numbers from 1 to 4. You build a vector with the c() function like this:

c(8, 2, 4, 6, 2, 1)

[1] 8 2 4 6 2 1

This is a numeric vector of length 6 (i.e. it has 6 elements). An example for a vector producing a linebreak is for example the following:

c(1, 23, 4, 5, 6, 7, 7, 8, 4, 2, 4, 6, 8, 98, 45, 23, 
  45, 8, 97, 23, 4, 23, 1, 3, 5, 6, 2, 45, 3, 45, 4, 1, 3)

 [1]  1 23  4  5  6  7  7  8  4  2  4  6  8 98 45 23 45  8 97 23  4 23  1  3  5
[26]  6  2 45  3 45  4  1  3

You can store vectors in variables as well:

my_vector <- c(8, 2, 4, 6, 2, 1)

Notice how the new variable my_vector now appears in the environment window on the upper right side. You can retrieve the vector stored in my_vector by typing it out and executing the code:

my_vector

[1] 8 2 4 6 2 1

Numeric

The kind of vector you have just seen is the numeric vector (or just numeric), which is a vector containing numbers. A numeric containing only whole numbers (like my_vector) can be called an integer, which is a subtype of numeric.

If the numbers in the numeric have decimal places, it can be called a double.

v <- c(1.5, 3.234, 7, 0.12356)
v

[1] 1.50000 3.23400 7.00000 0.12356

You can use numeric vectors in calculations just like single numbers:

a <- c(1, 2, 3, 4)
a*3

[1]  3  6  9 12

b <- c(2, 4, 6, 7)
a+b

[1]  3  6  9 11

R executes the operation element-wise. This means the computation should involve either two elements of the same length (like a+b in our example) or one vector and a single number (like a*3 in our example). If the lengths of the vectors in your calculation don’t fit, R will recycle the shorter to make it fit the longer of the vectors.

long <- c(1, 2, 3, 4)
short <- c(1, 2)
long+short

[1] 2 4 4 6

Here, the shorter vector was repeated, i.e. the calculation was long + c(short, short).

Character

R can not only deal with numbers, it can also deal with text. A piece of text is called a string and is written in a pair of double or single quotes. A vector containing strings as elements is called a character vector:

v2 <- c("male", "female", "female", "male")
v2

[1] "male"   "female" "female" "male"

v3 <- c('blue', 'brown', 'yellow')
v3

[1] "blue"   "brown"  "yellow"

Logical

Another important type of vector is the logical vector, the elements of which are the so called booleans TRUE and FALSE, which can be shortened by T and F (cases matter, you have to use upper case letters in both versions.)

c(TRUE, FALSE, TRUE)

[1]  TRUE FALSE  TRUE

c(F, T, T, T)

[1] FALSE  TRUE  TRUE  TRUE

Boolean values are the result of logical operations, that is, of statements that can be either true or false:

3 < 4

[1] TRUE

The most common logical operators that we will use are the following:

AND &
OR |
NOT !
greater than >
greater or equal >=
less than <
less or equal <=
equal to == (yes, you need two equal signs)
not equal to !=

The first three operators can be used with numbers like this:

3 < 1

[1] FALSE

5 > 2

[1] TRUE

5 == 5

[1] TRUE

5 != 5

[1] FALSE

The other operators can be used to link boolean values:

TRUE & FALSE

[1] FALSE

TRUE | FALSE

[1] TRUE

!FALSE

[1] TRUE

You can also create more complex expressions, using () to group statments:

((1+2)==(5-2)) & (7<9)

[1] TRUE

1.6.2 Subsetting vectors

Every element of a vector can be accessed individually by referencing its position (i.e. its index) in the vector. You can for example retrieve the fourth element of my_vector like this:

my_vector[4]

[1] 6

It is also possible to select more than one element of the vector by using an integer vector of the desired indices (e.g. c(1,4,5) if you want to retrieve the first, fourth and fifth element of a vector) within the square brackets:

my_vector[c(1, 4, 5)]

[1] 8 6 2

We call this subsetting your vector. For subsetting vectors we often need longer sequences of integers. To generate a sequence of consecutive integer numbers R has the <start> : <end> operator, which is read as from <start> to <end>:

3:10 #generates sequence from 3 to 10

[1]  3  4  5  6  7  8  9 10

1.6.3 Data frames

If you want to do statistics, the most likely format your data will come in is some kind of table. In R, the basic form of a table is called a data.frame and looks like this:

name	height	gender	age
John	185.2	male	25
Max	175.8	male	32
Susi	155.1	female	27
Anna	162.7	female	24

Usually every row is an observation (e.g. an individual or a measurement point) and each column is a variable on which the observation is measured (e.g. age, gender etc.). For learning purposes, R has some built-in data frames, one of which is the data.frame iris (Fisher 1936). You can have a look at a data.frame like this, though for really large data sets not all of the rows might be displayed:

View(iris) #careful, that's a capital V in `View()`

Table 1.1: The first 10 rows of the iris data set.
Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
5.1	3.5	1.4	0.2	setosa
4.9	3.0	1.4	0.2	setosa
4.7	3.2	1.3	0.2	setosa
4.6	3.1	1.5	0.2	setosa
5.0	3.6	1.4	0.2	setosa
5.4	3.9	1.7	0.4	setosa
4.6	3.4	1.4	0.3	setosa
5.0	3.4	1.5	0.2	setosa
4.4	2.9	1.4	0.2	setosa
4.9	3.1	1.5	0.1	setosa

The data set iris gives measurements of sepal and petal lengths and widths of 150 flowers from three different species of iris. You can extract each of the columns with a $ :

iris$Sepal.Length

  [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1
 [19] 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.0
 [37] 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0 6.4 6.9 5.5
 [55] 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8 6.2 5.6 5.9 6.1
 [73] 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4 6.0 6.7 6.3 5.6 5.5
 [91] 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3
[109] 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7 6.0 6.9 5.6 7.7 6.3 6.7 7.2
[127] 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7 6.3 6.4 6.0 6.9 6.7 6.9 5.8 6.8
[145] 6.7 6.7 6.3 6.5 6.2 5.9

As you can see, iris$Sepal.Length is just a numeric vector! Consequently you can do calculations on these vectors, e.g. compute the mean sepal length of the flowers:

mean(iris$Sepal.Length)

[1] 5.843333

Basically, a data.frame in R is a number of vectors of the same length that have been stuck together columnwise to build a table. Each column must have a unique format but different formats can be assigned to different columns. In this example, columns 1 to 4 are numbers and column 5 is a string.

1.6.4 Lists

While data.frames are useful to bundle together vectors of the same length, lists are used to combine more heterogeneous data. The following block of code creates a list:

#create list
my.list <- list(my_vector, long, iris[1:10,]) 
#print list
my.list

[[1]]
[1] 8 2 4 6 2 1

[[2]]
[1] 1 2 3 4

[[3]]
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1           5.1         3.5          1.4         0.2  setosa
2           4.9         3.0          1.4         0.2  setosa
3           4.7         3.2          1.3         0.2  setosa
4           4.6         3.1          1.5         0.2  setosa
5           5.0         3.6          1.4         0.2  setosa
6           5.4         3.9          1.7         0.4  setosa
7           4.6         3.4          1.4         0.3  setosa
8           5.0         3.4          1.5         0.2  setosa
9           4.4         2.9          1.4         0.2  setosa
10          4.9         3.1          1.5         0.1  setosa

A list is a collection of R objects that are called the elements of the list. Lists are similar to data.frames, but while data.frames can only have vectors of the same length as their elements (i.e. the variables), lists can have all kinds of data types as elements. An element of a list can be a vector of arbitrary length, a data.frame, another list or even a function. The list we have just created contains two vectors of different lengths and a data.frame containing the first ten rows of the iris data set. You can access a single list element by referencing its position in the list using double square brackets [[]]:

my.list[[1]] #result is a vector

[1] 8 2 4 6 2 1

my.list[[3]] #result is a data.frame

   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1           5.1         3.5          1.4         0.2  setosa
2           4.9         3.0          1.4         0.2  setosa
3           4.7         3.2          1.3         0.2  setosa
4           4.6         3.1          1.5         0.2  setosa
5           5.0         3.6          1.4         0.2  setosa
6           5.4         3.9          1.7         0.4  setosa
7           4.6         3.4          1.4         0.3  setosa
8           5.0         3.4          1.5         0.2  setosa
9           4.4         2.9          1.4         0.2  setosa
10          4.9         3.1          1.5         0.1  setosa

If you want to subset the list (i.e. keep only certain parts), use single square brackets []:

my.list[2:3] #results is a list

[[1]]
[1] 1 2 3 4

[[2]]
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1           5.1         3.5          1.4         0.2  setosa
2           4.9         3.0          1.4         0.2  setosa
3           4.7         3.2          1.3         0.2  setosa
4           4.6         3.1          1.5         0.2  setosa
5           5.0         3.6          1.4         0.2  setosa
6           5.4         3.9          1.7         0.4  setosa
7           4.6         3.4          1.4         0.3  setosa
8           5.0         3.4          1.5         0.2  setosa
9           4.4         2.9          1.4         0.2  setosa
10          4.9         3.1          1.5         0.1  setosa

Note that if you use single square brackets [], the result will always be a list, whereas using double square brackets [[]] will return whatever type the object is that you are referencing with [[]].

my.list is an unnamed list, but it is also possible to create a named list:

#create list
my.named.list <- list(a=my_vector, b=long, c=iris[1:10,]) 
#print list
my.named.list

$a
[1] 8 2 4 6 2 1

$b
[1] 1 2 3 4

$c
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1           5.1         3.5          1.4         0.2  setosa
2           4.9         3.0          1.4         0.2  setosa
3           4.7         3.2          1.3         0.2  setosa
4           4.6         3.1          1.5         0.2  setosa
5           5.0         3.6          1.4         0.2  setosa
6           5.4         3.9          1.7         0.4  setosa
7           4.6         3.4          1.4         0.3  setosa
8           5.0         3.4          1.5         0.2  setosa
9           4.4         2.9          1.4         0.2  setosa
10          4.9         3.1          1.5         0.1  setosa

The advantage of a named list is that you can extract the list elements by their names, similar to extracting variables from a data.frame:

my.named.list$a

[1] 8 2 4 6 2 1

my.named.list$b

[1] 1 2 3 4

The square brackets [] and [[]] do, however, also work on named lists. Because lists can bundle a lot of heterogeneous data in one R object, they are quite often used to give results of functions for statistical analyses as you will see later on.

1.6.5 Determining the class of an object

You can find out what type of data structure an object is with the class()function:

class(my.list)

[1] "list"

class(iris)

[1] "data.frame"

class(iris$Sepal.Length)

[1] "numeric"

class(my_vector)

[1] "numeric"

class(c(TRUE,FALSE, FALSE))

[1] "logical"

1.6.6 Investigating the structure of an object

When you get more complex objects, it can sometimes be useful to get an overview over their structure with str():

str(my.list)

List of 3
 $ : num [1:6] 8 2 4 6 2 1
 $ : num [1:4] 1 2 3 4
 $ :'data.frame':   10 obs. of  5 variables:
  ..$ Sepal.Length: num [1:10] 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9
  ..$ Sepal.Width : num [1:10] 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1
  ..$ Petal.Length: num [1:10] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5
  ..$ Petal.Width : num [1:10] 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1
  ..$ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1

This tells you that my.list is a list with 3 elements. The first two are numeric vectors, the third is a data.frame. As you can see the columns of the data.frame are also shown in part, but truncated so that they don’t clutter the console too much. You also get this information if you click on the little triangle next to the my.list object in the environment window on the upper right.

1.7 Troubleshooting

The code I’m sending to the console appears but does not seem to be executed: Check whether the last line of the console shows the prompt >. This means R is ready to receive new commands. If there is no > but a + instead, you probably forgot to close a bracket some lines before and R is waiting for the closing bracket. Just hit Esc to interrupt the current command and try again. Make sure the number of opening brackets matches the number of closing brackets. If you are in RStudio and there is no +, but a little red stop sign in the upper right corner of the console, R is still working on the computation. If it does not go away after a few moments, but you know your computation should not take this long, click the stop sign to terminate the current computation and try to find out why the computation you started will not finish.
Error: object ‘x’ not found: Either you have a typo in the object name or you forgot to define x (x beeing a stand-in for the variable in your error message) and it does not show up in the environment on the upper left. Run the assignment for x and try again.

References

Fisher, R. A. 1936. “The Use of Multiple Measurements in Taxonomic Problems.” Annals of Eugenics 7 (2): 179–88. https://doi.org/10.1111/j.1469-1809.1936.tb02137.x.

R Core Team. 2024. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.