Chapter 3 Intro to R

I can recall vividly how I started learning R as an undergrad and I told a friend of mine – a then grad student in education science and SPSS user – about it. He replied: “R? Isn’t that this incredibly fancy scientific calculator?” Well, he was not exactly right – but not really wrong either.

Today, you are going to make your first steps with R. In the following, you will learn how to use R as a fancy calculator. This encompasses that you can extend its functionality by installing packages, the possibility to do all kinds of calculations, storing data in objects of several kinds, and accessing them.

3.1 Installing packages

Being a fancy calculator implies that you can extend it as you want. One of the big upsides of using R is that due to its active community, whose members are permanently striving to make it a bit better, we useRs are basically standing on the shoulders of giants. You can install packages from CRAN by using the install.packages() command.

#install.packages("tidyverse") # installs the tidyverse package
# insert '#' if you want R not to execute the things that stand to its right; pretty useful for annotating code

CRAN packages have to fulfill certain requirements and packages are updated at a certain pace. If you want to use other packages or get development versions, you can also install packages from GitHub using the devtools package.

Before you can use a certain package in a session, you have to load it using the library() command.

library(tidyverse)

Now you are good to go!

3.2 Basic arithmetic operations

Using R as a calculator looks like this:

5 + 5
## [1] 10
5 + 5 * 3
## [1] 20
5 + 5^2
## [1] 30
sqrt(9)
## [1] 3

The latter, sqrt(), is no classic arithmetic operation but a function. It takes a non-negative number as input and returns its square root.

3.3 Vectors

R is vector-based. That implies that we can store multiple values in vectors and perform operations on them by element. This is pretty handy and distinguishes it from other languages like, for instance, C or Python (without NumPy).

In R, there are two kinds of vectors: atomic vectors and lists. Atomic vectors can only contain values of one type, whereas lists can contain atomic vectors of different types – and lists as well. It might be hard for you at first to wrap your head around this. However, it will become clear as soon as we fill it with some examples. Vectors can be characterized by two key properties: their type, which can be determined with typeof(), and their length which can be assessed using length(). NULL is the absence of a vector. NA, a missing value, is the absence of a value in a vector.

In the following, I first introduce atomic vectors. Afterwards, I describe lists. Finally, augmented vectors are to be introduced: factors, data frames/tibbles, and date/date-times. I will refer to atomic vectors as vectors, and to lists as lists. I will leave out matrices and arrays. We will not work with them in the course, and, honestly, I rarely use them myself.

This tutorial borrows heavily from Hadley Wickham’s “R for Data Science” (Wickham and Grolemund 2016), and Richard Cotton’s “Learning R” (Cotton 2013).

3.3.1 Atomic vectors

There exist six different types of atomic vectors: logical, integer, double, character, complex, and raw. The latter two are hardly used, hence I will not include them here. Integer and double are usually summarized under the umbrella term numeric vectors.

We can create a vector using the c() function. “c” stands for “concatenate.”

3.3.1.1 Logical vectors

Logical vectors can take three values: TRUE, FALSE, and NA. While you can create them by hand (logical_vec <- c(TRUE, FALSE, NA)), they are usually the result of comparisons. In R, you have six comparison operators:

  • <
  • >
  • <=
  • >=
  • == (always use two equal signs)
  • != (not equal)
5 > 6
## [1] FALSE

Sometimes, we want to store the results of what we are doing. Then, we assign our operation’s result to a meaningful name:

example_logical_vec <- 5 > 6

You may wonder how you should name your objects. In this case, just consult the tidyverse style guide. Here, it says that you should use lowercase letters, numbers, and underscores (called “snake case”). In general, you should stick to the tidyverse style guide. The conventions you can find in there will make your life and the lives of the people who have the honor to read your code a lot easier. And if you find examples in this tutorial where I violate any of the conventions stated there and point it out, I owe you a hot beverage.

Logical vectors can also be used in a numerical context. If so, TRUE becomes 1 and FALSE 0. You will see an example when we deal with the conversion of vectors to different types.

You can look at vectors by either typing in the name and then executing it, or by calling head(). The latter is especially useful if the vectors are very long, since it only gives back the first 10 values by default. However, you can specify the length of the output by providing a different n argument.

example <- c(TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, FALSE, FALSE, TRUE, FALSE, TRUE, FALSE)
example # too long
##  [1]  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE  TRUE FALSE  TRUE
## [13] FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE
head(example, n = 5)
## [1]  TRUE FALSE FALSE FALSE  TRUE

3.3.1.2 Numeric vectors

Numbers in R are double by default. To make a numeric vector an integer, add L to a number, or use as.integer().

double_vec <- c(1, 2, 3, 4)
typeof(double_vec)
## [1] "double"
integer_vec <- c(1L, 2L, 3L)
typeof(integer_vec)
## [1] "integer"
typeof(as.integer(double_vec))
## [1] "integer"

Furthermore, you can create sequences of numbers by using the : operator. This will also give you an integer.

new_sequence <- 1:9
new_sequence
## [1] 1 2 3 4 5 6 7 8 9
typeof(new_sequence)
## [1] "integer"

Note that doubles are only approximate, since they represent floating point numbers. In your every-day coding, you should not worry too much about it. However, keep it in mind later on. You can read more about it here (page 9).

Beyond that, Integers only have one special value – NA, implying a missing value. Doubles have four: NA – missing value, NaN – not a number, and Inf and -Inf – infinite values. The latter three can be illustrated with the following example:

c(-1, 0, 1) / 0
## [1] -Inf  NaN  Inf

And, very important: use decimal points instead of decimal commas (especially applicable to Germans).

3.3.1.3 Character vectors

The vectors of type character can consist of more or less anything. The only thing that matters is that their inputs are wrapped with either " " or ’ ’ (which can come in handy if you want to store text):

another_character <- c("hi", "1234", "!!1!", "#+*23$%&/(")
typeof(another_character)
## [1] "character"
text_character <- "I am my mother's child."
direct_speech <- '"It has never been easy to learn how to code," said my professor'

You cannot really “do” anything with character vectors, except for comparison.

#text_character + direct_speech # remove '#' if you want to try
text_character == text_character
## [1] TRUE
"b" > "a"
## [1] TRUE

3.3.2 Working with atomic vectors

3.3.2.1 Convert between types

You can either explicitly or implicitly convert a vector to a certain type.

For explicit conversion, or coercion, you can just call the respective as.xxx() function: as.logical(), as.integer(), as.double(), or as.character(). However, calling these functions often implies that your vector had the wrong type in first place. Hence, try to avoid it if possible, and, therefore, this is used relatively rarely.

Implicit conversion happens by using a vector in a context in which a vector of a different type is expected. One example is dealing with logical vectors. As mentioned earlier, TRUE is translated to 1, while FALSE becomes 0. This can come in pretty handy:

x <- sample(1000, 100, replace = TRUE) # draw 100 numbers between 1 and 1000
y <- x > 500 # whether numbers are greater than 500 
typeof(y)
## [1] "logical"
sum(y) # how many are greater than 500
## [1] 47
mean(y) # proportion of numbers which are greater than 500
## [1] 0.47

Also, if you build a vector out of multiple types – the most complex type always wins. Here, complex means that a vector can take many different values. Character vectors, for instance, can take basically every value:

typeof(c(TRUE, 1L))
## [1] "integer"
typeof(c(1L, 1.5))
## [1] "double"
typeof(c(1.5, "abc"))
## [1] "character"

3.3.2.2 Naming elements

Elements of vectors can be named. This can either happen during creation:

named_vector <- c(one = 1, two = 2, three = 3, four = 4, five = 5)

Or in hindsight using set_names() from the purrr package (which is part of the core tidyverse and, therefore, does not need to be loaded explicitly):

named_vector <- set_names(1:5, c("one", "two", "three", "four", "five"))

3.3.2.3 Accessing elements

If we want to access a certain element of the vector, we can tell R to do so by using square brackets [ ]. This can also be used for some filtering:

named_vector[1] # first element
## one 
##   1
named_vector[length(named_vector)] # last element, using a function, again
## five 
##    5
named_vector[-3] # all elements but the third
##  one  two four five 
##    1    2    4    5
named_vector[c(1, 3)] # first and third
##   one three 
##     1     3
named_vector[1:3] # first to third
##   one   two three 
##     1     2     3
named_vector[named_vector == 3] # elements that equal three
## three 
##     3
named_vector[named_vector %in% c(1, 2, 3)] # named_vectors that also are in another vector
##   one   two three 
##     1     2     3
named_vector[named_vector > 2] # values that are bigger than 2
## three  four  five 
##     3     4     5
rev(named_vector) # reverse vector -- using a function
##  five  four three   two   one 
##     5     4     3     2     1
named_vector[c(1, 1, 1, 2, 3, 3, 3)] # first first first second third third third element
##   one   one   one   two three three three 
##     1     1     1     2     3     3     3
named_vector[c(TRUE, TRUE, TRUE, FALSE, TRUE)] # subsetting with a logical vector -- TRUE = value at the corresponding position is retained, FALSE = value at the corresponding position is dropped
##   one   two three  five 
##     1     2     3     5
named_vector[c("one", "three")] # if the vector is named, you can also select the correspondingly named elements with a character vector
##   one three 
##     1     3

As stated in the beginning, atomic vectors can only contain data of one type. If we want to store data of several types in one object, we need to use lists.

3.3.3 Lists

Lists can contain all types of vectors, including other lists. Due to the latter feature, they are also called “recursive vectors.”

Lists can be created using list(). Naming elements works like naming elements of atomic vectors.

new_list <- list(numbers = 1:5, characters = c("Hello", "world", "!"), logical_vec = c(TRUE, FALSE), another_list = list(1:5, 6:10))

In theory, you can, for instance, look at a list calling head():

head(new_list)
## $numbers
## [1] 1 2 3 4 5
## 
## $characters
## [1] "Hello" "world" "!"    
## 
## $logical_vec
## [1]  TRUE FALSE
## 
## $another_list
## $another_list[[1]]
## [1] 1 2 3 4 5
## 
## $another_list[[2]]
## [1]  6  7  8  9 10

Another possibility, which is especially suitable for lists, is str(), because it focuses on the structure:

str(new_list)
## List of 4
##  $ numbers     : int [1:5] 1 2 3 4 5
##  $ characters  : chr [1:3] "Hello" "world" "!"
##  $ logical_vec : logi [1:2] TRUE FALSE
##  $ another_list:List of 2
##   ..$ : int [1:5] 1 2 3 4 5
##   ..$ : int [1:5] 6 7 8 9 10

3.3.3.1 Accessing list elements

Accessing elements of a list is similar to vectors. There are basically three ways:

Using singular square brackets gives you a sub-list:

sublist <- new_list[2]
sublist
## $characters
## [1] "Hello" "world" "!"
typeof(sublist)
## [1] "list"

Double square brackets gives you the component:

component_1 <- new_list[[1]]
component_1
## [1] 1 2 3 4 5
typeof(component_1)
## [1] "integer"

A bit hard to grasp? I certainly agree! You can find a nice real-world metaphor here.

If the elements are named, you can also extract them using the $ operator:

vector_of_numbers <- new_list$numbers
vector_of_numbers
## [1] 1 2 3 4 5
typeof(vector_of_numbers)
## [1] "integer"

3.3.3.2 Functions for working with vectors

all() and any()return whether all or any of the elements fulfill a certain condition.

all(vector_of_numbers == 5)
## [1] FALSE
any(vector_of_numbers == 5)
## [1] TRUE

You can also determine which() element of the vector meets a certain condition.

which(vector_of_numbers %in% c(1, 5))
## [1] 1 5

subset() enables you to filter out values in a vector.

subset(vector_of_numbers, vector_of_numbers > 4)
## [1] 5

3.3.4 Augmented vectors

In R, there are also other vector types. They are built upon the basic vectors – atomic vectors and lists. The most important ones are factors (built upon integers), date/date-time (built upon doubles), and data frames/tibbles (built upon lists).

3.3.4.1 Factors

Factors are used in R to represent categorical variables. They can only take a limited amount of values. Think for example of something like party affiliation of members of the German parliament. This should be stored as a factor, because you have a limited set of values (i.e., AfD, Buendnis 90/Die Gruenen, CDU, CSU, Die Linke, FDP, SPD, fraktionslos) which apply to multiple politicians. Names, on the other hand, should be stored as characters, since there is (in theory) an infinite number of possible values.

Factors are built on top of integers. They have an attribute called “levels.”

mdbs <- factor(levels = c("AfD", "Buendnis90/Die Gruenen", "CDU", "CSU", "Die Linke", "SPD"))
levels(mdbs)
## [1] "AfD"                    "Buendnis90/Die Gruenen" "CDU"                   
## [4] "CSU"                    "Die Linke"              "SPD"
typeof(mdbs)
## [1] "integer"
mdbs
## factor(0)
## Levels: AfD Buendnis90/Die Gruenen CDU CSU Die Linke SPD

In our daily workflow, we normally convert character vectors to factors using as.factor(). We will learn more about factors – and the forcats package which has been dedicated to them.

3.3.4.2 Date and date-time

Dates are simply numeric vectors that indicate the number of days that have passed since 1970-01-01. We will work with dates using the lubridate package.

library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
date <- as.Date("1970-01-02")
unclass(date)
## [1] 1
typeof(date)
## [1] "double"

Date-times work analogously: a numeric vector that represents the number of seconds that have passed since 1970-01-01 00:00:00.

datetime <- ymd_hms("1970-01-01 01:00:00")
unclass(datetime)
## [1] 3600
## attr(,"tzone")
## [1] "UTC"

If you want to learn more on dates and times, have a look at the lubridate package which has been dedicated to them.

3.3.5 Data Frames/Tibbles

The data structure in R which is probably the most central for this course – and for working with the tidyverse in general – is the data frame (or Tibble, which is used in the context of the tidy packages). In the following, I will only focus on Tibbles. The differences between a Tibble and a data frame can be found here. Strictly speaking, they are augmented vectors, but since they are the most important data type when working with tidyverse packages.

Tibbles are built upon lists, but there are some crucial differences: Lists can contain everything (including other lists), Tibbles can only contain vectors (including lists) which are of the same length or length 1 (then the value is repeated to make the vector the same length as the others, so-called recycling). These variables need to have a name. For creating tibbles, we need the tibble package which comes with the tidyverse. You can give elements names which are invalid variable names in R (e.g., because they contain spaces) by wrapping them with ``. If you want to work with this variable afterwards, you will also have to wrap its name with back ticks. When you’re working in RStudio, you can open a separate tab containing the tibble by either clicking on the object in the “environment” pane or by using the View() command (I had to comment it out in the script because otherwise the RMarkdown document would not have knit).

new_tibble <- tibble(
 a = 1:5,
 b = c("Hi", ",", "it's", "me", "!"),
 `an invalid name` = TRUE
)
new_tibble
## # A tibble: 5 × 3
##       a b     `an invalid name`
##   <int> <chr> <lgl>            
## 1     1 Hi    TRUE             
## 2     2 ,     TRUE             
## 3     3 it's  TRUE             
## 4     4 me    TRUE             
## 5     5 !     TRUE
# View(new_tibble)

You can access a Tibble’s columns by their name by either using the $ operator, or [[" – like when you access named elements in a list. This will return the vector:

new_tibble$a
## [1] 1 2 3 4 5
typeof(new_tibble$a)
## [1] "integer"
new_tibble[["a"]]
## [1] 1 2 3 4 5

You can also extract by position using [[:

new_tibble[[3]]
## [1] TRUE TRUE TRUE TRUE TRUE

As it returns a vector, you can extract the vector’s value by just adding the expression in square brackets:

new_tibble[[1]][[2]] # second value of first column
## [1] 2

Another way of accessing specific elements is by [row, column].

new_tibble[[1]][[2]] == new_tibble[2, 1] # second value of first column
##         a
## [1,] TRUE

Also, you can access the entire row by leaving out the column and vice versa:

new_tibble[2, ] #second row
## # A tibble: 1 × 3
##       a b     `an invalid name`
##   <int> <chr> <lgl>            
## 1     2 ,     TRUE
new_tibble[, 1] #first column
## # A tibble: 5 × 1
##       a
##   <int>
## 1     1
## 2     2
## 3     3
## 4     4
## 5     5

References

Burns, Patrick. 2011. The R Inferno.
Cotton, Richard. 2013. Learning R. First Edition. Beijing ; Sebastopol, CA: O’Reilly.
McNamara, Amelia, and Nicholas J Horton. 2017. “Wrangling Categorical Data in R.” Preprint. PeerJ Preprints. https://doi.org/10.7287/peerj.preprints.3163v2.
Wickham, Hadley, and Garrett Grolemund. 2016. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. First edition. Sebastopol, CA: O’Reilly.