4 R Fundamentals

2024-01-24

4.1 History of the R Language

The R language traces its roots to the S language which was developed at Bell Labs (previously part of AT&T, now part of Lucent Technologies) in the 1970s. R may be considered to be an open-source implementation of the S language, although there are some differences. The name R is partly due to the first names of the initiators of the project, Ross Ihaka and Robert Gentleman, two statisticians who were at the University of Auckland in the early 1990s when they began the project, and partly a play on the name of the S language. The combination of the popularity of S within academic statistics departments and the commitment of a core group of developers and a larger community willing to spend time to contribute to a community programming effort led to R becoming a very popular programming language, especially useful for statistical data exploration, modeling, and analysis.

4.2 Packages

A key feature of the language is the ease with which users can extend the existing language and share their extensions by creating packages. One important development was the creation of the grid package, a collection of low-level, object-oriented graphical functions that offered an alternative to the base R graphics system. This package set the stage for others to build high-level graphical packages to make it easier for general users to tap into the power of this new graphical system. The most successful of these new graphics packages was ggplot2, the first of a series of packages developed by Hadley Wickham and collaborators now released together as the tidyverse, to make it easier to use R for data science.

R is a very rich language and with thousands of packages that extend it, no one can hope to master the entire language with all of its extensions. In this course, we will use primarily code from the tidyverse so that we can master a small useful subset of the language designed to make it easy to do a number of common data science tasks, such as data visualization and transformation. Code written with tidyverse has a structure that looks quite distinct from conventional R code. However, we cannot avoid base R completely and there are many core R concepts we need to comprehend in order to master functions within the tidyverse. The remainder of this chapter introduces many important fundamental aspects of R.

4.3 Vectors

A vector is an R object that stores values all of the same type. Each column in a data frame is an example of a vector. We can also create vectors from scratch using a few common functions and operators.

  • The operator : creates a vector of consecutive integers
  • The function seq() makes more general sequences
    • Specify the start, end, and either the increment between values or the length
  • The function c() to collect a number of items together

Here are several examples.

## Using the colon (:) operator to form a sequence of consecutive integers.
1:10
 [1]  1  2  3  4  5  6  7  8  9 10
## Using the function seq() with the increment to advance by
seq(1,13,3)
[1]  1  4  7 10 13
seq(13,1,-4)
[1] 13  9  5  1
## Using seq() and the length.out argument
seq(0,10,length.out = 21)
 [1]  0.0  0.5  1.0  1.5  2.0  2.5  3.0  3.5  4.0  4.5  5.0  5.5  6.0  6.5  7.0
[16]  7.5  8.0  8.5  9.0  9.5 10.0
## Using c()
c(2,3,5,7,11)
[1]  2  3  5  7 11
c('A', 'BB', 'CCC')
[1] "A"   "BB"  "CCC"
c(TRUE, TRUE, FALSE, TRUE)
[1]  TRUE  TRUE FALSE  TRUE

4.4 Assignment

R allows multiple ways to assign a name to an object. The conventional method is to use the two-character combination <- which is intended to resemble an arrow. I prefer to use the symbol = which is the convention in many other programing languages. If you want to be in the majority and clearly distinguish between assignment and setting values to function arguments (such as length.out = 21 above), you may prefer <-. If you are not confused by the different uses of the symbol = in different contexts and prefer using a symbol common to other computer languages, then you may use =. This is a personal choice of preference.

In the previous examples, I created vectors which were printed and not saved. To save the vectors, assign each object to a valid name. After successful assignment, the vector is not printed. An object may be printed by typing its name.

v1 = 1:10
v1
 [1]  1  2  3  4  5  6  7  8  9 10
v2 <- seq(0,10,length.out = 21)
v2
 [1]  0.0  0.5  1.0  1.5  2.0  2.5  3.0  3.5  4.0  4.5  5.0  5.5  6.0  6.5  7.0
[16]  7.5  8.0  8.5  9.0  9.5 10.0
v3 = c(2,3,5,7,11)
v3
[1]  2  3  5  7 11

4.5 Arithmetic with Vectors

Vectors of the same size may be combined element-wise by various arithmetic operations.

v1 = c(1,2,3,4)
v2 = c(0,2,4,8)

## addition
v1 + v2
[1]  1  4  7 12
## subtraction
v2 - v1
[1] -1  0  1  4
## multiplication
v1*v2
[1]  0  4 12 32
## division
v2 / v1
[1] 0.000000 1.000000 1.333333 2.000000
## exponentiation
v1^v2
[1]     1     4    81 65536

Operations with a single scalar number and a vector may also occur, where, in effect, the scalar is expanded into a vector of the same length.

## Multiply each value by 10
10 * v1
[1] 10 20 30 40
## Raise each value to the 1.5 power
v2^1.5
[1]  0.000000  2.828427  8.000000 22.627417

There are also many built-in functions which apply to each individual element.

## square root
sqrt(v1)
[1] 1.000000 1.414214 1.732051 2.000000
## natural log
log(v1)
[1] 0.0000000 0.6931472 1.0986123 1.3862944
## log base 10
log10(v1)
[1] 0.0000000 0.3010300 0.4771213 0.6020600
log(v1, 10)
[1] 0.0000000 0.3010300 0.4771213 0.6020600
## log base 2
log(v1, 2)
[1] 0.000000 1.000000 1.584963 2.000000
## natural exponentiation
exp(v2)
[1]    1.000000    7.389056   54.598150 2980.957987
## other bases of exponentiation
2^v1
[1]  2  4  8 16
10^v2
[1] 1e+00 1e+02 1e+04 1e+08
## trig functions, using built in value of pi
sin(pi*v1)
[1]  1.224647e-16 -2.449294e-16  3.673940e-16 -4.898587e-16
cos(pi*v1)
[1] -1  1 -1  1

Arithmetic with real numbers on computers is not exact. An arithmetic operation which is exactly zero analytically might be calculated as a very small number in a computer, such as with sin() in the previous example where each value is zero in theory, but a small multiple of \(10^{-16}\) numerically in the computer

4.6 Numerical Summaries of Vectors

The previous examples apply mathematical functions to one element of a vector at a time. There are other examples that summarize the data in a vector and return a single number.

## length
length(v1)
[1] 4
## sum
sum(v1)
[1] 10
## maximum
max(v1)
[1] 4
## minimum
min(v1)
[1] 1
## mean
mean(v1)
[1] 2.5
## median
median(v1)
[1] 2.5
## quantiles (there are multiple definitions of quantiles for finite lists of numbers)
## For example, the 0.6 (60th percentile) of a list of four numbers may be defined in multiple different ways.
quantile(v1, 0.6)
60% 
2.8 
quantile(v1, 0.6, type = 3)
60% 
  2 
## standard deviation
sd(v1)
[1] 1.290994
## variance (square of the standard deviation)
var(v1)
[1] 1.666667
sd(v1)^2
[1] 1.666667

R has nine different definitions of quantiles implemented. My favorite is type = 3.

4.7 Data Frames

One of the vital base R concepts is that of a data frame, which is a rectangular table of data where each column contains a vector of data of the same type (a variable) and each row corresponds to a single case. An example from a previous chapter is the data frame with one row per winter with variables about freezing and thawing of Lake Mendota. Here are the first few rows.

head(mendota)
# A tibble: 6 × 7
  winter  year1 intervals duration first_freeze last_thaw  decade
  <chr>   <dbl>     <dbl>    <dbl> <date>       <date>      <dbl>
1 1855-56  1855         1      118 1855-12-18   1856-04-14   1850
2 1856-57  1856         1      151 1856-12-06   1857-05-06   1850
3 1857-58  1857         1      121 1857-11-25   1858-03-26   1850
4 1858-59  1858         1       96 1858-12-08   1859-03-14   1850
5 1859-60  1859         1      110 1859-12-07   1860-03-26   1850
6 1860-61  1860         1      117 1860-12-14   1861-04-10   1860

Data frames have dimensions which specify the number of rows and columns in the rectangular array of data. The attributes may be found with the base R functions dim(), nrow(), and ncol().

dim(mendota)
[1] 166   7
nrow(mendota)
[1] 166
ncol(mendota)
[1] 7

Note a feature of R when an array of one or more values is printed to the screen; each row begins with square brackets containing the index of the first item on that row. For short arrays, like above, there is a single row label [1]. Longer arrays label each additional row as well, as in these two examples.

LETTERS[1:26]
 [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
[20] "T" "U" "V" "W" "X" "Y" "Z"
1:100
  [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
 [19]  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
 [37]  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
 [55]  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
 [73]  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
 [91]  91  92  93  94  95  96  97  98  99 100

In the tidyverse, a standard data frame may be augmented with additional attributes by turning it into a tibble. For example, when a large tibble is printed, only some rows and columns are shown unless this behavior is overridden. In addition, there is a label under the name of each variable which specifies the type. In the mendota example above, we see some of the possibilities.

  • <chr> is short for character; each element is a string, or array of characters
  • <dbl> is short for double; each element is a number stored in double precision
    • this data type is known as numeric in other contexts
  • <date> stands for date; each element is stored as a date

We will see additional data types later in this chapter.

4.7.1 Extracting Parts of Data Frames

Data frames are special cases of lists, a general type of R container for holding data. Each column of the data frame is an element in the list. We will never (rarely?) use lists directly in this course, but there are some base R operators for lists that you should know about.

The $ operator extracts an element of a list by name. For example, to get the duration variable from mendota, we can do this.

mendota$duration
  [1] 118 151 121  96 110 117 132 104 125 118 125 123 110 127 131  99 126 144
 [19] 136 126  91 130  62 112  99 161  78 124 119 124 128 131 113  88  75 111
 [37]  97 112 101 101  91 110 100 130 111 107 105  89 126 108  97  94  83 106
 [55]  98 101 108  99  88 115 102 116 115  82 110  81  96 125 104 105 124 103
 [73] 106  96 107  98  65 115  91  94 101 121 105  97 105  96  82 116 114  92
 [91]  98 101 104  96 109 122 114  81  85  92 114 111  95 126 105 108 117 112
[109] 113 120  65  98  91 108 113 110 105  97 105 107  88 115 123 118  99  93
[127]  96  54 111  85 107  89  87  97  93  88  99 108  94  74 119 102  47  81
[145]  53 115  21  89  80 101  95  66 106  97  87 109  57  87 117  91  62  65
[163]  94  86  70  76

We can also use double square brackets [[]] to extract list items by position or name.

mendota[[4]]
  [1] 118 151 121  96 110 117 132 104 125 118 125 123 110 127 131  99 126 144
 [19] 136 126  91 130  62 112  99 161  78 124 119 124 128 131 113  88  75 111
 [37]  97 112 101 101  91 110 100 130 111 107 105  89 126 108  97  94  83 106
 [55]  98 101 108  99  88 115 102 116 115  82 110  81  96 125 104 105 124 103
 [73] 106  96 107  98  65 115  91  94 101 121 105  97 105  96  82 116 114  92
 [91]  98 101 104  96 109 122 114  81  85  92 114 111  95 126 105 108 117 112
[109] 113 120  65  98  91 108 113 110 105  97 105 107  88 115 123 118  99  93
[127]  96  54 111  85 107  89  87  97  93  88  99 108  94  74 119 102  47  81
[145]  53 115  21  89  80 101  95  66 106  97  87 109  57  87 117  91  62  65
[163]  94  86  70  76
mendota[['duration']]
  [1] 118 151 121  96 110 117 132 104 125 118 125 123 110 127 131  99 126 144
 [19] 136 126  91 130  62 112  99 161  78 124 119 124 128 131 113  88  75 111
 [37]  97 112 101 101  91 110 100 130 111 107 105  89 126 108  97  94  83 106
 [55]  98 101 108  99  88 115 102 116 115  82 110  81  96 125 104 105 124 103
 [73] 106  96 107  98  65 115  91  94 101 121 105  97 105  96  82 116 114  92
 [91]  98 101 104  96 109 122 114  81  85  92 114 111  95 126 105 108 117 112
[109] 113 120  65  98  91 108 113 110 105  97 105 107  88 115 123 118  99  93
[127]  96  54 111  85 107  89  87  97  93  88  99 108  94  74 119 102  47  81
[145]  53 115  21  89  80 101  95  66 106  97  87 109  57  87 117  91  62  65
[163]  94  86  70  76

Single square brackets ([]) are useful for extracting subsets of larger arrays of data by position. These may often be used in conjunction with the colon operator (:) which expands to a sequence of values, the seq() function for more general sequences, or the c() function to collect together items of the same type.

Here are a number of examples.

# The sequence from 1 to 5
1:5
[1] 1 2 3 4 5
# Another way
seq(1,5)
[1] 1 2 3 4 5
# And another way
c(1,2,3,4,5)
[1] 1 2 3 4 5
# Numbers from 1 to 13 by 3
seq(1, 13, 3)
[1]  1  4  7 10 13
# A collection of values
c(2,3,5,7,11)
[1]  2  3  5  7 11

The next examples demonstrate the use of single brackets to take subsets from single arrays.

## The third item from the duration variable of mendota
mendota$duration[3]
[1] 121
## A range of items
mendota$duration[1:5]
[1] 118 151 121  96 110
## A few specific items
mendota$duration[c(2, 3, 5, 7, 11)]
[1] 151 121 110 132 125
## All items larger than 140
mendota$duration[mendota$duration > 140]
[1] 151 144 161

This last example works in the following way. The expression mendota$duration > 140 is a vector of length 166, the number of rows in the mendota data frame, where each element is either TRUE or FALSE. Only the TRUE elements are retained after subsetting with [].

We can also use single square brackets to extract some rows and columns from the data frame (or other types of arrays with possible more dimensions) by separating the dimensions with commas.

# Get the first two rows of the columns 2, 3 and 5
mendota[1:2, c(2,3,5)]
# A tibble: 2 × 3
  year1 intervals first_freeze
  <dbl>     <dbl> <date>      
1  1855         1 1855-12-18  
2  1856         1 1856-12-06  

With the tidyverse, we will almost always use very different syntax when manipulating and extracting data in data frames. We will use the base commands and operators :, c(), and seq() at times, but will rarely use the operators $, [[]], or []. However, it is useful to know these operators as non-tidyverse code uses all of these operators extensively and when examining R code in other places, you will see them frequently.

4.8 Data Types

In the data frame mendota, we have seen three primitive data types in R: numeric, character, and date. Here is longer list of possible data types and examples.

  • integer: an integer such as 0, 3, and \(-17\)
  • numeric: a real number
  • character: a string, or sequence of character values
  • logical: true or false (use all caps, TRUE or FALSE in code)
  • date: a date
  • datetime: a date and time

Quantitative values in R are treated as numeric by default, even if the value is an integer. To explicitly specify that a value should be treated as an integer, add an L after the number. Strings are created by putting single or double quotes around a pattern of characters.

a = 2
class(a)
[1] "numeric"
b = 2L
class(b)
[1] "integer"
s = '2'
class(s)
[1] "character"
d = "2"
class(d)
[1] "character"

4.8.1 Conversions between types

R offers several functions to convert data from one type to another. These have the form as.type() where type is replaced by the desired data type. When the conversion is not allowed, a missing value NA is introduced.

Changing characters to numbers.

a = "2"
a
[1] "2"
class(a)
[1] "character"
b = as.integer(a)
b
[1] 2
class(b)
[1] "integer"
n = as.numeric(a)
n
[1] 2
class(n)
[1] "numeric"
m = as.numeric("a")
Warning: NAs introduced by coercion
m
[1] NA

Changing a number to a character

s = 1234
s
[1] 1234
class(s)
[1] "numeric"
ch = as.character(s)
ch
[1] "1234"
class(ch)
[1] "character"

Numbers may be converted to logical. Zero is treated as FALSE, and all other numbers as TRUE.

as.logical(0)
[1] FALSE
as.logical(1)
[1] TRUE
as.logical(-2.5)
[1] TRUE

When logical values are converted to numbers, FALSE becomes zero and TRUE becomes 1.

as.numeric(FALSE)
[1] 0
as.numeric(TRUE)
[1] 1

4.9 Valid Object Names

Objects we want to save in the active environment need to be given names. Valid names need to follow certain rules.

  • Names may contain letters, digits, periods (.), and underscore characters (_).
  • Names should begin with a letter
    • they cannot begin with a number or underscore
    • valid names can begin with a period, but such variables are “invisible” and do not appear in the environment.
    • variables that begin with a period are conventionally reserved for special internal variables that are not meant to be adjusted directly by users

Letters in names in R are case sensitive, so the names a and A are different.

It is best practice to not create objects with names that match the names of built-in R functions or objects.

  • the name c could be confused with the R function c();
  • the name pi could be confused with a built-in numerical constant of the same name;
  • the name dt matches a built-in function for the density of the t distribution;

There are many conventions for making long variable names more readable to people reading code.

Perhaps the best practice is to separate words with underscores, a convention named snake_case as it resembles a snake that may be digesting food in discrete areas.

here_is_a_snake_case_variable_name

Another common convention is camel case, which bunches words together and capitalizes every word after the first.

camelCaseVariableName

Using periods between words is also a convention. In many modern object-oriented programming languages (and in parts of R), the period has a special meaning, so I discourage using periods in names (recognizing many base R functions, such as as.numeric() do use periods in this way).

a.valid.variable.name

It is also possible to use otherwise invalid names by surrounding the name with single back ticks.

`bad name` = 1:5
`bad name` * 2
[1]  2  4  6  8 10
`2nd bad name` = 3:7
`2nd bad name`
[1] 3 4 5 6 7

4.10 Functions

There are many built-in functions in R. We will learn a small subset over the semester. Each function has a name. To actually call the function, you add a left and right parenthesis after the name, possibly with arguments within. Typing the name of the function alone will print the code for the function itself or some reference to the code, which is rarely what you intend. Here is an example with the function max().

# the function without ()
max
function (..., na.rm = FALSE)  .Primitive("max")
# calling the function
max(1:10)
[1] 10

4.10.1 Arguments

Many, but not all functions in R take one or more arguments. Arguments may be required, but also may have default values if the argument is not specified. Each argument of a function has a name. If the name is not provided, the argument is determined by the order. When arguments are named, they can be given in any order. A single call may also mix these two modes. We present several examples using the function quantile() as an example.

4.10.2 Accessing the Documentation

You may access the documentation for a function whose name you know by typing a question mark (?) followed immediately by the function name in the console.

?quantile

Skim down to the usage section and we see the following.

quantile(x, probs = seq(0, 1, 0.25), na.rm = FALSE,
         names = TRUE, type = 7, digits = 7, ...)

This tells us several pieces of information.

  • The first argument is called x. It does not have a default value, so it must be specified when calling the function.
    • Reading further, we learn that x is a numeric vector (in the simplest case)
  • The second argument probs by default takes the values of seq(0, 1, 0.25), or c(0, 0.25, 0.5, 0.75, 1).
    • Further documentation tells us that values need to be between 0 and 1.
  • The third argument is na.rm with default value FALSE.
    • By default, missing values (NA) are not removed. The answer will be NA in such cases.
  • The fourth argument is names with default argument TRUE.
    • By default, the result will have a names attribute which will label the quantiles.
  • The fifth argument is type with default value 7.
    • The documentation gives details on nine different algorithms associated with nine different definitions of a quantile.
  • The sixth argument, digits is only used when names is true and specifies the precision for reporting percentages in the names labels.
  • The last argument is ..., an R convention that means that any other options will be passed on to subsequent functions.

It does take practice to read the documentation well. Learn to skim to the arguments and to the examples.

For our example, we take as x the sequence of values from 1 to 10.

We may call quantile() on x for the default behavior which calculates the minimum, lower quartile (25th %-ile), median (50th %-ile), upper quartile (75th %-ile), and the maximum.

v = 1:10
quantile(v)
   0%   25%   50%   75%  100% 
 1.00  3.25  5.50  7.75 10.00 

The object v was treated as x as it was the first argument. We could have specified this explicitly.

quantile(x = v)
   0%   25%   50%   75%  100% 
 1.00  3.25  5.50  7.75 10.00 

If we only want to find, say, the 0.1 quantile (10th percentile), we can pass 0.1 as the second argument.

quantile(v, 0.10)
10% 
1.9 

If we do not recall the order of the arguments, we can pass the arguments by name in any order.

quantile(x = v, probs = 0.10)
10% 
1.9 
quantile(probs = 0.10, x = v)
10% 
1.9 

We can also name some, but not all arguments.

quantile(v, probs = 0.1)
10% 
1.9 

If we wanted to use the default value of probs but change the value of type from 7 to 3, the most straightforward way is to use the name type rather than passing in arguments for all of the intervening arguments so that the fifth can be specified without a name.

quantile(v, type = 3)
  0%  25%  50%  75% 100% 
   1    2    5    8   10 

When learning R, it is good practice to name the arguments everytime. For common functions that you use often, you can eventually save some time typing by just passing in the arguments in order without names. Using names can make the code more readable to others and avoids hard-to-find errors where you accidently pass in the values in the wrong order.