4 R Fundamentals
2024-01-24
4.1 History of the R Language
The R language traces its roots to the S language which was developed at Bell Labs (previously part of AT&T, now part of Lucent Technologies) in the 1970s. R may be considered to be an open-source implementation of the S language, although there are some differences. The name R is partly due to the first names of the initiators of the project, Ross Ihaka and Robert Gentleman, two statisticians who were at the University of Auckland in the early 1990s when they began the project, and partly a play on the name of the S language. The combination of the popularity of S within academic statistics departments and the commitment of a core group of developers and a larger community willing to spend time to contribute to a community programming effort led to R becoming a very popular programming language, especially useful for statistical data exploration, modeling, and analysis.
4.2 Packages
A key feature of the language is the ease with which users can extend the existing language and share their extensions by creating packages. One important development was the creation of the grid package, a collection of low-level, object-oriented graphical functions that offered an alternative to the base R graphics system. This package set the stage for others to build high-level graphical packages to make it easier for general users to tap into the power of this new graphical system. The most successful of these new graphics packages was ggplot2, the first of a series of packages developed by Hadley Wickham and collaborators now released together as the tidyverse, to make it easier to use R for data science.
R is a very rich language and with thousands of packages that extend it, no one can hope to master the entire language with all of its extensions. In this course, we will use primarily code from the tidyverse so that we can master a small useful subset of the language designed to make it easy to do a number of common data science tasks, such as data visualization and transformation. Code written with tidyverse has a structure that looks quite distinct from conventional R code. However, we cannot avoid base R completely and there are many core R concepts we need to comprehend in order to master functions within the tidyverse. The remainder of this chapter introduces many important fundamental aspects of R.
4.3 Vectors
A vector is an R object that stores values all of the same type. Each column in a data frame is an example of a vector. We can also create vectors from scratch using a few common functions and operators.
- The operator
:
creates a vector of consecutive integers - The function
seq()
makes more general sequences- Specify the start, end, and either the increment between values or the length
- The function
c()
to collect a number of items together
Here are several examples.
[1] 1 2 3 4 5 6 7 8 9 10
[1] 1 4 7 10 13
[1] 13 9 5 1
[1] 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0
[16] 7.5 8.0 8.5 9.0 9.5 10.0
[1] 2 3 5 7 11
[1] "A" "BB" "CCC"
[1] TRUE TRUE FALSE TRUE
4.4 Assignment
R allows multiple ways to assign a name to an object.
The conventional method is to use the two-character combination <-
which is intended to resemble an arrow.
I prefer to use the symbol =
which is the convention in many other programing languages.
If you want to be in the majority and clearly distinguish between assignment
and setting values to function arguments (such as length.out = 21
above), you may prefer <-
.
If you are not confused by the different uses of the symbol =
in different contexts
and prefer using a symbol common to other computer languages, then you may use =
.
This is a personal choice of preference.
In the previous examples, I created vectors which were printed and not saved. To save the vectors, assign each object to a valid name. After successful assignment, the vector is not printed. An object may be printed by typing its name.
[1] 1 2 3 4 5 6 7 8 9 10
[1] 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0
[16] 7.5 8.0 8.5 9.0 9.5 10.0
[1] 2 3 5 7 11
4.5 Arithmetic with Vectors
Vectors of the same size may be combined element-wise by various arithmetic operations.
[1] 1 4 7 12
[1] -1 0 1 4
[1] 0 4 12 32
[1] 0.000000 1.000000 1.333333 2.000000
[1] 1 4 81 65536
Operations with a single scalar number and a vector may also occur, where, in effect, the scalar is expanded into a vector of the same length.
[1] 10 20 30 40
[1] 0.000000 2.828427 8.000000 22.627417
There are also many built-in functions which apply to each individual element.
[1] 1.000000 1.414214 1.732051 2.000000
[1] 0.0000000 0.6931472 1.0986123 1.3862944
[1] 0.0000000 0.3010300 0.4771213 0.6020600
[1] 0.0000000 0.3010300 0.4771213 0.6020600
[1] 0.000000 1.000000 1.584963 2.000000
[1] 1.000000 7.389056 54.598150 2980.957987
[1] 2 4 8 16
[1] 1e+00 1e+02 1e+04 1e+08
[1] 1.224647e-16 -2.449294e-16 3.673940e-16 -4.898587e-16
[1] -1 1 -1 1
Arithmetic with real numbers on computers is not exact.
An arithmetic operation which is exactly zero analytically
might be calculated as a very small number in a computer,
such as with sin()
in the previous example
where each value is zero in theory,
but a small multiple of \(10^{-16}\) numerically in the computer
4.6 Numerical Summaries of Vectors
The previous examples apply mathematical functions to one element of a vector at a time. There are other examples that summarize the data in a vector and return a single number.
[1] 4
[1] 10
[1] 4
[1] 1
[1] 2.5
[1] 2.5
## quantiles (there are multiple definitions of quantiles for finite lists of numbers)
## For example, the 0.6 (60th percentile) of a list of four numbers may be defined in multiple different ways.
quantile(v1, 0.6)
60%
2.8
60%
2
[1] 1.290994
[1] 1.666667
[1] 1.666667
R has nine different definitions of quantiles implemented.
My favorite is type = 3
.
4.7 Data Frames
One of the vital base R concepts is that of a data frame, which is a rectangular table of data where each column contains a vector of data of the same type (a variable) and each row corresponds to a single case. An example from a previous chapter is the data frame with one row per winter with variables about freezing and thawing of Lake Mendota. Here are the first few rows.
# A tibble: 6 × 7
winter year1 intervals duration first_freeze last_thaw decade
<chr> <dbl> <dbl> <dbl> <date> <date> <dbl>
1 1855-56 1855 1 118 1855-12-18 1856-04-14 1850
2 1856-57 1856 1 151 1856-12-06 1857-05-06 1850
3 1857-58 1857 1 121 1857-11-25 1858-03-26 1850
4 1858-59 1858 1 96 1858-12-08 1859-03-14 1850
5 1859-60 1859 1 110 1859-12-07 1860-03-26 1850
6 1860-61 1860 1 117 1860-12-14 1861-04-10 1860
Data frames have dimensions which specify the number of rows and columns in the rectangular array of data.
The attributes may be found with the base R functions dim()
, nrow()
, and ncol()
.
[1] 166 7
[1] 166
[1] 7
Note a feature of R when an array of one or more values is printed to the screen;
each row begins with square brackets containing the index of the first item on that row.
For short arrays, like above, there is a single row label [1]
.
Longer arrays label each additional row as well,
as in these two examples.
[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
[20] "T" "U" "V" "W" "X" "Y" "Z"
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
[19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
[37] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
[55] 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
[73] 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
[91] 91 92 93 94 95 96 97 98 99 100
In the tidyverse, a standard data frame may be augmented with additional attributes
by turning it into a tibble.
For example, when a large tibble is printed,
only some rows and columns are shown unless this behavior is overridden.
In addition,
there is a label under the name of each variable which specifies the type.
In the mendota
example above, we see some of the possibilities.
<chr>
is short forcharacter
; each element is a string, or array of characters<dbl>
is short fordouble
; each element is a number stored in double precision- this data type is known as
numeric
in other contexts
- this data type is known as
<date>
stands fordate
; each element is stored as a date
We will see additional data types later in this chapter.
4.7.1 Extracting Parts of Data Frames
Data frames are special cases of lists, a general type of R container for holding data. Each column of the data frame is an element in the list. We will never (rarely?) use lists directly in this course, but there are some base R operators for lists that you should know about.
The $
operator extracts an element of a list by name.
For example, to get the duration
variable from mendota
,
we can do this.
[1] 118 151 121 96 110 117 132 104 125 118 125 123 110 127 131 99 126 144
[19] 136 126 91 130 62 112 99 161 78 124 119 124 128 131 113 88 75 111
[37] 97 112 101 101 91 110 100 130 111 107 105 89 126 108 97 94 83 106
[55] 98 101 108 99 88 115 102 116 115 82 110 81 96 125 104 105 124 103
[73] 106 96 107 98 65 115 91 94 101 121 105 97 105 96 82 116 114 92
[91] 98 101 104 96 109 122 114 81 85 92 114 111 95 126 105 108 117 112
[109] 113 120 65 98 91 108 113 110 105 97 105 107 88 115 123 118 99 93
[127] 96 54 111 85 107 89 87 97 93 88 99 108 94 74 119 102 47 81
[145] 53 115 21 89 80 101 95 66 106 97 87 109 57 87 117 91 62 65
[163] 94 86 70 76
We can also use double square brackets [[]]
to extract list items by position or name.
[1] 118 151 121 96 110 117 132 104 125 118 125 123 110 127 131 99 126 144
[19] 136 126 91 130 62 112 99 161 78 124 119 124 128 131 113 88 75 111
[37] 97 112 101 101 91 110 100 130 111 107 105 89 126 108 97 94 83 106
[55] 98 101 108 99 88 115 102 116 115 82 110 81 96 125 104 105 124 103
[73] 106 96 107 98 65 115 91 94 101 121 105 97 105 96 82 116 114 92
[91] 98 101 104 96 109 122 114 81 85 92 114 111 95 126 105 108 117 112
[109] 113 120 65 98 91 108 113 110 105 97 105 107 88 115 123 118 99 93
[127] 96 54 111 85 107 89 87 97 93 88 99 108 94 74 119 102 47 81
[145] 53 115 21 89 80 101 95 66 106 97 87 109 57 87 117 91 62 65
[163] 94 86 70 76
[1] 118 151 121 96 110 117 132 104 125 118 125 123 110 127 131 99 126 144
[19] 136 126 91 130 62 112 99 161 78 124 119 124 128 131 113 88 75 111
[37] 97 112 101 101 91 110 100 130 111 107 105 89 126 108 97 94 83 106
[55] 98 101 108 99 88 115 102 116 115 82 110 81 96 125 104 105 124 103
[73] 106 96 107 98 65 115 91 94 101 121 105 97 105 96 82 116 114 92
[91] 98 101 104 96 109 122 114 81 85 92 114 111 95 126 105 108 117 112
[109] 113 120 65 98 91 108 113 110 105 97 105 107 88 115 123 118 99 93
[127] 96 54 111 85 107 89 87 97 93 88 99 108 94 74 119 102 47 81
[145] 53 115 21 89 80 101 95 66 106 97 87 109 57 87 117 91 62 65
[163] 94 86 70 76
Single square brackets ([]
) are useful for extracting subsets of larger arrays of data by position.
These may often be used in conjunction with the colon operator (:)
which expands to a sequence of values, the seq()
function for more general sequences, or the c()
function to collect together items of the same type.
Here are a number of examples.
[1] 1 2 3 4 5
[1] 1 2 3 4 5
[1] 1 2 3 4 5
[1] 1 4 7 10 13
[1] 2 3 5 7 11
The next examples demonstrate the use of single brackets to take subsets from single arrays.
[1] 121
[1] 118 151 121 96 110
[1] 151 121 110 132 125
[1] 151 144 161
This last example works in the following way.
The expression mendota$duration > 140
is a vector of length 166, the number of rows in the mendota
data frame,
where each element is either TRUE
or FALSE
.
Only the TRUE
elements are retained after subsetting with []
.
We can also use single square brackets to extract some rows and columns from the data frame (or other types of arrays with possible more dimensions) by separating the dimensions with commas.
# A tibble: 2 × 3
year1 intervals first_freeze
<dbl> <dbl> <date>
1 1855 1 1855-12-18
2 1856 1 1856-12-06
With the tidyverse,
we will almost always use very different syntax when manipulating and extracting data in data frames.
We will use the base commands and operators :
, c()
, and seq()
at times,
but will rarely use the operators $
, [[]]
, or []
.
However,
it is useful to know these operators as non-tidyverse code uses all of these operators extensively and when examining R code in other places,
you will see them frequently.
4.8 Data Types
In the data frame mendota
,
we have seen three primitive data types in R: numeric, character, and date.
Here is longer list of possible data types and examples.
integer
: an integer such as 0, 3, and \(-17\)numeric
: a real numbercharacter
: a string, or sequence of character valueslogical
: true or false (use all caps, TRUE or FALSE in code)date
: a datedatetime
: a date and time
Quantitative values in R are treated as numeric
by default,
even if the value is an integer.
To explicitly specify that a value should be treated as an integer,
add an L
after the number.
Strings are created by putting single or double quotes around a pattern of characters.
[1] "numeric"
[1] "integer"
[1] "character"
[1] "character"
4.8.1 Conversions between types
R offers several functions to convert data from one type to another.
These have the form as.type()
where type
is replaced by the desired data type.
When the conversion is not allowed, a missing value NA
is introduced.
Changing characters to numbers.
[1] "2"
[1] "character"
[1] 2
[1] "integer"
[1] 2
[1] "numeric"
Warning: NAs introduced by coercion
[1] NA
Changing a number to a character
[1] 1234
[1] "numeric"
[1] "1234"
[1] "character"
Numbers may be converted to logical. Zero is treated as FALSE, and all other numbers as TRUE.
[1] FALSE
[1] TRUE
[1] TRUE
When logical values are converted to numbers, FALSE becomes zero and TRUE becomes 1.
[1] 0
[1] 1
4.9 Valid Object Names
Objects we want to save in the active environment need to be given names. Valid names need to follow certain rules.
- Names may contain letters, digits, periods (
.
), and underscore characters (_
). - Names should begin with a letter
- they cannot begin with a number or underscore
- valid names can begin with a period, but such variables are “invisible” and do not appear in the environment.
- variables that begin with a period are conventionally reserved for special internal variables that are not meant to be adjusted directly by users
Letters in names in R are case sensitive, so the names a
and A
are different.
It is best practice to not create objects with names that match the names of built-in R functions or objects.
- the name
c
could be confused with the R functionc()
; - the name
pi
could be confused with a built-in numerical constant of the same name; - the name
dt
matches a built-in function for the density of the t distribution;
There are many conventions for making long variable names more readable to people reading code.
Perhaps the best practice is to separate words with underscores, a convention named snake_case as it resembles a snake that may be digesting food in discrete areas.
here_is_a_snake_case_variable_name
Another common convention is camel case, which bunches words together and capitalizes every word after the first.
camelCaseVariableName
Using periods between words is also a convention.
In many modern object-oriented programming languages
(and in parts of R), the period has a special meaning,
so I discourage using periods in names (recognizing many base R functions, such as as.numeric()
do use periods in this way).
a.valid.variable.name
It is also possible to use otherwise invalid names by surrounding the name with single back ticks.
[1] 2 4 6 8 10
[1] 3 4 5 6 7
4.10 Functions
There are many built-in functions in R.
We will learn a small subset over the semester.
Each function has a name.
To actually call the function,
you add a left and right parenthesis after the name,
possibly with arguments within.
Typing the name of the function alone
will print the code for the function itself or some reference to the code,
which is rarely what you intend.
Here is an example with the function max()
.
function (..., na.rm = FALSE) .Primitive("max")
[1] 10
4.10.1 Arguments
Many, but not all functions in R take one or more arguments.
Arguments may be required, but also may have default values if the argument is not specified.
Each argument of a function has a name.
If the name is not provided,
the argument is determined by the order.
When arguments are named, they can be given in any order.
A single call may also mix these two modes.
We present several examples using the function quantile()
as an example.
4.10.2 Accessing the Documentation
You may access the documentation for a function whose name you know by typing a question mark (?
) followed immediately by the function name in the console.
Skim down to the usage section and we see the following.
quantile(x, probs = seq(0, 1, 0.25), na.rm = FALSE,
names = TRUE, type = 7, digits = 7, ...)
This tells us several pieces of information.
- The first argument is called
x
. It does not have a default value, so it must be specified when calling the function.- Reading further, we learn that
x
is a numeric vector (in the simplest case)
- Reading further, we learn that
- The second argument
probs
by default takes the values ofseq(0, 1, 0.25)
, orc(0, 0.25, 0.5, 0.75, 1)
.- Further documentation tells us that values need to be between 0 and 1.
- The third argument is
na.rm
with default valueFALSE
.- By default, missing values (NA) are not removed. The answer will be
NA
in such cases.
- By default, missing values (NA) are not removed. The answer will be
- The fourth argument is
names
with default argumentTRUE
.- By default, the result will have a names attribute which will label the quantiles.
- The fifth argument is
type
with default value 7.- The documentation gives details on nine different algorithms associated with nine different definitions of a quantile.
- The sixth argument,
digits
is only used whennames
is true and specifies the precision for reporting percentages in the names labels. - The last argument is
...
, an R convention that means that any other options will be passed on to subsequent functions.
It does take practice to read the documentation well. Learn to skim to the arguments and to the examples.
For our example, we take as x
the sequence of values from 1 to 10.
We may call quantile()
on x
for the default behavior
which calculates the minimum, lower quartile (25th %-ile),
median (50th %-ile), upper quartile (75th %-ile), and the maximum.
0% 25% 50% 75% 100%
1.00 3.25 5.50 7.75 10.00
The object v
was treated as x
as it was the first argument.
We could have specified this explicitly.
0% 25% 50% 75% 100%
1.00 3.25 5.50 7.75 10.00
If we only want to find, say, the 0.1 quantile (10th percentile), we can pass 0.1 as the second argument.
10%
1.9
If we do not recall the order of the arguments, we can pass the arguments by name in any order.
10%
1.9
10%
1.9
We can also name some, but not all arguments.
10%
1.9
If we wanted to use the default value of probs
but change the value of type
from 7 to 3,
the most straightforward way is to use the name type
rather than passing in arguments for all of the intervening arguments so that the fifth can be specified without a name.
0% 25% 50% 75% 100%
1 2 5 8 10
When learning R, it is good practice to name the arguments everytime. For common functions that you use often, you can eventually save some time typing by just passing in the arguments in order without names. Using names can make the code more readable to others and avoids hard-to-find errors where you accidently pass in the values in the wrong order.