Section 3 The Basics of R
R is built around a few basic pieces - once you understand them, it’s easier to understand more complex commands, since everything is built from the same basic foundations.
In programming terms, we can refer to the basic pieces that make up R as data types.
3.1 Basic data types
3.1.1 Numbers
The numeric data type allows you to work with numbers. R can do all the basic operations you’d expect: addition, subtraction, multiplication and division.
At the most basic level, you can use R as a calculator by doing
standard operations like +
, -
, /
(division), *
(multiplication),
^
(power) on numeric data:
## [1] 2
## [1] 7.5
## [1] 64
R also has an integer (whole number) data type. Integers (usually) work exactly the same as numeric data, so you don’t need to worry too much about the difference for now. Integers will automatically be converted to the more general numeric format when needed:
## [1] 2
## [1] 3.1
## [1] 2.5
3.1.2 Characters (text)
The character data type allows you to store and manipulate
text. Character data is created by wrapping text in either single '
or
double "
quotes. In programming terms, we also refer to each chunk of text
as a string:
## [1] "apple"
# Note: this is still just one string. All the text, including
# the spaces, is contained in the same chunk of text
toupper("three bananas")
## [1] "THREE BANANAS"
## [1] "car"
## [1] "forgot"
3.1.3 Logical (True/False)
The logical data type is used to represent the True/False result
of a logical test or comparison. These are represented by the
special values of TRUE
and FALSE
(basically 1 and 0, with special labels
attached to them). To do logical comparisons, you can use syntax like:
==
: equals. Note that you need a double equal sign to compare values, a single equal sign does something different.
## [1] FALSE
<
,>
: less than, greater than
## [1] TRUE
<=
,>=
: less than or equal to, greater than or equal to
## [1] TRUE
!=
: not equal to
## [1] TRUE
!
: not, which reverses the result of another logical test:
## [1] FALSE
3.1.3.1 Combining logicals: AND and OR
More complex logical tests can be conducted by combining multiple tests
with the and &
and or |
operators.
&
takes two logicals, e.g. a & b
, and returns TRUE
if both a
and b
are TRUE
, and FALSE
otherwise.
## [1] TRUE
## [1] FALSE
a | b
returns TRUE
if either a
or b
is TRUE
## [1] TRUE
## [1] FALSE
It’s best to wrap each individual test in parentheses ()
to make the logic clear.
3.2 Converting between types
Occasionally your data will be read in from a file as the wrong type. You might be able to fix this by changing the way you read in the file, but otherwise you should convert the data to the type that makes the most sense (you might have to clean up some invalid values first).
Functions like as.character()
, as.numeric()
and as.logical()
will
convert data to the relevant type. Probably the most common type conversion
you’ll have to do is when numeric
data gets treated as text and is stored
as character
. Numeric operations like addition won’t work until you fix
this:
## Error in "1" + 1: non-numeric argument to binary operator
## [1] 2
3.3 Variables: Storing Results
The results of calculations in R can be stored in variables: you give a name to the results, and then when you want to look at, use or change those results later, you access them using the same name.
You assign a value to a variable using either =
or <-
(these
are mostly equivalent, don’t worry too much about the difference), putting
the variable name on the left hand side and the value on the right.
## [1] 22
## [1] TRUE
# Changing a variable: this will overwrite the old value with the
# new one, the old value won't be available unless you've
# stored it somewhere else
scale_total = scale_total + 2
scale_total
## [1] 24
When you assign a variable, you’re asking R to remember some data so you can use it later. Understanding that simple principle will take you a long way in R programming.
The expression on the right hand side might be long and complex, but as long as it’s valid R code that creates a single value, all you’re doing is creating a single value and assigning it to a name, just like in the simple examples above.
Variable names in R should start with a letter (a-zA-Z
), and
can contain letters, numbers, underscores _
and periods .
, so
model3
, get.scores
, ANX_total
are all valid variable names.
3.3.1 Copying Variables
If you assign an existing value to a new variable, it will create a copy. If you change one copy, the other will stay as it was:
## [1] 3
## [1] 5
This can be useful if you want to test out some changes to your data, or create multiple different subsets of the same data.
3.4 Vectors
All the data-types discussed above can be stored in vectors1, which
are sequences with multiple elements of the same type. Vectors can be
created using the c()
function to put together multiple elements.
## [1] 1 2 3 10
The number of elements in a vector is the length:
## [1] 3
3.4.1 Calculating with vectors
R automatically applies most calculations and operations to every element of a vector at the same time, so you work with vectors the same way you would single values2.
Adding, multipying, or comparing a single value with a vector automatically adds, multiplies or compares the single value to every element of the vector. The result is a new vector with the same length:
## [1] 2 3 4
## [1] 30 60 90
## [1] FALSE TRUE FALSE
When you work with two vectors of the same length, R automatically matches up the first element of one vector with the first element of the other, the second element with the second element, and so on:
## [1] 11 22 33
## [1] TRUE FALSE TRUE
NB: not recommended! It’s possible to do some tricks with vectors of different lengths, but it’s generally not needed. Usually, you’ll either have a single value you want to work with, or vectors that are all the same length.
## [1] 0 2 0 4 0 6 0 8
Most of the time, trying to work with two vectors of different lengths will just produce a warning or an error that lets you know something is wrong:
## Warning in c(5, 6) * c(3, 8, 4): longer object length is not a multiple of
## shorter object length
## [1] 15 48 20
3.4.1.1 R is built around vectors
The details above are relevant when you need to do custom calculations manually. But of course, R already has plenty of built-in commands, and these are all designed to work with vectors. A lot of the time, you’ll just be feeding vectors into existing commands:
## [1] 6
## [1] 14
##
## Welch Two Sample t-test
##
## data: c(1, 2, 3, 4) and c(1, 2, 3, 5)
## t = -0.23355, df = 5.5846, p-value = 0.8237
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.917221 2.417221
## sample estimates:
## mean of x mean of y
## 2.50 2.75
3.4.2 Missing values
All types of vectors allow for missing data, through the special NA
value.
## [1] 29 NA 14
Generally, NA
values will stay NA
when you try to calculate
with the vector.
## [1] 58 NA 28
## [1] "A" NA "C"
# Missing values on either side of the sum will produce
# missing values in the result
c(1, NA, 3) + c(4, 5, NA)
## [1] 5 NA NA
The is.na()
function can test which values are missing:
## [1] FALSE TRUE FALSE
Functions like sum()
and mean()
will produce a missing
result by default if any values in the input are missing. Use the
na.rm = TRUE
option (short for “NA
remove”) to ignore the missing values
and just use the values that are available:
## [1] NA
## [1] 5
Other functions in R will automatically remove missing values, but will usually warn you when they do. It’s always good to check how missing values are being treated, whatever tool you’re using.
3.4.3 Indexing: accessing parts of vectors
To access parts of a vector, use square brackets []
after the vector
and use integers to specify which parts you want to extract. E.g. to
extract the second element:
## [1] "b"
This is known as indexing. Here, 2 is the index.
To extract multiple elements, you can use a vector as the index. R returns a new vector, containing the elements that match up to the index.
## [1] "b" "c" "d"
## [1] "b" "c" "d"
The index can include any number between 1 and the length of the vector, in any order. You can also access the same element multiple times:
## [1] "d" "a" "b" "c" "a" "b"
3.4.4 Logical indexing: filtering your data based on conditions
One of the most powerful tools in R is the ability to
access the subset of your data that meets a condition.
If you use a logical vector as the index for
your vector, R returns a new vector containing just the elements
where the index was TRUE
:
## [1] 20 30
This means you can filter the vector using the results of a logical test:
## [1] 20 30
This works because the result of x >= 20
is a logical vector
c(TRUE, FALSE, TRUE)
, which works just like it did above. If you’re having
trouble understanding a complex R expression, you can often pull out the
individual parts and test them separately to see how they work.
This kind of logical subsetting is particularly useful once you start testing based on other vectors:
group = c("Control", "Treatment", "Treatment", "Control")
score = c(6, 5, 7, 4)
score[group == "Treatment"]
## [1] 5 7
3.4.5 Changing vectors
To change part of a vector, you can index the vector on the left-hand
side of the =
/<-
symbol and put the replacement on the right-hand
side. You just need to make sure the replacement is either:
- A single value, or
- The same length as the part you’re replacing.
## [1] 11 15 12 44 18
## [1] 11 6 7 8 18
## [1] 11 0 12 13 0
Actually, most things in R are vectors, even when they look like single values. Even the single elements shown above, like
3
, are just vectors with a length of 1.↩In other programming languages, you might have to manually apply the operation to each element of the sequence, using something like a for loop.↩