Section 3 The Basics of R

R is built around a few basic pieces - once you understand them, it’s easier to understand more complex commands, since everything is built from the same basic foundations.

In programming terms, we can refer to the basic pieces that make up R as data types.

3.1 Basic data types

3.1.1 Numbers

The numeric data type allows you to work with numbers. R can do all the basic operations you’d expect: addition, subtraction, multiplication and division.

At the most basic level, you can use R as a calculator by doing standard operations like +, -, / (division), * (multiplication), ^ (power) on numeric data:

> 1 + 1

## [1] 2

> 2.5 * 3

## [1] 7.5

> 8^2

## [1] 64

R also has an integer (whole number) data type. Integers (usually) work exactly the same as numeric data, so you don’t need to worry too much about the difference for now. Integers will automatically be converted to the more general numeric format when needed:

# You can specify that data should be integers using "L"
1L + 1L

## [1] 2

# Automatically converts the result to numeric
3L + 0.1

## [1] 3.1

5L / 2

## [1] 2.5

3.1.2 Characters (text)

The character data type allows you to store and manipulate text. Character data is created by wrapping text in either single ' or double " quotes. In programming terms, we also refer to each chunk of text as a string:

"apple"

## [1] "apple"

# Note: this is still just one string. All the text, including
#   the spaces, is contained in the same chunk of text
toupper("three bananas")

## [1] "THREE BANANAS"

# Get part of a string.
substr("carrot", 1, 3)

## [1] "car"

# Stick multiple strings together with paste0
paste0("for", "got")

## [1] "forgot"

3.1.3 Logical (True/False)

The logical data type is used to represent the True/False result of a logical test or comparison. These are represented by the special values of TRUE and FALSE (basically 1 and 0, with special labels attached to them). To do logical comparisons, you can use syntax like:

==: equals. Note that you need a double equal sign to compare values, a single equal sign does something different.

> "a" == "b"

## [1] FALSE

<, >: less than, greater than

3 < 4

## [1] TRUE

<=, >=: less than or equal to, greater than or equal to

10 >= 10

## [1] TRUE

!=: not equal to

"hello" != "world"

## [1] TRUE

!: not, which reverses the result of another logical test:

> ! (5 > 3)

## [1] FALSE

3.1.3.1 Combining logicals: AND and OR

More complex logical tests can be conducted by combining multiple tests with the and & and or | operators.

& takes two logicals, e.g. a & b, and returns TRUE if both a and b are TRUE, and FALSE otherwise.

# Both conditions are true: TRUE & TRUE is TRUE
(2 > 1) & ("a" == "a")

## [1] TRUE

# Only one condition is true: TRUE & FALSE is FALSE
(3 >= 3) & ("b" == "a")

## [1] FALSE

a | b returns TRUE if either a or b is TRUE

# FALSE | TRUE is TRUE
(1 > 2) | ("a" == "a")

## [1] TRUE

# FALSE | FALSE is FALSE
(1 > 2) | ("a" == "b")

## [1] FALSE

It’s best to wrap each individual test in parentheses () to make the logic clear.

3.2 Converting between types

Occasionally your data will be read in from a file as the wrong type. You might be able to fix this by changing the way you read in the file, but otherwise you should convert the data to the type that makes the most sense (you might have to clean up some invalid values first).

Functions like as.character(), as.numeric() and as.logical() will convert data to the relevant type. Probably the most common type conversion you’ll have to do is when numeric data gets treated as text and is stored as character. Numeric operations like addition won’t work until you fix this:

"1" + 1

## Error in "1" + 1: non-numeric argument to binary operator

one_fixed = as.numeric("1")
one_fixed + 1

## [1] 2

3.3 Variables: Storing Results

The results of calculations in R can be stored in variables: you give a name to the results, and then when you want to look at, use or change those results later, you access them using the same name.

You assign a value to a variable using either = or <- (these are mostly equivalent, don’t worry too much about the difference), putting the variable name on the left hand side and the value on the right.

scale_total = 3 + 8 + 5 + 2 + 4
# Accessing saved results
scale_total

## [1] 22

# Using saved results in another calculation
severe_disorder = scale_total >= 15
severe_disorder

## [1] TRUE

# Changing a variable: this will overwrite the old value with the
#   new one, the old value won't be available unless you've
#   stored it somewhere else
scale_total = scale_total + 2
scale_total

## [1] 24

When you assign a variable, you’re asking R to remember some data so you can use it later. Understanding that simple principle will take you a long way in R programming.

The expression on the right hand side might be long and complex, but as long as it’s valid R code that creates a single value, all you’re doing is creating a single value and assigning it to a name, just like in the simple examples above.

Variable names in R should start with a letter (a-zA-Z), and can contain letters, numbers, underscores _ and periods ., so model3, get.scores, ANX_total are all valid variable names.

3.3.1 Copying Variables

If you assign an existing value to a new variable, it will create a copy. If you change one copy, the other will stay as it was:

a = 3
# b is a copy of a
b = a
b = b + 2
# We didn't change a
a

## [1] 3

# We only changed the copy
b

## [1] 5

This can be useful if you want to test out some changes to your data, or create multiple different subsets of the same data.

3.4 Vectors

All the data-types discussed above can be stored in vectors¹, which are sequences with multiple elements of the same type. Vectors can be created using the c() function to put together multiple elements.

> c(1, 2, 3, 10)

## [1]  1  2  3 10

The number of elements in a vector is the length:

> length(c("a", "b", "c"))

## [1] 3

3.4.1 Calculating with vectors

R automatically applies most calculations and operations to every element of a vector at the same time, so you work with vectors the same way you would single values².

Adding, multipying, or comparing a single value with a vector automatically adds, multiplies or compares the single value to every element of the vector. The result is a new vector with the same length:

> 1 + c(1, 2, 3)

## [1] 2 3 4

> 3 * c(10, 20, 30)

## [1] 30 60 90

> "banana" == c("apple", "banana", "carrot")

## [1] FALSE  TRUE FALSE

When you work with two vectors of the same length, R automatically matches up the first element of one vector with the first element of the other, the second element with the second element, and so on:

> c(1, 2, 3) + c(10, 20, 30)

## [1] 11 22 33

> c("a", "b", "c") == c("a", "x", "c")

## [1]  TRUE FALSE  TRUE

NB: not recommended! It’s possible to do some tricks with vectors of different lengths, but it’s generally not needed. Usually, you’ll either have a single value you want to work with, or vectors that are all the same length.

> # "Clever", but not recommended
> c(0, 1) * c(1, 2, 3, 4, 5, 6, 7, 8)

## [1] 0 2 0 4 0 6 0 8

Most of the time, trying to work with two vectors of different lengths will just produce a warning or an error that lets you know something is wrong:

> c(5, 6) * c(3, 8, 4)

## Warning in c(5, 6) * c(3, 8, 4): longer object length is not a multiple of
## shorter object length

## [1] 15 48 20

3.4.1.1 R is built around vectors

The details above are relevant when you need to do custom calculations manually. But of course, R already has plenty of built-in commands, and these are all designed to work with vectors. A lot of the time, you’ll just be feeding vectors into existing commands:

> sum(c(1, 2, 3))

## [1] 6

> mean(c(10, 12, 16, 18))

## [1] 14

> t.test(c(1, 2, 3, 4), c(1, 2, 3, 5))

## 
##  Welch Two Sample t-test
## 
## data:  c(1, 2, 3, 4) and c(1, 2, 3, 5)
## t = -0.23355, df = 5.5846, p-value = 0.8237
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.917221  2.417221
## sample estimates:
## mean of x mean of y 
##      2.50      2.75

3.4.2 Missing values

All types of vectors allow for missing data, through the special NA value.

c(29, NA, 14)

## [1] 29 NA 14

Generally, NA values will stay NA when you try to calculate with the vector.

c(29, NA, 14) * 2

## [1] 58 NA 28

toupper(c("a", NA, "c"))

## [1] "A" NA  "C"

# Missing values on either side of the sum will produce
#   missing values in the result
c(1, NA, 3) + c(4, 5, NA)

## [1]  5 NA NA

The is.na() function can test which values are missing:

is.na(c(1, NA, 3))

## [1] FALSE  TRUE FALSE

Functions like sum() and mean() will produce a missing result by default if any values in the input are missing. Use the na.rm = TRUE option (short for “NA remove”) to ignore the missing values and just use the values that are available:

mean(c(1, 3, NA, 7, 9))

## [1] NA

mean(c(1, 3, NA, 7, 9), na.rm = TRUE)

## [1] 5

Other functions in R will automatically remove missing values, but will usually warn you when they do. It’s always good to check how missing values are being treated, whatever tool you’re using.

3.4.3 Indexing: accessing parts of vectors

To access parts of a vector, use square brackets [] after the vector and use integers to specify which parts you want to extract. E.g. to extract the second element:

c("a", "b", "c", "d")[2]

## [1] "b"

This is known as indexing. Here, 2 is the index.

To extract multiple elements, you can use a vector as the index. R returns a new vector, containing the elements that match up to the index.

c("a", "b", "c", "d")[c(2, 3, 4)]

## [1] "b" "c" "d"

# This is a handy shortcut that does the same thing
c("a", "b", "c", "d")[2:4]

## [1] "b" "c" "d"

The index can include any number between 1 and the length of the vector, in any order. You can also access the same element multiple times:

c("a", "b", "c", "d")[c(4, 1, 2, 3, 1, 2)]

## [1] "d" "a" "b" "c" "a" "b"

3.4.4 Logical indexing: filtering your data based on conditions

One of the most powerful tools in R is the ability to access the subset of your data that meets a condition. If you use a logical vector as the index for your vector, R returns a new vector containing just the elements where the index was TRUE:

c(20, 10, 30)[c(TRUE, FALSE, TRUE)]

## [1] 20 30

This means you can filter the vector using the results of a logical test:

x = c(20, 10, 30)
# Same results as above
x[x >= 20]

## [1] 20 30

This works because the result of x >= 20 is a logical vector c(TRUE, FALSE, TRUE), which works just like it did above. If you’re having trouble understanding a complex R expression, you can often pull out the individual parts and test them separately to see how they work.

This kind of logical subsetting is particularly useful once you start testing based on other vectors:

group = c("Control", "Treatment", "Treatment", "Control")
score = c(6, 5, 7, 4)
score[group == "Treatment"]

## [1] 5 7

3.4.5 Changing vectors

To change part of a vector, you can index the vector on the left-hand side of the =/<- symbol and put the replacement on the right-hand side. You just need to make sure the replacement is either:

A single value, or
The same length as the part you’re replacing.

x = c(11, 15, 12, 13, 18)
# Replace a single element
x[4] = 44
x

## [1] 11 15 12 44 18

# Replace multiple elements
x[2:4] = c(6, 7, 8)
x

## [1] 11  6  7  8 18

# Replace based on a test
x = c(11, 15, 12, 13, 18)
x[x > 14] = 0
x

## [1] 11  0 12 13  0

Actually, most things in R are vectors, even when they look like single values. Even the single elements shown above, like 3, are just vectors with a length of 1.↩
In other programming languages, you might have to manually apply the operation to each element of the sequence, using something like a for loop.↩