Section 4 The Basics of R
Once you understand the basics of R’s data types, some of the more advanced features of R start to make sense.
R is built around a few basic pieces - once you understand them, it’s easier to understand more complex commands, since everything is built from the same basic foundations.
In programming terms, we can refer to the basic pieces that make up R as data types.
4.1 Basic data types
4.1.1 Numbers
The numeric data type allows you to work with numbers. R can do all the basic operations you’d expect: addition, subtraction, multiplication and division.
At the most basic level, you can use R as a calculator by doing
standard operations like +
, -
, /
(division), *
(multiplication),
^
(power) on numeric data:
> #addition
> 1 + 1
## [1] 2
> #multiplication
> 2.5 * 3
## [1] 7.5
> #to the power of
> 8^2
## [1] 64
> #exponentiating
> exp(0.36)
## [1] 1.433329
> #log transformation
> log10(6.66)
## [1] 0.8234742
R also has an integer (whole number) data type. Integers (usually) work exactly the same as numeric data, so you don’t need to worry too much about the difference for now. Integers will automatically be converted to the more general numeric format when needed:
# You can specify that data should be integers using "L"
## Addition
+ 3L 2L
## [1] 5
# if we add a decimal place, R will automatically converts the result to numeric
+ 0.1 3L
## [1] 3.1
4.1.2 Characters (text)
The character data type allows you to store and manipulate
text. Character data is created by wrapping text in either single '
or
double "
quotes. In programming terms, we also refer to each chunk of text
as a string:
"apple"
## [1] "apple"
#we can change it to uppercase
toupper("crab apple")
## [1] "CRAB APPLE"
# Get part of a string.
substr("apple", 1, 3) #telling it to get the first 3 letters
## [1] "app"
# Stick multiple strings together with paste0
paste0("crab", "apple")
## [1] "crabapple"
4.1.3 Factors (categorical data)
Factors are how R represents categorical data. They have a fixed number of levels, that are set up when you first create a factor vector:
= sample(c("Moderate", "Severe"), 10, replace=TRUE)
severity # Setting 'levels' also sets the order of the levels
= factor(severity, levels = c("Moderate", "Severe"))
sev_factor sev_factor
## [1] Moderate Moderate Moderate Moderate Moderate Moderate Severe Severe Moderate Severe
## Levels: Moderate Severe
4.1.4 Logical (True/False)
The logical data type is used to represent the True/False result
of a logical test or comparison. These are represented by the
special values of TRUE
and FALSE
(basically 1 and 0, with special labels
attached to them). To do logical comparisons, you can use syntax like:
==
: equals. Note that you need a double equal sign to compare values, a single equal sign does something different.
> "a" == "b"
## [1] FALSE
<
,>
: less than, greater than
3 < 4
## [1] TRUE
<=
,>=
: less than or equal to, greater than or equal to
10 >= 9
## [1] TRUE
!=
: not equal to
"goodbye" != "spss"
## [1] TRUE
4.2 Converting between types
Occasionally your data will be read in from a file as the wrong type. You might be able to fix this by changing the way you read in the file, but otherwise you should convert the data to the type that makes the most sense (you might have to clean up some invalid values first).
You will want to check first what variable type your variables have been read in as. We can use the “class” function and select individual variables from the dataset
class(data$sex)
## [1] "character"
class(data$neuroticism)
## [1] "numeric"
… or if we want to know it for the whole dataset, we can use the structure function.
str(data)
## tibble [1,421 x 5] (S3: tbl_df/tbl/data.frame)
## $ neuroticism : num [1:1421] 16 8 5 8 9 6 8 12 15 NA ...
## ..- attr(*, "format.spss")= chr "F2.0"
## $ extraversion: num [1:1421] 13 14 16 20 19 15 10 11 16 7 ...
## ..- attr(*, "format.spss")= chr "F2.0"
## $ sex : chr [1:1421] "female" "male" "male" "female" ...
## ..- attr(*, "format.spss")= chr "A6"
## ..- attr(*, "display_width")= int 6
## $ volunteer : chr [1:1421] "no" "no" "no" "no" ...
## ..- attr(*, "format.spss")= chr "A3"
## ..- attr(*, "display_width")= int 18
## $ treatment : num [1:1421] 1 2 2 1 2 2 2 1 1 1 ...
## ..- attr(*, "format.spss")= chr "F8.0"
We can also see this for the first variables in the dataframe by clicking the arrow next to our dataframe in the environment.
If you have data in a format that needs to be changed, for example here, we would like our treatment variable to be a factor, not numeric, we can use functions like as.character()
, as.numeric()
and as.logical()
. These functions will
convert data to the relevant type. Probably the most common type conversion
you’ll have to do is when numeric
data gets treated as text and is stored
as character
. Numeric operations like addition won’t work until you fix
this:
"1" + 1
## Error in "1" + 1: non-numeric argument to binary operator
= as.numeric("1")
one_fixed + 1 one_fixed
## [1] 2
In our dataset we will change our numeric ‘treatment’ variable to a factor, so it can be used as such in subsequent analyses. We will also convert our character vectors to factor.
$treatment <- as.factor(data$treatment)
data$sex <- as.factor(data$sex)
data$volunteer <- as.factor(data$volunteer) data
4.3 Variables: Storing Results
The results of calculations in R can be stored in variables: you give a name to the results, and then when you want to look at, use or change those results later, you access them using the same name.
You assign a value to a variable using either =
or <-
(these
are mostly equivalent, don’t worry too much about the difference), putting
the variable name on the left hand side and the value on the right.
NOTE THAT EVERYTHING ON THE RIGHT HAND SIDE IS BEING ASSIGNED TO THE NAME YOU GIVE ON THE LEFT
$personality_total <- data$neuroticism + data$extraversion
data# Accessing saved results
#data$personality_total
This variable will have also now saved into your dataframe:
# Using saved results in another calculation
$severe_personality = data$personality_total >= 20
data#data$severe_personality
# Changing a variable: this will overwrite the old value with the
# new one, the old value won't be available unless you've
# stored it somewhere else, or named the new variable something different
$personality_total <- data$personality_total + 2 data
Now our personality total variable is not the same as our original one, it has been overwritten with our new command.
When you assign a variable, you’re asking R to remember some data so you can use it later. Understanding that simple principle will take you a long way in R programming.
Variable names in R should start with a letter (a-zA-Z
), and
can contain letters, numbers, underscores _
and periods .
, so
model3
, get.scores
, ANX_total
are all valid variable names.
4.3.1 Missing values
Some of you smart cookies may have noticed that our ‘personality_total’ variable has some NAs…
Functions like sum()
and mean()
will produce a missing
result by default if any values in the input are missing. Use the
na.rm = TRUE
option (short for “NA
remove”) to ignore the missing values
and just use the values that are available:
mean(data$neuroticism)
## [1] NA
mean(data$neuroticism, na.rm = TRUE)
## [1] 11.46888
If we wanted to re-make our personality_total variable so that individuals with values for only one of the two personality variables are assigned that value as their total (we probably wouldn’t want to do this, but let’s say we do!):
$personality_total <- rowSums(cbind(data$neuroticism, data$extraversion), na.rm = TRUE) data
Other functions in R will automatically remove missing values, but will usually warn you when they do. It’s always good to check how missing values are being treated, whatever tool you’re using.