Chapter 3 Week 3
3.1 Vectors, lists, data frames and data types
In this section we will introduce you to various types of data you can store and create in R. In applied research and psychology in particular, you will often find different types of information in your data and these could be both numerical and text. We will work through a few examples below to give you an overview of various types of objects that store data and will talk about how you can access and work with the information within those.
As you saw during your lecture last week there are few key types of variables we can enounter when it comes to storing information.
There is a way to specify this in R:
When working with continuous/numeric data you will be creating numeric variables.
When working with categorical/nominal/ordinal data you will be creating factor variables.
When working with variables which store information such as TRUE or FALSE you will be using logical variables.
3.2 Get the package first
For today, the key package we will need is tidyverse
. Let’s install it first.
install.packages('tidyverse')
Next, we will need to call it from the library:
# Load from the library
library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.1 ✔ purrr 0.3.2
## ✔ tibble 2.1.3 ✔ dplyr 0.8.1
## ✔ tidyr 0.8.3 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ───────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
3.3 Numeric Data
Let’s imagine that someone wanted to provide summary data for the average temperature each month in Edinburgh. Imagine that they started updating their notes monthly. They started in June and by now they have the following records:
The average temperature was 18 degrees in June, 22 in July, 19 in August, and 16 in September.
Task: I want to create a variable, called monthly.temp
that stores this data. The first number should be 18, the second 22, and so on. We want to use the combine function c()
to help us do this. To create our vector, we should write:
monthly.temp <- c(18, 22, 19, 16)
monthly.temp
## [1] 18 22 19 16
To summarise, we have created a single variable called monthly.temp
, and this variable is a vector with four elements.
So, now that we have our vector, how do we get information out of it? What if I wanted to know the average temperature for August, for example? Since we started in June, August is the third month, so let’s try:
monthly.temp[3]
## [1] 19
Turns out that the numbers I put for August were wrong, and it was actually warmer this August (not 19 but 21!). How can I fix this in my monthly.temp
variable? I could make the whole vector again, but that’s a lot of typing and it’s wasteful given that I only need to change one value.
We can just tell R to change that one specific value:
monthly.temp[3] <- 21
monthly.temp
## [1] 18 22 21 16
You can also ask R
to return multiple values at once by indexing. For example, say I wanted to know the temperature between July (the second element) and September (the fourth element). The first way to ask for an element is to simply provide the numeric position of the desired element in the structure (vector, list…) in a set of square brackets [ ]
at the end of the object name. I would ask R
:
monthly.temp[2:4]
## [1] 22 21 16
# This is equivalent to:
monthly.temp[c(2, 3, 4)]
## [1] 22 21 16
Notice that the order matters here. If I asked for it in the reverse order, then R
would output the data in the reverse too.
3.4 Text/Character Data
Although you will mostly be dealing with numeric data, this isn’t always the case. Sometimes, you’ll use text. Let’s create a simple variable to see how its done:
greeting <- "hello"
greeting
## [1] "hello"
It is important to note the use of quotation marks here. This is because R
recognises this as a “character”. A character can be a single letter, 'g'
, but it can equally well be a sentence including punctuation, "Descriptive statistics can be like online dating profiles: technically accurate and yet pretty darn misleading."
Back to our temperature records example, I might want to create a variable that includes the names of the months. To do so, I could tell R
:
# Create months
months <- c("June", "July", "August", "September")
In simple terms, you have now created a character vector containing four elements, each of which is the name of a month. Let’s say I wanted to know what the fourth month was. What would I type?
# As before, access the fourth element of your vector
months[4]
## [1] "September"
3.5 Logical Data
A logical element can take one of two values: TRUE
or FALSE
. Logicals are usually the output of logical operations (anything that can be phrased as a yes/no question e.g. Is x equal to y?). In formal logic, TRUE
is represented as 1 and FALSE
as 0. This is also the case in R
.
If we ask R
to calculate 2 + 2, it will always give the same answer.
2+2
## [1] 4
If we want R
to judge whether something is a TRUE
statement, we have to explicitly ask. For example:
2+2 == 4
## [1] TRUE
By using the equality operator ==
, R
is being forced to make a TRUE
or FALSE
judgement.
2+2 == 3
## [1] FALSE
What if we try to force R
to believe some fake news (aka incorrect truths)?
2+2 = 3
## Error in 2 + 2 = 3: target of assignment expands to non-language object
R
cannot be convinced that easily. It understands that the 2+2
is not a variable (“non-language object”), and it won’t let you change what 2+2
is. In other words, it won’t let you change the ‘definition’ of the value of 2
.
There are several other logical operators that you can use, some of which are detailed in the table below.
Operation | R code | Example input | Example output |
---|---|---|---|
Less than | < |
1 < 2 | TRUE |
Greater than | > |
1 > 2 | FALSE |
Less than or equal to | <= |
1 <= 2 | TRUE |
Greater than or equal to | >= |
1 >= 2 | FALSE |
Equal to | == |
1 == 2 | FALSE |
Not equal to | != |
1 != 2 | TRUE |
Not | ! |
!(1==1) | FALSE |
Or | | |
(1==1) | (1==2) |
And | & |
(1==1) & (1==2) | FALSE |
Let’s apply some of these logical operators to our vectors. Let’s use our monthly.temp
vector, and ask R
whether there were any months when the temperature dropped below zero.
monthly.temp < 0
## [1] FALSE FALSE FALSE FALSE
I can then store this information in a vector:
any.temp <- monthly.temp < 0
any.temp
## [1] FALSE FALSE FALSE FALSE
To summarise, we have created a new logical vector called any.temp
, whose elements are TRUE
only if the corresponding sale is below zero.
But this output isn’t very helpful, as a big list of TRUE
and FALSE
values don’t give much insight into which months the temperature was below zero.
We can use logical indexing to ask for the names of the months where temperature was below zero. Ask R
:
months[ any.temp < 0 ]
## character(0)
3.6 Variable Classes
So far, you’ve encountered character, numeric and logical data. It is really important that you remember/know what kind of information each variable stores (and it is essential) that R
remembers, because otherwise you could run into some problems.
For example, let’s say you create the following variables:
x <- 1
y <- 2
Given that we have assigned numbers, let’s check whether they are numeric:
is.numeric(x)
## [1] TRUE
is.numeric(y)
## [1] TRUE
Great, that means that we could proceed with simple sums e.g. multiplication. However, if they contained character data, R
would provide you with an error:
x <- "blue"
y <- "yellow"
x*y
## Error in x * y: non-numeric argument to binary operator
Yes, R
is smart enough to know that you can’t multiply colours. It knows because you’ve used the quotation marks to indicate that the variable contains text. This might seem unhelpful, but it is actually quite useful, especially when working with data. For example, without quotation marks, R
would treat 10 as a number and would allow you to do sums with it. With the quotation marks, “5”, it treats it as text.
Above, we checked to specifically see whether our x
and y
variables were stored as numeric variables. But what if you can’t remember what you should be checking for? You could use the class( )
and mode( )
functions instead. The class( )
of the variable tells you the classification, and mode( )
relates to the format of the information. The former is the most useful in most cases.
x <- "hello"
class(x)
## [1] "character"
mode(x)
## [1] "character"
y <- TRUE
class(y)
## [1] "logical"
mode(y)
## [1] "logical"
z <- 10
class(z)
## [1] "numeric"
mode(z)
## [1] "numeric"
3.7 Factors
Let’s get into some more relevant examples for statistics. We have only referred to ‘numeric’ data so far but we commonly make the distinctions between nominal, ordinal, interval, and ratio numeric data.
Imagine that we had conducted a study with different treatment conditions. Within our study, all twelve participants completed the same task, but each of the three groups were given different instructions. Let’s first create a variable that tracks which group people were in:
group <- c(1,1,1,1,2,2,2,2,3,3,3,3)
Now, it wouldn’t make sense to add two to group 1, group 2, and group 3 but let’s try anyway:
group + 2
## [1] 3 3 3 3 4 4 4 4 5 5 5 5
R has now created groups 4 and 5, which don’t exist. But we allowed it to do so, as the values are currently just ordinary numbers. We need to tell R to treat “group” as a factor. We can do this using the as_factor
function.
group <- as_factor(group)
group
## [1] 1 1 1 1 2 2 2 2 3 3 3 3
## Levels: 1 2 3
This output is a little different from the first lot, but let’s check that it is now a factor:
is.factor(group)
## [1] TRUE
class(group)
## [1] "factor"
Now, let’s try to add two to the group again to see what happens.
group + 2
## Warning in Ops.factor(group, 2): '+' not meaningful for factors
## [1] NA NA NA NA NA NA NA NA NA NA NA NA
Great! Now R
knows that we’re the ones being stupid! But what if we wanted to assign meaningful labels to the different levels
of the factor? Say, for example, we had low, high, and control conditions? We can do it like this:
levels(group) <- c("low", "high", "control")
print(group)
## [1] low low low low high high high high
## [9] control control control control
## Levels: low high control
Factors are extremely helpful, and they are the main way to represent nominal scales. It is really important that you label them with meaningful names, as they can help when interpreting output.
There are lots of other ways that you can assign labels
to your levels
.
3.8 Lists
Lists arrange elements in a collection of vectors or other data structures. In other words, lists are just a collection of variables, that have no constraints on what types of variables can be included.
Emma <- list(age = 26,
siblings = TRUE,
parents = c("Mike", "Donna")
)
Here, R
has created a list variable called Emma
, which contains three different variables - age
, siblings
, and parents
. Let’s have a look at how R
stores this list:
print(Emma)
## $age
## [1] 26
##
## $siblings
## [1] TRUE
##
## $parents
## [1] "Mike" "Donna"
If you wanted to extract one element of the list, you would use the $
operator:
Emma$age
## [1] 26
You can also add new entries to the list, again using the $
. For example:
Emma$handedness <- "right"
print(Emma)
## $age
## [1] 26
##
## $siblings
## [1] TRUE
##
## $parents
## [1] "Mike" "Donna"
##
## $handedness
## [1] "right"
Nice! These are key things we want you to understand for today. Now, let’s load your Week3_Practice.Rmd file from Learn and you can try these for yourself. Try to attempt all exercises and where necessary, go back to the notes above.
Note, that you will need to create the data that we will use next week so make sure to complete all the steps. The solutions are available below but do not look at those yet.
3.9 Practice.Rmd Solutions
To do this practice, we will use tidyverse
.
Make sure that you first run the installation code for the package in your console. Remember once you have done it once, there is no need to do it ever again.
install.packages('tidyverse')
# Load packages
library('tidyverse')
3.9.1 Exercise 1
Create a list with some information about yourself or play around and store something you think can be best described using lists. Check out the example you saw in the tutorial.
# Exercise 1 (example answer)
Anastasia<- list(favourite_colour = 'blue',
married = TRUE ,
speak_languages = c("English", "Russian")
)
3.9.2 Exercise 2
Create a nominal variable called sex
with three groups: male, female, and other, with four individuals in each group. Make sure to level
and label
your variable appropriately! I provided an example for this one, try to figure out the second one by yourself.
# Exercise 2 (example answer)
# Create the variable
sex <- c(1,1,1,1,2,2,2,2,3,3,3,3)
# Transform into factor
sex <- as_factor(sex)
# Label the levels
levels(sex) <- c("male", "female", "other")
print(sex)
Create a variable called “group” that you saw in the tutorial. Remember that one to three represent low, high, and control conditions respectively.
# Create group (example answer)
group <- c(1,2,3,1,2,2,3,3,1,2,3,3)
# Tranform into factor
group <- as_factor(group)
# Label the levels
levels(group) <- c("low", "high", "control")
print(group)
3.9.3 Exercise 3
Earlier we created two variables called group
and sex
. We also have the test scores and ages of these individuals, so let’s record those as well so we can create a full data set that has everything together.
- My records for age were: 20, 22, 49, 41, 35, 47, 18, 33, 21, 24, 22, 28
- My records were scores were: 70, 89, 56, 60, 68, 62, 93, 63, 71, 65, 54, 67
# Exercise 3 (example answer)
age <- c(20, 22, 49, 41, 35, 47, 18, 33, 21, 24, 22, 28)
score <- c(70, 89, 56, 60, 68, 62, 93, 63, 71, 65, 54, 67)
We now have four variables of the same size in our environment - age
, sex
, group
, and score
. Each of them are the same size (i.e. vectors with four elements) and the first entry for age (i.e. age[1]
) corresponds to the same person for sex[1]
. All of these four variables correspond to the same data set, but R
doesn’t know this yet, we need to tell it!
Now, let’s put everything side by side:
# Now let's put everything together
mydata <- tibble(age, sex, group, score)
mydata
Note that data is now completely self-contained, and if you were to make changes to say, your original age variable stored in a vector, it will not make any changes to age stored in your data frame.
When you have large data frames, you might want to check what variables you have stored in there. To do this, you can ask R
to return the names of each of the variables using the names()
function.
# Check for the variable names
names(mydata)
# Glimpse at your data
glimpse(mydata)
This gives you a very basic overview of your data, but is a very helpful tool in displaying the breakdown of what is contained in an object.
You might want to get some specific data out of your data frame, as opposed to all four columns. You need to be specific in asking R
to return you this information. For example, let’s say you want to extract the scores.
# Select specific column
select(mydata, score) # Note how we specify data first, then the variable we want
# Same fo age
select(mydata, age)
Quick note: Remember the steps you used to create ‘mydata’ as we will use it again next week to build plots and visualisations.