Chapter 4 Week 3

4.1 Vectors,lists, data frames and data types

In this section we will introduce you to various types of data you can store and create in R. In applied research and in psychology in particular, you will often find different type of information in your data and these could be both numerical and text. We will work through few examples below to give you an overview of various types of objects but also how can you acess the information within those.

4.2 Numeric Data

Let’s imagine that someone in the audience works for R, likes the look of this workbook, and decides to sign me up to write textbooks for them. Imagine that they then start to send me my sales data monthly. Let’s suppose I have 50 sales in June, 20 in July, 10 in August, and 70 in September, but then no other sales for the rest of the year.

Task: I want to create a variable, called monthly.sales that stores this data. The first number should be 50, the second 20, and so on. We want to use the combine function c() to help us do this. To create our vector, we should write:

monthly.sales <- c(50, 20, 10, 70, 0, 0, 0)
monthly.sales
## [1] 50 20 10 70  0  0  0

To summarise, we have created a single variable called monthly.sales, and this variable is a vector with 7 elements.

So, now that we have our vector, how do we get information out of it? What if I wanted to know how many sales I made in August, for example. Since we started in June, August was the 3rd month of sales, so lets try:

monthly.sales[3]
## [1] 10

Turns out that the numbers I received for the August sales were wrong, and I actually had 100 sold, not 10! How can I fix this in my monthly.sales variable? I could make the whole vector again, but thats a lot of typing and wasteful, given that I only need to change one value.

We can just tell R to change that one specific value:

monthly.sales[3] <- 100
monthly.sales
## [1]  50  20 100  70   0   0   0

You could also use the edit() and fix() functions, but we won’t be covering these in this session. You should check them out in your own time.

You can also ask R to return multiple values at once by indexing. For example, say I wanted to know how much I earned between July (2nd element) and October (5th element). The first way to ask for an element is to simply provide the numeric position of the desired element in the structure (vector, list…) in a set of square brackets [ ] at the end of the object name. I would ask R:

monthly.sales[2:5]
## [1]  20 100  70   0
# equivalent to
monthly.sales[c(2, 3, 4, 5)]
## [1]  20 100  70   0

Notice that the order matters here. If I asked for it in the reverse order, then R would output the data in the reverse too.

monthly.sales[5:2]
## [1]   0  70 100  20
# equivalent to
monthly.sales[c(5, 4, 3, 2)]
## [1]   0  70 100  20

Next I want to figure out how much money I’ll be making each month (given that the end of the year isn’t looking too good, I hope the next few months are!). Since I earn £5 per book, I can just multiply each element of monthly.sales by 5. Sounds pretty easy, and it is!

monthly.sales * 5
## [1] 250 100 500 350   0   0   0

4.3 Text/Character Data

Although you will mostly be dealing with numeric data, this isn’t always the case. Sometimes, you’ll use text. Let’s create a simple variable:

greeting <- "hello"
greeting
## [1] "hello"

It is important to note the use of quotation marks here. This is because R recognises this as a “character”, a string of characters, no matter how long. It can be a single letter, 'g', but it can equally well be a sentence, "Descriptive statistics can be like online dating profiles: technically accurate and yet pretty darn misleading."

Back to my R book example, I might want to create a variable that includes the names of the months. To do so, I could tell R:

months <- c("June", "July", "August", "September", "October", "November", "December")

In simple terms, you have now created a character vector containing 7 elements, each of which is the name of a month. Lets say I wanted to know how many what the 5th month was. What would I type?

months[5]
## [1] "October"

4.4 Logical Data

A logical element can take one of two values, TRUE or FALSE. Logicals are usually the output of logical operations (anything that can be phrased as a yes/no question, e.g., is x equal to y?). In formal logic, TRUE is represented as 1 and FALSE as 0. This is also the case in R.

If we ask R to calculate 2 + 2, it will always give the same answer

2+2
## [1] 4

If we want R to judge whether something is a TRUE statement, we have to explicitly ask. For example:

2+2 == 4
## [1] TRUE

By using the equality operator == , R is being forced to make a TRUE or FALSE judgement.

2+2 == 3
## [1] FALSE

What if we try to force R to believe some fake news (aka incorrect truths)?

2+2 = 3
## Error in 2 + 2 = 3: target of assignment expands to non-language object

R cannot be convinced that easily. It understands that the 2+2 is not a variable (“non-language object”), and it won’t let you change what 2+2 is. In other words, it wont let you change the ‘definition’ of the value of 2.

There are several other logical operators that you can use, some of which are detailed in the below table.

Operation R code Example Input Example Output
Less than < 1 < 2 TRUE
Greater than > 1 > 2 FALSE
Less than or equal to <= 1 <= 2 TRUE
Greater than or equal to <= 1 >= 2 FALSE
Equal to == 1 == 2 FALSE
Not equal to != 1 != 2 TRUE
Not ! !(1==1) FALSE
Or | (1==1) (1==2)
And & (1==1) (1==2)

Lets apply some of these logical operators to our vectors. Lets use our monthly.sales vector, and ask R when I actually sold a book:

monthly.sales > 0
## [1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE

I can then store this into a vector:

any.sales <- monthly.sales > 0
any.sales 
## [1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE

To summarise, we have created a new logical vector called any.sales, whose elements are TRUE only if the corresponding sale is > 0.

But this output isn’t very helpful, as a big list of TRUE and FALSE values don’t give me much insight to which months I’ve sold my book in.

We can use logical indexing to ask for the names of the months where sales are > 0. Ask R:

months[ any.sales > 0 ]
## [1] "June"      "July"      "August"    "September"

You can apply the same logic to find the actual sales numbers for these months too:

monthly.sales [monthly.sales > 0]
## [1]  50  20 100  70

You could also do the same thing with text. It turns out that the one store that sold my R book didn’t always have books in stock. Let’s create a variable called stock.levels to have a look at this:

stock.levels <- c("high", "high", "low", "high", "low", "out", "out")
stock.levels
## [1] "high" "high" "low"  "high" "low"  "out"  "out"

Now, apply the same logical indexing trick, but with the character vector instead, to see when the book was not in stock.

months[stock.levels == "out"]
## [1] "November" "December"

That explains the lack of sales anyway! But what if I wanted to know when the shop either had low or no copies? You could ask R one of two things:

months[stock.levels == "out" | stock.levels == "low"]
## [1] "August"   "October"  "November" "December"
#Alternatively
months[stock.levels != "high"]
## [1] "August"   "October"  "November" "December"

4.4.1 Exercise

Try to create a few variables of your own, and ask R to return you specific elements that you are interested with.

4.5 Variable Classes

So far, you’ve now encountered character, numeric and logical data. It is really important that you remember/know what kind of information each variable stores (and it is essential) that R remembers, because otherwise you could run into some problems.

For example, lets say you create the following variables:

x <- 1
y <- 2

Given that we have assigned numbers, lets check and see that they are numeric:

is.numeric(x)
## [1] TRUE
is.numeric(y)
## [1] TRUE

Great, that means that we could proceed with simple sums, eg multiplication. However, if they contained character data, R would provide you with an error:

x <- "blue"
y <- "yellow"
x*y
## Error in x * y: non-numeric argument to binary operator

Yes, R is smart enough to know that you can’t multiply colours. It knows because you’ve used the quotation marks to indicate that the variable contains text. This might seem unhelpful, but it is actually quite useful, especially when working with data. For example, without quotation marks, R would treat 10 as the number ten, and would allow you to do sums with it. With the quotation marks, “5”, it knows that it is text.

Above, we checked to specifically see whether our x and y variables were stored as numeric variables. But what if you can’t remember what you should be checking for? You could use the class( ) and mode( ) functions instead. The class( ) of the variable tells you the classification, and mode( ) relates to the format of the information. The former is the most useful in most cases.

x <- "hello"
class(x)
## [1] "character"
mode(x)
## [1] "character"
y <- TRUE
class(y)
## [1] "logical"
mode(y)
## [1] "logical"
z <- 10
class(z)
## [1] "numeric"
mode(z)
## [1] "numeric"

4.6 Factors

Let’s get into some more relevant examples for statistics. Although we have only referred to ‘numeric’ data so far, we commonly make the distinctions between nominal, ordinal, interval, and ratio. The numeric variable in R is generally fine for ratio scale data, interval, and ordinal, but what about nominal?

Imagine that we had conducted a study with different treatment conditions. Within our study, all twelve participants completed the same task, but each of the three groups were given different instructions. Lets first create a variable that tracks which group people were in:

group <- c(1,1,1,1,2,2,2,2,3,3,3,3)

Now, it wouldn’t make sense to add two to group 1, group 2, and group 3, and we know that its not possible since they are distinct groups, but lets try anyway:

group + 2
##  [1] 3 3 3 3 4 4 4 4 5 5 5 5

R has now created groups 4 and 5, which don’t exist. But we allowed it to do so, as the values are currently just ordinary numbers. We need to tell R to treat group as a factor. We can do this using the as.factor function.

group <- as.factor(group)
group
##  [1] 1 1 1 1 2 2 2 2 3 3 3 3
## Levels: 1 2 3

This output is a little different from the first lot, but lets check that it is now a factor:

is.factor(group)
## [1] TRUE
class(group)
## [1] "factor"

Now, lets try to add 2 to the group again to see what happens.

group + 2
## Warning in Ops.factor(group, 2): '+' not meaningful for factors
##  [1] NA NA NA NA NA NA NA NA NA NA NA NA

Great! Now R knows that were the ones being stupid! But what if we wanted to assign meaningful labels to the different levels of the factor? Say, for example, we had low, high, and control conditions? We can do it like this:

levels(group) <- c("low", "high", "control")
print(group)
##  [1] low     low     low     low     high    high    high    high   
##  [9] control control control control
## Levels: low high control

Factors are extremely helpful, and they are the main way to represent nominal scales. It is really important that you label them with meaningful names, as they can help when interpreting output.

There are lots of other ways that you can assign labels to your levels.

4.6.1 Exercise

Create a nominal variable called sex with 2 groups - male and female with 5 and 7 individuals in each group respectively. Make sure to level and label your variable appropriately!

sex <- c(1,1,1,1,1,2,2,2,2,2,2,2)
sex <- as.factor(sex)
levels(sex) <- c("male", "female")
print(sex)
##  [1] male   male   male   male   male   female female female female female
## [11] female female
## Levels: male female

4.7 Lists

Lists arrange elements in a collection of vectors or other data structures. In other words, lists are just a collection of variables, that have no constraints on what types of variables can be included.

Emma <- list(age = 26,
             siblings = TRUE,
             parents = c("Mike", "Donna")
             )

Here, R has created a list variable called Emma, which contains three different variables - age, siblings, and parents. Lets have a look at how R stores this list:

print(Emma)
## $age
## [1] 26
## 
## $siblings
## [1] TRUE
## 
## $parents
## [1] "Mike"  "Donna"

If you wanted to extract one element of the list, you would use the $ operator:

Emma$age
## [1] 26

You can also add new entries to the list, again using the $. For example:

Emma$handedness <- "right"
print(Emma)
## $age
## [1] 26
## 
## $siblings
## [1] TRUE
## 
## $parents
## [1] "Mike"  "Donna"
## 
## $handedness
## [1] "right"

4.8 Exercise

Create a list with some infromation about yourself or play around and store something you think can be described best using lists.

4.9 Matrices

A matrix has 2 dimensions, rows and columns. The first number/vector in the []s represents rows and the second columns. Leaving either position blank will return all rows/columns:

mat <- matrix(data = 1:12,
       nrow = 4,
       ncol = 3)
mat
##      [,1] [,2] [,3]
## [1,]    1    5    9
## [2,]    2    6   10
## [3,]    3    7   11
## [4,]    4    8   12
# blank spaces technically not needed but improve code readability
mat[1, ] # first row
## [1] 1 5 9
mat[ , 1] # first column
## [1] 1 2 3 4
mat[c(2, 4), ] # rows 2 and 4, notice the c()
##      [,1] [,2] [,3]
## [1,]    2    6   10
## [2,]    4    8   12
mat[c(2, 4), 1:3] # elements 2 and 4 of columns 1-3
##      [,1] [,2] [,3]
## [1,]    2    6   10
## [2,]    4    8   12
colnames(mat) <- c("One","Two","Three")
rownames(mat) <- c("Un","Deux","Trois", "Quatre")

mat
##        One Two Three
## Un       1   5     9
## Deux     2   6    10
## Trois    3   7    11
## Quatre   4   8    12

 

To get the full matrix, we simply type its name. However, you can think of the same operation as asking for all rows and all columns of the matrix:

mat[ , ] # all rows, all columns
##        One Two Three
## Un       1   5     9
## Deux     2   6    10
## Trois    3   7    11
## Quatre   4   8    12

4.10 Data Frames

In this section, we are going to achieve two things. Firstly, you are going to create a data frame, and secondly, you are going to load in some data.

Previously, we created two variables called group and sex. We also have the test scores and ages of these individuals too, so lets record those:

age <- c(20, 22, 49, 41, 35, 47, 18, 33, 21, 24, 22, 28)
score <- c(70, 89, 56, 60, 68, 62, 93, 63, 71, 65, 54, 67)

We now have 4 variables of the same size in our environment - age, sex, group, and score. Each of them are the same size (ie vectors with 4 elements) and the first entry for age (i.e. age[1]) corresponds to the same person for sex[1]. All of these 4 variables correspond to the same data set, but R doesn’t know this yet, we need to tell it!

To do this, we need to create a dataframe.

mydata <- data.frame(age, sex, group, score)
mydata
##    age    sex   group score
## 1   20   male     low    70
## 2   22   male     low    89
## 3   49   male     low    56
## 4   41   male     low    60
## 5   35   male    high    68
## 6   47 female    high    62
## 7   18 female    high    93
## 8   33 female    high    63
## 9   21 female control    71
## 10  24 female control    65
## 11  22 female control    54
## 12  28 female control    67

Note that data is now completely self contained, and if you were to make changes to say, your original age variable stored in a vector, it will not make any changes to age stored in your data frame.

When you have large data frames, you might want to check what variables you have stored in there. To do this, you can ask R to return the names of each of the variables using the names() function.

names(mydata)
## [1] "age"   "sex"   "group" "score"

You can also compactly display the internal structure of an R object

str(mydata)
## 'data.frame':    12 obs. of  4 variables:
##  $ age  : num  20 22 49 41 35 47 18 33 21 24 ...
##  $ sex  : Factor w/ 2 levels "male","female": 1 1 1 1 1 2 2 2 2 2 ...
##  $ group: Factor w/ 3 levels "low","high","control": 1 1 1 1 2 2 2 2 3 3 ...
##  $ score: num  70 89 56 60 68 62 93 63 71 65 ...

This gives you a very basic overview of your data, but is a very helpful tool in displaying the breakdown of what is contained in an object.

You might want to get some specific data out of your data frame, as opposed to the full 4 columns. You need to be specific in asking R to return you this information. The simplest way is to make use of the $ operator to extract the desired information. For example, lets say you want to extract the scores.

mydata$score
##  [1] 70 89 56 60 68 62 93 63 71 65 54 67

4.11 Loading Data (Advanced)

For most analyses that you conduct in R, the first step involves importing a data set into R. There are lots of different ways to load data into R, and many different types of data it can read too. We will talk about these in mode depth in Week 5 so if it is getting overwhelming, take a deep breath and note, that we will return to practicing more of reading data in soon.

Datasets that we will use for this section can be downloaded here. It is a zip folder. Save it somewhere where you can easily find it.

We will be using the read.csv function to do this, as our data is stored in Excel files. One thing to be mindful of here is the path to the file.

If you have saved the file within your current working directory, you can simply write:

books <- read.csv("books.csv", header = T)

Note that the books dataset has now appeared in the Global Environment.

There are other ways that you can do this too. For example, you can use the read.table() function, and you can also use the read.spss() or read.sav() functions for SPSS files too. Note, to use the SPSS import functions, you will need to load the foreign or haven packages (more on this later).

Now that we have our data read into R, lets have a look at it. We might first want to see a breakdown of the data frame. We can do this by using the str() function.

str(books)
## 'data.frame':    500 obs. of  2 variables:
##  $ comic   : int  -44 20 0 -18 -19 13 16 14 -6 11 ...
##  $ statbook: int  16 -14 6 -13 7 -33 3 -7 -6 -3 ...

We can see that we have 2 variables, one called comic and one called statbook. Both are numeric, and there are 500 observations in each. We could also extract the specific information in single commands:

ncol(books) #number of columns
## [1] 2
nrow(books) #number of rows
## [1] 500
colnames(books) #column names
## [1] "comic"    "statbook"

If we wanted to have a quick glance at the data, you could use the head() or tail()functions. If you really wanted to see all of your data, you can use theprint()` function.

head(books) #first 6 rows
##   comic statbook
## 1   -44       16
## 2    20      -14
## 3     0        6
## 4   -18      -13
## 5   -19        7
## 6    13      -33
tail(books) #last 6 rows
##     comic statbook
## 495    11        3
## 496     9       13
## 497    10      -10
## 498    11      -22
## 499    10      -10
## 500    26       10

4.12 Practical Example

(Example is partially adapted from A. Field, “Discovering statistics using R”, Sage, chapter 10, p. 400)

The example contains data relating to what contributed to pain relief for patients and compares the effects – of administering a sugar pill to a patient (placebo condition, dose code = 1), or a low dose of a drug, for instance ibuprofen (dose code = 2) or a high dose of the same drug (dose code = 3).

We thus have two main variables and we surveyed 15 participants: Condition – 1 (Placebo), 2 (Low dose of ibuprofen), 3 (high dose of ibuprofen); and Pain level (effect) – measured at scale 1-10

Now its over to you - read in the ‘dose.csv’ file, check the type of data you have, level and label where appropriate. You should also explore the data using the commands that we used above too.

#Read in data
exp <- read.csv("dose.csv", header = T)
exp
##    ID dose effect
## 1   1    1      3
## 2   2    1      2
## 3   3    1      1
## 4   4    1      1
## 5   5    1      4
## 6   6    2      5
## 7   7    2      2
## 8   8    2      4
## 9   9    2      2
## 10 10    2      3
## 11 11    3      7
## 12 12    3      4
## 13 13    3      5
## 14 14    3      3
## 15 15    3      6
str(exp)
## 'data.frame':    15 obs. of  3 variables:
##  $ ID    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ dose  : int  1 1 1 1 1 2 2 2 2 2 ...
##  $ effect: int  3 2 1 1 4 5 2 4 2 3 ...
exp$dose <- factor(exp$dose, levels=c(1,2,3), labels = c("Placebo", "Low_dose", "High_dose"))
is.factor(exp$dose)
## [1] TRUE
head(exp)
##   ID     dose effect
## 1  1  Placebo      3
## 2  2  Placebo      2
## 3  3  Placebo      1
## 4  4  Placebo      1
## 5  5  Placebo      4
## 6  6 Low_dose      5