Chapter 7 Week 7

7.1 Reading in Data, Merge and More R Practice

This week we will focus on reading various datasets and merging them together where necessary. This can be useful if you collected data at different times or might be merging datasets from different studies for your analysis. We will also work through a few code examples that can help you to navigate around the data and filter observations, using filter and arrange from tidyverse.

You can find more examples in the books here:

Chapter 5 on Data Transformation - R for Data Science by Garett Grolemund and Hadley Wikham

Chapter on Descriptive Statistics and Data Manipuliation - Modern R with the Tidyverse by Bruno Rodrigues

Reading in and merging data of different formats can be tricky if you don’t have R to help you. We will walk you through a fairly simple example today to build your intutition. When you get to Years 3 and 4 you will get to work with your own dissertation data - we hope you’ll find the notes here useful for then.

First things first, let’s load tidyverse.

library(tidyverse)

This week we will work with the datasets that have information on students’ grades for different programmes. We have got three separate datasets which we want to put together for the analysis. The first one has data on programmes for 15 students. The second dataset provides grades for the same students. We then have a separate dataset that has information on another 15 students and their respective programmes and grades.

On top of that, each dataset comes in a different format. We have .csv, .txt, .sav (SPSS) and also .dta (STATA) format. Confusing right?

Our task for today will be to find an efficient way to create a single dataset that has information for these 30 students that you will then work with in your practice.

7.2 Dataset 1 (.csv)

We have student IDs and grades. In terms of observations, we have 15 in total.

  • ID (ID1, ID2, ID3… ID15)
  • grades (1-100)

The dataset comes in the format familiar to us, csv, so we know how to read that one in, please note that this week we introduce a tidyverse read_csv() function which can be useful for us when it comes to joining the data.

data_students_1 <- read.csv('data_students_1.csv')

7.3 Dataset 2 (.txt)

We have student IDs and programme. In terms of observations, we have 15 in total.

  • ID (ID1, ID2, ID3… ID15)
  • programme (‘psych’, ‘lang’, and ‘phil’)

Here, we have got a .txt format. Not a problem for R:

data_students_2 <- read.table("data_students_2.txt", header = TRUE) # Note that we add the header TRUE which will read the first line in the file as the column names.

7.4 Dataset 3 (.sav and .dta)

We have student IDs, grades and programme. In terms of observations, we have 15 in total but these are different students so we will want to add those with the previous datasets later.

  • ID (ID16… ID30)
  • grades (1-100)
  • programme (‘psych’, ‘lang’, and ‘phil’)

We have .dta format here which may look foreign to you as it looks like the data was saved by a different software. To deal with those in R, we can install package foreign and then read directly from the format:

library(foreign) 
data_students_3 <- read.dta('data_students_3.dta')

You may also note that we have data_students_3.sav. This format comes from very popular software that psychology researchers often use, SPSS.

We can open it vis foreign as well:

data_students_3 <- read.spss("data_students_3.sav", to.data.frame=TRUE) # note the argument for data.frame - if you don't specify, the data will be loaded as the list

7.5 Merging datasets together

Once the data is visible in the environment, we can start attempting to bring them together. There are number of ways to do this. Let us start with the most intuitive one. We can merge datasets 1 and 2 using the ID column. We are lucky to have a unique identifier which can allow us to bring datasets together so we can have our grades and programme all in one dataset.

Exploring data first can help:

head(data_students_1)
##   X   ID grades
## 1 1 ID_1     20
## 2 2 ID_2     35
## 3 3 ID_3     45
## 4 4 ID_4     85
## 5 5 ID_5     70
## 6 6 ID_6     72
head(data_students_2)
##     ID programme
## 1 ID_1     psych
## 2 ID_2      lang
## 3 ID_3      phil
## 4 ID_4     psych
## 5 ID_5      lang
## 6 ID_6      phil

We can merge data by ID now and we will use full_join().

# Full join
students_grades_prog <- full_join(data_students_1, data_students_2, by = c('ID')) # We can specify the unqiue variable we use to match the datasets via the 'by =' argument.

Quickly check that you got what you wanted:

head(students_grades_prog)
##   X   ID grades programme
## 1 1 ID_1     20     psych
## 2 2 ID_2     35      lang
## 3 3 ID_3     45      phil
## 4 4 ID_4     85     psych
## 5 5 ID_5     70      lang
## 6 6 ID_6     72      phil

Nice! Now we can work with this data a bit using some extra functions from tidyverse.

7.6 Sorting and arrange()

To check how the data looks when sorted we can use arrange()::

# Sort in ascending order (default option)
students_grades_prog %>%
  arrange(grades) 
##     X    ID grades programme
## 1   1  ID_1     20     psych
## 2   2  ID_2     35      lang
## 3  15 ID_15     40      phil
## 4  11 ID_11     44      lang
## 5   3  ID_3     45      phil
## 6  10 ID_10     56     psych
## 7   9  ID_9     58      phil
## 8   8  ID_8     60      lang
## 9  14 ID_14     64      lang
## 10  7  ID_7     65     psych
## 11 12 ID_12     68      phil
## 12  5  ID_5     70      lang
## 13 13 ID_13     70     psych
## 14  6  ID_6     72      phil
## 15  4  ID_4     85     psych
# Sort in descending order (add 'desc')
students_grades_prog %>%
  arrange(desc(grades))
##     X    ID grades programme
## 1   4  ID_4     85     psych
## 2   6  ID_6     72      phil
## 3   5  ID_5     70      lang
## 4  13 ID_13     70     psych
## 5  12 ID_12     68      phil
## 6   7  ID_7     65     psych
## 7  14 ID_14     64      lang
## 8   8  ID_8     60      lang
## 9   9  ID_9     58      phil
## 10 10 ID_10     56     psych
## 11  3  ID_3     45      phil
## 12 11 ID_11     44      lang
## 13 15 ID_15     40      phil
## 14  2  ID_2     35      lang
## 15  1  ID_1     20     psych

We can also sort by other variables:

# Sort in ascending order by programme (just remove desc())
students_grades_prog %>%
  arrange(programme) 
##     X    ID grades programme
## 1   2  ID_2     35      lang
## 2   5  ID_5     70      lang
## 3   8  ID_8     60      lang
## 4  11 ID_11     44      lang
## 5  14 ID_14     64      lang
## 6   3  ID_3     45      phil
## 7   6  ID_6     72      phil
## 8   9  ID_9     58      phil
## 9  12 ID_12     68      phil
## 10 15 ID_15     40      phil
## 11  1  ID_1     20     psych
## 12  4  ID_4     85     psych
## 13  7  ID_7     65     psych
## 14 10 ID_10     56     psych
## 15 13 ID_13     70     psych

7.7 Data description

Now, we have a full dataset we can try to do some simple analysis on the data and study the variation in grades across the cohort and by programme.

# All grades 
students_grades_prog %>%
  summarise(mean = mean(grades),
            median = median(grades),
            sd = sd(grades))
##   mean median       sd
## 1 56.8     60 17.00084
# Group grades by programme
students_grades_prog %>%
  group_by(programme) %>%
  summarise(mean = mean(grades),
            median = median(grades),
            sd = sd(grades))
## # A tibble: 3 x 4
##   programme  mean median    sd
##   <fct>     <dbl>  <int> <dbl>
## 1 lang       54.6     60  14.6
## 2 phil       56.6     58  14.0
## 3 psych      59.2     65  24.3

7.8 Using filter()

We can use a handy option filter() to do descriptives only for certain observations in the data. Let’s group descriptives by programme but only look at ‘psych’.

# Check programme 'psych'
students_grades_prog %>%
  filter(programme == 'psych') %>%
  summarise(mean_psych = mean(grades),
            median_psych = median(grades),
            sd_psych = sd(grades))
##   mean_psych median_psych sd_psych
## 1       59.2           65 24.30432

Try with ‘lang’ and ‘phil’ too.

# Check programme 'lang'
students_grades_prog %>%
  filter(programme == 'lang') %>%
  summarise(mean_lang = mean(grades),
            median_lang = median(grades),
            sd_lang = sd(grades))
##   mean_lang median_lang  sd_lang
## 1      54.6          60 14.58767
# Check programme 'phil'
students_grades_prog %>%
  filter(programme == 'phil') %>%
  summarise(mean_phil = mean(grades),
            median_phil = median(grades),
            sd_phil = sd(grades))
##   mean_phil median_phil  sd_phil
## 1      56.6          58 13.95708

Where necessary, we can also focus on studying only specific values in our data. For instance, imagine you just wanted to study students who have received grades of 50 and above in ‘psych’.

# Check programme 'psych', grades > 50
students_grades_prog %>%
  filter(programme == 'phil', grades > 50) %>%
  summarise(mean_phil = mean(grades),
            median_phil = median(grades),
            sd_phil = sd(grades))
##   mean_phil median_phil  sd_phil
## 1        66          68 7.211103

You will get much higher means now as you removed the grades which were lower.

We can also use filter() to check just the for occurence of specific values. For example, what if we focused on only very high grades or very low grades in psych?

# Check programme 'psych' grades above 70 or below 40
students_grades_prog %>%
  filter(programme == 'psych') %>%
  filter(grades > 70 | grades < 40) # Take a note of how we use '|' to specify OR.
##   X   ID grades programme
## 1 1 ID_1     20     psych
## 2 4 ID_4     85     psych

There are two extreme values in our data for our specification but let’s look at all the programmes together as well.

# Group programmes and check for extreme values
students_grades_prog %>%
  group_by(programme) %>%
  filter(grades > 80 | grades < 40) 
## # A tibble: 3 x 4
## # Groups:   programme [2]
##       X ID    grades programme
##   <int> <fct>  <int> <fct>    
## 1     1 ID_1      20 psych    
## 2     2 ID_2      35 lang     
## 3     4 ID_4      85 psych

7.9 Visualisations

Lastly, as usual, always visualise your data to gauge what the distribution looks like.

# Visualise
ggplot(data = students_grades_prog, aes(x = grades)) + 
  geom_histogram(bins = 15, color = 'grey', fill = 'blue') + 
  labs(x = 'Grades', y = 'Frequency', title = 'Histogram of Student Grades') + 
  theme_minimal()

What about by programme? Try plotting with subsets. Bear in mind we don’t have that many observations.

# Example for psych only
ggplot(data = subset(students_grades_prog, programme %in% c('psych')), aes(x = grades)) +
  geom_histogram(bins = 20, color = 'grey', fill = 'blue') + 
  labs(x = 'Grades', y = 'Frequency', title = 'Histogram of Student Grades (Psych)') + 
  theme_minimal()

7.10 Save the file in your folder

Before finishing off, we can also write the merged dataset into our folder so it can be saved for the future. Check your folder after you run below.

write.csv(students_grades_prog, 'student_grades_prog.csv') # Note how we first specify the object we want to save and then the name of the file including the extension '.csv'.

7.11 Practice.Rmd Solutions

First, make sure that all the necessary packages are loaded:

library(tidyverse)
library(foreign)

For your practice work with the same data as in the tutorial. You will need to use filter and arrange and also mutate to answer the questions below.

You are also required to provide some simple visualisations for your data to show what is happening in student grades by programme.

Here is the breakdown of the tasks we want you to do and the solutions:

  • Read all three datasets in (data_students_1, data_students_2, data_students_3). Since they come in different formats make sure to check your notes from the tutorial. Note that you have data_students_3 in different formats so you can choose which one you want to read in. After you’ve read them in, check what’s inside of each dataset.
# Dataset 1
data_students_1 <- read.csv('data_students_1.csv')
head(data_students_1)
##   X   ID grades
## 1 1 ID_1     20
## 2 2 ID_2     35
## 3 3 ID_3     45
## 4 4 ID_4     85
## 5 5 ID_5     70
## 6 6 ID_6     72
# Dataset 2
data_students_2 <- read.table("data_students_2.txt", header = TRUE) 
head(data_students_2)
##     ID programme
## 1 ID_1     psych
## 2 ID_2      lang
## 3 ID_3      phil
## 4 ID_4     psych
## 5 ID_5      lang
## 6 ID_6      phil
# Dataset 3
data_students_3 <- read.dta('data_students_3.dta')
head(data_students_3)
##      ID grades programme
## 1 ID_16     40     psych
## 2 ID_17     38      lang
## 3 ID_18     50      phil
## 4 ID_19     80     psych
## 5 ID_20     69      lang
## 6 ID_21     70      phil
# Or
data_students_3 <- read.csv('data_students_3.csv')
head(data_students_3)
##   X    ID grades programme
## 1 1 ID_16     40     psych
## 2 2 ID_17     38      lang
## 3 3 ID_18     50      phil
## 4 4 ID_19     80     psych
## 5 5 ID_20     69      lang
## 6 6 ID_21     70      phil

Note that we have IDs 16-30 which means that we have got data for an extra 15 students. We can now add these to the other dataset we have. Let us first merge grades and programme for data_students_1 and data_students_2.

  • Merge datasets together. First merge data_students_1 and data_students_2, then merge the resulting data with data_students_3. Hint: you will need to use full_join().
# Dataset 1 + Dataset 2
students_grades_prog <- full_join(data_students_1, data_students_2, by = c('ID')) # We can specify the unqiue variable we use to match the datasets via the 'by =' argument.
# + Dataset 3
students_grades_prog_all <- full_join(students_grades_prog, data_students_3) # Please note that we do not need to specify a unique identifier here as we just want to match data by columns and R is clever enough to know what to do.
head(students_grades_prog_all)
##   X   ID grades programme
## 1 1 ID_1     20     psych
## 2 2 ID_2     35      lang
## 3 3 ID_3     45      phil
## 4 4 ID_4     85     psych
## 5 5 ID_5     70      lang
## 6 6 ID_6     72      phil
  • Work with the final dataset that has information on all students (30 observations).

  • Provide means, medians and standard deviations for grades in each programme.

students_grades_prog_all %>%
  group_by(programme) %>%
  summarise(mean = mean(grades),
            median = median(grades),
            sd = sd(grades))
## # A tibble: 3 x 4
##   programme  mean median    sd
##   <fct>     <dbl>  <dbl> <dbl>
## 1 lang       54.7   61    13.8
## 2 phil       56.9   58.5  12.0
## 3 psych      61     65.5  19.1
  • Provide a simple visualisaiton for each programme.
# Psych
ggplot(data = subset(students_grades_prog_all, programme %in% c('psych')), aes(x = grades)) +
  geom_histogram(bins = 20, color = 'grey', fill = 'cornsilk') + 
  labs(x = 'Grades', y = 'Frequency', title = 'Histogram of Student Grades (Psych)') + 
  theme_minimal()

# Lang
ggplot(data = subset(students_grades_prog_all, programme %in% c('lang')), aes(x = grades)) +
  geom_histogram(bins = 20,color = 'grey', fill = 'lightblue') + 
  labs(x = 'Grades', y = 'Frequency', title = 'Histogram of Student Grades (Lang)') + 
  theme_minimal()

# Phil
ggplot(data = subset(students_grades_prog_all, programme %in% c('phil')), aes(x = grades)) +
  geom_histogram(bins = 20, color = 'grey', fill = 'coral') + 
  labs(x = 'Grades', y = 'Frequency', title = 'Histogram of Student Grades (Phil)') + 
  theme_minimal()

Now, try to answer the following questions:

  • How many students in the dataset received the grades above 70?
# Filter for grades above 70
students_grades_prog_all %>%
  filter(grades > 70)
##   X    ID grades programme
## 1 4  ID_4     85     psych
## 2 6  ID_6     72      phil
## 3 4 ID_19     80     psych

The answer is three.

  • What is the mean and the median grade for those who got more than 65?
# Mean and median for grades above 70
students_grades_prog_all %>%
  filter(grades > 65)  %>%
  summarise(mean_above_65 = mean(grades),
            median_above_65 = median(grades))
##   mean_above_65 median_above_65
## 1            72              70

The answer is 72 and 70.

  • How many students received grades that were between 40 and 50 in philosophy programme?
# Phil grades between 40 and 50
students_grades_prog_all %>%
  filter(programme == 'phil') %>%
  filter(grades > 40 & grades < 50) # Note that we use '&' to specify that we want grades both less than 50 and more than 40.
##    X    ID grades programme
## 1  3  ID_3     45      phil
## 2 15 ID_30     42      phil

There are two students.

  • Considering only philosophy programme, what were the top three grades in the cohort?
# Only phil arranged
students_grades_prog_all %>%
  filter(programme == 'phil') %>%
  arrange(desc(grades))
##     X    ID grades programme
## 1   6  ID_6     72      phil
## 2   6 ID_21     70      phil
## 3  12 ID_12     68      phil
## 4  12 ID_27     65      phil
## 5   9 ID_24     59      phil
## 6   9  ID_9     58      phil
## 7   3 ID_18     50      phil
## 8   3  ID_3     45      phil
## 9  15 ID_30     42      phil
## 10 15 ID_15     40      phil

The answer is 72, 70 and 68.

  • Now, for language, what were the three lowest grades in the cohort?
# Only lang arranged 
students_grades_prog_all %>%
  filter(programme == 'lang')  %>%
  arrange(grades)
##     X    ID grades programme
## 1   2  ID_2     35      lang
## 2   2 ID_17     38      lang
## 3  11 ID_26     40      lang
## 4  11 ID_11     44      lang
## 5   8  ID_8     60      lang
## 6   8 ID_23     62      lang
## 7  14 ID_14     64      lang
## 8  14 ID_29     65      lang
## 9   5 ID_20     69      lang
## 10  5  ID_5     70      lang

It should be 35, 38 and 40.

Well done. It may have taken a while to build all of these code chunks but it is an essential part of the practice to keep playing with the code we are showing you. Try to arrange things differently and see what happens.