# Lab 7 Correlation

## 7.1 What is correlation?

When conducting empirical research, we are often interested in associations between two variables, for example, personal income and attitudes towards migrants. In this lab we will focus on visualizing relationship between variables and how to measure it. In quantitative research, the main variable of interest in an analysis is called the *dependent* or *response variable*, and the second is known as the *independent* or *explanatory*. In the example above, we can think of personal income as the independent variable and attitudes as the dependent.

The relationship between variables can be positive, negative or non-existent. The figure below shows these type of relationships to different extents. The association is positive when one of the variables increases and the second variable tends to go in the same direction (that is increasing as well). The first plot on the left-hand side shows a strong positive relationship. As you can see, the points are closely clustered around the straight line. The next plot also shows a positive relationship. This time the relationship is moderate. Therefore, the points are more dispersed in relation to the line compared to the previous one.

The plot in the middle, shows two variables that are not correlated. The location of the points is not following any pattern and the line is flat. By contrast, the last two plots on the right hand-side show a negative relationship. When the values on the X axis increase, the values on the Y axis tend to decrease.

### 7.1.1 Data and R environment

We will continue working on the same R Studio Cloud project as in the previous session and using the 2012 Northern Ireland Life and Times Survey (NILT) data. To set the `R`

environment please follow the next steps:

- Please go to your ‘Quants lab group’ in RStudio Cloud (log in if necessary);
- Open your own copy of the ‘NILT’ project from the ‘Quants lab group’;
- Create a new Rmd file, type ‘Correlation analysis’ in ‘Tile’ section and your name in the ‘Author’ box. Leave the ‘Default Output Format’ as
`HTML`

. - Save the Rmd document under the name ‘Lab7_correlation’.
- Delete all the contents in the Rmd default example with the exception of the first bit which contains the YALM and the first chunk, which contains the default chunk options (that is all from line 12 and on).
- In the setup chunk, change
`echo`

from`TRUE`

to`FALSE`

in line 9 (this will hide the code for all chunks in your final document). - Within the first chunk, copy and paste the following code below line 9
`knitr::opts_chunk$set(message = FALSE, warning = FALSE)`

. This will hide the warnings and messages when you load the packages.

In the Rmd document insert a new a chunk, copy and paste the following code. Then, run the individual chunk by clicking on the green arrow on the top-right of the chunk.

```
## Load the packages
library(tidyverse)
# Load the data from the .rds file we created in lab 3
<- readRDS("data/nilt_r_object.rds") nilt
```

This time we will use new variables from the survey. Therefore, we need to coerce them into their appropriate type first. Insert a second chunk, copy and paste the code below. Then, run the individual chunk.

```
# Age of respondent’s spouse/partner
$spage <- as.numeric(nilt$spage)
nilt# Migration
<- mutate_at(nilt, vars(mil10yrs, miecono, micultur), as.numeric) nilt
```

Also, we will create a new variable called `mig_per`

by summing the respondent’s opinion in relation to migration using the following variables: `mil10yrs`

, `miecono`

and `micultur`

(see the documentation p. 14 here to know more about these variables). Again, insert a new chunk, copy and paste the code below, and run the individual chunk.

```
# overall perception towards migrants
<- rowwise(nilt) %>%
nilt # sum values
mutate(mig_per = sum(mil10yrs, miecono, micultur, na.rm = T )) %>%
ungroup() %>%
# assign NA to values that sum 0
mutate(mig_per = na_if(mig_per, 0))
```

### 7.1.2 Visualizing correlation

Visualizing two or more variables can help to uncover or understand the relationship between these variables. As briefly introduced in the previous session, different types of plots are appropriate for different types of variables. Therefore, we split the following sections according to the type of data to be analysed.

You do not need to run or reproduce the examples shown in the following sections in your R session with the exception of exercises that are under the *activity* headers.

#### 7.1.2.1 Numeric vs numeric

To illustrate this type of correlation, let’s start with a relatively obvious but useful example. Suppose we are interested in how people choose their spouse or partner. The first characteristic that we might look at is age. We might suspect that there is a correlation between the the `nilt`

respondents’ won age and their partner’s age. Since both ages are numeric variables, a scatter plot is appropriate to visualize the correlation. To do this, let’s use the functions `ggplot()`

and `geom_point()`

. In aesthetics `aes()`

let’s define the respondent’s age `rage`

in the X axis and the respondent’s spouse/partner age `spage`

in the Y axis. As a general convention in quantitative research, the response/dependent variable is visualized on the Y axis and the independent on the X axis (you do not need to copy and reproduce the example below).

```
ggplot(nilt, aes(x = rage, y = spage)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "Respondent's age vs respondent’s spouse/partner age",
x = "Respondent's age", y = "Respondent’s spouse/partner age" )
```

Note that in this plot the function `geom_smooth()`

was used. This is to plot a straight line which describes the best all the points in the graph.

From the plot above, we see that there is a strong positive correlation between the respondent’s age and their partner’s age. We see that for some individuals their partner’s age is older, whereas others is younger. Also, there are some dots that are far away from the straight line. For example, in one case the respondent is around 60 years old and the age of their partner is around 30 years old (can you find that dot on the plot?). These extreme values are known as *outliers*.

We may also suspect that the respondents’ sex is playing a role in this relationship. We can include this as a third variable in the plot by colouring the dots by the respondents’ sex. To do this, let’s specify the `colour`

argument in aesthetics `aes()`

with a categorical variable `rsex`

.

```
ggplot(nilt, aes(x = rage, y = spage, colour = rsex)) +
geom_point() +
geom_abline(slope = 1, intercept = 0, colour = "gray20") +
labs(title = "Respondent's age vs respondent’s spouse/partner age",
x = "Respondent's age", y = "Respondent’s spouse/partner age" )
```

In the previous plot, we included a line which describes what it would look like if the partner’s age were exactly the same as the respondent’s age. We observe a clear pattern in which female participants are on one side of the line and males on the other. As we can see, most female respondents tend to choose/have partners who are older, whereas males younger ones.

### 7.1.3 Activity 1

In the `Lab7_correlation`

file, use the `nilt`

data object to visualize the relationship of the following variables by creating a new chunk. Run the chunk individually and comment on what you observe from the result as text (outside the code chunks).

- Create a scatter plot to visualize the correlation between the respondent’s overall opinion in relation to migration
`mig_per`

and the respondent’s age`rage`

. Remember that we just created the`mig_per`

variable by summing three variables which were in a 0-10 scale (the higher the value, the better the person’s perception is). In`aes()`

, specify`rage`

on the X axis and`mig_per`

on the Y axis. Use`ggplot()`

function and`geom_point()`

. Also, include a straight line describing the points using the`geom_smooth()`

function. Within this function set the`method`

argument to`'lm'`

. - What type of relationship do you observe? Comment as text in the Rmd the overall result of the plot and whether this is in line with your previous expectation.

#### 7.1.3.1 Numeric vs categorical

As briefly introduced in the last lab, correlations often occur between categorical and numeric data. A good way to observe the relationship between these type of variables is using a box plot. Which essentially shows the distribution of the numeric values by category/group.

Let’s say we are interested in the relationship between education level and perception of migration. The variable `highqual`

contains the respondent’s highest education qualification. Using `ggplot()`

, we can situate `mig_per`

on the X axis and `highqual`

on the Y axis, and plot it with the `gem_boxplot()`

function. Note that before passing the dataset to `ggplot`

, we can filter out two categories of the variable `highqual`

where education level is unknown (i.e. “Other, level unknown” or “Unclassified”).

```
%>%
nilt filter(highqual != "Other, level unknown" & highqual != "Unclassified") %>%
ggplot(aes(x = mig_per, y = highqual )) +
geom_boxplot()
```

From the plot above, we see that respondents with higher education level (on the bottom) appear to have more positive opinion on migration when compared to respondents with lower education level or no qualifications (on the top). Overall, the data shows a pattern that the lower one’s education level is, the worse their opinion towards migration is likely to be. Since education level is an ordinal variable, we can say this is a positive relationship.

### 7.1.4 Activity 2

Using the `nilt`

data object, visualize the relationship of the following variables by creating a new chunk. Run the chunk individually and comment on what you can observe from the results as text in the Rmd file to introduce the plot.

- Create a boxplot to visualize the correlation between the respondent’s overall opinion in relation to migration
`mig_per`

and the political party which the respondent identify with`uninatid`

. Use`ggplot()`

in combination with`geom_boxplot()`

. Make sure to specify`mig_per`

on the Y axis and`uninatid`

on the X axis in`aes()`

. - Do you think the opinion towards migration differs among the groups in the plot? Comment on the overall results in the Rmd document.

## 7.2 Measuring correlation

So far we have examined correlation by visualizing variables only. A useful practice in quantitative research is to actually measure the magnitude of the relationship between these variables. One common measure is the *Pearson* correlation coefficient. This measure results in a number that goes from -1 to 1. A coefficient below 0 implies a negative correlation whereas a coefficient over 0 a positive one. When the coefficient is close to positive one (1) or negative one (-1), it implies that the relationship is strong. By contrast, coefficients close to 0 indicate a weak relationship. This technique is appropriate to measure linear numeric relationships, which is when we have numeric variables with a normal distribution, e.g. age in our dataset.

Let’s start measuring the relationship between the respondent’s age and their partner’s age. To do this in R, we should use the `cor()`

function. In the R syntax, first we specify the variables separated by a comma. We need to be explicit by specifying the object name, the dollar sign, and the name of the variable, as shown below. Also, I set the `use`

argument as `'pairwise.complete.obs'`

. This is because one or both of the variables contain more than one missing value. Therefore, we are telling R to use complete observations only.

`cor(nilt$rage, nilt$spage, use = 'pairwise.complete.obs')`

`## [1] 0.9481297`

The correlation coefficient between this variables is 0.95. This is close to positive 1. Therefore, it is a strong positive correlation. The result is completely in line with the plot above, since we saw how the dots were close to the straight line.

What about the relationship between `age`

and `mig_per`

that you plotted earlier?

`cor(nilt$rage, nilt$mig_per, use = 'pairwise.complete.obs')`

`## [1] -0.05680918`

The coefficient is very close to 0, which means that the correlation is practically non-existent. The absence of correlation is also interesting in research. For instance, one might expect that younger people would be more open to migration. However, it seems that age does not play a role on people’s opinion about migration in NI according to this data.

Let’s say that we are interested in the correlation between `mig_per`

and all other numeric variables in the dataset. Instead of continuing computing the correlation one by one, we can run a correlation matrix. The code syntax can be read as follows: from the `nilt`

data select these variables, then compute the correlation coefficient using complete cases, and then round the result to 3 decimals.

```
%>%
nilt select(mig_per, rage, spage, rhourswk, persinc2) %>%
cor(use = 'pairwise.complete.obs') %>%
round(3)
```

```
## mig_per rage spage rhourswk persinc2
## mig_per 1.000 -0.057 -0.132 0.082 0.228
## rage -0.057 1.000 0.948 -0.013 -0.036
## spage -0.132 0.948 1.000 -0.182 -0.090
## rhourswk 0.082 -0.013 -0.182 1.000 0.383
## persinc2 0.228 -0.036 -0.090 0.383 1.000
```

From the result above, we have a correlation matrix that computes the *Person* correlation coefficient for the selected variables. In the first row we have migration perception. You will notice that the first value is 1.00, this is because it is measuring the correlation against the same variable (i.e. itself). The next value in the first row is age, which is nearly 0. The next variables also result in low coefficients, with the exception of the personal income, where we see a moderate/low positive correlation. This can be interpreted that respondents with high income are associated with more positive opinion towards migration compared to low-income respondents.

### 7.2.1 Activity 3

- Insert a new chunk in your Rmd file;
- Using the
`nilt`

data object, compute a correlation matrix using the following variables:`rage`

,`persinc2`

,`mil10yrs`

,`miecono`

and`micultur`

, setting the`use`

argument to`'pairwise.complete.obs'`

and rounding the result to 3 decimals; - Run the chunk individually and comment whether personal income or age is correlated with the perception of migrants in relation to the specific aspects asked in the variables measured (consult the documentation in p. 14 to get a description of these variables);
- Knit the
`Lab7_correlation`

Rmd document to`.html`

or`.pdf`

. The output document will automatically be saved in your project. - Discuss your previous results with your neighbour or tutor.