Chapter 13 Correlation, Causation and LM

Things are often correlated, and such correlations are often random and may be attributed to scales of your variables and just numerical variability rather than true relationship. Always have an idea in mind before setting up and claiming any relationship in your data instead of just looking at the relationships and then suggesting there is one.

To see some fun spurious correlations you can have a look here

We will illustrate this with a very silly example. We have got some data on ice creams sales and shark attacks and have found that there was a significant and positive correlation. Lets get data in and see what has happened there.

13.1 Sharks and ice cream example

library(psych)

Lets describe the data:

describe(sharks)

##               vars  n    mean    sd  median trimmed   mad    min     max
## Year             1 84 2011.00  2.01 2011.00 2011.00  2.97 2008.0 2014.00
## Month            2 84    6.50  3.47    6.50    6.50  4.45    1.0   12.00
## SharkAttacks     3 84   34.00  8.44   35.00   34.34 10.38   14.0   50.00
## Temperature      4 84   18.73  4.03   18.38   18.85  5.02   10.7   25.36
## IceCreamSales    5 84   88.17 13.40   91.00   88.47 17.79   64.0  109.00
##               range  skew kurtosis   se
## Year           6.00  0.00    -1.29 0.22
## Month         11.00  0.00    -1.26 0.38
## SharkAttacks  36.00 -0.26    -0.92 0.92
## Temperature   14.66 -0.07    -0.89 0.44
## IceCreamSales 45.00 -0.15    -1.35 1.46

And also visualise:

library(ggplot2)
qplot(sharks$SharkAttacks, sharks$IceCreamSales, xlab='Shark Attacks' , ylab='Ice Cream Sales')

We can further check for significance of the correlation:

cor.test(sharks$SharkAttacks, sharks$IceCreamSales)

## 
##  Pearson's product-moment correlation
## 
## data:  sharks$SharkAttacks and sharks$IceCreamSales
## t = 5.7247, df = 82, p-value = 1.648e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3613613 0.6717900
## sample estimates:
##       cor 
## 0.5343576

What do you think has happened here? There may be of course some confounding variables we forgot to control for. Those could be also mediating the effect so we cannot observe the true effect of ice cream sales on shark attacks. True effect that we should expect would be no effect and something else must be in an interplay here. Lets run a simple linear model and try to build up our analysis:

13.2 Simple Linear Regression in R

We are looking at the following set up. Where _0 represents the expected value of shark attacks when no ice cream has been sold and where _1 would suggest us the marginal effect of every unit increase in ice cream sale from the number of shark attacks.

$\begin{equation} Y_[i] = \beta_0 + \beta_1 * X_i + \epsilon_i \end{equation}$

More specifically:

$\begin{equation} Shark Attacks = \beta_0 + \beta_1 * Ice Cream Sales \end{equation}$

#Run the model
sharks_model_1<-lm(SharkAttacks~IceCreamSales, data=sharks)
#Summarise the output
summary(sharks_model_1)

## 
## Call:
## lm(formula = SharkAttacks ~ IceCreamSales, data = sharks)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.5813  -4.6829  -0.9704   4.8391  17.7726 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    4.35231    5.23776   0.831    0.408    
## IceCreamSales  0.33627    0.05874   5.725 1.65e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.173 on 82 degrees of freedom
## Multiple R-squared:  0.2855, Adjusted R-squared:  0.2768 
## F-statistic: 32.77 on 1 and 82 DF,  p-value: 1.648e-07

Lets study the output carefully, what is your _0, _1? What do you conclude about the overall model? What about R squared values? F statistics?

Remember that R squared represents the ratio of explained variance by the model over total variation in your data. While F statistic is used to test the hypothesis that the model you set up is better than null/no model or that the ration of explained variance over total will be different once we set up the model with few explanatory variables.

In fact, this model inst that bad if you were to look at explanatory power. However, we know that it doesn’t make much sense! Let us add the temperature variable into the setting.

More specifically:

$\begin{equation} Shark Attacks = \beta_0 + \beta_1 * Ice Cream Sales + \beta_2 * Temperature \end{equation}$

#Adding extra variable
sharks_model_2<-lm(SharkAttacks~IceCreamSales + Temperature, data=sharks)
#Summarise
summary(sharks_model_2)

## 
## Call:
## lm(formula = SharkAttacks ~ IceCreamSales + Temperature, data = sharks)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.2433  -2.6065  -0.0976   2.8733  16.6965 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    0.56510    4.30514   0.131   0.8959    
## IceCreamSales  0.10459    0.05957   1.756   0.0829 .  
## Temperature    1.29276    0.19802   6.528  5.4e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.842 on 81 degrees of freedom
## Multiple R-squared:  0.5319, Adjusted R-squared:  0.5203 
## F-statistic: 46.01 on 2 and 81 DF,  p-value: 4.469e-14

Lets make tables a bit easier to study:

Load a nice package ‘texreg’.

library(texreg)

Looking great! You can now easily compare the models and suggest that temperature was a confounding variable. We’ll discuss model comparisons in more details a bit later.

13.3 Regression Diagnostics

Lets try to run a model on a different data set. We will study what predicts anxiety levels of students taking statistics course. We have some ideas about revision completeness being one of the reasons why students may feel anxious.

Our model can be written as follows:

$\begin{equation} Anxiety = \beta_0 + \beta_1 * Revision \end{equation}$

Load the data in:

anxiety<-read.csv('anxiety.csv')

Describe and visualise the variables of interest, we introduce you to pair panels here, quite handy way to check for correlations. You can also try to use loops you have learnt earlier to extract both correlations and p-values in one go.

#Describe
describe(anxiety)

##         vars   n  mean    sd median trimmed   mad  min    max  range  skew
## Code       1 103 52.00 29.88  52.00   52.00 38.55 1.00 103.00 102.00  0.00
## Revised    2 103 19.85 18.16  15.00   16.70 11.86 0.00  98.00  98.00  1.95
## Exam       3 103 56.57 25.94  60.00   57.75 29.65 2.00 100.00  98.00 -0.36
## Anxiety    4 103 74.34 17.18  79.04   77.05 10.75 0.06  97.58  97.53 -1.95
## Gender*    5 103  1.50  0.50   2.00    1.51  0.00 1.00   2.00   1.00 -0.02
##         kurtosis   se
## Code       -1.24 2.94
## Revised     4.34 1.79
## Exam       -0.91 2.56
## Anxiety     4.73 1.69
## Gender*    -2.02 0.05

#Visualise
pairs.panels(anxiety [, 2:4])

Lets fit linear model:

#Run the model
model_anx<-lm(Anxiety ~ Revised, data=anxiety)
#Summarise
summary(model_anx)

## 
## Call:
## lm(formula = Anxiety ~ Revised, data = anxiety)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -76.325  -4.269   1.304   6.415  36.488 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 87.66755    1.78184   49.20   <2e-16 ***
## Revised     -0.67108    0.06637  -10.11   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.17 on 101 degrees of freedom
## Multiple R-squared:  0.503,  Adjusted R-squared:  0.4981 
## F-statistic: 102.2 on 1 and 101 DF,  p-value: < 2.2e-16

Lets also plot the relationship and fitted line:

#Plot
plot(anxiety$Revised,anxiety$Anxiety, main='Relationship between revision and anxiety', ylab='Anxiety score', xlab='% of all topics revised', col='r ed', pch=10)
#Add the line
abline(model_anx, col='red', lwd=2)

Lets check for the assumptions check list. From first glance do you notice anything? perhaps we could say that most of the students in the sample have completed their revision with scores about 60-100 percent of total topics. We have quite a few students with very high levels of anxiety and those were associated with less than 50 percent of topics being revised. Can we say that the effect of extra revision on anxiety would be the same for those who are on the lower scale of revision versus high? We can clearly see that the effect might decrease gradually as we tend towards revision=100.

Would we trust the results we found? Lets do some diagnostics!

Data Appropriateness

The predicted variable (also called the outcome, the dependent variable or Y) should be continuous (measured at interval or ratio level)
The predicted variable should be unbounded, or at least cover a wide range of values
The predictor variable (also called the independent variable or X) should be continuous or binary (i.e., categorical with two levels, which is sometimes also called dichotomous).

Independence

Each outcome variable should be independent from the others. This would be violated if there were some pattern of time dependency for instance, or if subjects were clustered into larger units (such as members of the same family or children studying in the same classroom) Independence can only be checked at design level, so let’s assume that each anxiety score and revision belongs to a different individual, and nobody has been counted twice or at different times or similar.

Normality of Error

The errors should be normally distributed (while the original data can have any kind of distribution).

We can check normality of the error by accessing the residual plots generated by the model. We can either look at them all at once but pick one:

#Specify that you want table output present you 2x2 =4 tables
par(mfrow=c(2,2))
#Plot results of the model
plot(model_anx)

To reverse back specify:

par(mfrow=c(1,1))

To get to Q-Q plot for instance you then just use:

The plot compares how many points we have in each quantile (top 5%, top 10%, top 25% …) for our residuals and for an “ideal” normal distribution. They better be the same…

So are the two distributions of quantiles similar? In other words, is the data aligned along the diagonal? Mostly, yes. So the assumption of normality of error holds from visual inspection.

Homogeneity of the error term (homoskedasticity)

The errors of prediction (the distance between each real point and the point position predicted by your linear model) should be constant across the entire data range. This means that the errors should not be, for instance, smaller when the y value is small, and larger when the y value is large.

You can explore this one from the plots above. Look for Scale-Location plot or go for:

plot(model_anx, which=3)

We can also perform the test to check this:

ncvTest(model_anx)

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 0.8274758, Df = 1, p = 0.363

In this case our p value is smaller than 0.05, we thus can conclude that the assumption of constant variance is violated. Don’t worry about it yet, we will deal with this in a second.

In short, visually,what we want to see - if I align my fitted values (predicted anxiety) along the x axis, how does the corresponding residual (error) look on the y axis? Residuals should be a mixture of everything, large and medium and small. Here the homoscedasticity of error seems not to at visual inspection; the points are sprinkled around but tend to have greater spread on the left indicating non-constant variation at different value of predicted anxiety. There is more variability at lower levels of predicted anxiety.

Measurement

You can check these using describe() and using histograms via hist()

Outliers

We can check for high influence points using Cooks distance.

# Visual inspection using Cook's distance
plot(model_anx, which=4)

# Visual inspection using Residuals vs Leverage plot
plot(model_anx, which=5)

# High Cook distance threshold (value above threshold may need further attention)
(D <- 4/nrow(anxiety))

## [1] 0.03883495

I have tested for data points having high influence on the model using a Cook’s distance plot. As a rule of thumb threshold, I will consider as problematic points having D > 4/n, so here points 78, 90 and 101 have high influence on the model (D > 4/100 or D > 0.04). Point 40 in particular may be problematic (D=0.04).

A very useful key to study Residuals vs Leverage plot is presented below:

Linearity

The relation between X and Y should be linear, that is, the data points should align on an approximately straight line. If the shape is different from a straight line, you might try to transform one or both variables, for example using a log transformation.

These can be examined visually and by looking at residuals plots:

# Check for linearity of relation between DV and IV
plot(model_anx, which=1)

# scatterplot of IV vs DV
plot( anxiety$Anxiety,anxiety$Revised, main='Relationship between revision and anxiety', ylab='Anxiety score', xlab='% of all topics revised', col='red', pch=10)

# Pearson residuals plot
residualPlots(model_anx)

##            Test stat Pr(>|Test stat|)  
## Revised      -1.7293          0.08684 .
## Tukey test   -1.7293          0.08375 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

13.3.1 Violations of the assumptions: available treatments

One of the most common assumption that hard not to violate is homescedasticity. We can use simple standard error corrections in R.

Package ‘lmtest’ for robust standard error estimation are presented below, make sure that ‘texreg’ is also loaded and package ‘sandwich’ that we will use to calculate new var-cov matrix that will account for heteroscedasticity:

library(lmtest)
library(texreg)
library(sandwich)

We can use function `coeftest() to see whether confidence intervals may be different once we consider heteroscedasticity.

coeftest(model_anx, vcov)

## 
## t test of coefficients:
## 
##              Estimate Std. Error t value  Pr(>|t|)    
## (Intercept) 87.667549   1.781844  49.200 < 2.2e-16 ***
## Revised     -0.671080   0.066371 -10.111 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Lets present our new results:

screenreg(coeftest(model_anx, vcov = vcovHC(model_anx)))

## 
## ======================
##              Model 1  
## ----------------------
## (Intercept)  87.67 ***
##              (2.25)   
## Revised      -0.67 ***
##              (0.10)   
## ======================
## *** p < 0.001, ** p < 0.01, * p < 0.05

Note that the coefficients will remain the same, it is the standard error that may change once we have added the correction. You can notice now that st.error for intercept estimated and Revised have increased. We can further put our results together with our main model and overwrite the old standard error in the full output.

#Save your corrected errors
corrected_errors <- coeftest(model_anx, vcov = vcovHC(model_anx)) 

screenreg(model_anx, 
          override.se = corrected_errors[, 2], 
          override.pval = corrected_errors[, 4])

## 
## =======================
##              Model 1   
## -----------------------
## (Intercept)   87.67 ***
##               (2.25)   
## Revised       -0.67 ***
##               (0.10)   
## -----------------------
## R^2            0.50    
## Adj. R^2       0.50    
## Num. obs.    103       
## RMSE          12.17    
## =======================
## *** p < 0.001, ** p < 0.01, * p < 0.05

Some of the others may relate to your error term. You will find that standardization of your variables or logarithmic transformation may often solve the issue. Sometimes the issue maybe coming from model specification. We will present a brief example below.

13.4 Interaction (simple slope) and multiple explanatory factors

Lets try a different example, we now will try to incorporate both continuous and categorical variable as our independent variables. We will also consider the case where our categorical variable may moderate the effect of a continuous one.

Tomorrow we will extend this further to mixed effect models. But for now, lets load the data set that have some salary indicators and we will look at how department and year in service may affect the variation in salary.

#Load the data
salary<-read.csv('salary.csv')

#Describe
summary(salary)

##     service          salary           dept     
##  Min.   :2.608   Min.   :22.05   Min.   :0.00  
##  1st Qu.:4.126   1st Qu.:30.08   1st Qu.:0.00  
##  Median :4.882   Median :37.79   Median :0.00  
##  Mean   :4.869   Mean   :39.78   Mean   :0.48  
##  3rd Qu.:5.651   3rd Qu.:49.61   3rd Qu.:1.00  
##  Max.   :6.929   Max.   :71.05   Max.   :1.00

Well also center our data around the mean here and also make sure that variable department is coded as factor:

## centre
salary$service_m <- salary$service - mean(salary$service)

## dept as factor
salary$dept <- factor(salary$dept, labels = c("Store Manager", "Accounts"))

Lets run a very simple model first:

#Simple model with not interaction
model_salary_1 <- lm(salary ~ service_m+dept, data = salary)
summary(model_salary_1)

## 
## Call:
## lm(formula = salary ~ service_m + dept, data = salary)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.9315 -2.2789 -0.5309  2.5806 12.0499 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    30.339      0.952  31.868  < 2e-16 ***
## service_m       4.368      0.667   6.549 3.95e-08 ***
## deptAccounts   19.665      1.377  14.280  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.845 on 47 degrees of freedom
## Multiple R-squared:  0.8499, Adjusted R-squared:  0.8435 
## F-statistic:   133 on 2 and 47 DF,  p-value: < 2.2e-16

#Interaction
 model_salary_2 <- lm(salary ~ service_m*dept, data = salary)
 summary(model_salary_2)

## 
## Call:
## lm(formula = salary ~ service_m * dept, data = salary)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.3320  -2.7217  -0.2861   2.8132   9.9405 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             30.1911     0.9073  33.276  < 2e-16 ***
## service_m                2.7290     0.9227   2.958  0.00488 ** 
## deptAccounts            19.6694     1.3095  15.021  < 2e-16 ***
## service_m:deptAccounts   3.1071     1.2704   2.446  0.01834 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.607 on 46 degrees of freedom
## Multiple R-squared:  0.8672, Adjusted R-squared:  0.8585 
## F-statistic: 100.1 on 3 and 46 DF,  p-value: < 2.2e-16

We can note some slight improvements in explained variance and interaction is significant, identifying the moderation effects. We can also plot this to see how it looks in terms of the slope affect. We can use package interaction which has nice specified function for interaction plot.

#Install and load 'jtools'
library(interactions)

interact_plot(model_salary_2, pred = service_m, modx = dept, plot.points = TRUE,
               x.label = "Years Service", y.label = "Salary (£)",
               legend.main = "Department")