14 Simple Regression
14.1 Background Information
Prior to diving into the data, let’s revisit some high school math! You likely recall
Hopefully, you can see that the line crosses the y-axis at 3. Furthermore, the slope can be calculated by
Determine the equation for the following line:
You may be thinking, “STOP TYLER”, but this is relevant. This equation maps nicely onto our more general linear model that we use for our analyses:
Where
Imagine that instead of x and y, we had a independent and a dependent variable. Say, we wanted to predict someones depression score (y; and measured on a scale of 1-14) by knowing the number of cognitive distortions they have on average each day. It might look like the following:
We can try to fine a straight line that fits all those points, but it won’t be able to. For example, maybe we can guess that that y-intercept is around 2.5, and the slope is about 0.5. This would result in:
That’s not too bad of a guess…but what it seems to have a lot of error.
Here, error is represtned by the dotted lines. That is, we guessed that:
But that would mean that the points fall directly on the line. Thus, the distance between each point and the line is the error. Let’s consider person 1 (circled below).
This person had, on average, 17 cognitive distortions per day and had a depression score of 12. Our line would not predict a depression score of 12.
The difference between 12 and 11 is the error for individual 1. We typically call this a residual. The residual for individual 1 is 1.
For person 1:
Hmmm. I wonder if we could find a better fitting line, that would minimize the errors across all the observations? That is, if we calculated the error for every person, as we did for person 1, the total of the squared errors would be 115.75. We can get this number lower.
14.1.1 Ordinary Least Squares
Ordinary Least Squares (OLS) is an algebraic way to get the best possible solution for our regression line. It minimizes the error of the line. Typically in psychology we write a simple regression as the following:
Where
The following are solutions to OLS simple regression:
Where
And the intercept solution is:
Therefore:
You did it! Our best possible fitting line is:
If we were to calculate the sum or the squared residuals for each person, we would get 66.1. This is much lower than the line that was built on our best guess.
Typically, when using regression, research reference trying to predict some variable using another variable. The predictor is often the independent variable (x), and the outcome/criterion variable is the dependent variable (y). When we have only one predictor, we refer to this as a simple regression. Note: this is not way implies that x causes y. Consider prediction much like a relationship or association, as discussed in the last chapter on correlations.
14.2 Our Data
You are a psychologist investigating the impact of technology use at night and sleep quality. Specifically, you believe that the amount of screen time within two hours of bedtime will negatively impact the total time in REM sleep. You recruit students and ask them to measure both screen time before bed (IV) and access their Apple Watch data to assess the amount of time in REM sleep during the night (DV).
14.2.1 Power
You review the literature and believe that the link between screen time and sleep is negative. Specifically, your best guess at the population parameter is
pwr.f2.test(f2=.3333,
power=.8,
u=1)
Multiple regression power calculation
u = 1
v = 23.6195
f2 = 0.3333
sig.level = 0.05
power = 0.8
So, the results suggest that
Screen Time | REM Sleep |
---|---|
64 | 125 |
79 | 115 |
50 | 112 |
83 | 95 |
48 | 117 |
45 | 107 |
63 | 92 |
14 | 112 |
57 | 126 |
92 | 52 |
62 | 86 |
16 | 125 |
34 | 120 |
68 | 116 |
76 | 124 |
100 | 90 |
41 | 119 |
76 | 81 |
105 | 85 |
58 | 116 |
33 | 89 |
81 | 41 |
65 | 99 |
44 | 121 |
58 | 95 |
95 | 58 |
14.3 Our Hypothesis
Regression will focus on
14.4 Our Model
Recall that we started by relating our regression to
And we are hypothesizing that the outcome is the function of some variables, so we can now say:
Where
14.5 Our Analysis
We can use the formulas above to solve the regression equation. We will need the mean of the IV (Screen Time), mean of the DV (REM Sleep), their covariance, and the variances. These are as follows:
Mean_Screen | Mean_REM | Var_Screen | Var_REM | Cov |
---|---|---|---|---|
61.80769 | 100.6923 | 572.4015 | 548.8615 | -321.3015 |
Thus:
We interpret this as, for every unit change in Screen Time (which was in minutes), we would predict a 0.5613 unit decrease in REM sleep. Thus, for every minute more of screen time, we would predict 0.5613 less minutes of REM sleep.
We must also solve for
The interpretation of this is: we would predict someone with NO screen time before bed (
We have our final equation!
14.5.1 Effect Size
Effect size for simple regression is
Therefore, the model explains 32.9% of the variance in REM Sleep.
14.5.2 Analysis in R
Regression and ANOVAs fall under the ‘general linear model’, which indicates that an outcome (e.g., lm()
(linear model) function to write out our regression equation.
lm(REM ~ ScreenTime, data=sr_data)
Note that here, I have a data frame called sr_dat with two variables called ScreenTime and REM. The ~
symbol is the same as ‘equal’ or predicted by. So, we have REM is predicted by ScreenTime. R will automatically include an intercept and the error term.
The results of lm(REM ~ ScreenTime, data=sr_data)
should be passed into a summary()
argument. So, first, let’s create our model!
<- lm(REM ~ ScreenTime, data=sr_dat) our_model
And then pass that into the summary function:
summary(our_model)
Call:
lm(formula = REM ~ ScreenTime, data = sr_dat)
Residuals:
Min 1Q Median 3Q Max
-48.919 -10.800 4.189 10.637 31.274
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 135.3863 10.8277 12.504 5.31e-12 ***
ScreenTime -0.5613 0.1638 -3.427 0.0022 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 19.59 on 24 degrees of freedom
Multiple R-squared: 0.3286, Adjusted R-squared: 0.3006
F-statistic: 11.75 on 1 and 24 DF, p-value: 0.002205
14.5.3 Another Way
The apaTables
package also has a lovely output, and can save a word document with the output.
library(apaTables)
apa.reg.table(our_model)
Regression results using REM as the criterion
Predictor b b_95%_CI beta beta_95%_CI sr2 sr2_95%_CI
(Intercept) 135.39** [113.04, 157.73]
ScreenTime -0.56** [-0.90, -0.22] -0.57 [-0.92, -0.23] .33 [.05, .55]
r Fit
-.57**
R2 = .329**
95% CI[.05,.55]
Note. A significant b-weight indicates the beta-weight and semi-partial correlation are also significant.
b represents unstandardized regression weights. beta indicates the standardized regression weights.
sr2 represents the semi-partial correlation squared. r represents the zero-order correlation.
Square brackets are used to enclose the lower and upper limits of a confidence interval.
* indicates p < .05. ** indicates p < .01.
While the regular lm()
function gives exact p-values, the apa.reg.table()
function gives more info such as CIs, r, sr, and effect size.
14.6 Our Results
We fitted a linear model to predict REM Sleep with Screen Time. The model explains a statistically significant and substantial proportion of variance,
14.7 Our Assumptions
There are a few basic assumptions:
- Homoscedasticity
- The residual variance is constant across different levels/values of the IV. R can produce a plot of residuals across each fited value of
.
plot(our_model, 1)
Here, we want a relatively straight line around 0, indicating an mean residual of 0. Furthermore, we want the dots to be dispersed equally around each fitted value. It’s hard to determine with our data because there are so few points.
- Independence
- Each observation is independent; thus, each residual is independent. You must ensure this as a researcher. For example, if you had repeated measures (e.g., two observation from each person), then these would not be independent.
- Linearity
- The relationship between IV and DV is linear. We can visually assess this using a scatterplot. We hope that the points seem to follow a straight line. We can fit our line of best fit from OLS to help with this.
`geom_smooth()` using formula 'y ~ x'
Our data appear quite linear. Some examples of non-linear plots:
Quadratic
`geom_smooth()` using formula 'y ~ x'
- Normality of residuals
- We can asses using Q-Q plots and Shapiro-Wilk, which was covered in a previous chapter. Remember, the null hypothesis of the SW test is that the data are normally distributed.
qqnorm(our_model$residuals)
shapiro.test(our_model$residuals)
Shapiro-Wilk normality test
data: our_model$residuals
W = 0.96592, p-value = 0.5211
14.8 Practice Question
Generate the regression equation for the following data that investigating the Graduate Record Exams ability to predict GPA in graduate school.
Interpret the intercept and coefficient for GRE.
Write the hypotheses.
Write up the results.
Student | GRE | GPA |
---|---|---|
1 | 163 | 1.6 |
2 | 171 | 1.9 |
3 | 173 | 1.8 |
4 | 139 | 3.1 |
5 | 174 | 3.9 |
6 | 139 | 1.7 |
7 | 162 | 1.6 |
8 | 141 | 3.6 |
Call:
lm(formula = GPA ~ GRE, data = dat_prac)
Residuals:
Min 1Q Median 3Q Max
-0.9105 -0.7439 -0.3900 0.6201 1.6824
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.17081 3.94968 1.056 0.332
GRE -0.01123 0.02493 -0.450 0.668
Residual standard error: 1.028 on 6 degrees of freedom
Multiple R-squared: 0.03268, Adjusted R-squared: -0.1285
F-statistic: 0.2027 on 1 and 6 DF, p-value: 0.6683
Intercept: Someone with a score of 0 on the GRE would be predicted to have a score GPA of 4.17 (this is impossible).
Slope: For every one unit increase in GRE score, we would predict a 0.011 unit dencrease in GPA.
We fitted a linear model to predict GPA with GRE. The model did not explain a statistically significant proportion of variance