Chapter 9 Simple Linear Regression

This chapter has not yet been updated for the Spring 2021 offering of HE-802. It is not yet ready for student use in Spring 2021 and will likely be updated before it is assigned to students in the course calendar.

This week, our goals are to…

Demonstrate understanding of how ordinary least squares linear regression works.
Run and interpret a simple linear regression model in R.
Calculate residual errors in a linear regression model.

9.1 Tips, Tricks, and Answers From Last Week

As always, reading this section is optional, but it is based on questions I received from members of the class over the last week and it may contain some useful information.

9.1.1 Removing Missing Data

Some of you have reported trouble caused by running regressions or tests on datasets with missing values. You can do the following to remove missing values from your data:

First, make a new dataset that contains ONLY the variables you want to use from the dataset (the one with missing values):¹⁰⁰

if (!require(dplyr)) install.packages('dplyr')
library(dplyr)

NewData <- OldData %>%
  dplyr::select(var1, var2, var3)

Now you have a new version of your dataset called NewData which only contains the variables you will use in your analysis (which in the example above are called var1, var2, and var3).

Next, you can remove all missing data in one single command. This will remove all rows of data in which even one variable has a missing value (which in R is coded as NA):

EvenNewerData <- na.omit(NewData)

And then just use the dataset EvenNewerData for the rest of your analysis (and ignore NewData and OldData)!

9.1.2 Problems With Plots in R

Sometimes you might get the following error message when you try to make plots in R:

Error in plot.new() : figure margins too large

There are two ways that I suggest getting around this:

Run the command par(mar=c(1,1,1,1)) in the console. Then try making your plots again. If that doesn’t work, try the next option:
If you are using RStudio Cloud, reset the project:
1. Make sure you have saved all of your work and files. Be aware that this procedure will save all of your saved code files and datasets that are within your cloud file system, but it will erase everything from the environment. This will not be a problem as long as you save your code file and if your code file has all of the code necessary to re-load your dataset.
2. Locate your name in the top-right corner of RStudio Cloud.
3. Locate the icon with three dots within a circle, just to the left of your name. Click on it. A dropdown menu should appear.
4. Click on Relaunch Project. Read the message that appears in a pop-up window, and then click OK.

Optionally, you can read more about this issue by clicking here. It is not at all necessary to read this.

9.2 Simple Linear Regression

This week we are going to explore simple linear regression, which is a method for finding out the linear association between a dependent variable and one independent variable.

Please read/watch:

Sections A–H in Online Stat Book: Regression
Introduction to residuals and least squares regression

These additional resources are not necessary to read, but they may be useful to skim through to reinforce your knowledge:

Dave Your Tutor. Simplest Explanation of the Standard Errors of Regression Coefficients - Statistics Help. August 23 2015. http://youtube.com/watch?v=1oHe1a3JqHw.
Interpreting Regression Output. Princeton University Library. https://dss.princeton.edu/online_help/analysis/interpreting_regression.htm#ptse.
Jim Frost. How to Interpret P-values and Coefficients in Regression Analysis. https://statisticsbyjim.com/regression/interpret-coefficients-p-values-regression/.
How To Interpret Regression Analysis Results: P-Values & Coefficients? April 11 2017. Statswork. http://statswork.com/blog/how-to-interpret-regression-analysis-results/.

Sometimes, the slope of the regression line is referred to with the Greek letter beta. Just keep in mind that this just means the slope of the line.¹⁰¹

When we do a regression and get an estimate of a slope of the relationship between Y and X, there are two possibilities:

In the population from which the sample is drawn, there is no true relationship between Y and X. The slope is 0. As you have more or less of X, Y doesn’t change at all. This is called the null hypothesis.
In the population from which the sample is drawn, there is a true, non-zero relationship between Y and X. This is called the alternative hypothesis.

If the p-value of your regression estimate is less than 0.05 (or 5%), then (assuming your regression meets other conditions that we will discuss later) you can conclude that Scenario #2 above is correct and that the estimate is trustworthy for the population. In statistical jargon, this is called rejecting the null hypothesis because our analysis had sufficient evidence to make us 95% certain that the alternate hypothesis is true. $1-p$ = level of certainty about the alternate hypothesis.

9.3 OLS Regression in R

Now that you have read about how ordinary least squares (OLS) linear regression works, it is time to learn how to run such a regression in R. Please read the following resource as well as the notes below:

Introduction to Regression in R (Part1, Simple and Multiple Regression) – This is the first of a very useful three-part series from the IDRE Statistical Consulting Group. For now, please only read Part/Lesson #1.

To run an OLS linear regression in R, you will use the lm() command. Here is approximately what this will look like:

reg1 <- lm(DepVar ~ IndVar, data = mydata)

Here is what we are telling the computer to do with the code above:

reg1 – Create a new regression object called reg1. You can call this whatever you want. It doesn’t need to be called reg1.
<- – Assign reg1 to be the result of the output of the function lm()
lm() – Run a linear model using OLS linear regression.
DepVar – This is the dependent variable in the regression model. You will write the name of your own dependent variable instead of DepVar.
~ – This is part of a formula. This is like the equals sign within an equation.
IndVar – This is the independent variable in the regression model. You will write the name of your own independent variable instead of IndVar.
data = mydata – Use the dataset mydata for this regression. mydata is where the dependent and independent variables are located. You will replace mydata with the name of your own dataset.

Then we would run the following command to see the results of our saved regression, reg1:

summary(reg1)

In your assignment this week, you will modify the code above to run your own OLS regression.

If you would like to see a review of how to interpret the computer regression output, the following video may be useful. But it is not necessary for you to watch it:

Interpreting computer regression data. July 12 2017. Khan Academy. https://www.youtube.com/watch?v=sIJj7Q77SVI&list=WL&index=92&t=0s.

9.4 Assignment

9.4.1 Simple Linear Regression

In this part of the assignment, we will practice running a simple OLS linear regression in R and interpreting the results.

First, run the following code to load the fitness dataset into R:

Name <- c("Person A","Person B","Person C","Person D","Person E")
WeeklyWeightliftHours <- c(3,4,4,2,6)
WeightLiftedKG <- c(20,30,21,25,40)

fitness <- data.frame(Name, WeeklyWeightliftHours, WeightLiftedKG)
fitness

Name	WeeklyWeightliftHours	WeightLiftedKG
Person A	3	20
Person B	4	30
Person C	4	21
Person D	2	25
Person E	6	40

This dataset contains five people and the following two variables of interest:

WeeklyWeightliftHours is the number of hours per week the person spends weightlifting.
WeightLiftedKG is how much weight the person could lift on the day of the survey.

Task 1: What is a reasonable research question that we could ask with this data?

Task 2: What is the dependent variable and independent variable for a quantitative analysis that we could do to answer this research question?

Task 3: Run a linear regression to answer your research question. Make sure to include a summary table of your regression results.

Task 4: Based on the regression output, what is the equation of the regression line? Be sure to include a slope and an intercept,¹⁰² and write the equation in the format $y = mx+b$ .

Task 5: Write the interpretation of the slope coefficient that you obtained from the regression output.

Task 6: Write the interpretation of the intercept that you obtained from the regression output.

Let’s step back and review about why we do regressions. Of course, we do them to see if the dependent variable and independent variables are associated with each other statistically, but we also do them to find out if the trends that we see in our data are (or are not) similar to those in the population at large.

Consider the fitness dataset above about weightlifting. Let’s say we wanted to know about the hours spent lifting and weight lifted in Boston, not just for the five people in the data. So then people in Boston would be our population of interest. The five people in the fitness dataset are five people that we surveyed out of this population. These five people are our sample. These are important terms to remember. Our goal is to use the sample (the data that we do have) to learn whatever we can about the population as a whole.

When you did the regression above, you found an association between your dependent and independent variables. But that association is only for our sample of five people. What about all of Boston? That’s where inference comes in. Inference is when you use your sample to attempt to figure out trends in your whole population. And this is what the Std. Error (standard error), t value, and Pr(>|t|) (p-value) in the regression output are all about.

To reiterate, we have our regression line for the five people. But what would the line look like for the entire population of Boston? Would it look the same or would the slope be different? If we want to know the true slope of the regression line in the population (the statistical relationship between hours spent lifting and weight lifted in all of Boston), we have to look at the other columns of the regression output and use inferential statistics.

Now look again at your regression output.

Task 7: In the output, what is the p-value for the independent variable? What does this p-value mean for the question of whether the true population regression line also has the same slope as the line for your sample?

Task 8: In the regression output, what do we get when we divide the coefficient for the independent variable by its standard error? Do you see that number somewhere else in the same line of the regression table? (You should!)

9.4.2 Predicted Values and Residuals

In this part of the assignment, you will practice looking at the predicted values and errors in linear regression. You will continue to use the same fitness dataset that you did earlier.

Task 9: Copy the fitness data into a table in Word or Excel.¹⁰³ Here’s a copy of it, so that you don’t have to go up and find it:

fitness

Name	WeeklyWeightliftHours	WeightLiftedKG
Person A	3	20
Person B	4	30
Person C	4	21
Person D	2	25
Person E	6	40

Task 10: Add new columns to this table for the following items: 1) Predicted WeightLiftedKG, 2) Residual, 3) Difference from mean. Plug¹⁰⁴ each WeeklyWeightliftHours value into the regression equation to get the Predicted WeightLiftedKG for each person. These are called the predicted values or fitted values. This is how much weight the regression line “thinks” each person lifted, based on the data we gave it.

Task 11: Calculate a residual value for each person. This is the difference between the actual and predicted value of how much each person lifted. You can think of this as the error for each person in the dataset.

Task 12: Are predicted values and residuals calculated for the dependent variable or independent variable?

Task 13: Are predicted values and residuals calculated for the y values or the x values?¹⁰⁵

Task 14: What is the total sum of squares of the residuals? (Take each residual that you just calculated, square it, and then add up those five numbers). This is called the SSR, or Sum of Squares Regression.

Task 15: Calculate the difference from the mean of each Y value. Here’s how you do that:

Calculate the mean of WeightLiftedKG (across all five people in the data).
For each person (row), calculate the difference between the mean and WeightLiftedKG. This is the difference from the mean.

Task 16: Calculate the sum of squares of the difference from the mean. This is called SST, or Sum of Squares Total.

Task 17: Calculate $1-\frac{SSR}{SST}$ . Show your work. This should be equal to the Multiple R-squared value from the regression output.

The correlation of WeightLiftedKG and WeeklyWeightliftHours is shown below:

# Calculate correlation between two vectors/variables
CorrelationWeightHours <- cor(fitness$WeeklyWeightliftHours,fitness$WeightLiftedKG)
CorrelationWeightHours

## [1] 0.7677303

Task 14: Look again at the Multiple R-squared value from the regression output. What is the relationship between Multiple R-squared and the correlation coefficient displayed above? Hint: the correlation coefficient is often referred to as R.

Remember: We want SSR to be as low as possible, which will make $R^2$ as high as possible. We want $R^2$ to be high. SSR is calculated from the residuals, and residuals are errors in the regression model’s prediction. SSR is all of the error totaled up together. So if SSR is low then error is low. $R^2$ is one of the most commonly used metrics to determine how well a regression model fit your data.

9.4.3 Logistical Tasks

Task 18: Include any questions you have that you wrote down as you read the chapter and did the assignment this week.

Task 19: Please submit your assignment to the online dropbox in D2L as a PDF or HTML file (which you created using RMarkdown¹⁰⁶). Please use the same file naming procedure that we used for the previous assignments.

Task 20: Please submit your assignment to me by email.

When you are doing a study that uses quantitative methods and has data with missing values, you should explain how exactly you are handling those missing values. Sometimes you may choose to impute (calculate predicted values for) those missing values rather than just dropping them from your analysis. However, for the purposes of getting your regressions and diagnostics to work just for the purposes of this course, the procedure explained here should be helpful.↩︎
We have also referred previously to the slope as m (as in mx+b), $b_1$ (as in $b_1x + b_0$ ), coefficient, and coefficient estimate.↩︎
Note that the intercept is also sometimes referred to as the constant.↩︎
You will have to submit this table with your homework, in case that helps you decide whether to use Word or Excel. You can also hand-write it and take a photo and submit that.↩︎
Keep in mind that you practiced plugging values into linear equations in last week’s assignment.↩︎
This is identical to the previous question but with different terminology.↩︎
You can also submit your raw RMarkdown file as well, if you wish↩︎