Chapter 5 Linear Regression

  • When do we use this?: When \(Y\) is a numeric variable.

5.1 Fit a LSLR Line (Build an LSLR model)

m1 <- lm( y ~ x , data = dataset)

  • m1 = replace with any name you want for your model. However, names cannot contain spaces or special characters (*, &, etc)
  • y = replace with your response variable
  • x = replace with your explanatory variable
  • dataset = replace with the name of your data set

NOTE: This code does NOT plot the line. The job of this code is to do all the math we need for LSLR. It computes the estimated slope and intercept, the \(R^2\), the residuals, the fitted values, and a lot more!

5.2 Obtain the Summary Table

summary(m1)$coefficients
  • m1 = replace with the name of your LSLR model

5.3 Obtain the Estimated Slope and Intercept

m1$coefficients

-m1 = replace with the name of your LSLR model

5.4 Writing the LSLR line

In the white space in your Markdown file (NOT IN A CODE CHUNK!), put

$$\widehat{y} = intercept + slope x$$

  • Replace the intercept and slope with the values from your line!

5.5 To Draw the LSLR line on top of a scatter plot

library(ggplot2)
ggplot(  data  ,  aes(  x = , y =  )  )  +
  geom_point(  ) +
  stat_smooth(formula = y ~ x, method="lm",  se=FALSE)
  • data= replace with the name of your data set
  • x = your x variable (no quotes)
  • y = your y variable (no quotes)
  • DO NOT change anything in the geom_smooth part of the code. Leave it as y ~ x.

5.6 Obtain the R-squared

summary(m1)$r.squared

  • m1 = replace with the name of your LSLR model

5.7 Obtain the residuals

m1$residuals

  • m1 = replace with the name of your LSLR model

5.8 Creating a residual plot

To create a residual plot, you just create a scatter plot with the residuals (above) on the Y axis.

ggplot( dataset, aes( y, m1$residuals) ) + geom_point()

  • m1: replace with the name of your model
  • dataset: replace with the name of your data set
  • y: replace with the name of the column holding your Y variable.

5.9 Obtain the studentized residuals

library(MASS)

studres(m1)

  • m1 = replace with the name of your LSLR model

5.10 Outliers

If we want to identify outliers, you can use:

library(MASS)

which( studres(m1)> 3 | studres(m1) < -3)

  • m1 = replace with the name of your LSLR model

This will print out the rows in the data holding the outliers.

To create a data set holding the outliers,

outliers <- dataset[ which( studres(m1)> 3 | studres(m1) < -3) , ]

  • m1 : replace with the name of your LSLR model
  • dataset: replace with the name of your data set.

To fit a model without the outliers,

lm( y ~ x, data = dataset[ -which( studres(m1)> 3 | studres(m1) < -3) , ])

  • m1 = replace with any name you want for your model. However, names cannot contain spaces or special characters (*, &, etc)
  • y = replace with your response variable
  • x = replace with your explanatory variable
  • dataset = replace with the name of your data set

5.11 Obtain the \(\hat{Y}\) values

m1$fitted.values

  • m1 = replace with the name of your LSLR model

5.12 Find the Correlation between Two Variables

cor(dataset$var1, dataset$var2)

  • dataset = replace with the name of your data set
  • var1 = replace with the name of your first variable
  • var2 = replace with the name of your second variable

5.13 Creating a QQPlot

qqnorm(m1$residuals, main ="The Title you Want")`
qqline(m1$residuals)`
  • m1 needs to be replaced with the name of your model

5.14 Analysis of Variance With one model only

  • anova(model)

This gives you the breakdown of the sum of squares for one model.

5.15 Nested F-test

anova(model1, model2)

  • model1 = the smaller model
  • model2 = the larger model; model 1 must be nested in model2

5.16 Best Subset Selection

library(leaps)

BSSOut <- regsubsets( y ~ x1 + x2 + x3, data = , nvmax = )

  • Replace y with your response variable
  • Replace x1, x2, x3 with your possible predictors (as many as you like)
  • data = your data set
  • nvmax = the total number of coefficients you want R to consider
  • NOTE: For categorical variables, you have to make sure this number incorporates the number of levels. One categorical predictor with 4 levels would contribute 3 to the nvmax total.

plot(BSSOut, method = "adjr2")

  • This creates a plot to allow you to see your outcome.
  • You can change adjr2 which means \(R^2_{adj}\), to be bic if you want to use the BIC as a metric or Cp if you want to use Mallows’ Cp.