Chapter 5 Linear Regression

  • When do we use this?: When \(Y\) is a numeric variable.

5.1 Fit a LSLR Line

m1 <- lm( y ~ x , data = dataset)

  • m1 = replace with any name you want for your model. However, names cannot contain spaces or special characters (*, &, etc)
  • y = replace with your response variable
  • x = replace with your explanatory variable
  • dataset = replace with the name of your data set

NOTE: This code does NOT plot the line. The job of this code is to do all the math we need for LSLR. It computes the estimated slope and intercept, the \(R^2\), the residuals, the fitted values, and a lot more!

5.2 To Draw the LSLR line on top of a scatter plot

library(ggplot2)
ggplot(  data  ,  aes(  x = , y =  )  )  +
  geom_point(  ) +
  stat_smooth(formula = y ~ x, method="lm",  se=FALSE)
  • data= replace with the name of your data set
  • x = your x variable (no quotes)
  • y = your y variable (no quotes)
  • DO NOT change anything in the geom_smooth part of the code. Leave it as y ~ x.

5.3 Obtain the Summary Table

summary(m1)
  • m1 = replace with the name of your LSLR model

5.4 Obtain the Estimated Slope and Intercept

m1$coefficients

-m1 = replace with the name of your LSLR model

5.5 Obtain the R-squared

summary(m1)$r.squared

  • m1 = replace with the name of your LSLR model

5.6 Obtain the residuals

m1$residuals

  • m1 = replace with the name of your LSLR model

5.7 Creating a residual plot

To create a residual plot, you just create a scatter plot with the residuals (above) on the Y axis.

5.8 Obtain the studentized residuals

library(MASS)

studres(m1)

  • m1 = replace with the name of your LSLR model

5.9 Obtain the \(\hat{Y}\) values

m1$fitted.values

  • m1 = replace with the name of your LSLR model

5.10 Find the Correlation between Two Variables

cor(dataset$var1, dataset$var2)

  • dataset = replace with the name of your data set
  • var1 = replace with the name of your first variable
  • var2 = replace with the name of your second variable

5.11 Creating a QQPlot

qqnorm(m1$residuals, main ="The Title you Want")`
qqline(m1$residuals)`
  • m1 needs to be replaced with the name of your model

5.12 Analysis of Variance With one model only

  • anova(model)

This gives you the breakdown of the sum of squares for one model.

5.13 Nested F-test

anova(model1, model2)

  • model1 = the smaller model
  • model2 = the larger model; model 1 must be nested in model2

5.14 Best Subset Selection

library(leaps)

BSSOut <- regsubsets( y ~ x1 + x2 + x3, data = , nvmax = )

  • Replace y with your response variable
  • Replace x1, x2, x3 with your possible predictors (as many as you like)
  • data = your data set
  • nvmax = the total number of coefficients you want R to consider
  • NOTE: For categorical variables, you have to make sure this number incorporates the number of levels. One categorical predictor with 4 levels would contribute 3 to the nvmax total.

plot(BSSOut, method = "adjr2")

  • This creates a plot to allow you to see your outcome.
  • You can change adjr2 which means \(R^2_{adj}\), to be bic if you want to use the BIC as a metric or Cp if you want to use Mallows’ Cp.