Chapter 5 Linear Regression

When do we use this?: When $Y$ is a numeric variable.

5.1 Fit a LSLR Line

m1 <- lm( y ~ x , data = dataset)

m1 = replace with any name you want for your model. However, names cannot contain spaces or special characters (*, &, etc)
y = replace with your response variable
x = replace with your explanatory variable
dataset = replace with the name of your data set

NOTE: This code does NOT plot the line. The job of this code is to do all the math we need for LSLR. It computes the estimated slope and intercept, the $R^2$ , the residuals, the fitted values, and a lot more!

5.2 To Draw the LSLR line on top of a scatter plot

library(ggplot2)
ggplot(  data  ,  aes(  x = , y =  )  )  +
  geom_point(  ) +
  stat_smooth(formula = y ~ x, method="lm",  se=FALSE)

data= replace with the name of your data set
x = your x variable (no quotes)
y = your y variable (no quotes)
DO NOT change anything in the geom_smooth part of the code. Leave it as y ~ x.

5.3 Obtain the Summary Table

summary(m1)

m1 = replace with the name of your LSLR model

5.4 Obtain the Estimated Slope and Intercept

m1$coefficients

-m1 = replace with the name of your LSLR model

5.5 Obtain the R-squared

summary(m1)$r.squared

m1 = replace with the name of your LSLR model

5.6 Obtain the residuals

m1$residuals

m1 = replace with the name of your LSLR model

5.7 Creating a residual plot

To create a residual plot, you just create a scatter plot with the residuals (above) on the Y axis.

5.8 Obtain the studentized residuals

library(MASS)

studres(m1)

m1 = replace with the name of your LSLR model

5.9 Obtain the $\hat{Y}$ values

m1$fitted.values

m1 = replace with the name of your LSLR model

5.10 Find the Correlation between Two Variables

cor(dataset$var1, dataset$var2)

dataset = replace with the name of your data set
var1 = replace with the name of your first variable
var2 = replace with the name of your second variable

5.11 Creating a QQPlot

qqnorm(m1$residuals, main ="The Title you Want")`
qqline(m1$residuals)`

m1 needs to be replaced with the name of your model

5.12 Analysis of Variance With one model only

anova(model)

This gives you the breakdown of the sum of squares for one model.

5.13 Nested F-test

anova(model1, model2)

model1 = the smaller model
model2 = the larger model; model 1 must be nested in model2

5.14 Best Subset Selection

library(leaps)

BSSOut <- regsubsets( y ~ x1 + x2 + x3, data = , nvmax = )

Replace y with your response variable
Replace x1, x2, x3 with your possible predictors (as many as you like)
data = your data set
nvmax = the total number of coefficients you want R to consider
NOTE: For categorical variables, you have to make sure this number incorporates the number of levels. One categorical predictor with 4 levels would contribute 3 to the nvmax total.

plot(BSSOut, method = "adjr2")

This creates a plot to allow you to see your outcome.
You can change adjr2 which means $R^2_{adj}$ , to be bic if you want to use the BIC as a metric or Cp if you want to use Mallows’ Cp.