# 1 Linear Regression

## 1.1 Introduction

You should be familiar with linear regression, so this section is likely a review of this model. Also, linear regression is a well established method and it is well studied, both from the theoretical and practical perspective. Therefore, there are many aspects that are referred in the textbook but we will not explore much in this section, such as, outliers, testing, heteroscedasticity, leverage power, but you should be familiar with these terms.

Read the following chapters of An introduction to statistical learning:

## 1.3 Practice session

### Task 1 - Fit a linear model

With the bmd.csv dataset, we want to fit a linear model to predict bone mineral density (BMD) based on AGE, SEX and BMI (BMI has to be computed) and we want to compute the $$R^2$$ and MSE for the models that were fitted.

Let’s first read the data and compute “BMI”

#libraries that we will need
library(psych) #for the function pairs.panels
set.seed(1974) #fix the random generator seed 
#read the dataset
bmd.data     <-
stringsAsFactors = TRUE)

bmd.data$bmi <- bmd.data$weight_kg / (bmd.data$height_cm/100)^2 summary(bmd.data) ## id age sex fracture weight_kg ## Min. : 35 Min. :35.81 F:83 fracture : 50 Min. :36.00 ## 1st Qu.: 2018 1st Qu.:54.42 M:86 no fracture:119 1st Qu.:56.00 ## Median : 6702 Median :63.49 Median :64.50 ## Mean : 9103 Mean :63.63 Mean :64.67 ## 3rd Qu.:17100 3rd Qu.:72.08 3rd Qu.:73.00 ## Max. :24208 Max. :88.75 Max. :96.00 ## height_cm medication waiting_time bmd ## Min. :142.0 Anticonvulsant : 9 Min. : 5.00 Min. :0.4076 ## 1st Qu.:154.0 Glucocorticoids: 24 1st Qu.: 9.00 1st Qu.:0.6708 ## Median :160.5 No medication :136 Median :14.00 Median :0.7861 ## Mean :160.3 Mean :19.74 Mean :0.7831 ## 3rd Qu.:166.0 3rd Qu.:24.00 3rd Qu.:0.8888 ## Max. :177.0 Max. :96.00 Max. :1.3624 ## bmi ## Min. :15.43 ## 1st Qu.:22.15 ## Median :24.96 ## Mean :25.20 ## 3rd Qu.:27.55 ## Max. :38.54 Before we model, let’s look at the correlation structure of the variables involved pairs.panels(bmd.data[c("bmd", "age","sex", "bmi")], method = "pearson", # correlation method hist.col = "#00AFBB", density = TRUE, # show density plots ellipses = TRUE # show correlation ellipses ) We fit a linear model for BMD and evaluate the R-squared #Fits a linear model with fixed effects only model1.bmd <- lm(bmd ~ age + sex + bmi, data = bmd.data) summary(model1.bmd) ## ## Call: ## lm(formula = bmd ~ age + sex + bmi, data = bmd.data) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.38207 -0.07669 -0.00654 0.07888 0.51256 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.6063945 0.0834051 7.270 1.36e-11 *** ## age -0.0041579 0.0008625 -4.821 3.23e-06 *** ## sexM 0.0949602 0.0213314 4.452 1.56e-05 *** ## bmi 0.0155913 0.0024239 6.432 1.30e-09 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.138 on 165 degrees of freedom ## Multiple R-squared: 0.3254, Adjusted R-squared: 0.3131 ## F-statistic: 26.53 on 3 and 165 DF, p-value: 4.677e-14  mean(model1.bmd$residuals^2) #MSE
## [1] 0.01859697

TRY IT YOURSELF:

1. Fit a linear model with the interaction age*sex - call it model 2
See the solution code
#Fits a linear model with an interaction age*sex
model2.bmd <- lm(bmd ~ age * sex + bmi, data = bmd.data)
summary(model2.bmd)
mean(model2.bmd$residuals^2) #MSE 1. Fit a linear model with the the interaction age*sex and cubic effect for BMI - call it model 3 See the solution code #Fits a linear model with an interaction and polynomial f model3.bmd <- lm(bmd ~ age*sex + bmi + I(bmi^2) + I(bmi^3), data = bmd.data) #You could use the poly() function to fit the same model #however, poly() will use orthogonal polynomials #so the coefficients will not be the same as above #summary(lm(bmd ~ age*sex + poly(bmi,3) , data = bmd.data)) summary(model3.bmd) mean(model3.bmd$residuals^2) #MSE

### Task 2 - Predicting from a linear model

We first plot the scatter for BMD and BMI, then get the predictions from model 3 in task 1, for a new data where age=50, sex=F and we let BMI vary from 15 to 40. We also compute the predictions for males with similar characteristics. Finally, we add the fitted lines to the plot.

#Scatter plot of BMD and BMI
plot(bmd.data$bmi, bmd.data$bmd,
col = ifelse(bmd.data$sex=="F", "red", "blue")) #prediction from model b) in task 1 bmd.f50 <- predict(model3.bmd, newdata = data.frame(age=50, sex="F", bmi=seq(15,40))) bmd.m50 <- predict(model3.bmd, newdata = data.frame(age=50, sex="M", bmi=seq(15,40))) lines(seq(15,40), bmd.f50, col="red") lines(seq(15,40), bmd.m50, col="blue") TRY IT YOURSELF: 1. Produce the scatter plot for BMD and AGE, only for women See the solution code #Scatter plot of BMD and AGE plot( bmd.data$age[bmd.data$sex=="F"], bmd.data$bmd[bmd.data$sex=="F"]) 1. Predict the BMD for women, with a BMI=25 and AGE between 40 and 90, using model 3 from task 1 and plot the prediction See the solution code #Scatter plot of BMD and AGE plot( bmd.data$age[bmd.data$sex=="F"], bmd.data$bmd[bmd.data\$sex=="F"])

#prediction from model 3 in task 1
#(the prediction line only )
bmd.bmi25 <- predict(model3.bmd,
newdata = data.frame(age=seq(40,90), sex="F", bmi=25))
lines(seq(40,90), bmd.bmi25)

## 1.4 Exercises

Solve the following exercises from the book:

1. Exercise 4 (page 120)

2. Exercise 13 (page 124)

3. With the fat dataset in the library(faraway), we want to fit a linear model to predict body fat (variable brozek) using the variable abdom and age. After loading the library, use the command data(fat) tp load the dataset.

1. recode the variable age into age_cat with the following categories: <30, 30-50 and >50

2. fit a linear model using abdom and age_cat and compute the mean squared error

3. fit a linear model using abdom , age_cat and the interaction between these two predictors. Comment on the change in the mean squared error of this model compared to the one without the interaction.