5 Exercise 1 - Transformation of Predictors
The data frame SIMDATAXT contains simulated data for the response, y, and predictors, x1, x2, and x3.
We will apply appropriate transformations to x1, x1, andx1 to linearise the relationships between the response and predictors one at a time.
Remember that when you use operators such as +, -, ^, *, you must use the identity function, I() , to inhibit the interpretation of your formula operator as an arithmetic operator
(a) Below are some questions to help you follow the right steps. Starting with x1:
Does the plot of x1 against Y appear linear?
Which transformation produces a scatterplot that seem to present a linear relationship?
Do both the residual plots, as well as the and the coefficient of determination, R2 support this transformed model?
If no, go back to ii.
# Plot of x1 against Y
plot(y ~ x1, data = SIMDATAXT) # This appears non-linear
# Plot of data with a square root transformation
plot(y ~ I(x1^0.5), data = SIMDATAXT) # This appears linear
# Define your linear model
<- lm(y ~ I(x1^0.5), data = SIMDATAXT)
Model1
# Get the residual plots
plot(rstandard(Model1) ~ fitted(Model1))
qqnorm(rstandard(Model1))
#Get the coefficient of determination, R-squared
summary(Model1)
##
## Call:
## lm(formula = y ~ I(x1^0.5), data = SIMDATAXT)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.6885 -0.8014 -0.0047 0.6989 3.4644
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.44101 0.17178 -2.567 0.011 *
## I(x1^0.5) 1.07688 0.02769 38.890 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.083 on 196 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.8853, Adjusted R-squared: 0.8847
## F-statistic: 1512 on 1 and 196 DF, p-value: < 2.2e-16
# 0.8853 is high enough to indicate a good fit and the assumpions here are met.
(b) Repeat for x2
# Plot of x2 against Y
plot(y ~ x2, data = SIMDATAXT) # This appears non-linear
# Plot of data with a squared transformation
plot(y ~ I(x2^2), data = SIMDATAXT) # This appears linear
# Define your linear model
<- lm(y ~ I(x2^2), data = SIMDATAXT)
Model1
# Get the residual plots
plot(rstandard(Model1) ~ fitted(Model1))
qqnorm(rstandard(Model1))
#Get the coefficient of determination, R-squared
summary(Model1)
##
## Call:
## lm(formula = y ~ I(x2^2), data = SIMDATAXT)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.5088 -0.7309 -0.0139 0.7235 3.4310
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.3336 0.1679 -1.987 0.0482 *
## I(x2^2) 1.0619 0.0272 39.040 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.087 on 198 degrees of freedom
## Multiple R-squared: 0.885, Adjusted R-squared: 0.8844
## F-statistic: 1524 on 1 and 198 DF, p-value: < 2.2e-16
# 0.885 is high enough to indicate a good fit and the assumpions here are met.
(c) Repeat for x3
# Plot of x3 against Y
plot(y ~ x3, data = SIMDATAXT) # This appears non-linear
# Plot of data with an inverse transformation
plot(y ~ I(x3^-1), data = SIMDATAXT) # This appears linear
# Define your linear model
<- lm(y ~ I(x3^-1), data = SIMDATAXT)
Model1
# Get the residual plots
plot(rstandard(Model1) ~ fitted(Model1))
qqnorm(rstandard(Model1))
#Get the coefficient of determination, R-squared
summary(Model1)
##
## Call:
## lm(formula = y ~ I(x3^-1), data = SIMDATAXT)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.5088 -0.7309 -0.0139 0.7235 3.4310
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.3336 0.1679 -1.987 0.0482 *
## I(x3^-1) 1.0619 0.0272 39.040 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.087 on 198 degrees of freedom
## Multiple R-squared: 0.885, Adjusted R-squared: 0.8844
## F-statistic: 1524 on 1 and 198 DF, p-value: < 2.2e-16
# 0.885 is high enough to indicate a good fit and the assumpions here are met.
(d) Which selection of variables therefore needed to be transformed?