# 5 Exercise 1 - Transformation of Predictors

The data frame SIMDATAXT contains simulated data for the response, y, and predictors, $$x_1$$, $$x_2$$, and $$x_3$$. We will apply appropriate transformations to $$x_1$$, $$x_1$$, and$$x_1$$ to linearise the relationships between the response and predictors one at a time.
Remember that when you use operators such as $$\text{+, -, ^, *, }$$ you must use the identity function, $$I()$$ , to inhibit the interpretation of your formula operator as an arithmetic operator

(a) Below are some questions to help you follow the right steps. Starting with $$x_1$$:

1. Does the plot of $$x_1$$ against $$Y$$ appear linear?

2. Which transformation produces a scatterplot that seem to present a linear relationship?

3. Do both the residual plots, as well as the and the coefficient of determination, $$R^2$$ support this transformed model?

If no, go back to ii.

# Plot of x1 against Y
plot(y ~ x1, data = SIMDATAXT) # This appears non-linear

# Plot of data with a square root transformation
plot(y ~ I(x1^0.5), data = SIMDATAXT) # This appears linear

# Define your linear model
Model1 <- lm(y ~ I(x1^0.5), data = SIMDATAXT)

# Get the residual plots
plot(rstandard(Model1) ~ fitted(Model1))

qqnorm(rstandard(Model1))

#Get the coefficient of determination, R-squared
summary(Model1)
##
## Call:
## lm(formula = y ~ I(x1^0.5), data = SIMDATAXT)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -2.6885 -0.8014 -0.0047  0.6989  3.4644
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.44101    0.17178  -2.567    0.011 *
## I(x1^0.5)    1.07688    0.02769  38.890   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.083 on 196 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.8853, Adjusted R-squared:  0.8847
## F-statistic:  1512 on 1 and 196 DF,  p-value: < 2.2e-16
# 0.8853 is high enough to indicate a good fit and the assumpions here are met.

(b) Repeat for $$x_2$$

# Plot of x2 against Y
plot(y ~ x2, data = SIMDATAXT) # This appears non-linear

# Plot of data with a squared transformation
plot(y ~ I(x2^2), data = SIMDATAXT) # This appears linear

# Define your linear model
Model1 <- lm(y ~ I(x2^2), data = SIMDATAXT)

# Get the residual plots
plot(rstandard(Model1) ~ fitted(Model1))

qqnorm(rstandard(Model1))

#Get the coefficient of determination, R-squared
summary(Model1)
##
## Call:
## lm(formula = y ~ I(x2^2), data = SIMDATAXT)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -2.5088 -0.7309 -0.0139  0.7235  3.4310
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  -0.3336     0.1679  -1.987   0.0482 *
## I(x2^2)       1.0619     0.0272  39.040   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.087 on 198 degrees of freedom
## Multiple R-squared:  0.885,  Adjusted R-squared:  0.8844
## F-statistic:  1524 on 1 and 198 DF,  p-value: < 2.2e-16
# 0.885 is high enough to indicate a good fit and the assumpions here are met.

(c) Repeat for $$x_3$$

# Plot of x3 against Y
plot(y ~ x3, data = SIMDATAXT) # This appears non-linear

# Plot of data with an inverse transformation
plot(y ~ I(x3^-1), data = SIMDATAXT) # This appears linear

# Define your linear model
Model1 <- lm(y ~ I(x3^-1), data = SIMDATAXT)

# Get the residual plots
plot(rstandard(Model1) ~ fitted(Model1))

qqnorm(rstandard(Model1))

#Get the coefficient of determination, R-squared
summary(Model1)
##
## Call:
## lm(formula = y ~ I(x3^-1), data = SIMDATAXT)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -2.5088 -0.7309 -0.0139  0.7235  3.4310
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  -0.3336     0.1679  -1.987   0.0482 *
## I(x3^-1)      1.0619     0.0272  39.040   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.087 on 198 degrees of freedom
## Multiple R-squared:  0.885,  Adjusted R-squared:  0.8844
## F-statistic:  1524 on 1 and 198 DF,  p-value: < 2.2e-16
# 0.885 is high enough to indicate a good fit and the assumpions here are met.

(d) Which selection of variables therefore needed to be transformed?