# 5 Exercise 1 - Transformation of Predictors

The data frame SIMDATAXT contains simulated data for the response, y, and predictors, \(x_1\), \(x_2\), and \(x_3\).
We will apply appropriate transformations to \(x_1\), \(x_1\), and\(x_1\) to linearise the relationships between the response and predictors one at a time.

*Remember that when you use operators such as* \(\text{+, -, ^, *, }\) *you must use the identity function, * \(I()\) *, to inhibit the interpretation of your formula operator as an arithmetic operator*

(a) Below are some questions to help you follow the right steps. Starting with \(x_1\):

Does the plot of \(x_1\) against \(Y\) appear linear?

Which transformation produces a scatterplot that seem to present a linear relationship?

Do both the residual plots, as well as the and the coefficient of determination, \(R^2\) support this transformed model?

If no, go back to

*ii.*

```
# Plot of x1 against Y
plot(y ~ x1, data = SIMDATAXT) # This appears non-linear
```

```
# Plot of data with a square root transformation
plot(y ~ I(x1^0.5), data = SIMDATAXT) # This appears linear
```

```
# Define your linear model
<- lm(y ~ I(x1^0.5), data = SIMDATAXT)
Model1
# Get the residual plots
plot(rstandard(Model1) ~ fitted(Model1))
```

`qqnorm(rstandard(Model1))`

```
#Get the coefficient of determination, R-squared
summary(Model1)
```

```
##
## Call:
## lm(formula = y ~ I(x1^0.5), data = SIMDATAXT)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.6885 -0.8014 -0.0047 0.6989 3.4644
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.44101 0.17178 -2.567 0.011 *
## I(x1^0.5) 1.07688 0.02769 38.890 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.083 on 196 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.8853, Adjusted R-squared: 0.8847
## F-statistic: 1512 on 1 and 196 DF, p-value: < 2.2e-16
```

`# 0.8853 is high enough to indicate a good fit and the assumpions here are met.`

(b) Repeat for \(x_2\)

```
# Plot of x2 against Y
plot(y ~ x2, data = SIMDATAXT) # This appears non-linear
```

```
# Plot of data with a squared transformation
plot(y ~ I(x2^2), data = SIMDATAXT) # This appears linear
```

```
# Define your linear model
<- lm(y ~ I(x2^2), data = SIMDATAXT)
Model1
# Get the residual plots
plot(rstandard(Model1) ~ fitted(Model1))
```

`qqnorm(rstandard(Model1))`

```
#Get the coefficient of determination, R-squared
summary(Model1)
```

```
##
## Call:
## lm(formula = y ~ I(x2^2), data = SIMDATAXT)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.5088 -0.7309 -0.0139 0.7235 3.4310
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.3336 0.1679 -1.987 0.0482 *
## I(x2^2) 1.0619 0.0272 39.040 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.087 on 198 degrees of freedom
## Multiple R-squared: 0.885, Adjusted R-squared: 0.8844
## F-statistic: 1524 on 1 and 198 DF, p-value: < 2.2e-16
```

`# 0.885 is high enough to indicate a good fit and the assumpions here are met.`

(c) Repeat for \(x_3\)

```
# Plot of x3 against Y
plot(y ~ x3, data = SIMDATAXT) # This appears non-linear
```

```
# Plot of data with an inverse transformation
plot(y ~ I(x3^-1), data = SIMDATAXT) # This appears linear
```

```
# Define your linear model
<- lm(y ~ I(x3^-1), data = SIMDATAXT)
Model1
# Get the residual plots
plot(rstandard(Model1) ~ fitted(Model1))
```

`qqnorm(rstandard(Model1))`

```
#Get the coefficient of determination, R-squared
summary(Model1)
```

```
##
## Call:
## lm(formula = y ~ I(x3^-1), data = SIMDATAXT)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.5088 -0.7309 -0.0139 0.7235 3.4310
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.3336 0.1679 -1.987 0.0482 *
## I(x3^-1) 1.0619 0.0272 39.040 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.087 on 198 degrees of freedom
## Multiple R-squared: 0.885, Adjusted R-squared: 0.8844
## F-statistic: 1524 on 1 and 198 DF, p-value: < 2.2e-16
```

`# 0.885 is high enough to indicate a good fit and the assumpions here are met.`

(d) Which selection of variables therefore needed to be transformed?