R code
Please find some R code used to fit and plot some models. You can find most data sets on Moodle. Please download data and try some of the R code as you read through these notes.
Example
The EC Water Framework Directive (WFD) 2000 states that there should be ‘good status’ in all shallow waters by 2016. Therefore, it is of interest for environmental management to investigate lake typology and identify lakes that are susceptible to large amounts of algae.
An ecologist has recorded the total biovolume of algae in lakes over the summer months. He also measured the altitude, mean depth, retention time, alkalinity, colour and total phosphorus (TP) of the lakes. He was interested in which of these lake typology variables were related to the total amount of algae in the lakes. All variables were transformed using natural logs originally to make the distribution of each variable more symmetric.
The variable names in the dataset are:
lAltitude
, lMeanDepth
, lRettime
, lAlkalinity
, lColour
, lTotalP
and ltotalbio
.
The question of interest is which of the lake typology variables are useful predictors of the log total biovolume, ltotalbio
of algae within lakes?
First we do a exploratory analysis of the lakes data. A first step would be to plot all possible pairs of variables. This is to identify possible relationship with our response variable, identify high correlations, and possible multicollinearites, between explanatory variables, and identify explanatory variables we think could play an important role in our regression model. In a formal report, i.e. if this was a project, then we would want to be able to pick out the interesting plots for the full set of plots.
Initially, log total biovolume has a positive linear relationship with log Alkalinity (0486) and log total phosphorus (0.426), and log Alkalinity and log total phosphorus have a relatively high correlation (0.488). Log mean depth appears to be related to most of the variables we have in the data set.
Now that we have explored the dataset and since we have not found any obvious multicollinearity we will use all the variables in the model to create the full model and explore the confidence intervals of the coefficients.
lakes.all <- lm(ltotalbio ~ lAltitude + lMeanDepth + lRettime + lAlkalinity + lColour +
lTotalP, data = lakes)
summary(lakes.all)
ggcoef(lakes.all,exclude_intercept = TRUE)
lakes.all <- lm(ltotalbio ~ lAltitude + lMeanDepth + lRettime + lAlkalinity + lColour +
lTotalP, data = lakes)
print(summary(lakes.all),concise=TRUE)
##
## Call:
## lm(formula = ltotalbio ~ lAltitude + lMeanDepth + lRettime +
## lAlkalinity + lColour + lTotalP, data = lakes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.4642 -0.9078 0.0429 1.0253 3.9294
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.88317 0.59396 23.374 < 2e-16 ***
## lAltitude -0.09303 0.08978 -1.036 0.301355
## lMeanDepth -0.05890 0.11774 -0.500 0.617479
## lRettime 0.04063 0.07984 0.509 0.611378
## lAlkalinity 0.36380 0.09561 3.805 0.000188 ***
## lColour -0.14891 0.10861 -1.371 0.171910
## lTotalP 0.37189 0.09950 3.737 0.000243 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.475 on 200 degrees of freedom
## Multiple R-squared: 0.2966, Adjusted R-squared: 0.2754
## F-statistic: 14.05 on 6 and 200 DF, p-value: 2.51e-13
Backward selection with AIC
We will do a model selection to see if we can reduce the model. We use the step
function which by default uses the AIC
criterion for model selection.
## Start: AIC=167.91
## ltotalbio ~ lAltitude + lMeanDepth + lRettime + lAlkalinity +
## lColour + lTotalP
##
## Df Sum of Sq RSS AIC
## - lMeanDepth 1 0.5447 435.95 166.17
## - lRettime 1 0.5638 435.96 166.18
## - lAltitude 1 2.3375 437.74 167.02
## - lColour 1 4.0920 439.49 167.85
## <none> 435.40 167.91
## - lTotalP 1 30.4100 465.81 179.89
## - lAlkalinity 1 31.5182 466.92 180.38
##
## Step: AIC=166.17
## ltotalbio ~ lAltitude + lRettime + lAlkalinity + lColour + lTotalP
##
## Df Sum of Sq RSS AIC
## - lRettime 1 0.187 436.13 164.26
## - lAltitude 1 2.966 438.91 165.58
## - lColour 1 3.553 439.50 165.85
## <none> 435.95 166.17
## - lTotalP 1 31.416 467.36 178.58
## - lAlkalinity 1 39.566 475.51 182.16
##
## Step: AIC=164.26
## ltotalbio ~ lAltitude + lAlkalinity + lColour + lTotalP
##
## Df Sum of Sq RSS AIC
## - lAltitude 1 2.937 439.07 163.65
## - lColour 1 3.721 439.85 164.02
## <none> 436.13 164.26
## - lTotalP 1 32.960 469.09 177.34
## - lAlkalinity 1 39.386 475.52 180.16
##
## Step: AIC=163.65
## ltotalbio ~ lAlkalinity + lColour + lTotalP
##
## Df Sum of Sq RSS AIC
## <none> 439.07 163.65
## - lColour 1 4.939 444.01 163.97
## - lTotalP 1 32.902 471.97 176.61
## - lAlkalinity 1 62.350 501.42 189.14
##
## Call:
## lm(formula = ltotalbio ~ lAlkalinity + lColour + lTotalP, data = lakes)
##
## Coefficients:
## (Intercept) lAlkalinity lColour lTotalP
## 13.3658 0.4261 -0.1476 0.3806
Here we have used the AIC criteria. At each step we have dropped a predictor and evaluated the AIC of the new model (without that predictor). Specifically,
we started with the full model that had an AIC value of 168.
Dropping one of the variables of the full model, let’s say lTotallP, had as a result an increase of the AIC to 180. If on the other hand the dropped variable was lMean Depth then the AIC would be 166. This is the lowest possible AIC value by removing one variable from the full model so this is the variable we decide to drop.
Continuing this procedure, we end up with a model with three variables.
In the final step we start with an AIC of 164 and the row that gives the lowest AIC is the one with
<none>
implying that we should not gain anything by dropping this variable.
So our final model is ltotalbio ~ lAlkalinity + lColour + lTotalP
. Note that if we had used a different criterion, we might have chosen a different model.
Now we look at the summary of the reduced model.
# Reduced Model
lakes.step <- lm(ltotalbio ~ lAlkalinity + lColour + lTotalP, data = lakes)
summary(lakes.step)
##
## Call:
## lm(formula = ltotalbio ~ lAlkalinity + lColour + lTotalP, data = lakes)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.6126 -0.9624 0.0269 0.9864 4.0390
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.36578 0.42041 31.792 < 2e-16 ***
## lAlkalinity 0.42610 0.07936 5.369 2.15e-07 ***
## lColour -0.14764 0.09770 -1.511 0.132308
## lTotalP 0.38058 0.09758 3.900 0.000131 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.471 on 203 degrees of freedom
## Multiple R-squared: 0.2906, Adjusted R-squared: 0.2801
## F-statistic: 27.72 on 3 and 203 DF, p-value: 4.551e-15
We notice that the predictor lColour
would have been dropped based on the p-value criterion. Our objective here was not to find a final model with only signification effects.
Backward selection p-values
In this case, we begin with the full model with all possible explanatory variables and remove on at at time based on coefficient p-values.
##
##
## Stepwise Summary
## -------------------------------------------------------------------------
## Step Variable AIC SBC SBIC R2 Adj. R2
## -------------------------------------------------------------------------
## 0 Full Model 757.355 784.017 170.402 0.29655 0.27545
## 1 lMeanDepth 755.614 778.943 168.575 0.29567 0.27815
## 2 lRettime 753.703 773.699 166.592 0.29537 0.28142
## -------------------------------------------------------------------------
##
## Final Model Output
## ------------------
##
## Model Summary
## ---------------------------------------------------------------
## R 0.543 RMSE 1.452
## R-Squared 0.295 MSE 2.159
## Adj. R-Squared 0.281 Coef. Var 10.473
## Pred R-Squared 0.254 AIC 753.703
## MAE 1.141 SBC 773.699
## ---------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
## AIC: Akaike Information Criteria
## SBC: Schwarz Bayesian Criteria
##
## ANOVA
## --------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## --------------------------------------------------------------------
## Regression 182.819 4 45.705 21.169 0.0000
## Residual 436.133 202 2.159
## Total 618.951 206
## --------------------------------------------------------------------
##
## Parameter Estimates
## ----------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## ----------------------------------------------------------------------------------------
## (Intercept) 13.715 0.516 26.594 0.000 12.698 14.732
## lAltitude -0.102 0.087 -0.080 -1.166 0.245 -0.274 0.070
## lAlkalinity 0.379 0.089 0.324 4.271 0.000 0.204 0.555
## lColour -0.130 0.099 -0.081 -1.313 0.191 -0.325 0.065
## lTotalP 0.381 0.097 0.271 3.907 0.000 0.189 0.573
## ----------------------------------------------------------------------------------------
##
## Call:
## lm(formula = paste(response, "~", paste(c(include, cterms), collapse = " + ")),
## data = l)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.4321 -0.9007 0.0192 1.0319 3.9546
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.71476 0.51571 26.594 < 2e-16 ***
## lAltitude -0.10194 0.08740 -1.166 0.244839
## lAlkalinity 0.37940 0.08883 4.271 2.99e-05 ***
## lColour -0.12972 0.09882 -1.313 0.190766
## lTotalP 0.38092 0.09749 3.907 0.000127 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.469 on 202 degrees of freedom
## Multiple R-squared: 0.2954, Adjusted R-squared: 0.2814
## F-statistic: 21.17 on 4 and 202 DF, p-value: 1.359e-14
Notice that in this model, some variables still have p-values >0.05. This is because the default parameters specific a much higher cut off point of 0.3. While we have consistently used 0.05 to decide between signification and non-significant effects (or regression coefficients that are/are not significiantly different from zero), this is not a fixed number.
##
## Call:
## lm(formula = paste(response, "~", paste(c(include, cterms), collapse = " + ")),
## data = l)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.4321 -0.9007 0.0192 1.0319 3.9546
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.71476 0.51571 26.594 < 2e-16 ***
## lAltitude -0.10194 0.08740 -1.166 0.244839
## lAlkalinity 0.37940 0.08883 4.271 2.99e-05 ***
## lColour -0.12972 0.09882 -1.313 0.190766
## lTotalP 0.38092 0.09749 3.907 0.000127 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.469 on 202 degrees of freedom
## Multiple R-squared: 0.2954, Adjusted R-squared: 0.2814
## F-statistic: 21.17 on 4 and 202 DF, p-value: 1.359e-14
## step variable r2 adj_r2 aic sbc sbic mallows_cp rmse
## 1 1 lMeanDepth 0.2956707 0.2781501 755.6138 778.9428 168.5755 5.250208 1.451212
## 2 2 lRettime 0.2953684 0.2814153 753.7026 773.6989 166.5925 3.336156 1.451523
From the last chunk of output, we can see which variables were removed at each step, including the values of some selection criterion. Please note that we have strictly used p-values to add and remove variables. As such there are no constraints to say that the value removed satisfies any of the other selection criterion. For example removing lMeanDepth
at the first stage did not necessarily produce a model that minimised BIC.
Forward-Backward selection with BIC
The ols_step_both_bic
function which uses BIC criteria for adding or removing variables in the model starting from the null model. In other words, we have the option to add or remove variables (after at least the second step in the algorithm).
## Stepwise Selection Method
## -------------------------
##
## Candidate Terms:
##
## 1. lAltitude
## 2. lMeanDepth
## 3. lRettime
## 4. lAlkalinity
## 5. lColour
## 6. lTotalP
##
##
## Step => 0
## Model => ltotalbio ~ 1
## SBC => 824.8347
##
## Initiating stepwise selection...
##
## Table: Adding New Variables
## -------------------------------------------------------------------------
## Predictor DF AIC SBC SBIC R2 Adj. R2
## -------------------------------------------------------------------------
## lAltitude 1 800.745 810.743 212.423 0.08957 0.08513
## lMeanDepth 1 807.087 817.085 218.648 0.06125 0.05667
## lRettime 1 819.603 829.601 230.937 0.00273 -0.00213
## lAlkalinity 1 764.477 774.475 176.843 0.23589 0.23216
## lColour 1 820.137 830.136 231.462 0.00015 -0.00472
## lTotalP 1 778.809 788.808 190.901 0.18111 0.17712
## -------------------------------------------------------------------------
##
## Step => 1
## Added => lAlkalinity
## Model => ltotalbio ~ lAlkalinity
## SBC => 774.4748
##
## Table: Adding New Variables
## -----------------------------------------------------------------------
## Predictor DF AIC SBC SBIC R2 Adj. R2
## -----------------------------------------------------------------------
## lAltitude 1 765.001 778.332 177.315 0.24132 0.23388
## lMeanDepth 1 766.436 779.767 178.709 0.23604 0.22855
## lRettime 1 765.703 779.034 177.997 0.23874 0.23128
## lColour 1 766.050 779.381 178.334 0.23746 0.22999
## lTotalP 1 753.408 766.738 166.056 0.28264 0.27561
## -----------------------------------------------------------------------
##
## Step => 2
## Added => lTotalP
## Model => ltotalbio ~ lAlkalinity + lTotalP
## SBC => 766.7385
##
## Table: Removing Existing Variables
## ------------------------------------------------------------------------
## Predictor DF AIC SBC SBIC R2 Adj. R2
## ------------------------------------------------------------------------
## lAlkalinity 1 778.809 788.808 190.901 0.18111 0.17712
## lTotalP 1 764.477 774.475 176.843 0.23589 0.23216
## ------------------------------------------------------------------------
##
## Table: Adding New Variables
## -----------------------------------------------------------------------
## Predictor DF AIC SBC SBIC R2 Adj. R2
## -----------------------------------------------------------------------
## lAltitude 1 753.461 770.125 166.215 0.28936 0.27886
## lMeanDepth 1 755.401 772.065 168.080 0.28267 0.27207
## lRettime 1 755.250 771.913 167.934 0.28319 0.27260
## lColour 1 753.092 769.756 165.861 0.29062 0.28014
## -----------------------------------------------------------------------
##
##
## No more variables to be added or removed.
##
## Variables Selected:
##
## => lAlkalinity
## => lTotalP
##
##
## Stepwise Summary
## ------------------------------------------------------------------------------
## Step Variable AIC SBC SBIC R2 Adj. R2
## ------------------------------------------------------------------------------
## 0 Base Model 818.169 824.835 230.037 0.00000 0.00000
## 1 lAlkalinity (+) 764.477 774.475 176.843 0.23589 0.23216
## 2 lTotalP (+) 753.408 766.738 166.056 0.28264 0.27561
## ------------------------------------------------------------------------------
##
## Final Model Output
## ------------------
##
## Model Summary
## ---------------------------------------------------------------
## R 0.532 RMSE 1.465
## R-Squared 0.283 MSE 2.177
## Adj. R-Squared 0.276 Coef. Var 10.515
## Pred R-Squared 0.259 AIC 753.408
## MAE 1.149 SBC 766.738
## ---------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
## AIC: Akaike Information Criteria
## SBC: Schwarz Bayesian Criteria
##
## ANOVA
## --------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## --------------------------------------------------------------------
## Regression 174.942 2 87.471 40.189 0.0000
## Residual 444.009 204 2.177
## Total 618.951 206
## --------------------------------------------------------------------
##
## Parameter Estimates
## ----------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## ----------------------------------------------------------------------------------------
## (Intercept) 13.083 0.378 34.632 0.000 12.339 13.828
## lAlkalinity 0.428 0.080 0.365 5.373 0.000 0.271 0.585
## lTotalP 0.348 0.095 0.248 3.646 0.000 0.160 0.536
## ----------------------------------------------------------------------------------------
##
## Call:
## lm(formula = paste(response, "~", paste(preds, collapse = " + ")),
## data = l)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.7557 -0.9364 0.0289 1.0249 3.8661
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.08342 0.37779 34.632 < 2e-16 ***
## lAlkalinity 0.42775 0.07960 5.373 2.1e-07 ***
## lTotalP 0.34820 0.09549 3.646 0.000338 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.475 on 204 degrees of freedom
## Multiple R-squared: 0.2826, Adjusted R-squared: 0.2756
## F-statistic: 40.19 on 2 and 204 DF, p-value: 1.928e-15
## step variable method r2 adj_r2 aic sbc sbic
## 1 1 lAlkalinity addition 0.2358921 0.2321647 764.4766 774.4748 176.8430
## 2 2 lTotalP addition 0.2826431 0.2756102 753.4076 766.7385 166.0562
In this example, it just so happened that we added variables at each step, once we have at least one variable in the model (assuming we begin with no variables, i.e. the null model) we did have the option to add or remove a variable at each step.
We can see that we have ended up with a model with three predictors, i.e. ltotalbio ~ lAlkalinity + lColour + lTotalP
based on BIC. Let’s repeat using p-values.
##
##
## Stepwise Summary
## ------------------------------------------------------------------------------
## Step Variable AIC SBC SBIC R2 Adj. R2
## ------------------------------------------------------------------------------
## 0 Base Model 818.169 824.835 230.037 0.00000 0.00000
## 1 lAlkalinity (+) 764.477 774.475 176.843 0.23589 0.23216
## 2 lTotalP (+) 753.408 766.738 166.056 0.28264 0.27561
## ------------------------------------------------------------------------------
##
## Final Model Output
## ------------------
##
## Model Summary
## ---------------------------------------------------------------
## R 0.532 RMSE 1.465
## R-Squared 0.283 MSE 2.177
## Adj. R-Squared 0.276 Coef. Var 10.515
## Pred R-Squared 0.259 AIC 753.408
## MAE 1.149 SBC 766.738
## ---------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
## AIC: Akaike Information Criteria
## SBC: Schwarz Bayesian Criteria
##
## ANOVA
## --------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## --------------------------------------------------------------------
## Regression 174.942 2 87.471 40.189 0.0000
## Residual 444.009 204 2.177
## Total 618.951 206
## --------------------------------------------------------------------
##
## Parameter Estimates
## ----------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## ----------------------------------------------------------------------------------------
## (Intercept) 13.083 0.378 34.632 0.000 12.339 13.828
## lAlkalinity 0.428 0.080 0.365 5.373 0.000 0.271 0.585
## lTotalP 0.348 0.095 0.248 3.646 0.000 0.160 0.536
## ----------------------------------------------------------------------------------------
##
## Call:
## lm(formula = paste(response, "~", paste(preds, collapse = " + ")),
## data = l)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.7557 -0.9364 0.0289 1.0249 3.8661
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.08342 0.37779 34.632 < 2e-16 ***
## lAlkalinity 0.42775 0.07960 5.373 2.1e-07 ***
## lTotalP 0.34820 0.09549 3.646 0.000338 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.475 on 204 degrees of freedom
## Multiple R-squared: 0.2826, Adjusted R-squared: 0.2756
## F-statistic: 40.19 on 2 and 204 DF, p-value: 1.928e-15
## step variable method r2 adj_r2 aic sbc sbic mallows_cp
## 1 1 lAlkalinity addition 0.2358921 0.2321647 764.4766 774.4748 176.8430 14.246072
## 2 2 lTotalP addition 0.2826431 0.2756102 753.4076 766.7385 166.0562 2.954129
## rmse
## 1 1.511543
## 2 1.464572
We can see that we have ended up with a model with two predictors, i.e. ltotalbio ~ lAlkalinity + TotalP
based on p-values with a entering cut-off value of 0.05.
Once you have a model, the next step is to check model assumptions before interpretation.