R code

Please find some R code used to fit and plot some models. You can find most data sets on Moodle. Please download data and try some of the R code as you read through these notes.

Example

The EC Water Framework Directive (WFD) 2000 states that there should be ‘good status’ in all shallow waters by 2016. Therefore, it is of interest for environmental management to investigate lake typology and identify lakes that are susceptible to large amounts of algae.

An ecologist has recorded the total biovolume of algae in lakes over the summer months. He also measured the altitude, mean depth, retention time, alkalinity, colour and total phosphorus (TP) of the lakes. He was interested in which of these lake typology variables were related to the total amount of algae in the lakes. All variables were transformed using natural logs originally to make the distribution of each variable more symmetric.

The variable names in the dataset are: lAltitude, lMeanDepth, lRettime, lAlkalinity, lColour, lTotalP and ltotalbio.

The question of interest is which of the lake typology variables are useful predictors of the log total biovolume, ltotalbio of algae within lakes?

First we do a exploratory analysis of the lakes data. A first step would be to plot all possible pairs of variables. This is to identify possible relationship with our response variable, identify high correlations, and possible multicollinearites, between explanatory variables, and identify explanatory variables we think could play an important role in our regression model. In a formal report, i.e. if this was a project, then we would want to be able to pick out the interesting plots for the full set of plots.

library(GGally)
ggpairs(lakes)

Initially, log total biovolume has a positive linear relationship with log Alkalinity (0486) and log total phosphorus (0.426), and log Alkalinity and log total phosphorus have a relatively high correlation (0.488). Log mean depth appears to be related to most of the variables we have in the data set.

Now that we have explored the dataset and since we have not found any obvious multicollinearity we will use all the variables in the model to create the full model and explore the confidence intervals of the coefficients.

lakes.all <- lm(ltotalbio ~ lAltitude + lMeanDepth + lRettime + lAlkalinity + lColour + 
                  lTotalP, data = lakes)
summary(lakes.all)
ggcoef(lakes.all,exclude_intercept = TRUE)
lakes.all <- lm(ltotalbio ~ lAltitude + lMeanDepth + lRettime + lAlkalinity + lColour + 
                  lTotalP, data = lakes)
print(summary(lakes.all),concise=TRUE)
## 
## Call:
## lm(formula = ltotalbio ~ lAltitude + lMeanDepth + lRettime + 
##     lAlkalinity + lColour + lTotalP, data = lakes)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.4642 -0.9078  0.0429  1.0253  3.9294 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.88317    0.59396  23.374  < 2e-16 ***
## lAltitude   -0.09303    0.08978  -1.036 0.301355    
## lMeanDepth  -0.05890    0.11774  -0.500 0.617479    
## lRettime     0.04063    0.07984   0.509 0.611378    
## lAlkalinity  0.36380    0.09561   3.805 0.000188 ***
## lColour     -0.14891    0.10861  -1.371 0.171910    
## lTotalP      0.37189    0.09950   3.737 0.000243 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.475 on 200 degrees of freedom
## Multiple R-squared:  0.2966, Adjusted R-squared:  0.2754 
## F-statistic: 14.05 on 6 and 200 DF,  p-value: 2.51e-13
ggcoef(lakes.all,vline_color = "red",
  vline_linetype =  "solid",
  errorbar_color = "blue",
  errorbar_height = .25,exclude_intercept = TRUE)

Backward selection with AIC

We will do a model selection to see if we can reduce the model. We use the step function which by default uses the AIC criterion for model selection.

step(lakes.all)
## Start:  AIC=167.91
## ltotalbio ~ lAltitude + lMeanDepth + lRettime + lAlkalinity + 
##     lColour + lTotalP
## 
##               Df Sum of Sq    RSS    AIC
## - lMeanDepth   1    0.5447 435.95 166.17
## - lRettime     1    0.5638 435.96 166.18
## - lAltitude    1    2.3375 437.74 167.02
## - lColour      1    4.0920 439.49 167.85
## <none>                     435.40 167.91
## - lTotalP      1   30.4100 465.81 179.89
## - lAlkalinity  1   31.5182 466.92 180.38
## 
## Step:  AIC=166.17
## ltotalbio ~ lAltitude + lRettime + lAlkalinity + lColour + lTotalP
## 
##               Df Sum of Sq    RSS    AIC
## - lRettime     1     0.187 436.13 164.26
## - lAltitude    1     2.966 438.91 165.58
## - lColour      1     3.553 439.50 165.85
## <none>                     435.95 166.17
## - lTotalP      1    31.416 467.36 178.58
## - lAlkalinity  1    39.566 475.51 182.16
## 
## Step:  AIC=164.26
## ltotalbio ~ lAltitude + lAlkalinity + lColour + lTotalP
## 
##               Df Sum of Sq    RSS    AIC
## - lAltitude    1     2.937 439.07 163.65
## - lColour      1     3.721 439.85 164.02
## <none>                     436.13 164.26
## - lTotalP      1    32.960 469.09 177.34
## - lAlkalinity  1    39.386 475.52 180.16
## 
## Step:  AIC=163.65
## ltotalbio ~ lAlkalinity + lColour + lTotalP
## 
##               Df Sum of Sq    RSS    AIC
## <none>                     439.07 163.65
## - lColour      1     4.939 444.01 163.97
## - lTotalP      1    32.902 471.97 176.61
## - lAlkalinity  1    62.350 501.42 189.14
## 
## Call:
## lm(formula = ltotalbio ~ lAlkalinity + lColour + lTotalP, data = lakes)
## 
## Coefficients:
## (Intercept)  lAlkalinity      lColour      lTotalP  
##     13.3658       0.4261      -0.1476       0.3806

Here we have used the AIC criteria. At each step we have dropped a predictor and evaluated the AIC of the new model (without that predictor). Specifically,

  • we started with the full model that had an AIC value of 168.

  • Dropping one of the variables of the full model, let’s say lTotallP, had as a result an increase of the AIC to 180. If on the other hand the dropped variable was lMean Depth then the AIC would be 166. This is the lowest possible AIC value by removing one variable from the full model so this is the variable we decide to drop.

  • Continuing this procedure, we end up with a model with three variables.

  • In the final step we start with an AIC of 164 and the row that gives the lowest AIC is the one with <none> implying that we should not gain anything by dropping this variable.

So our final model is ltotalbio ~ lAlkalinity + lColour + lTotalP. Note that if we had used a different criterion, we might have chosen a different model.

Now we look at the summary of the reduced model.

# Reduced Model
lakes.step <- lm(ltotalbio ~ lAlkalinity + lColour + lTotalP, data = lakes) 
summary(lakes.step)
## 
## Call:
## lm(formula = ltotalbio ~ lAlkalinity + lColour + lTotalP, data = lakes)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.6126 -0.9624  0.0269  0.9864  4.0390 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.36578    0.42041  31.792  < 2e-16 ***
## lAlkalinity  0.42610    0.07936   5.369 2.15e-07 ***
## lColour     -0.14764    0.09770  -1.511 0.132308    
## lTotalP      0.38058    0.09758   3.900 0.000131 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.471 on 203 degrees of freedom
## Multiple R-squared:  0.2906, Adjusted R-squared:  0.2801 
## F-statistic: 27.72 on 3 and 203 DF,  p-value: 4.551e-15

We notice that the predictor lColour would have been dropped based on the p-value criterion. Our objective here was not to find a final model with only signification effects.

Backward selection p-values

In this case, we begin with the full model with all possible explanatory variables and remove on at at time based on coefficient p-values.

library(olsrr)
ols_back_p<-ols_step_backward_p(lakes.all)
ols_back_p
## 
## 
##                              Stepwise Summary                              
## -------------------------------------------------------------------------
## Step    Variable        AIC        SBC       SBIC        R2       Adj. R2 
## -------------------------------------------------------------------------
##  0      Full Model    757.355    784.017    170.402    0.29655    0.27545 
##  1      lMeanDepth    755.614    778.943    168.575    0.29567    0.27815 
##  2      lRettime      753.703    773.699    166.592    0.29537    0.28142 
## -------------------------------------------------------------------------
## 
## Final Model Output 
## ------------------
## 
##                          Model Summary                          
## ---------------------------------------------------------------
## R                       0.543       RMSE                 1.452 
## R-Squared               0.295       MSE                  2.159 
## Adj. R-Squared          0.281       Coef. Var           10.473 
## Pred R-Squared          0.254       AIC                753.703 
## MAE                     1.141       SBC                773.699 
## ---------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
##  AIC: Akaike Information Criteria 
##  SBC: Schwarz Bayesian Criteria 
## 
##                                ANOVA                                 
## --------------------------------------------------------------------
##                Sum of                                               
##               Squares         DF    Mean Square      F         Sig. 
## --------------------------------------------------------------------
## Regression    182.819          4         45.705    21.169    0.0000 
## Residual      436.133        202          2.159                     
## Total         618.951        206                                    
## --------------------------------------------------------------------
## 
##                                   Parameter Estimates                                    
## ----------------------------------------------------------------------------------------
##       model      Beta    Std. Error    Std. Beta      t        Sig      lower     upper 
## ----------------------------------------------------------------------------------------
## (Intercept)    13.715         0.516                 26.594    0.000    12.698    14.732 
##   lAltitude    -0.102         0.087       -0.080    -1.166    0.245    -0.274     0.070 
## lAlkalinity     0.379         0.089        0.324     4.271    0.000     0.204     0.555 
##     lColour    -0.130         0.099       -0.081    -1.313    0.191    -0.325     0.065 
##     lTotalP     0.381         0.097        0.271     3.907    0.000     0.189     0.573 
## ----------------------------------------------------------------------------------------
summary(ols_back_p$model)
## 
## Call:
## lm(formula = paste(response, "~", paste(c(include, cterms), collapse = " + ")), 
##     data = l)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.4321 -0.9007  0.0192  1.0319  3.9546 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.71476    0.51571  26.594  < 2e-16 ***
## lAltitude   -0.10194    0.08740  -1.166 0.244839    
## lAlkalinity  0.37940    0.08883   4.271 2.99e-05 ***
## lColour     -0.12972    0.09882  -1.313 0.190766    
## lTotalP      0.38092    0.09749   3.907 0.000127 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.469 on 202 degrees of freedom
## Multiple R-squared:  0.2954, Adjusted R-squared:  0.2814 
## F-statistic: 21.17 on 4 and 202 DF,  p-value: 1.359e-14

Notice that in this model, some variables still have p-values >0.05. This is because the default parameters specific a much higher cut off point of 0.3. While we have consistently used 0.05 to decide between signification and non-significant effects (or regression coefficients that are/are not significiantly different from zero), this is not a fixed number.

ols_back_p2<-ols_step_backward_p(lakes.all,prem=0.05)
summary(ols_back_p2$model)
## 
## Call:
## lm(formula = paste(response, "~", paste(c(include, cterms), collapse = " + ")), 
##     data = l)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.4321 -0.9007  0.0192  1.0319  3.9546 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.71476    0.51571  26.594  < 2e-16 ***
## lAltitude   -0.10194    0.08740  -1.166 0.244839    
## lAlkalinity  0.37940    0.08883   4.271 2.99e-05 ***
## lColour     -0.12972    0.09882  -1.313 0.190766    
## lTotalP      0.38092    0.09749   3.907 0.000127 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.469 on 202 degrees of freedom
## Multiple R-squared:  0.2954, Adjusted R-squared:  0.2814 
## F-statistic: 21.17 on 4 and 202 DF,  p-value: 1.359e-14
ols_back_p2$metrics
##   step   variable        r2    adj_r2      aic      sbc     sbic mallows_cp     rmse
## 1    1 lMeanDepth 0.2956707 0.2781501 755.6138 778.9428 168.5755   5.250208 1.451212
## 2    2   lRettime 0.2953684 0.2814153 753.7026 773.6989 166.5925   3.336156 1.451523

From the last chunk of output, we can see which variables were removed at each step, including the values of some selection criterion. Please note that we have strictly used p-values to add and remove variables. As such there are no constraints to say that the value removed satisfies any of the other selection criterion. For example removing lMeanDepth at the first stage did not necessarily produce a model that minimised BIC.

Forward-Backward selection with BIC

The ols_step_both_bic function which uses BIC criteria for adding or removing variables in the model starting from the null model. In other words, we have the option to add or remove variables (after at least the second step in the algorithm).

ols_both_bic<-ols_step_both_sbc(lakes.all,details=TRUE)
## Stepwise Selection Method 
## -------------------------
## 
## Candidate Terms: 
## 
## 1. lAltitude 
## 2. lMeanDepth 
## 3. lRettime 
## 4. lAlkalinity 
## 5. lColour 
## 6. lTotalP 
## 
## 
## Step     => 0 
## Model    => ltotalbio ~ 1 
## SBC      => 824.8347 
## 
## Initiating stepwise selection... 
## 
##                        Table: Adding New Variables                        
## -------------------------------------------------------------------------
## Predictor      DF      AIC        SBC       SBIC        R2       Adj. R2  
## -------------------------------------------------------------------------
## lAltitude       1    800.745    810.743    212.423    0.08957     0.08513 
## lMeanDepth      1    807.087    817.085    218.648    0.06125     0.05667 
## lRettime        1    819.603    829.601    230.937    0.00273    -0.00213 
## lAlkalinity     1    764.477    774.475    176.843    0.23589     0.23216 
## lColour         1    820.137    830.136    231.462    0.00015    -0.00472 
## lTotalP         1    778.809    788.808    190.901    0.18111     0.17712 
## -------------------------------------------------------------------------
## 
## Step     => 1 
## Added    => lAlkalinity 
## Model    => ltotalbio ~ lAlkalinity 
## SBC      => 774.4748 
## 
##                       Table: Adding New Variables                       
## -----------------------------------------------------------------------
## Predictor     DF      AIC        SBC       SBIC        R2       Adj. R2 
## -----------------------------------------------------------------------
## lAltitude      1    765.001    778.332    177.315    0.24132    0.23388 
## lMeanDepth     1    766.436    779.767    178.709    0.23604    0.22855 
## lRettime       1    765.703    779.034    177.997    0.23874    0.23128 
## lColour        1    766.050    779.381    178.334    0.23746    0.22999 
## lTotalP        1    753.408    766.738    166.056    0.28264    0.27561 
## -----------------------------------------------------------------------
## 
## Step     => 2 
## Added    => lTotalP 
## Model    => ltotalbio ~ lAlkalinity + lTotalP 
## SBC      => 766.7385 
## 
##                    Table: Removing Existing Variables                    
## ------------------------------------------------------------------------
## Predictor      DF      AIC        SBC       SBIC        R2       Adj. R2 
## ------------------------------------------------------------------------
## lAlkalinity     1    778.809    788.808    190.901    0.18111    0.17712 
## lTotalP         1    764.477    774.475    176.843    0.23589    0.23216 
## ------------------------------------------------------------------------
## 
##                       Table: Adding New Variables                       
## -----------------------------------------------------------------------
## Predictor     DF      AIC        SBC       SBIC        R2       Adj. R2 
## -----------------------------------------------------------------------
## lAltitude      1    753.461    770.125    166.215    0.28936    0.27886 
## lMeanDepth     1    755.401    772.065    168.080    0.28267    0.27207 
## lRettime       1    755.250    771.913    167.934    0.28319    0.27260 
## lColour        1    753.092    769.756    165.861    0.29062    0.28014 
## -----------------------------------------------------------------------
## 
## 
## No more variables to be added or removed.
## 
## Variables Selected: 
## 
## => lAlkalinity 
## => lTotalP
ols_both_bic
## 
## 
##                                 Stepwise Summary                                
## ------------------------------------------------------------------------------
## Step    Variable             AIC        SBC       SBIC        R2       Adj. R2 
## ------------------------------------------------------------------------------
##  0      Base Model         818.169    824.835    230.037    0.00000    0.00000 
##  1      lAlkalinity (+)    764.477    774.475    176.843    0.23589    0.23216 
##  2      lTotalP (+)        753.408    766.738    166.056    0.28264    0.27561 
## ------------------------------------------------------------------------------
## 
## Final Model Output 
## ------------------
## 
##                          Model Summary                          
## ---------------------------------------------------------------
## R                       0.532       RMSE                 1.465 
## R-Squared               0.283       MSE                  2.177 
## Adj. R-Squared          0.276       Coef. Var           10.515 
## Pred R-Squared          0.259       AIC                753.408 
## MAE                     1.149       SBC                766.738 
## ---------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
##  AIC: Akaike Information Criteria 
##  SBC: Schwarz Bayesian Criteria 
## 
##                                ANOVA                                 
## --------------------------------------------------------------------
##                Sum of                                               
##               Squares         DF    Mean Square      F         Sig. 
## --------------------------------------------------------------------
## Regression    174.942          2         87.471    40.189    0.0000 
## Residual      444.009        204          2.177                     
## Total         618.951        206                                    
## --------------------------------------------------------------------
## 
##                                   Parameter Estimates                                    
## ----------------------------------------------------------------------------------------
##       model      Beta    Std. Error    Std. Beta      t        Sig      lower     upper 
## ----------------------------------------------------------------------------------------
## (Intercept)    13.083         0.378                 34.632    0.000    12.339    13.828 
## lAlkalinity     0.428         0.080        0.365     5.373    0.000     0.271     0.585 
##     lTotalP     0.348         0.095        0.248     3.646    0.000     0.160     0.536 
## ----------------------------------------------------------------------------------------
summary(ols_both_bic$model)
## 
## Call:
## lm(formula = paste(response, "~", paste(preds, collapse = " + ")), 
##     data = l)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.7557 -0.9364  0.0289  1.0249  3.8661 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.08342    0.37779  34.632  < 2e-16 ***
## lAlkalinity  0.42775    0.07960   5.373  2.1e-07 ***
## lTotalP      0.34820    0.09549   3.646 0.000338 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.475 on 204 degrees of freedom
## Multiple R-squared:  0.2826, Adjusted R-squared:  0.2756 
## F-statistic: 40.19 on 2 and 204 DF,  p-value: 1.928e-15
ols_both_bic$metrics
##   step    variable   method        r2    adj_r2      aic      sbc     sbic
## 1    1 lAlkalinity addition 0.2358921 0.2321647 764.4766 774.4748 176.8430
## 2    2     lTotalP addition 0.2826431 0.2756102 753.4076 766.7385 166.0562

In this example, it just so happened that we added variables at each step, once we have at least one variable in the model (assuming we begin with no variables, i.e. the null model) we did have the option to add or remove a variable at each step.

We can see that we have ended up with a model with three predictors, i.e. ltotalbio ~ lAlkalinity + lColour + lTotalP based on BIC. Let’s repeat using p-values.

ols_both_p2<-ols_step_both_p(lakes.all,penter=0.05,prem=0.05)
ols_both_p2
## 
## 
##                                 Stepwise Summary                                
## ------------------------------------------------------------------------------
## Step    Variable             AIC        SBC       SBIC        R2       Adj. R2 
## ------------------------------------------------------------------------------
##  0      Base Model         818.169    824.835    230.037    0.00000    0.00000 
##  1      lAlkalinity (+)    764.477    774.475    176.843    0.23589    0.23216 
##  2      lTotalP (+)        753.408    766.738    166.056    0.28264    0.27561 
## ------------------------------------------------------------------------------
## 
## Final Model Output 
## ------------------
## 
##                          Model Summary                          
## ---------------------------------------------------------------
## R                       0.532       RMSE                 1.465 
## R-Squared               0.283       MSE                  2.177 
## Adj. R-Squared          0.276       Coef. Var           10.515 
## Pred R-Squared          0.259       AIC                753.408 
## MAE                     1.149       SBC                766.738 
## ---------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
##  AIC: Akaike Information Criteria 
##  SBC: Schwarz Bayesian Criteria 
## 
##                                ANOVA                                 
## --------------------------------------------------------------------
##                Sum of                                               
##               Squares         DF    Mean Square      F         Sig. 
## --------------------------------------------------------------------
## Regression    174.942          2         87.471    40.189    0.0000 
## Residual      444.009        204          2.177                     
## Total         618.951        206                                    
## --------------------------------------------------------------------
## 
##                                   Parameter Estimates                                    
## ----------------------------------------------------------------------------------------
##       model      Beta    Std. Error    Std. Beta      t        Sig      lower     upper 
## ----------------------------------------------------------------------------------------
## (Intercept)    13.083         0.378                 34.632    0.000    12.339    13.828 
## lAlkalinity     0.428         0.080        0.365     5.373    0.000     0.271     0.585 
##     lTotalP     0.348         0.095        0.248     3.646    0.000     0.160     0.536 
## ----------------------------------------------------------------------------------------
summary(ols_both_p2$model)
## 
## Call:
## lm(formula = paste(response, "~", paste(preds, collapse = " + ")), 
##     data = l)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.7557 -0.9364  0.0289  1.0249  3.8661 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.08342    0.37779  34.632  < 2e-16 ***
## lAlkalinity  0.42775    0.07960   5.373  2.1e-07 ***
## lTotalP      0.34820    0.09549   3.646 0.000338 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.475 on 204 degrees of freedom
## Multiple R-squared:  0.2826, Adjusted R-squared:  0.2756 
## F-statistic: 40.19 on 2 and 204 DF,  p-value: 1.928e-15
ols_both_p2$metrics
##   step    variable   method        r2    adj_r2      aic      sbc     sbic mallows_cp
## 1    1 lAlkalinity addition 0.2358921 0.2321647 764.4766 774.4748 176.8430  14.246072
## 2    2     lTotalP addition 0.2826431 0.2756102 753.4076 766.7385 166.0562   2.954129
##       rmse
## 1 1.511543
## 2 1.464572

We can see that we have ended up with a model with two predictors, i.e. ltotalbio ~ lAlkalinity + TotalP based on p-values with a entering cut-off value of 0.05.

Once you have a model, the next step is to check model assumptions before interpretation.