18  Model evaluation

Author

David Carslaw

18.1 Background

The modStats function provides key model evaluation statistics for comparing models against measurements and models against other models.

There are a very wide range of evaluation statistics that can be used to assess model performance. There is, however, no single statistic that encapsulates all aspects of interest. For this reason it is useful to consider several performance statistics and also to understand the sort of information or insight they might provide.

In the following definitions, \(O_i\) represents the \(i\)th observed value and \(M_i\) represents the \(i\)th modelled value for a total of \(n\) observations.

Fraction of predictions within a factor or two, FAC2

The fraction of modelled values within a factor of two of the observed values are the fraction of model predictions that satisfy:

\[ 0.5 \leq \frac{M_i}{O_i} \leq 2.0 \tag{18.1}\]

Mean bias, MB

The mean bias provides a good indication of the mean over or under estimate of predictions. Mean bias in the same units as the quantities being considered.

\[ MB = \frac{1}{n}\sum_{i=1}^{N} M_i - O_i \tag{18.2}\]

Mean Gross Error, MGE

The mean gross error provides a good indication of the mean error regardless of whether it is an over or under estimate. Mean gross error is in the same units as the quantities being considered.

\[ MGE = \frac{1}{n}\sum_{i=1}^{N} |M_i - O_i| \tag{18.3}\]

Normalised mean bias, NMB

The normalised mean bias is useful for comparing pollutants that cover different concentration scales and the mean bias is normalised by dividing by the observed concentration.

\[ NMB = \frac{\sum\limits_{i=1}^{n} M_i - O_i}{\sum\limits_{i=1}^{n} O_i} \tag{18.4}\]

Normalised mean gross error, NMGE

The normalised mean gross error further ignores whether a prediction is an over or under estimate.

\[ NMGE = \frac{\sum\limits_{i=1}^{n} |M_i - O_i|}{\sum\limits_{i=1}^{n} O_i} \tag{18.5}\]

Root mean squared error, RMSE

The RMSE is a commonly used statistic that provides a good overall measure of how close modelled values are to predicted values.

\[ RMSE = \left({\frac{\sum\limits_{i=1}^{n} (M_i - O_i)^2}{n}}\right)^{1/2} \tag{18.6}\]

Correlation coefficient, r

The (Pearson) correlation coefficient is a measure of the strength of the linear relationship between two variables. If there is perfect linear relationship with positive slope between the two variables, \(r\) = 1. If there is a perfect linear relationship with negative slope between the two variables \(r\) = -1. A correlation coefficient of 0 means that there is no linear relationship between the variables. Note that modStats accepts an option method, which can be set to “kendall” and “spearman” for alternative calculations of \(r\).

\[ r = \frac{1}{(n-1)}\sum\limits_{i=1}^{n}\left(\frac{M_i - \overline{M}}{\sigma_M}\right)\left(\frac{O_i - \overline{O}}{\sigma_O}\right) \tag{18.7}\]

Coefficient of Efficiency, COE

The Coefficient of Efficiency based on Legates and McCabe (2012) and Legates and McCabe Jr (1999). There have been many suggestions for measuring model performance over the years, but the \(COE\) is a simple formulation which is easy to interpret.

Legates, D. R., and G. J. McCabe. 2012. “A Refined Index of Model Performance: A Rejoinder.” International Journal of Climatology.
Legates, D. R., and G. J. McCabe Jr. 1999. “Evaluating the Use of ‘Goodness-of-Fit’ Measures in Hydrologic and Hydroclimatic Model Validation.” Water Resources Research 35 (1): 233–41.

A perfect model has a \(COE\) = 1. As noted by Legates and McCabe although the \(COE\) has no lower bound, a value of \(COE\) = 0.0 has a fundamental meaning. It implies that the model is no more able to predict the observed values than does the observed mean. Therefore, since the model can explain no more of the variation in the observed values than can the observed mean, such a model can have no predictive advantage.

For negative values of \(COE\), the model is less effective than the observed mean in predicting the variation in the observations.

\[ COE = 1.0 - \frac{\sum\limits_{i=1}^{n} |M_i -O_i|}{\sum\limits_{i=1}^{n} |O_i - \overline{O}|} \tag{18.8}\]

Index of Agreement, IOA

The Index of Agreement, \(IOA\) is commonly used in model evaluation (Willmott, Robeson, and Matsuura 2011). It spans between -1 and +1 with values approaching +1 representing better model performance. An \(IOA\) of 0.5, for example, indicates that the sum of the error-magnitudes is one half of the sum of the observed-deviation magnitudes. When \(IOA\) = 0.0, it signifies that the sum of the magnitudes of the errors and the sum of the observed-deviation magnitudes are equivalent. When \(IOA\) = -0.5, it indicates that the sum of the error-magnitudes is twice the sum of the perfect model-deviation and observed-deviation magnitudes. Values of \(IOA\) near -1.0 can mean that the model-estimated deviations about \(O\) are poor estimates of the observed deviations; but, they also can mean that there simply is little observed variability — so some caution is needed when the \(IOA\) approaches -1. It is defined as (with \(c = 2\)):

Willmott, Cort J, Scott M Robeson, and Kenji Matsuura. 2011. “A Refined Index of Model Performance.” International Journal of Climatology.

\[ IOA = \left\{ \begin{array}{l} 1.0 - \frac{\displaystyle \sum\limits_{i=1}^{n} |M_i-O_i|} {\displaystyle c\sum\limits_{i=1}^{n} |O_i - \overline{O}|}, when \\ \sum\limits_{i=1}^{n} |M_i-O_i| \le c\sum\limits_{i=1}^{n} |O_i - \overline{O}| \\ \frac{\displaystyle c\sum\limits_{i=1}^{n} |O_i - \overline{O}|} {\displaystyle \sum\limits_{i=1}^{n} |M_i-O_i|} - 1.0, when \\ \sum\limits_{i=1}^{n} |M_i-O_i| > c\sum\limits_{i=1}^{n} |O_i - \overline{O}| \end{array} \right. \tag{18.9}\]

18.2 Example of use

The function can be called very simply and only requires two numeric fields to compare. To show how the function works, some synthetic data will be generated for 5 models.

## observations; 100 random numbers
set.seed(10)
obs <- 100 * runif(100)
mod1 <- data.frame(obs, mod = obs + 10, model = "model 1")
mod2 <- data.frame(obs, mod = obs + 20 * rnorm(100), model = "model 2")
mod3 <- data.frame(obs, mod = obs - 10 * rnorm(100), model = "model 3")
mod4 <- data.frame(obs, mod = obs / 2 + 10 * rnorm(100), model = "model 4")
mod5 <- data.frame(obs, mod = obs * 1.5 + 3 * rnorm(100), model = "model 5")
modData <- rbind(mod1, mod2, mod3, mod4, mod5)
head(modData)
        obs      mod   model
1 50.747820 60.74782 model 1
2 30.676851 40.67685 model 1
3 42.690767 52.69077 model 1
4 69.310208 79.31021 model 1
5  8.513597 18.51360 model 1
6 22.543662 32.54366 model 1

We now have a data frame with observations and predictions for 5 models. The evaluation of the statistics is given by:

library(openair) # load package

modStats(modData, obs = "obs", mod = "mod", type = "model")
    model   n FAC2         MB       MGE         NMB      NMGE      RMSE
1 model 1 100 0.89  10.000000 10.000000  0.22455508 0.2245551 10.000000
2 model 2 100 0.79   0.922391 16.592192  0.02071276 0.3725861 19.318041
3 model 3 100 0.88   1.013647  7.887325  0.02276196 0.1771139  9.451354
4 model 4 100 0.56 -20.603684 21.860655 -0.46266619 0.4908921 25.758520
5 model 5 100 0.96  22.521685 22.569229  0.50573588 0.5068035 26.132866
          r          COE       IOA
1 1.0000000  0.538289594 0.7691448
2 0.8258202  0.233921216 0.6169606
3 0.9371237  0.635833987 0.8179170
4 0.8142735 -0.009329173 0.4953354
5 0.9963753 -0.042044773 0.4789776

It is possible to rank the statistics based on the , which is a good general indicator of model performance.

modStats(modData, obs = "obs", mod = "mod", type = "model", 
         rank.name = "model")
    model   n FAC2         MB       MGE         NMB      NMGE      RMSE
3 model 3 100 0.88   1.013647  7.887325  0.02276196 0.1771139  9.451354
1 model 1 100 0.89  10.000000 10.000000  0.22455508 0.2245551 10.000000
2 model 2 100 0.79   0.922391 16.592192  0.02071276 0.3725861 19.318041
4 model 4 100 0.56 -20.603684 21.860655 -0.46266619 0.4908921 25.758520
5 model 5 100 0.96  22.521685 22.569229  0.50573588 0.5068035 26.132866
          r          COE       IOA
3 0.9371237  0.635833987 0.8179170
1 1.0000000  0.538289594 0.7691448
2 0.8258202  0.233921216 0.6169606
4 0.8142735 -0.009329173 0.4953354
5 0.9963753 -0.042044773 0.4789776

The modStats function is however much more flexible than indicated above. While it is useful to calculate model evaluation statistics in a straightforward way it can be much more informative to consider the statistics split by different periods.

Data have been assembled from a Defra model evaluation exercise which consists of hourly O3 predictions at 15 receptor points around the UK for 2006. The aim here is not to identify a particular model that is ‘best’ and for this reason the models are simply referred to as model 1, model 2 and so on. We will aim to make the data more widely available. However, data set has this form:

load("book_data/modelData.RData")
head(modTest)
        site                date o3   mod   group
1 Aston.Hill 2006-01-01 00:00:00 NA    NA model 1
2 Aston.Hill 2006-01-01 01:00:00 74 65.28 model 1
3 Aston.Hill 2006-01-01 02:00:00 72 64.64 model 1
4 Aston.Hill 2006-01-01 03:00:00 72 64.46 model 1
5 Aston.Hill 2006-01-01 04:00:00 70 64.88 model 1
6 Aston.Hill 2006-01-01 05:00:00 66 65.80 model 1

There are columns representing the receptor location (site), the date, measured values (o3), model predictions (mod) and the model itself (group). There are numerous ways in which the statistics can be calculated. However, of interest here is how the models perform at a single receptor by season. The seasonal nature of O3 is a very important characteristic and it is worth considering in more detail. The statistics are easy enough to calculate as shown below. In this example a subset of the data is selected to consider only the Harwell site. Second, the type option is used to split the calculations by season and model. Finally, the statistics are grouped by the \(IOA\) for each season. It is now very easy how model performance changes by season and which models perform best in each season.

options(digits = 2) ## don't display too many decimal places
modStats(subset(modTest, site == "Harwell"), 
         obs = "o3", mod = "mod",
         type = c("season", "group"), rank = "group")
# A tibble: 36 × 12
# Groups:   season [4]
   season      group     n  FAC2     MB   MGE     NMB  NMGE  RMSE     r      COE
   <ord>       <fct> <int> <dbl>  <dbl> <dbl>   <dbl> <dbl> <dbl> <dbl>    <dbl>
 1 spring (MA… mode…  1905 0.875   8.58  16.5  0.137  0.263  21.9 0.576  0.0747 
 2 spring (MA… mode…  1905 0.876  -2.04  16.8 -0.0325 0.268  21.8 0.452  0.0584 
 3 spring (MA… mode…  1905 0.872  11.9   17.7  0.190  0.282  23.7 0.519  0.00818
 4 spring (MA… mode…  1905 0.885   4.98  18.1  0.0793 0.289  23.9 0.361 -0.0151 
 5 spring (MA… mode…  1905 0.870  10.8   19.5  0.172  0.311  24.0 0.588 -0.0940 
 6 spring (MA… mode…  1825 0.821   7.79  20.0  0.123  0.317  25.8 0.480 -0.118  
 7 spring (MA… mode…  1905 0.786 -13.7   21.4 -0.218  0.340  26.1 0.531 -0.195  
 8 spring (MA… mode…  1905 0.732 -13.4   24.3 -0.213  0.388  29.8 0.312 -0.362  
 9 spring (MA… mode…  1905 0.743   1.01  24.9  0.0161 0.396  32.8 0.264 -0.392  
10 summer (JJ… mode…  2002 0.918   3.83  15.5  0.0632 0.256  20.9 0.750  0.317  
# … with 26 more rows, and 1 more variable: IOA <dbl>

Note that it is possible to read the results of the modStats function into a data frame, which then allows the results to be plotted. This is generally a good idea when there is a lot of numeric data to consider and plots will convey the information better.

The modStats function is much more flexible than indicated above and can be used in lots of interesting ways. The type option in particular makes it possible to split the statistics in numerous ways. For example, to summarise the performance of models by site, model and day of the week:

modStats(modStats, obs = "o3", mod = "mod",
         type = c("site", "weekday", "group"), 
         rank = "group")

Similarly, if other data are available e.g. meteorological data or other pollutant species then these variables can also be used to test models against ranges in their values. This capability is potentially very useful because it allows for a much more probing analysis into model evaluation. For example, with wind speed and direction it is easy to consider how model performance varies by wind speed intervals or wind sectors, both of which could reveal important performance characteristics.