3 Model Accuracy
3.1 Introduction
In machine learning there is a big emphasis in the prediction ability of the model. We will see several measures of performance for different methods but one commonly used measure (in particular when \(Y\) is continuous) is the mean squared error (MSE).
The MSE is defined as:
\[ MSE = E \big[ (\hat{Y} - Y)^2 \big] \]
where \(\hat{Y} = \hat f(\mathbf{X})\).
We can estimate the MSE using the same data (training data) that we have used to obtain \(\hat f(\mathbf{X})\) (the training MSE). If we have \(n\) observations,
\[ y= \begin{pmatrix} y_1 \\ y_2 \\ \vdots\\ y_n \end{pmatrix} \]
and
\[ \mathbf{x}= \begin{pmatrix} x_{11} & \dots & x_{p1} \\ x_{12} & \dots & x_{p2} \\ \vdots & \vdots & \vdots\\ x_{1n} & \dots & x_{pn} \end{pmatrix} \]
the estimate of MSE based on the data would then be:
\(MSE = \frac{1}{n} \sum_{i=1}^n \big(y_i - \hat f(x_{1i},\dots, x_{pi})\big)^2\\\)
However, this MSE tends to overestimate the true predictive ability, given that the model is optimised to the training data. Ideally, we would like to evaluate the performance of the model in an independent dataset (test data) with \(y^{new}\) and \(\mathbf{x}^{new}\).
One important concept associated with the MSE that we will be talking later in the coming modules, is the bias-variance tradeoff. The MSE can be decomposed into bias and variance:
\[\begin{equation} MSE = \mathrm{E} \big[ (\hat{Y} - Y)^2 \big] \\ =\mathrm{E} \big(\hat{Y}^2\big) + \mathrm{E}\big(Y^2\big) - 2\mathrm{E}(\hat{Y}Y) \\ = \mathrm{E} \big(\hat{Y}^2\big) + Y^2 - 2Y\mathrm{E}(\hat{Y}) + \mathrm{E}^2(\hat{Y}) - \mathrm{E}^2(\hat{Y})\\ =\underbrace{ \big[\mathrm{E}(\hat{Y}) - Y \big]^2}_{bias^2} + \underbrace{\mathrm{E} \big(\hat{Y}^2\big) - \mathrm{E}^2(\hat{Y})}_{var(\hat{Y})} \end{equation}\]
If we use a method that produces unbiased predictions for \(Y\), such as the ordinary least squares (OLS) for linear regression, then \(\mathrm{E}(\hat{Y}) - Y =0\). In this case, the MSE simplifies to the \(var(\hat{Y})\).
We can see from the decomposition, that an unbiased estimation (prediction) of \(Y\) does not lead necessarily to the lowest MSE possible. Once again, the OLS is a good example of this. In the case of high colinearity of the predictors \(\mathbf x\), we know that the OLS becomes quite “unstable” or in other words, the OLS will have a high variance. In this situation, it may be better to choose a different methods that can produce some bias but will have a much lower variance, resulting in a lower MSE. This is the case of the ridge estimator, as an alternative for the OLS when this estimator has high variance (we will talk about ridge regression in module 4).
Several methods that we will discuss use this principle of exchanging variance for bias.
3.3 R review
3.3.1 Task 1 - Using libraries
Currently, there are more that 16,000 packages (also called libraries) available at CRAN (Comprehensive R Archive Network). Packages are the fundamental units of reproducible R code; they include reusable R functions, the documentation that describes how to use them, and sample data.
We will install and use a library that produces tables similar to the ones used for publication. This library is called tableone`.
##
## The downloaded binary packages are in
## /var/folders/xs/_nnmpq453c9b5lkpwdyr_2rc0000gp/T//RtmpMGdUEd/downloaded_packages
Read the bmd.csv
dataset in R and use the function CreateTableOne()
from the package tableone
to describe the variable age, bmd and sex.
bmd.data <- read.csv("https://www.dropbox.com/s/7wjsfdaf0wt2kg2/bmd.csv?dl=1")
CreateTableOne(c("age", "bmd", "sex"), data=bmd.data)
##
## Overall
## n 169
## age (mean (SD)) 63.63 (12.36)
## bmd (mean (SD)) 0.78 (0.17)
## sex = M (%) 86 (50.9)
Let’s repeat the table above but now stratified by fracture status (variable fracture)
## Stratified by fracture
## fracture no fracture p test
## n 50 119
## age (mean (SD)) 69.77 (13.38) 61.05 (10.97) <0.001
## bmd (mean (SD)) 0.62 (0.10) 0.85 (0.14) <0.001
## sex = M (%) 25 (50.0) 61 (51.3) 1.000
3.3.2 Task 2 - Using ggplot
ggplot2
is a powerful library that implements a “grammar of graphics”
developed by Wilkinson in 1999.
There are seven grammatical elements: * Data - The dataset * Aesthetics -How the variables in the data are mapped to visual properties (aesthetics) of geoms * Geometries - The visual element used for plotting the data * Statistics - Representation of the data to help understand relationships * Facets - Split in multiple plots * Coordinates - Systems of coordinates * Themes - Color schemes, font sizes,….
The combination of these elements, following certain rules, produces the plot
Suppose we want to plot the scatter for bmd and age.
First we get and load the library
##
## The downloaded binary packages are in
## /var/folders/xs/_nnmpq453c9b5lkpwdyr_2rc0000gp/T//RtmpMGdUEd/downloaded_packages
We start by defining the data and aesthetics
now, we add the geometry (in this case, dots)
let’s add a smooth line (statistics)
and split by sex (facets)
Finally, change the theme
3.3.3 Task 3 - Writing a function
Write a function that computes the Body Mass Index = weight(kg)/height\(^2\)(m) using weight and height as arguments
bmi.func <- function(W, H){
bmi <- W/H^2
bmi <- round(bmi,1) #rounds 1 decimal place
return(paste("The BMI is ", bmi))
}
What is the BMI for an individual with 1.83m weighting 89Kg?
## [1] "The BMI is 26.6"