Introduction

What is Machine Learning?

Is Machine Learning (ML) the same as Statistics? Is it Data Science? Is it it’s own thing? Is it just a buzz word which is good for employment? Just what is Machine Learning?

Well, I don’t think there’s a single concrete definition for ML, but let’s have a look at several different ones….

Wikipedia used to state¹ the following: “Machine Learning is the study of computer algorithms that can improve automatically through experience and by use of data”, but now states “Machine learning (ML) is a field of inquiry devoted to understanding and building methods that `learn’, that is, methods that leverage data to improve performance on some set of tasks”, although Statistical Learning (whatever that is) also redirects to the same page, and given that it has Statistic(s) in the title, must mean they are interlinked!

Another definition: “Machine learning is a branch of artificial intelligence (AI) and computer science which focuses on the use of data and algorithms to imitate the way that humans learn, gradually improving its accuracy”. Hurwitz and Kirsch (2018)

And one more: “Machine learning is the science of getting computers to act without being explicitly programmed”. which appears in quite a few places online so I am not entirely sure of its origin.

Well, it seems that the general consensus is that ML is about getting the machine (most likely a computer in some sense) to learn as much as possible, with the least human interaction possible. This may be an admirable aim, however, I would say that there are elements of human interaction and interpretation of results that machines can’t do. Machines and humans (bringing an analytical mind as well as the know-how to use the machines) must together perform Statistics, Data Science, Machine Learning in my opinion.

With regards to this course specifically, in my opinion, my half of the course could be seen as Statistics, Data Science or ML, whereas Dr. Hailiang Du’s half will cover some topics that I more classically associate with ML. I will leave you to decide for yourself about these various names for our science as the course progresses! What I do know is that we are going to cover some important techniques if you want to get into Data Science!

Motivation

Recall: Linear Models

Consider the standard linear model: \[ y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \ldots + \beta_p x_p +\epsilon \] Here, \(y\) is the response variable, the \(x\)’s are the predictor variables and \(\epsilon\) is an error term.

Linear models:

Offer significant inferential tools.
Can be very competitive (despite their simplicity) in comparison to more complicated models in terms of prediction.
Unlike non-linear models, they are also easy to interpret.

Recall: Least Squares

We fit or train linear models by minimising the Residual Sum of Squares (RSS)
\[RSS = {\sum_{i=1}^{n} \Bigg( y_i - \beta_0 - \sum_{j=1}^{p}\beta_j x_{ij}\Bigg)^2},\] where \(n\) is the training sample size and the minimisation is with respect to the regression coefficients \(\beta_0,\, \beta_1,\, \beta_2,\, \ldots,\, \beta_p\).

This gives us the best possible fit to the training data and we can use the model to perform inference.

Importantly, we can also use the estimated coefficients, which we denote as \(\hat{\beta}_j\) for \(j=0,1,\ldots,p\), for prediction. Given a new set of observations \(x_1^*,\, x_2^*,\, \ldots,\, x_p^*\), our prediction will be \[\hat{y}^* = \hat{\beta_0}+\hat{\beta_1}x_1^* + \hat{\beta_2}x_2^*+ \ldots + \hat{\beta_p}x_p^*.\]

Issues

The model that considers all \(p\) available predictor variables is commonly referred to as the full model.

On many occasions it is preferable to select a subset of predictor variables or consider extensions of the least squares (LS) solution of the full model, because of two important issues:

Predictive accuracy.
Model interpretability.

In This Course…

In the first two weeks of this course, we look into strategies for addressing these issues…

Part I - Variable Subset Selection - statistical algorithmic procedures which perform a model search and pick a subset of the parameters most related to the response (discarding the remaining predictors as irrelevant).
Part II - Shrinkage - Modern regression methods which impose constraints that shrink towards zero the regression coefficients of the full model.
Part III - Transformation - Extensions to the regression model when the best straight line doesn’t quite work!

Some Remarks

Research on model-search approaches is limited nowadays due to the fact that such strategies are not scalable as we will see (problems quickly start to arise even when the number of predictors is more than 10).
In the era of “Big Data”, certain applications can have \(p\) in the order of thousands or even millions; consider e.g. biomedical applications where the purpose is to associate a specific disease/phenotype with genes.
Penalised regression methods, despite not being all that new (e.g. ridge (Hoerl and Kennard (1976)) and lasso (Tibshiriani (1996))) are extremely scalable. Many papers on variations/extensions of these methods are still being published to this day!

A Note on These Notes

Since this is a Masters course, you are encouraged to read about widely, expanding your knowledge beyond the notes contained here. That’s because that is precisely what these are - Notes! They are written to complement the lectures and give you a flavour for the topics covered in this course. They are not meant to be a comprehensive textbook, covering every detail that will be necessary to become a well-trained Statistician, Data Scientist or Machine Learner - that requires reading around! …and practice!

Key References

The key references are the following:

James et al. (2013) - , by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshiriani. This can be downloaded from here.

Faraway (2009) - Linear Models with R, by Julian Faraway. A digital copy of this book is available on the course Ultra page, in the course details section.

You can explore the references contained within these books (and these lecture notes) for further information, and take yourselves on a journey of exploration.

Some Acronyms

A reference to some of the acronyms used during the course:

CV - Cross-Validation
GAM - Generalised Additive Model
LS - Least Squares
ML - Machine Learning
MSE - Mean Squared Error
PC - Principal Component
PCA - Principal Component Analysis
PCR - Principal Component Regression
RSS - Residual Sum of Squares
TSS - Total Sum of Squares

Faraway, J. J. 2009. Linear Models with r. Edited by F. Dominici, J. J. Faraway, M. Tanner, and J. Zidek. London: CRC press.

Hoerl, Arthur E., and Robert W. Kennard. 1976. “Ridge Regression Iterative Estimation of the Biasing Parameter.” Communications in Statistics - Theory and Methods 5 (1): 77–88. https://doi.org/10.1080/03610927608827333.

Hurwitz, J., and D. Kirsch. 2018. Machine Learning for Dummies. New Jersey: John Wiley & Sons.

James, G., D. Witten, T. Hastie, and R. Tibshirani. 2013. An Introduction to Statistical Learning. Edited by G. Casella, Fienberg S, and I. Olkin. New York: Springer.

Tibshiriani, R. 1996. “Regression Shrinkage and Selection via the Lasso.” Journal of the Royal Statistical Society: Series B (Methodological) 58 (1): 267–88.

References

Faraway, J. J. 2009. Linear Models with r. Edited by F. Dominici, J. J. Faraway, M. Tanner, and J. Zidek. London: CRC press.

Hurwitz, J., and D. Kirsch. 2018. Machine Learning for Dummies. New Jersey: John Wiley & Sons.

James, G., D. Witten, T. Hastie, and R. Tibshirani. 2013. An Introduction to Statistical Learning. Edited by G. Casella, Fienberg S, and I. Olkin. New York: Springer.

Tibshiriani, R. 1996. “Regression Shrinkage and Selection via the Lasso.” Journal of the Royal Statistical Society: Series B (Methodological) 58 (1): 267–88.

about a year ago↩︎