How does machine learning work?
Machine learning is about more than just applying advanced mathematical and statistical algorithms to large data sets. A great deal of thought and care also needs to go into the choice and pre-processing of the data to be used by the selected machine learning algorithm.
Machine learning is also subjective, to a certain extent - as we will see, after running a machine learning algorithm there may be several competing results from which to choose, and your final choice may differ depending on the context.
2.1 Example Task: Predicting students’ exam scores
Let’s introduce some of the key terminology used in machine learning, via the following example task.
Suppose we would like to predict future students’ exam scores
for a La Trobe University statistics subject, using data from the current cohort.
In order to complete this task, suppose we begin by collecting data from current students such as:
- a. The average amount of time they spend studying each day
- b. Their marks for a previous statistics subject
- c. Their sleep patterns
- d. Their diet
- e. Their exercise routines
Naturally, each student will have a different set of responses to these five topics (with perhaps some matches).
Each topic here can be thought of as a variable. When asked about these topics, the students’ responses will be the observed values for each variable. At the end of the semester, we will also record the student’s exam scores
.
2.1.1 Feature and Outcome Variables
When conducting machine learning, we typically deal with two types of variables: feature and outcome variables. In simple terms, we use feature variables to model or predict our outcome variable(s).
Variables can signify different ‘characteristics’, ‘attributes’ or ‘features’ of the phenomenon of interest, and can be continuous, discrete, quantitative or qualitative.
The feature variables6 are used in order to fit a ML model, with the aim of trying to model or predict the outcome variable. They are not the reason for conducting the ML process, but are vital to its success - if our feature variables are poorly defined or chosen, then our resultant model may not be very accurate.
The outcome variable is the variable in which we are predominantly interested, and motivates the ML process.
For our students example, the five topics listed above from a-e act as our feature variables, while the exam scores
variable is our outcome variable.
2.2 Problem Class
Different machine learning models and methods can be applied to different types of tasks or problems. Broadly speaking there are two main categories of problem class, namely classification problems, and regression problems.
For our student exam scores example, our outcome variable takes a discrete number of values - i.e. the letter grades of A, B, C etc. Therefore, we would refer to this as an example of a classification problem7. Since we have multiple classes (i.e. A, B, C, etc), it is a multi-class classification problem.
If we simplified the exam scores
results to pass or fail, our problem would become a binary classification problem, since there would now be only two possible classes (Pass/Fail) for our outcome variable.
We could also change the focus of our problem, and instead aim to predict a student’s overall numeric exam mark (out of 100), rather than their letter grade. This would then change our problem to being a regression problem, since we would be dealing with output values that were continuous rather than discrete.
2.3 Supervised versus Unsupervised Learning
There are two main types of machine learning:
- Supervised learning
- Unsupervised learning
So far in our student exam scores example, we have considered using labelled data, i.e. we know the details of the data, and we have a clear outcome variable. This is an example of supervised machine learning, where we train our machine learning algorithm to accurately classify or predict outcomes8.
If we had a new data set which consisted solely of feature variables, and we had no clear outcome variable in mind, but rather wanted to search for hidden patterns, then this would be an example of unsupervised machine learning.
Problem classes for unsupervised learning include clustering problems9 - using clustering techniques such as k-means clustering, just like we looked at in Computer Lab 7B - as well as dimensionality reduction and association analysis.
In STM1001 we restrict our attention to supervised machine learning.
2.4 Training Phase
As we noted initially, the mathematical and statistical algorithms we use for our machine learning are not actually the most important aspect of machine learning. “In fact, the clever part of machine learning is in the training phase”10, where we provide the data to be used to train our learning algorithm. We can think of this ‘training’ process as being similar to the way in which humans develop experience through exposure to stimuli.
Continuing our example, suppose the students surveyed now sit the exam for the statistics subject, and we record their results. Collectively, the feature variables’ values and the outcome variable’s values we have collected will form our training phase data set.
Note: While we want here to predict exam scores
, we initially require some observed values for this outcome variable, in order to know whether or not our ML model will produce accurate predictions when presented with new data.
2.5 Pre-Processing
Before we begin training our ML model, it is important to carefully pre-process our data. This can involve (amongst other things) converting variables into more appropriate formats, checking for samples and variables that could have excessive influence on the model, and checking for variables that are highly correlated with other variables. Any samples or variables identified as being potentially problematic may then be removed as part of the pre-processing stage, depending on the machine learning model employed.
Note: Details on how to perform these pre-processing steps in RStudio are covered later in this document.
2.6 Training and Validation data
Once our pre-processing is complete, we typically split our data in two - the larger portion will be our training data, with the remainder to be used later as our validation data.
- Training data is used to train our ML model
- Validation data is used to check the predictive accuracy of our trained ML model
We provide our machine with the training data, and use the chosen ML algorithm to ‘learn’ parameter values (i.e. learn the relationship between the feature and outcome variables). Once this training is complete, we can then provide the machine with our validation data, to check if it produces an accurate prediction.
Generally speaking, the more (good quality) data used to train the ML model, the more accurate the ML model - therefore when we split our data into training and validation sets, the training data set will be much larger than the validation data set.
Continuing our student exam scores example, suppose we collected data from 500 students. Suppose we then randomly select 400 of these, and use their recorded information to train our ML model. We then verify the quality of our ML model using the information for the remaining 100 students (which up until this point the machine has not seen).
If our ML model produces accurate results using the validation data, that’s great! We can now feel confident in using our ML model to predict the exam scores
of students in future years, by surveying them at the start of semester and recording their responses for the questions relating to the feature variables defined above.
2.7 Important Notes
It is important to note that an ML model should not be accurate solely for the training data, but also when provided with new data - after all, when dealing with supervised learning, we are interested in accurate predictions for the future.
As noted by Grant and Wischik (2020), it is also important to remember that:
Hence we need to carefully ensure the quality of the training data we provide to our ML model, since this data will be used by the model to determine what is ‘true’.
Simply put: garbage in, garbage out.