How does machine learning work?

Machine learning is about more than just applying advanced mathematical and statistical algorithms to large data sets. A great deal of thought and care also needs to go into the choice and pre-processing of the data to be used by the machine learning algorithm. Machine learning is also, to a certain extent, subjective - as we will see, after running a machine learning algorithm, there may be several competing results from which to choose, and the final choice may differ depending on the context.

Let’s introduce some of the key terminology used in machine learning, via the following example:

Suppose we would like to predict future students’ final grades for a La Trobe University statistics subject.

2.1 Feature Variables

To help us with this, we collect data from current students such as:

1. The average amount of time they spend studying each day
1. Their marks for a previous statistics subject
1. Their sleep patterns
1. Their diet
1. Their exercise routines

Each student will have a different set of responses to these questions (with perhaps some matches). Each question here can be thought of as a variable, with the students’ responses being the observed values for that variable. Because the variables signify different ‘characteristics’, ‘attributes’ or ‘features’ of the students, in machine learning terminology these variables are referred to as feature variables⁵. They are not the reason for conducting the machine learning process, but are vital to its success - if our feature variables are poorly defined or chosen, then our resultant model may not be very accurate.

2.2 Outcome Variables

It follows that the variable we would like to predict here, i.e. the exam score, is our outcome variable.

Continuing our example, suppose the students surveyed now sit the exam for the statistics subject, and we record their results. Collectively, the feature variables’ values and the outcome variable’s values will form our data set.

2.3 Problem Class

Different machine learning models and methods can be applied to different types of problems.

For our student grades example, our outcome variable takes a discrete number of values - i.e. the subject grades of A, B, C etc. Therefore, we would refer to this as an example of a classification problem⁶. Since we have multiple classes, it is a multi-class classification problem. If we reduced the outcome variable to pass or fail, this problem would become a binary classification problem, since there are now only two possible classes.

Suppose we decided to change the focus of our problem, and aimed instead to predict a student’s overall mark (out of 100) for the subject, rather than their grade. This would change our problem to being a regression problem, since we are now dealing with output values that are continuous rather than discrete.

2.4 Supervised versus Unsupervised Learning

There are two main types of machine learning:

Supervised learning
Unsupervised learning

So far in our student grades example, we have considered using labelled data, i.e. we know the details of the data, and we have a clear outcome variable. This is an example of supervised machine learning, where we train our machine learning algorithm to accurately classify or predict outcomes⁷.

If we had a new data set which consisted solely of feature variables, and we had no clear outcome variable in mind, but rather wanted to search for hidden patterns, then this would be an example of unsupervised machine learning.

Problem classes for unsupervised learning include clustering problems⁸ - using clustering techniques such as k-means clustering, just like we looked at in Computer Lab 6B - as well as dimensionality reduction and association analysis.

In this subject, we restrict our attention to supervised machine learning.

2.5 Training Phase

As we noted at the beginning, the mathematical and statistical algorithms we use for our machine learning are not actually the most important aspect of machine learning. “In fact, the clever part of machine learning is in the training phase”⁹, where we provide the data to be used to train our learning algorithm. We can think of this ‘training’ process as being similar to the way in which humans develop experience through exposure to stimuli.

2.6 Pre-Processing

Before we begin training our machine learning model, it is important to carefully pre-process our data. This can involve (amongst other things) converting variables into more appropriate formats, checking for samples that could have excessive influence on the model, and possibly removing variables that are highly correlated with other variables (as this can sometimes cause problems, depending on the machine learning model employed).

More details on how to perform these pre-processing steps in R are included in section 3.3 of this document.

2.7 Training and Validation data

Once our pre-processing is complete, we typically split our data into two - the larger portion will be our training data, with the remainder to be used later as our validation data.

We provide our machine with the training data, and use the chosen machine learning algorithm to ‘learn’ parameter values (i.e. learn the relationship between the feature and outcome variables). Once this training is complete, we can then provide the machine with our validation data, to check if it produces an accurate prediction. Generally speaking, the more data used to train the model, the more accurate the model - therefore when we split our data into training and validation sets, the training data set will be much larger than the validation data set.

Continuing our student grades example, suppose our data set contains information from 500 students. We randomly select 400 of these, and use their recorded information to train our model. We then verify/validate the quality of our model using the information for the remaining 100 students (which up until this point the machine has not seen).

If results are good using the validation data, that’s great! We can now use our model to predict the grades of students in future years, by surveying them at the start of semester and recording their responses for the questions relating to the feature variables defined above.

2.8 Important Notes

It is important to note that a machine learning model should not be accurate solely for the training data, but also when provided with new data - after all, when dealing with supervised learning, we are interested in accurate predictions for the future.

As noted by Grant and Wischik (2020), it is also important to remember that:

“The patterns found by machine learning are not laws of nature like Newton’s laws of motion, and they are not precise stipulative rules in the sense of directives laid down in statutes. They are simply fitted curves; and if the data is noisy then the curves will not fit well.”¹⁰

Hence we need to carefully ensure the quality of the training data we provide to our model, since this data will be used by the model to determine what is ‘true’ .

Simply put: garbage in, garbage out.

References

Delua, J. (IBM). 2021. “Supervised Vs. Unsupervised Learning: What’s the Difference?” 2021. https://www.ibm.com/cloud/blog/supervised-vs-unsupervised-learning.

Grant, Thomas D, and Damon J Wischik. 2020. On the Path to AI: Law’s Prophecies and the Conceptual Foundations of the Machine Learning Age. Cham: Springer International Publishing AG.

Zhou, Zhi-Hua. 2021. Machine Learning.

Also known as predictor variables.↩︎
Zhou (2021), p.4↩︎
See e.g. Delua (2021)↩︎
See e.g. Zhou (2021), p.4↩︎
Grant and Wischik (2020), p.35↩︎
Grant and Wischik (2020), p.4↩︎