Chapter 23 Model Construction

One way we can increase the quality of a supervised machine learning model is to use a resampling strategy that repeatedly takes independent samples from a population to construct an estimate. For this tutorial, we will use the same resampling strategy over and over, so we can save this attribute using the trainControl() function. trainControl() is especially useful when you are applying a bunch of different models but are using the same arguments, as we will do here.

trctrl <- trainControl(method = "boot")

For this tutoral, we will use the bootstrapping method of resampling, which you can learn about in this towards data science tutorial.

We’re almost ready to start using some SML models! But first–you need to learn about hyperparameters.

23.1 (Hyper)parameter Optimizing

In supervised machine learning, algorithms “learn” which parameters (features or combinations of features) are optimal for its classification task. In our standard linear model (y = a + bx), a and b are parameters (y is the function, and x is a feature). However, algorithms sometimes require additional parameters which a users (you!) is expected to provide. Parameters that are provided by the user are called “hyperparameters.” In unsupervised machine learning, the most important hyperparameter is the k (i.e., the number of topics or clusters you have). Because each SML algorithm is unique (as in, based on a unique body of mathematical logic), each algorithm has its own set of unique hyperparameters.

Still struggling to distinguish hyperparameters and parameters? Check out this explanation.

The process of identifying the right hyperparameters of for your model is tedious and time consuming–just as identifying the right k for unsupervised machine learning is time-consuming. However, it is an important part of ensuring you produce a quality text classifier. Tuning the algorithm can greatly improve the quality of your algorithm, but there are no “tried and true” rules for hyper-parameters because they are intentionally meant to be tuned for different types of data. In other words, there is no “right” hyper-parameter all the time–it is only “right” for your specific data for your specific supervised machine learning task.

Because this tutorial is a broad overview of supervised machine learning algorithms, we will not be able to go into hyperparameter optimization in depth. However, the way we compare different algorithms at the end of this tutorial is similar to the way in which we would compare two text classifiers using the same algorithm but different hyperparameters (i.e., using percent agreement and F-scores). I also encourage you to play around with the hyper-parameters in this tutorial so you can see how they change the results of the analysis.

For a full list of the hyperparameters for each model, check outthe caret tutorial. If you intend to work with supervised machine learning in R, I enocurage you to familiarize yourself with this tutorial as it is extremely useful for any aspirational data scientist working in R.

Okay, onto the algorithms!