6 Lab 7 Instructions
You may use:
The FIA data that I have pre-processed and put in the Lab 7 folder, which I briefly describe at the end of the Bookdown section
Anything else you want that could be treated as a multiclass (3+) classification problem, regardless of whether it is a the “landscape scale”, e.g., it could be some fish morphology data or animal body condition or whatnot.
Note: For simplicity, I am writing the lab instructions for use with ranger
but if you would like to try out Spark, then you can talk to me separately since it’s hard to do in an RMarkdown notebook and the first time setup is most of the challenge.
6.1 Data Selection and Pre-Processing (25 pts)
Please select a data set for multiclass classification with random forest.
First, describe what your classification target is and your describe the numbers of classes (5 pts) and the imbalance among classes in your data (5 pts):
Describe the variables you are including as predictors (features) in your classification and why (5 pts):
Ensure that your data for at least the variables you are using are complete with no missing values (5 pts):
Ensure that your response variable is a factor for classification - it has to be! (5 pts):
Treat any predictors as factors - this will depend on the variable (5 pts):
6.2 Basic vs. weighted random forest (40 pts)
Create a basic random forest model using
ranger
for some typical number of trees 100-1000, and without hyperparameter tuning (15 pts for successful run)Calculate the balanced accuracy of the model (5 pts):
Create a weighted random forest model using
ranger
for the same number of trees as before and, and without hyperparameter tuning (10 pts for successful run)Calculate the balanced accuracy of the model (5 pts):
Interpret how your balanced accuracy improved, or not, when weighing classes (5 pts):
6.3 Hyperparameter tuning (35 pts)
Choose either a search grid or random search method. Create a search where you vary tree number, minimum node size, and mtry with at least 12 different runs (10 pts):
Justify your choices for your search for each of the 3 hyperparameters being tuned (5 pts, each).
Create a variable importance plot for your best model from your hyperparameter tuning (10 pts).