Chapter 22 Data Partitioning
As with our logistic regression tutorial, we will begin by partitining the labeled dataset into a training set and a test set (in this tutorial, we will using a 70/30 split). We will apply the algorithm on the training set and then test the quality of the algorithm on the test set. We will do this four times: one for each algorithm. Then, we will compare how accurately each algorithm performed to select one that we can apply to the whole dataset.
<- createDataPartition(y = conservative_data$conservative, p = 0.7,list = FALSE) trainIndex
In the next three lines, I construct training and test sets by splitting the document-term matricies up. But the dtm doesn’t contain the labeled data, so I will need to also subset the conservative code in the original dataset (conservative_data
) using the partition.
set.seed(381)
<- con_dtm[trainIndex, ] %>% as.matrix() %>% as.data.frame()
tw_to_train <- con_dtm[-trainIndex, ] %>% as.matrix() %>% as.data.frame()
tw_to_test
<- conservative_data$conservative[trainIndex] conservative_code
Now, we have our training set, our test set, and the labels for our supervised machine learning model. Yay!