Chapter 22 Data Partitioning

As with our logistic regression tutorial, we will begin by partitining the labeled dataset into a training set and a test set (in this tutorial, we will using a 70/30 split). We will apply the algorithm on the training set and then test the quality of the algorithm on the test set. We will do this four times: one for each algorithm. Then, we will compare how accurately each algorithm performed to select one that we can apply to the whole dataset.

trainIndex <- createDataPartition(y = conservative_data$conservative, p = 0.7,list = FALSE)

In the next three lines, I construct training and test sets by splitting the document-term matricies up. But the dtm doesn’t contain the labeled data, so I will need to also subset the conservative code in the original dataset (conservative_data) using the partition.

set.seed(381)
tw_to_train <- con_dtm[trainIndex, ] %>% as.matrix() %>% as.data.frame()
tw_to_test <- con_dtm[-trainIndex, ] %>% as.matrix() %>% as.data.frame()

conservative_code <- conservative_data$conservative[trainIndex]

Now, we have our training set, our test set, and the labels for our supervised machine learning model. Yay!