4.3 Binning Data:
A quick look at Distribution of Variables:
for (i in 1:19) {
hist <- hist(sausage.processed[,i], breaks = 20,
main = paste("Histogram of" ,colnames(sausage.processed)[i]),
xlab = "Bins of Variables",
ylab = "Ratings")
}
Binning using BinQuant:
We make the matrix space to store the dplyr::recoded data. Here we use the same dimensions as the original data and fill in with NA to signal if the data was not successfully filled.
# create a space to store our binning results and spearman
bin_data <- matrix(rep(NA,nrow(sausage.processed)),nrow =
nrow(sausage.processed), ncol = ncol(sausage.processed)
)
colnames(bin_data) <- colnames(sausage.processed)
rownames(bin_data) <- wk0$Product
We use this loop along with a try catch to skip the varibles that BinQuant cannot bin. We see that the issue irise when nclass is larger than 1. Here, at nclass = 4, we see 3 variables with error. We will go in to bin_data to find which one it is then manually bin ourselves.
# we use this loop along with a try catch to skip the varibles that BinQuant cannot bin.
for (i in 1:19) {
tryCatch({
var_clust = which(colnames(sausage.processed) == colnames(sausage.processed)[i])
bin_data[,colnames(sausage.processed)[var_clust]] <- BinQuant(
sausage.processed[,var_clust], nClass = 4, stem = '')
}, error=function(e){cat("ERROR :",conditionMessage(e), "\n")})
}
## ERROR : 'breaks' are not unique
## ERROR : 'breaks' are not unique
## ERROR : 'breaks' are not unique
Next, after looking at the newly-filled bin_data, we see that 3 variables (Bitter-6, Fatty-13, HVP-14) had issue. We will attempt to bin them manually. Suspected reason:
# variable Bitter - works only when nclass = 1
var_clust = which(colnames(sausage.processed) == colnames(sausage.processed)[6])
bin_data[,colnames(sausage.processed)[var_clust]] <- BinQuant(
sausage.processed[,var_clust], nClass = 1, stem = '')
# variable Fatty - works when nclass = 3
var_clust = which(colnames(sausage.processed) == colnames(sausage.processed)[13])
bin_data[,colnames(sausage.processed)[var_clust]] <- BinQuant(
sausage.processed[,var_clust], nClass = 3, stem = '')
# variable Fatty - works when nclass = 3
var_clust = which(colnames(sausage.processed) == colnames(sausage.processed)[14])
bin_data[,colnames(sausage.processed)[var_clust]] <- BinQuant(
sausage.processed[,var_clust], nClass = 3, stem = '')
Distribution, Sample size, and Spearman Correlation after Binning:
for (i in 1:19) {
hist <- hist(bin_data[,i], breaks = 4,
main = paste("Histogram of" ,colnames(sausage.processed)[i], "- After Binning"),
xlab = "Bins of Variables",
ylab = "Ratings",
labels = TRUE,
col = c("lightblue", "gold", "lightgreen", "chocolate1")
)
Spearman <- cor(sausage.processed[,i], bin_data, method = "spearman")
hist <- legend("topright", legend = c("Spearman =", Spearman[i]))
hist <- abline(v = 2, col = "red")
}