Chapter 16 K-Means Clustering

Hello! Today, we’ll be learning about k-means, a clustering strategy. Clustering algorithms are a family of algorithms that are used to group observations based on their similarity to specific variables. Clustering is considered an unsupervised machine learning strategy because the computer is learning patterns from the variables, not from your interpretation of the variables.

K-means is a strategy that accepts continuous (numeric) variables. I have included a tutorial for how to use categorical variables when constructing a K-means model in R, but we’ll stick with continuous variables for this tutorial.

Let us first download our data (the voting survey) and load the packages! The main function we will use kmeans() is a base R function. However, factoextra is a package that is especially useful for visualizing and interpreting cluster analysis. It is used for a range of unsupervised strategies, including Principal Component Analysis, Hierarchical Clustering, and others. Learn more about factoextra in this sthda tutotial.

options(scipen=999)
library(tidyverse)

library(factoextra)

survey_data <- read.csv("data/survey_2020-10_kaiser_voting_31118013.csv", stringsAsFactors = T) %>%
  select(q1b, q1c, q1e, trumpapprove)
#str(survey_data)