Chapter 19 K-Means Clustering

Hello! Today, we’ll be learning about k-means, a clustering strategy. Clustering algorithms are a family of algorithms that are used to group observations based on their similarity to specific variables. Clustering is considered an unsupervised machine learning strategy because the computer is learning patterns from the variables, not from your interpretation of the variables.

K-means is a strategy that accepts continuous (numeric) variables. I have included a tutorial for how to use categorical variables when constructing a K-means model in R, but we’ll stick with continuous variables for this tutorial.

Let us first download our data (the pew survey) and load the packages! The main function we will use kmeans() is a base R function. However, factoextra is a package that is especially useful for visualizing and interpreting cluster analysis. It is used for a range of unsupervised strategies, including Principal Component Analysis, Hierarchical Clustering, and others. Learn more about factoextra in this sthda tutotial.

options(scipen=999)
library(tidyverse)
#install.packages("factoextra")
library(factoextra)

survey_data <- read_csv("data/pew_AmTrendsPanel_110_2022_revised.csv") %>%
  select(gun_bill_approval, perceived_party_difference, age, income, personal_fin)
#str(survey_data)

19.1 Data Cleaning/Wrangling

Importantly, k-means requires full observations to work. In other words, if you have any values that are NA, then kmeans() will not work. We can remove observations with an NA using na.omit(). (This is less relevant for our dataset, as we do not have NA values in this subset).

survey_data <- na.omit(survey_data)