Chapter 19 K-Means Clustering
Hello! Today, we’ll be learning about k-means, a clustering strategy. Clustering algorithms are a family of algorithms that are used to group observations based on their similarity to specific variables. Clustering is considered an unsupervised machine learning strategy because the computer is learning patterns from the variables, not from your interpretation of the variables.
K-means is a strategy that accepts continuous (numeric) variables. I have included a tutorial for how to use categorical variables when constructing a K-means model in R, but we’ll stick with continuous variables for this tutorial.
Let us first download our data (the pew survey) and load the packages! The main function we will use kmeans()
is a base R function. However, factoextra
is a package that is especially useful for visualizing and interpreting cluster analysis. It is used for a range of unsupervised strategies, including Principal Component Analysis, Hierarchical Clustering, and others. Learn more about factoextra
in this sthda tutotial.
options(scipen=999)
library(tidyverse)
#install.packages("factoextra")
library(factoextra)
survey_data <- read_csv("data/pew_AmTrendsPanel_110_2022_revised.csv") %>%
select(gun_bill_approval, perceived_party_difference, age, income, personal_fin)
#str(survey_data)
19.1 Data Cleaning/Wrangling
Importantly, k-means requires full observations to work. In other words, if you have any values that are NA
, then kmeans()
will not work. We can remove observations with an NA
using na.omit()
. (This is less relevant for our dataset, as we do not have NA
values in this subset).