4.2 Unsupervised Learning

Unsupervised learning involves discovering patterns in data without labeled responses.

Some key goals in this setting include:

Learn the structure of data: It is possible to learn if the data consists of clusters, or if it can be represented in a lower dimension.
Learn the probability distribution of data: By learning the probability distribution where the training data came from, it is possible to generate synthetic data which is “similar” to real data
Learn a representation for data: We can learn a representation that is useful in solving other tasks later. With this new representation, for example, we can reduce the need for labeled examples for classification.

4.2.1 Clustering

Clustering is one of the main tasks in unsupervised learning. It is the process of detecting clusters in the dataset.

Often, the membership of a cluster can replace the role of a label in the training dataset. In general, clusters reveal a lot of information about the underlying structure of the data.

There are different ways to cluster datapoints. Some attempts to define a “good cluster” are as follows:

A “good” cluster is a subset of points which are closer to each other than to all other points in the dataset.
A “good” cluster is a subset of points which are closer to the mean of their own cluster than to the mean of other clusters.

Centroid-Based Clustering

Also called K-means clustering.

# Load necessary library
library(cluster)

# Perform k-means clustering
set.seed(123)
kmeans_result <- kmeans(mtcars[, c("mpg", "hp")], centers = 3)

# Print cluster centers
print(kmeans_result$centers)

##        mpg        hp
## 1 14.62000 263.80000
## 2 15.80000 178.50000
## 3 24.22353  93.52941

Density-Based Clustering

Hierarchical Clustering

4.2.2 Language Models

In this chapter, we continue our investigation into unsupervised learning techniques and now turn our attention to language models. You may have heard of natural language processing (NLP) and models such as GPT-3 in the news lately. The latter is quite impressive, being able to write and publish its own opinion article on a reputable news website!

While most of these models are trained using state-of-the-art deep learning techniques this chapter explores a key idea, which is to view language as the output of a probabilistic process, which leads to an interesting measure of the “goodness” of the model. Specifically, we will investigate the so-called n-gram language model.

Probabilistic Model of Language

4.2.3 Recommender System

Recommender systems are a class of algorithms that help users discover items of interest. They are widely used in e-commerce, streaming services, and social networks to suggest products, movies, friends, and more.

While many recommender systems involve supervised learning (e.g., predicting ratings), they can also be formulated as unsupervised learning problems. A typical unsupervised approach involves finding similar users or items based on interaction patterns, without any labeled outcome.

References

Arora, S., Park, S., Jacob, D., Chen, D. (2024) Introduction to Machine Learning. Lecture notes for CS 324 at Princeton University.
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013b). An Introduction to Statistical learning. In Springer texts in statistics. https://doi.org/10.1007/978-1-4614-7138-7