Module 4 Un-supervised Learning

Unsupervised learning is a fundamental paradigm in machine learning that deals with the exploration and extraction of patterns from unlabeled data. Unlike supervised learning, where the model is trained on labeled examples with explicit input-output pairs, unsupervised learning operates on data lacking predefined labels. The primary objective is to uncover inherent structures, relationships, or representations within the data without explicit guidance.

4.1 Clustering

Clustering is a fundamental unsupervised learning technique that involves grouping similar data points together based on certain criteria, aiming to discover intrinsic patterns within the dataset. The goal is to create homogeneous groups or clusters, where data points within the same cluster share more similarities with each other than with those in other clusters. This process helps in uncovering the inherent structure of the data, providing valuable insights for various applications such as customer segmentation, anomaly detection, and data organization.

4.1.1 Different clustering approaches

There are various types of clustering methods, each with its own approach and characteristics. Here are some commonly used types of clustering:

Centroid-Based Clustering
Hierarchical Clustering
Density-Based Clustering
Distribution-Based Clustering
Graph-Based Clustering
Fuzzy Clustering
Model-Based Clustering

In this foundation course, we will discuss only the first two types of clustering algorithms.

4.1.2 1. Centroid-Based Clustering

Centroid-based clustering is a type of clustering algorithm that partitions the data into clusters, where each cluster is represented by a central point known as the centroid. The goal is to minimize the distance between data points in the same cluster and maximize the separation between different clusters. One of the most widely used centroid-based clustering algorithms is K-Means.

4.1.2.1 K-Means Algorithm:

4.1.2.2 Mathematical Formulation:

Given a dataset \(X = \{x_1, x_2, \ldots, x_n\}\) with \(n\) data points in a \(p\)-dimensional space, and a predefined number of clusters \(k\), the K-Means algorithm aims to find \(k\) centroids \(c_1, c_2,\ldots, c_k\) and assign each data point to the cluster with the nearest centroid.

Initialization:
- Randomly select \(k\) initial centroids \(c_1^{(0)}, c_2^{(0)}, ..., c_k^{(0)}\).
Assignment Step:
- For each data point \(x_i\), compute its distance to each centroid \(c_j\):
  
  \[d(x_i, c_j) = \sqrt{\sum_{l=1}^{p} (x_{il} - c_{jl})^2}\]
- Assign \(x_i\) to the cluster with the nearest centroid:
  
  \[\text{Cluster}(x_i) = \arg \min_{j} d(x_i, c_j)\]
Update Step:
- Recalculate the centroids based on the data points in each cluster: \[ c_j^{(t+1)} = \frac{1}{|C_j|} \sum_{x \in C_j} x \]
- Here, \(t\) represents the iteration, \(C_j\) is the set of data points assigned to cluster \(j\), and \(|C_j|\) is the number of data points in cluster \(j\).
Convergence:
- Repeat the Assignment and Update steps until convergence criteria are met (e.g., minimal change in centroids or a predefined number of iterations).

4.1.2.3 Objective Function (Cost Function):

The K-Means algorithm minimizes the following objective function, often referred to as the “within-cluster sum of squares” or “inertia”: \[J = \sum_{j=1}^{k} \sum_{x \in C_j} \|x - c_j\|^2\] Here, \(\|x - c_j\|^2\) represents the squared Euclidean distance between data point \(x\) and centroid \(c_j\).

4.1.2.3.1 Limitations:

K-Means is sensitive to the initial choice of centroids, and different initializations may lead to different final clusters.
The algorithm assumes spherical and equally-sized clusters, making it less suitable for non-spherical or unevenly sized clusters.

Despite its limitations, K-Means remains a widely used and computationally efficient method for centroid-based clustering in various applications.

4.1.2.4 Pseudocode:

# Input
X = $\{x_1, x_2, ..., x_n\}$
k = Number of clusters

# Initialization
Randomly select k initial centroids $c_1^(0), c_2^(0), ..., c_k^(0)$

# Repeat until convergence
while not converged:
  # Assignment Step
  for each data point x_i:
    Compute distances to all centroids: d(x_i, c_j) for j = 1 to k
    Assign x_i to the cluster with the nearest centroid: Cluster(x_i) = arg min_j d(x_i, c_j)

  # Update Step
  for each cluster j = 1 to k:
    Recalculate the centroid: c_j^(t+1) = (1/|C_j|) * \sum_{x in C_j} x

  # Convergence criteria
  Check if centroids do not change significantly or reach a predefined number of iterations.

4.1.2.5 Simple example of K-means clustering using `iris` dataset.

In this example we will demonstrate how to use clustering to find an approximate classification of iris flowers using the input features only.

Step 1: Loading libraries

import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt

Loading the dataset

# Load the Iris dataset
iris = load_iris()
X = iris.data # create only input data not target

Visulazing data point distribution:

# Visualize the clusters
plt.scatter(X[:, 0], X[:, 1])
#plt.scatter(centroids[:, 0], centroids[:, 1], marker='X', s=200, c='red')
plt.xlabel(iris.feature_names[0])
plt.ylabel(iris.feature_names[1])
plt.title('Distribution of Iris data')
plt.show()

Finding the best value of k using elbow method:

# Initialize an empty list to store the within-cluster sum of squares
wcss = []

# Try different values of k
for k in range(1, 11):
    kmeans = KMeans(n_clusters=k,n_init='auto', random_state=42)
    kmeans.fit(X)
    wcss.append(kmeans.inertia_)# here inertia calculate sum of square distance in each cluster

Creating elbow plot:

# Plot the within-cluster sum of squares for different values of k
plt.plot(range(1, 11), wcss, marker='o')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Within-Cluster Sum of Squares (WCSS)')
plt.title('Elbow Method')
plt.show()

Using Gridsearch method:

# Define the parameter grid
from sklearn.model_selection import GridSearchCV

param_grid = {'n_clusters': [2, 3, 4, 5, 6]}

# Create a KMeans object
kmeans = KMeans(n_init='auto',random_state=42)

# Create a GridSearchCV object
grid_search = GridSearchCV(kmeans, param_grid, cv=5)

# Perform grid search
grid_search.fit(X)

# Get the best parameters and the best score
best_params = grid_search.best_params_
best_score = grid_search.best_score_

Showing best k-value:

print("Best Parameters:", best_params)
print("Best Score:", best_score)

Implementing K-means clustering:

# Perform k-means clustering
k = 6 # Number of clusters
kmeans = KMeans(n_clusters=k,n_init='auto', random_state=42)
kmeans.fit(X)

Extracting labels and cluster centers:

# Get the cluster labels and centroids
labels = kmeans.labels_
centroids = kmeans.cluster_centers_

# Add the cluster labels to the DataFrame
df['Cluster'] = labels

Glimpse of cluster labels:

df.head()

Visualizing the clustering using first two features:

# Visualize the clusters
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis')
plt.scatter(centroids[:, 0], centroids[:, 1], marker='X', s=200, c='red')
plt.xlabel(iris.feature_names[0])
plt.ylabel(iris.feature_names[1])
plt.title('K-means Clustering')
plt.show()

Conclusion: Originally there are only three categories of iris flowers. But with all possible four features, the model predict there are six types in iris flowers. So clustering is not a good option to label the iris dataset using input features.

4.1.3 2.Hierarchical Clustering

Hierarchical clustering is a type of clustering algorithm that organizes data points into a tree-like structure, known as a dendrogram. This method creates a hierarchy of clusters, revealing relationships and structures within the dataset. Hierarchical clustering can be either agglomerative (bottom-up) or divisive (top-down), and it does not require a predefined number of clusters.

4.1.3.1 Agglomerative Hierarchical Clustering:

Concept of Similarity and proximity-matrix:

Take the distance between the centroids of these clusters. The points having the least distance are referred to as similar points and we can merge them. We can refer to this as a distance-based algorithm as well (since we are calculating the distances between the clusters).

In hierarchical clustering, we have a concept called a proximity matrix. This stores the distances between each point. Let’s take an example to understand this matrix as well as the steps to perform hierarchical clustering.

Example: Suppose a teacher wants to divide her students into different groups. She has the marks scored by each student in an assignment and based on these marks, she wants to segment them into groups. There’s no fixed target here as to how many groups to have. Since the teacher does not know what type of students should be assigned to which group, it cannot be solved as a supervised learning problem. So, we will try to apply hierarchical clustering here and segment the students into different groups.

Let’s take a sample of 5 students:

\[\begin{array}{|c|c|}\hline 1&10\\\hline 2&7\\\hline 3&28\\\hline 4&20\\\hline 5&35\\\hline \end{array}\]

Creating a Proximity Matrix First, we will create a proximity matrix which will tell us the distance between each of these points. Since we are calculating the distance of each point from each of the other points, we will get a square matrix of shape n X n (where n is the number of observations).

Let’s make the 5 x 5 proximity matrix for our example:

\[\begin{array}{|c|c|c|c|c|c|}\hline &1&2&3&4&5\\\hline 1&0&3&18&10&25\\\hline 2&3&0&21&13&28\\\hline 3&18&21&0&8&7\\\hline 4&10&13&8&0&15\\\hline 5&25&28&7&15&0\\\hline \end{array}\]

The diagonal elements of this matrix will always be 0 as the distance of a point with itself is always 0. We will use the Euclidean distance formula to calculate the rest of the distances.

4.2 Steps to Perform Hierarchical Clustering

Step 1: First, we assign all the points to an individual cluster:
Step 2: Next, we will look at the smallest distance in the proximity matrix and merge the points with the smallest distance. We then update the proximity matrix:

Note: Here, the smallest distance is 3 and hence we will merge point 1 and 2: Here, we have taken the maximum of the two marks (7, 10) to replace the marks for this cluster. Instead of the maximum, we can also take the minimum value or the average values as well. Now, we will again calculate the proximity matrix for these clusters:

\[\begin{array}{|c|c|}\hline ID&Marks\\\hline (1,2)&10\\\hline 3&28\\\hline 4&20\\\hline 5&35\\\hline \end{array}\]

Updating proximity-matrix:

Now update the proximity matrix as:

\[\begin{array}{|c|c|c|c|c|}\hline ID&(1,2)&3&4&5\\\hline (1,2)&0&18&10&25\\\hline 3&18&0&8&7\\\hline 4&10&8&0&15\\\hline 5&25&7&15&0\\\hline \end{array}\]

Next level clusters:

3 and 5 together form new cluster. Updated data is:

\[\begin{array}{|c|c|}\hline (1,2)&10\\\hline (3,5)&35\\\hline 4&20\\\hline \end{array}\]

New proximity-matrix:

\[\begin{array}{|c|c|c|c|}\hline ID&(1,2)&(3,5)&4\\\hline (1,2)&0&25&10\\\hline (3,5)&25&0&15\\\hline 4&10&15&0\\\hline \end{array}\]

New cluster (1,2,4). Updated data:

\[\begin{array}{|c|c|}\hline (1,2,4)&20\\\hline (3,5)&35\\\hline \end{array}\]

4.2.1 Mathematical Formulation:

Given a dataset \(X = \{x_1, x_2, ..., x_n\}\) with \(n\) data points in a \(p\)-dimensional space, agglomerative hierarchical clustering proceeds as follows:

Initialization:
- Treat each data point as a singleton cluster, resulting in \(n\) initial clusters.
Merge Step:
- Identify the two clusters with the smallest dissimilarity or linkage and merge them into a new cluster.
- The dissimilarity or linkage between two clusters can be measured using methods such as:
  - Single Linkage: Minimum distance between any two points in the clusters.
  - Complete Linkage: Maximum distance between any two points in the clusters.
  - Average Linkage: Average distance between all pairs of points in the clusters.
Update Dissimilarity Matrix:
- Recalculate the dissimilarity or linkage between the new cluster and the remaining clusters.
Repeat:
- Repeat the Merge and Update steps until all data points belong to a single cluster, forming a dendrogram.

4.2.2 Dendrogram:

A dendrogram visually represents the hierarchy of clusters and the order in which they are merged. The vertical lines in the dendrogram represent clusters, and the height at which two clusters merge indicates their dissimilarity.

4.2.3 `Python` code to implement Heirarchical Clustering.

Task: Suppose a teacher wants to divide her students into different groups. She has the marks scored by each student in an assignment and based on these marks, she wants to segment them into groups. There’s no fixed target here as to how many groups to have. Since the teacher does not know what type of students should be assigned to which group, it cannot be solved as a supervised learning problem. So, we will try to apply hierarchical clustering here and segment the students into different groups.

Let’s take a sample of 5 students:

\[\begin{array}{|c|c|}\hline 1&10\\\hline 2&7\\\hline 3&28\\\hline 4&20\\\hline 5&35\\\hline \end{array}\]

#loading libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

df=pd.DataFrame([10,7,28,20,35],columns=["Marks"])

import scipy.cluster.hierarchy as shc
plt.figure(figsize=(10, 7))  
plt.title("Dendrograms")  
dend = shc.dendrogram(shc.linkage(df, method='ward'))
plt.axhline(y=3, color='r', linestyle='--')

# running clustering
from sklearn.cluster import AgglomerativeClustering
cluster = AgglomerativeClustering(n_clusters=2, affinity='euclidean', linkage='ward')  
cluster.fit_predict(df)

4.2.4 Limitations:

The time complexity of agglomerative hierarchical clustering can be high, especially for large datasets.
It is sensitive to noise and outliers.

Despite its limitations, hierarchical clustering is valuable for revealing structures within the data, providing insights into the relationships between data points at different scales.

4.2.5 Pseudocode:

# Input
X = {x_1, x_2, ..., x_n}

# Initialization
Create n singleton clusters, one for each data point

# Repeat until a single cluster is formed
while number of clusters > 1:
  Find the two clusters with the minimum dissimilarity or linkage
  Merge the two clusters into a new cluster
  Update the dissimilarity matrix

# Dendrogram Visualization
Plot the dendrogram to visualize the hierarchy of clusters

4.3 Dimensionality Reduction

Dimensionality reduction is a critical technique in machine learning and data analysis, playing a pivotal role in feature engineering. As datasets grow in complexity with an increasing number of features, the curse of dimensionality becomes a significant challenge. Dimensionality reduction addresses this challenge by transforming high-dimensional data into a lower-dimensional representation, preserving essential information while improving computational efficiency and model performance.

4.3.1 The Challenge of High Dimensionality

In many real-world applications, datasets contain a large number of features, each representing a different aspect of the data. High-dimensional datasets can suffer from several issues, such as increased computational complexity, elevated risk of overfitting, and challenges in visualizing and interpreting data. Dimensionality reduction aims to mitigate these problems by extracting the most relevant information from the original features while reducing redundancy and noise.

4.3.2 Key Objectives of Dimensionality Reduction

Computational Efficiency: High-dimensional datasets often require extensive computational resources. By reducing the number of features, dimensionality reduction accelerates the training and evaluation of machine learning models, making them more scalable and efficient.
Overfitting Prevention: With an abundance of features, models can become overly complex and may fit noise in the data rather than capturing the underlying patterns. Dimensionality reduction helps prevent overfitting by focusing on the most significant features, enhancing a model’s generalization to new, unseen data.
Improved Model Performance: Reducing dimensionality often results in a more concise and informative representation of the data. This can lead to improved model performance, especially in scenarios where the original feature space is noisy or redundant.

4.3.3 Role in Feature Engineering

Feature engineering involves the process of selecting, transforming, or creating features to enhance the performance of machine learning models. Dimensionality reduction is a crucial aspect of feature engineering as it allows practitioners to distill the most critical aspects of the data into a more manageable form. This distilled representation serves as a foundation for building more effective models, simplifying the feature space without sacrificing predictive power.

4.3.4 Common Dimensionality Reduction Techniques

Principal Component Analysis (PCA): PCA is a widely-used linear technique that identifies orthogonal axes (principal components) along which the variance of the data is maximized. It projects the data onto a lower-dimensional subspace while retaining as much variance as possible.
t-Distributed Stochastic Neighbor Embedding (t-SNE): t-SNE is a non-linear dimensionality reduction technique that excels in visualizing high-dimensional data in lower-dimensional spaces. It is particularly useful for exploring the local relationships between data points.
Autoencoders: Autoencoders are neural network architectures that learn a compressed, meaningful representation of the input data. They consist of an encoder and a decoder, working together to reduce dimensionality while preserving important features.

Dimensionality reduction is a crucial tool in the realm of feature engineering. By transforming high-dimensional data into a more compact and informative representation, dimensionality reduction techniques contribute to more efficient, interpretable, and accurate machine learning models. This process is especially valuable when dealing with complex datasets where the sheer number of features poses computational and modeling challenges.

4.3.4.1 1. Principal Component Analysis

PCA transforms high-dimensional data into a lower-dimensional representation, capturing the most significant variance in the data. The resulting components, called principal components, provide a new basis for the data that emphasizes its essential features.

Mathematical Backgrounds: Given a dataset \(X\) with \(n\) data points, each of dimension \(p\), PCA aims to find a set of orthogonal vectors (principal components) that represent the directions of maximum variance in the data. The principal components are linear combinations of the original features. The steps involved in PCA are as follows:

Step 1: Data Mean-Centering

Mean-Centering: Subtract the mean of each feature from the dataset. \[\bar{x}_j = x_j - \frac{1}{n} \sum_{i=1}^{n} x_{ij}\] This ensures that the data is centered around the origin.

Step 2: Covariance Matrix Calculation

Covariance Matrix (\(C\)): Calculate the covariance matrix for the mean-centered data. \[C_{jk} = \frac{1}{n-1} \sum_{i=1}^{n} \bar{x}_{ij} \bar{x}_{ik}\]

Step 3: Eigen decomposition of Covariance Matrix

Eigendecomposition: Compute the eigenvectors (\(\mathbf{v}_i\)) and eigenvalues (\(\lambda_i\)) of the covariance matrix (\(C\)). \[C \mathbf{v}_i = \lambda_i \mathbf{v}_i\] Each \(\mathbf{v}_i\) represents a principal component direction.

Step 4: Principal Components Selection

Principal Components: Select the top \(k\) eigenvectors corresponding to the \(k\) largest eigenvalues to form the projection matrix (\(W\)).

Arrange the selected eigenvectors as columns in \(W\).
The data can be projected onto the lower-dimensional subspace using \(Z = X W\).

The eigenvalues \(\lambda_i\) represent the amount of variance captured by each principal component.
Larger eigenvalues correspond to more significant variability in the data.
Choosing \(k\) depends on the desired level of dimensionality reduction and the cumulative explained variance.

4.3.4.2 Demonstration of `PCA` on MNIST digit dataset

This section is devoted to the illustration of the Principal Components Analysis in a built-in dataset. As a first step, load the necessary Python libraries and dependancies for this example.

from __future__ import print_function
import time
import numpy as np
import scipy
import pandas as pd
from sklearn.datasets import fetch_openml
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
%matplotlib inline
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import seaborn as sns

Next is loading the dataset from the repository and check the dimension.

mnist = fetch_openml('mnist_784', cache=False)
X = mnist.data / 255.0 # scaling of data
y = mnist.target
print(X.shape, y.shape)

We are going to convert the matrix and vector to a Pandas DataFrame. This is very similar to the DataFrames used in R and will make it easier for us to plot it later on.

df = pd.DataFrame(X)
#feat_cols = [ 'pixel'+str(i) for i in range(1,X.shape[1]) ]
feat_cols=df.columns
df['y'] =y
df['label'] = df['y'].apply(lambda i: str(i))
#X, y = None, None
print('Size of the dataframe: {}'.format(df.shape))

Also convert the target variable into int type.

df.y=df.y.astype(int)

Now just have a look at the cleaned dataset.

df.head()

Next stage is the randomization and sampling (the systematic approach used in Statistical Analysis to avoid bias). Because we dont want to be using 70,000 digits in some calculations we’ll take a random subset of the digits. The randomization is important as the dataset is sorted by its label (i.e., the first seven thousand or so are zeros, etc.). To ensure randomization we’ll create a random permutation of the number 0 to 69,999 which allows us later to select the first five or ten thousand for our calculations and visualizations.

# For reproducability of the results
np.random.seed(42)
rndperm = np.random.permutation(df.shape[0])
rndperm

We now have our dataframe and our randomization vector. Lets first check what these numbers actually look like (checking sampling power). To do this we’ll generate 30 plots of randomly selected images.

plt.gray()
fig = plt.figure( figsize=(16,7) )
for i in range(1,16):
    ax = fig.add_subplot(3,5,i, title="Digit: {}".format(str(df.loc[rndperm[i],'label'])) )
    ax.matshow(df.loc[rndperm[i],feat_cols].values.reshape((28,28)).astype(float))
plt.show()

The images are all essentially 28-by-28 pixel images and therefore have a total of 784 dimensions, each holding the value of one specific pixel. What we can do is reduce the number of dimensions drastically whilst trying to retain as much of the ‘variation’ in the information as possible. This is where we get to dimensionality reduction. Lets first take a look at the Principal Component Analysis.

pca = PCA(n_components=3)
pca_result = pca.fit_transform(df[feat_cols].values)
df['pca-one'] = pca_result[:,0]
df['pca-two'] = pca_result[:,1] 
df['pca-three'] = pca_result[:,2]
print('Explained variation per principal component: {}'.format(pca.explained_variance_ratio_))

Now, given that the first two components account for about 25% of the variation in the entire dataset lets see if that is enough to visually set the different digits apart. What we can do is create a scatterplot of the first and second principal component and color each of the different types of digits with a different color. If we are lucky the same type of digits will be positioned (i.e., clustered) together in groups, which would mean that the first two principal components actually tell us a great deal about the specific types of digits.

plt.figure(figsize=(16,10))
sns.scatterplot(
    x="pca-one", y="pca-two",
    hue="y",
    palette=sns.color_palette("hls", 10),
    data=df.loc[rndperm,:],
    legend="full",
    alpha=0.3
)

From the graph we can see the two components definitely hold some information, especially for specific digits, but clearly not enough to set all of them apart. Luckily there is another technique that we can use to reduce the number of dimensions that may prove more helpful. For a 3d-version of the same plot

from mpl_toolkits import mplot3d
ax = plt.figure(figsize=(16,10))
ax = plt.axes(projection ='3d')
ax.scatter(
    xs=df.loc[rndperm,:]["pca-one"], 
    ys=df.loc[rndperm,:]["pca-two"], 
    zs=df.loc[rndperm,:]["pca-three"], 
    c=df.loc[rndperm,:]["y"], 
    cmap='tab10'
)
ax.set_xlabel('pca-one')
ax.set_ylabel('pca-two')
ax.set_zlabel('pca-three')
plt.show()

In the next few paragraphs we are going to take a look at that technique and explore if it gives us a better way of reducing the dimensions for visualization. The method we will be exploring is known as t-SNE (t-Distributed Stochastic Neighbouring Entities).

4.3.4.3 2. t-Distributed Stochastic Neighbouring Entities

What is the power of t-SNE

“t-Distributed stochastic neighbor embedding (t-SNE) minimizes the divergence between two distributions: a distribution that measures pairwise similarities of the input objects and a distribution that measures pairwise similarities of the corresponding low-dimensional points in the embedding”.

Essentially what this means is that it looks at the original data that is entered into the algorithm and looks at how to best represent this data using less dimensions by matching both distributions. The way it does this is computationally quite heavy and therefore there are some (serious) limitations to the use of this technique. For example one of the recommendations is that, in case of very high dimensional data, you may need to apply another dimensionality reduction technique before using t-SNE:

Drawback of t-SNE

“Since t-SNE scales quadratically in the number of objects N, its applicability is limited to data sets with only a few thousand input objects; beyond that, learning becomes too slow to be practical (and the memory requirements become too large)”.

4.3.4.4 Develop a `t-SNE` model in `Python`

time_start = time.time()
tsne = TSNE(n_components=2, verbose=1, perplexity=40, n_iter=300)
tsne_results = tsne.fit_transform(data_subset)
print('t-SNE done! Time elapsed: {} seconds'.format(time.time()-time_start))

Now let’s compare the t-SNE with a compatiable PCA model. For that purpose, create a PCA instance too.

N = 10000
df_subset = df.loc[rndperm[:N],:].copy()
data_subset = df_subset[feat_cols].values
pca = PCA(n_components=3)
pca_result = pca.fit_transform(data_subset)
df_subset['pca-one'] = pca_result[:,0]
df_subset['pca-two'] = pca_result[:,1] 
df_subset['pca-three'] = pca_result[:,2]
print('Explained variation per principal component: {}'.format(pca.explained_variance_ratio_))

Now that we have the two resulting dimensions we can again visualise them by creating a scatter plot of the two dimensions and coloring each sample by its respective label.

df_subset['tsne-2d-one'] = tsne_results[:,0]
df_subset['tsne-2d-two'] = tsne_results[:,1]
plt.figure(figsize=(16,10))
sns.scatterplot(
    x="tsne-2d-one", y="tsne-2d-two",
    hue="y",
    palette=sns.color_palette("hls", 10),
    data=df_subset,
    legend="full",
    alpha=0.3
)

This is already a significant improvement over the PCA visualisation we used earlier. We can see that the digits are very clearly clustered in their own sub groups. If we would now use a clustering algorithm to pick out the seperate clusters we could probably quite accurately assign new points to a label. Just to compare PCA & T-SNE:

plt.figure(figsize=(16,7))
ax1 = plt.subplot(1, 2, 1)
sns.scatterplot(
    x="pca-one", y="pca-two",
    hue="y",
    palette=sns.color_palette("hls", 10),
    data=df_subset,
    legend="full",
    alpha=0.3,
    ax=ax1
)
ax2 = plt.subplot(1, 2, 2)
sns.scatterplot(
    x="tsne-2d-one", y="tsne-2d-two",
    hue="y",
    palette=sns.color_palette("hls", 10),
    data=df_subset,
    legend="full",
    alpha=0.3,
    ax=ax2
)

Model improvement stage: We’ll now take the recommendations to heart and actually reduce the number of dimensions before feeding the data into the t-SNE algorithm. For this we’ll use PCA again. We will first create a new dataset containing the fifty dimensions generated by the PCA reduction algorithm. We can then use this dataset to perform the t-SNE to improve performance.

pca_50 = PCA(n_components=50)
pca_result_50 = pca_50.fit_transform(data_subset)
print('Cumulative explained variation for 50 principal components: {}'.format(np.sum(pca_50.explained_variance_ratio_)))

The first 50 components roughly hold around 85% of the total variation in the data. So 50 dimensions is enough to capture data information. Now lets try and feed this data into the t-SNE algorithm. This time we’ll use 10,000 samples out of the 70,000 to make sure the algorithm does not take up too much memory and CPU.

time_start = time.time()
tsne = TSNE(n_components=2, verbose=0, perplexity=40, n_iter=300)
tsne_pca_results = tsne.fit_transform(pca_result_50)
print('t-SNE done! Time elapsed: {} seconds'.format(time.time()-time_start))

A final model visualization of all the above mentioned models are created using the following Python code.

df_subset['tsne-pca50-one'] = tsne_pca_results[:,0]
df_subset['tsne-pca50-two'] = tsne_pca_results[:,1]
plt.figure(figsize=(16,4))
ax1 = plt.subplot(1, 3, 1)
sns.scatterplot(
    x="pca-one", y="pca-two",
    hue="y",
    palette=sns.color_palette("hls", 10),
    data=df_subset,
    legend="full",
    alpha=0.3,
    ax=ax1
)
ax2 = plt.subplot(1, 3, 2)
sns.scatterplot(
    x="tsne-2d-one", y="tsne-2d-two",
    hue="y",
    palette=sns.color_palette("hls", 10),
    data=df_subset,
    legend="full",
    alpha=0.3,
    ax=ax2
)
ax3 = plt.subplot(1, 3, 3)
sns.scatterplot(
    x="tsne-pca50-one", y="tsne-pca50-two",
    hue="y",
    palette=sns.color_palette("hls", 10),
    data=df_subset,
    legend="full",
    alpha=0.3,
    ax=ax3
)

In conclusion, a PCA+t-SNE combination performs well in many complex data structure.

4.4 Anomaly Detection

Anomaly detection detects data points in data that does not fit well with the rest of the data. It has a wide range of applications such as fraud detection, surveillance, diagnosis, data cleanup, and predictive maintenance. An interesting and comprehensive explanation of anomaly detection in a general setup is given in the blog post-

https://iwringer.wordpress.com/2015/11/17/anomaly-detection-concepts-and-techniques/