Unsupervised Learning

Introduction

Supervised learning, in machine learning, refers to methods that are applied when we want to estimate the function \(f(X)\) that relates a group of predictors \(X\) to a measured outcome \(Y\). Unsupervised learning refers to methods that learn from the data but there is no observed outcome.

In this module, we will cover several unsupervised learning methods, namely principal components analysis, k-means clustering and hierarchical clustering.

By the end of this module you should be able to:

  1. Implement dimension reduction using principal component analysis
  2. Implement clustering methods

Dataset used in the examples


The dataset fat is available in the library(faraway). You have to install this library.

The data set contains several physical measurements of 252 males. Most of the variables can be measured with a scale or tape measure. Can they be used to predict the percentage of body fat? If so, this offers an easy alternative to an underwater weighing technique.

Data frame with 252 observations on the following 19 variables.

The data were generously supplied by Dr. A. Garth Fisher, Human Performance Research Center, Brigham Young University, Provo, Utah 84602, who gave permission to freely distribute the data and use them for non-commercial purposes. Reference to the data is made in Penrose, et al. (1985).

Variables:

  • brozek – Percent body fat using Brozek’s equation, 457/Density - 414.2
  • siri – Percent body fat using Siri’s equation, 495/Density - 450
  • density – Density (gm/cm^2)
  • age – Age (yrs)
  • weight – Weight (lbs)
  • height – Height (inches)
  • adipos – BMI Adiposity index = Weight/Height^2 (kg/m^2)
  • free – Fat Free Weight = (1 - fraction of body fat) * Weight, using Brozek’s formula (lbs)
  • neck – Neck circumference (cm)
  • chest – Chest circumference (cm)
  • abdom – Abdomen circumference (cm) “at the umbilicus and level with the iliac crest”
  • hip – Hip circumference (cm)
  • dthigh – Thigh circumference (cm)
  • knee – Knee circumference (cm)
  • ankle – Ankle circumference (cm)
  • biceps – Extended biceps circumference (cm)
  • forearm – Forearm circumference (cm)
  • wrist – Wrist circumference (cm) “distal to the styloid processes”


The dataset bdiag.csv contains quantitative information from digitized images of a diagnostic test (fine needle aspirate (FNA) test on breast mass) for the diagnosis of breast cancer. The variables describe characteristics of the cell nuclei present in the image.

Variables Information:

  • ID number
  • Diagnosis (M = malignant, B = benign)

and ten real-valued features are computed for each cell nucleus:

  • radius (mean of distances from center to points on the perimeter)
  • texture (standard deviation of gray-scale values)
  • perimeter
  • area
  • smoothness (local variation in radius lengths)
  • compactness (perimeter^2 / area - 1.0)
  • concavity (severity of concave portions of the contour)
  • concave points (number of concave portions of the contour)
  • symmetry
  • fractal dimension (“coastline approximation” - 1)

The mean, standard error and “worst” or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.

This database is also available through the UW CS ftp server: ftp ftp.cs.wisc.edu cd math-prog/cpo-dataset/machine-learn/WDBC/