Regularisation

Introduction

In this module we will talk about model selection and regularisation methods (also called penalisation methods), namely, ridge and lasso. We will start with classical algorithms for model selection, such as the best subset selection and stepwise (backward and forward) selection. Then we introduce the idea of bias-variance trade-off and the motivation for ridge regression. Finally, we will talk about Lasso regression and some of its extensions.

By the end of this module you should be able to:

  1. Implement the best subset and stepwise algorithms
  2. Calculate the ridge estimator in generalised linear models (GLMs)
  3. Perform variable selection using lasso in GLMs
  4. Explain the differences between maximum likelihood estimation and regularisation

Dataset used in the examples


The dataset fat is available in the library(faraway). You have to install this library.

The data set contains several physical measurements of 252 males. Most of the variables can be measured with a scale or tape measure. Can they be used to predict the percentage of body fat? If so, this offers an easy alternative to an underwater weighing technique.

Data frame with 252 observations on the following 19 variables.

The data were generously supplied by Dr. A. Garth Fisher, Human Performance Research Center, Brigham Young University, Provo, Utah 84602, who gave permission to freely distribute the data and use them for non-commercial purposes. Reference to the data is made in Penrose, et al. (1985).

Variables:

  • brozek – Percent body fat using Brozek’s equation, 457/Density - 414.2
  • siri – Percent body fat using Siri’s equation, 495/Density - 450
  • density – Density (gm/cm^2)
  • age – Age (yrs)
  • weight – Weight (lbs)
  • height – Height (inches)
  • adipos – BMI Adiposity index = Weight/Height^2 (kg/m^2)
  • free – Fat Free Weight = (1 - fraction of body fat) * Weight, using Brozek’s formula (lbs)
  • neck – Neck circumference (cm)
  • chest – Chest circumference (cm)
  • abdom – Abdomen circumference (cm) “at the umbilicus and level with the iliac crest”
  • hip – Hip circumference (cm)
  • dthigh – Thigh circumference (cm)
  • knee – Knee circumference (cm)
  • ankle – Ankle circumference (cm)
  • biceps – Extended biceps circumference (cm)
  • forearm – Forearm circumference (cm)
  • wrist – Wrist circumference (cm) “distal to the styloid processes”


The dataset Prostate is available in the package lasso2 and contains information on 97 men who were about to receive a radical prostatectomy. The level of prostate specific antigen (PSA) is used as a diagnostic test for prostate cancer. and a number of clinical measures. The dataset contains additional information regarding other clinical variables

The data were generously supplied by Dr. A. Garth Fisher, Human Performance Research Center, Brigham Young University, Provo, Utah 84602, who gave permission to freely distribute the data and use them for non-commercial purposes. Reference to the data is made in Penrose, et al. (1985).

The data frame has the following variables:

  • lcavol - log(cancer volume)
  • lweight - log(prostate weight)
  • age - age
  • lbph -log(benign prostatic hyperplasia amount)
  • svi - seminal vesicle invasion
  • lcp - log(capsular penetration)
  • gleason - Gleason score
  • pgg45 - percentage Gleason scores 4 or 5
  • lpsa - log(prostate specific antigen)


The dataset lowbwt.csv was collected in a study aiming to identify risk factors associated with giving birth to a low birth weight baby (weighing less than 2500 grams). Data were collected on 189 women, 59 of which had low birth weight babies and 130 of which had normal birth weight babies. Four variables which were thought to be of importance were age, weight of the subject at her last menstrual period, race, and the number of physician visits during the first trimester of pregnancy.

Variables:

  • id - Identification Code
  • low - Low Birth Weight (0 = Birth Weight >= 2500g, 1 = Birth Weight < 2500g)
  • age - Age of the Mother in Years
  • lwt - Mother’s Weight in Pounds at the Last Menstrual Period
  • race - Race (1 = White, 2 = Black, 3 = Other)
  • smoke - Smoking Status During Pregnancy (1 = Yes, 0 = No)
  • plt - History of Premature Labor (0 = None 1 = One, etc.)
  • ht - History of Hypertension (1 = Yes, 0 = No)
  • ui - Presence of Uterine Irritability (1 = Yes, 0 = No)
  • ftv - Number of Physician Visits During the First Trimester (0 = None, 1 = One, 2 = Two, etc.)
  • bwt - Birth Weight in Grams

SOURCE: Hosmer and Lemeshow (2000) Applied Logistic Regression: Second Edition. These data are copyrighted by John Wiley & Sons Inc. and must be acknowledged and used accordingly. Data were collected at Baystate Medical Center, Springfield, Massachusetts during 1986.


The dataset bdiag.csv contains quantitative information from digitized images of a diagnostic test (fine needle aspirate (FNA) test on breast mass) for the diagnosis of breast cancer. The variables describe characteristics of the cell nuclei present in the image.

Variables Information:

  • ID number
  • Diagnosis (M = malignant, B = benign)

and ten real-valued features are computed for each cell nucleus:

  • radius (mean of distances from center to points on the perimeter)
  • texture (standard deviation of gray-scale values)
  • perimeter
  • area
  • smoothness (local variation in radius lengths)
  • compactness (perimeter^2 / area - 1.0)
  • concavity (severity of concave portions of the contour)
  • concave points (number of concave portions of the contour)
  • symmetry
  • fractal dimension (“coastline approximation” - 1)

The mean, standard error and “worst” or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.

This database is also available through the UW CS ftp server: ftp ftp.cs.wisc.edu cd math-prog/cpo-dataset/machine-learn/WDBC/

Slides

You can download the slides used in the videos for the regularisation module: