Regression and Classification

Introduction

This module will cover traditional methods for prediction of continuous outcomes and categorical ones. In machine learning, these methods are known as regression (for continuous outcomes) and classification (for categorical outcomes) methods. This sometimes is a bit confusing given that, despite its name, logistic regression is a classification method under this terminology because in statistics, regression is used to refer to many models associating any type of outcome with independent variables.

In this module we are going to review the linear regression and describe the k-nearest neighbour regression. For the classification methods, we will explore three widely-used classifiers: logistic regression, K-nearest neighbours and linear discriminant analysis.

By the end of this module you should be able to:

Use linear regression for prediction
Estimate the mean squared error of a predictive model
Use knn regression and knn classifier
Use logistic regression as a classification algorithm
Calculate the confusion matrix and evaluate the classification ability
Implement linear and quadratic discriminant analyses

Datasets used in the examples

The file bmd.csv contains 169 records of bone densitometries (measurement of bone mineral density). The following variables were collected:

id – patient’s number
age – patient’s age
fracture – hip fracture (fracture / no fracture)
weight_kg – weight measured in Kg
height_cm – height measure in cm
waiting_time – time the patient had to wait for the densitometry (in minutes)
bmd – bone mineral density measure in the hip

The file SBI.csv contains 2349 records of children admitted to the emergency room with fever and tested for serious bacterial infection (sbi). The following variables are included :

id – patient’s number
fever_hours – duration of the fever in hours
age – child’s age
sex – child’s sex (M / F)
wcc – white cell count
prevAB – previous antibiotics (Yes / No)
sbi – serious bacterial infection (Not Applicable / UTI / Pneum / Bact)
pct – procalcitonin
crp – c-reactive protein

The dataset bdiag.csv contains quantitative information from digitized images of a diagnostic test (fine needle aspirate (FNA) test on breast mass) for the diagnosis of breast cancer. The variables describe characteristics of the cell nuclei present in the image.

Variables Information:

ID number
Diagnosis (M = malignant, B = benign)

and ten real-valued features are computed for each cell nucleus:

radius (mean of distances from center to points on the perimeter)
texture (standard deviation of gray-scale values)
perimeter
area
smoothness (local variation in radius lengths)
compactness (perimeter^2 / area - 1.0)
concavity (severity of concave portions of the contour)
concave points (number of concave portions of the contour)
symmetry
fractal dimension (“coastline approximation” - 1)

The mean, standard error and “worst” or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.

This database is also available through the UW CS ftp server: ftp ftp.cs.wisc.edu cd math-prog/cpo-dataset/machine-learn/WDBC/

Slides from the videos

You can download the slides used in the videos for Regression and Classification:

Slides

Machine Learning for Biostatistics

Machine Learning for Biostatistics

Module 2

Regression and Classification

Introduction

Datasets used in the examples

Slides from the videos