Machine Learning for Biostatistics
Module 2
2022-07-23
Regression and Classification
Introduction
This module will cover traditional methods for prediction of continuous outcomes and categorical ones. In machine learning, these methods are known as regression (for continuous outcomes) and classification (for categorical outcomes) methods. This sometimes is a bit confusing given that, despite its name, logistic regression is a classification method under this terminology because in statistics, regression is used to refer to many models associating any type of outcomes with independent variables.
In this module we are going to review the linear regression and describe the k-nearest neighbour regression. For the classification methods, we will explore three widely-used classifiers: logistic regression, K-nearest neighbours and linear discriminant analysis.
By the end of this module you should be able to:
- Use linear regression for prediction
- Estimate the mean squared error of a predictive model
- Use knn regression and knn classifier
- Use logistic regression as a classification algorithm
- Calculate the confusion matrix and evaluate the classification ability
- Implement linear and quadratic discriminant analyses
Datasets used in the examples
The file bmd.csv contains 169 records of bone densitometries (measurement of bone mineral density). The following variables were collected:
- id – patient’s number
- age – patient’s age
- fracture – hip fracture (fracture / no fracture)
- weight_kg – weight measured in Kg
- height_cm – height measure in cm
- waiting_time – time the patient had to wait for the densitometry (in minutes)
- bmd – bone mineral density measure in the hip
The file SBI.csv contains 2349 records of children admitted to the emergency room with fever and tested for serious bacterial infection (sbi). The following variables are included :
- id – patient’s number
- fever_hours – duration of the fever in hours
- age – child’s age
- sex – child’s sex (M / F)
- wcc – white cell count
- prevAB – previous antibiotics (Yes / No)
- sbi – serious bacterial infection (Not Applicable / UTI / Pneum / Bact)
- pct – procalcitonin
- crp – c-reactive protein
The dataset bdiag.csv contains quantitative information from digitized images of a diagnostic test (fine needle aspirate (FNA) test on breast mass) for the diagnosis of breast cancer. The variables describe characteristics of the cell nuclei present in the image.
Variables Information:
- ID number
- Diagnosis (M = malignant, B = benign)
and ten real-valued features are computed for each cell nucleus:
- radius (mean of distances from center to points on the perimeter)
- texture (standard deviation of gray-scale values)
- perimeter
- area
- smoothness (local variation in radius lengths)
- compactness (perimeter^2 / area - 1.0)
- concavity (severity of concave portions of the contour)
- concave points (number of concave portions of the contour)
- symmetry
- fractal dimension (“coastline approximation” - 1)
The mean, standard error and “worst” or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. For instance, field 3 is Mean Radius, field 13 is Radius SE, field 23 is Worst Radius.
This database is also available through the UW CS ftp server: ftp ftp.cs.wisc.edu cd math-prog/cpo-dataset/machine-learn/WDBC/