Machine Learning for Biostatistics
Module 3
2024-07-11
Resampling methods
Introduction
This module will cover bootstrap and cross-validation. These are two important techniques that are useful to study sample variability, evaluate model performance and choosing tuning parameters in many of the methods covered in this unit.
We will switch the order presented in the book Introduction to Statistical Learning and start with bootstrap and then proceed to cross-validation.
By the end of this module you should be able to:
- Be able to compute standard errors for different statistics through bootstrapping
- Compute model performance statistics by cross-validation
- Use cross-validation to select tuning parameters such as the number of neighbours in KNN
Dataset used in the examples
The file bmd.csv contains 169 records of bone densitometries (measurement of bone mineral density). The following variables were collected:
- id – patient’s number
- age – patient’s age
- fracture – hip fracture (fracture / no fracture)
- weight_kg – weight measured in Kg
- height_cm – height measure in cm
- waiting_time – time the patient had to wait for the densitometry (in minutes)
- bmd – bone mineral density measure in the hip
The file SBI.csv contains the records of 2349 children admitted to the emergency room with fever and tested for serious bacterial infection (sbi). The following variables were collected:
- id – patient’s number
- fever_hours – duration of the fever in hours
- age – child’s age
- sex – child’s sex (M / F)
- wcc – white cell count
- prevAB – previous antibiotics (Yes / No)
- sbi – serious bacterial infection (Not Applicable / UTI / Pneum / Bact)
- pct – procalcitonin
- crp – c-reactive protein