Machine Learning for Biostatistics
Module 6
2022-10-05
Beyond Additivity
Introduction
So far, most of the methods that we have seen (with the exception of KNN) assume an additive effect of the predictors. We will now study non-parametric methods to estimate \(f(\mathbf x)\).
By the end of this module you should be able to:
- Construct regression and classification trees
- Use cross-validation to prune a tree
- Understand the advantages and disadvantages of trees
- Define bagging and boosting
- Fit random forests
Dataset used in the examples
The dataset triceps is available in the MultiKink
package.
You may install.packages("MultiKink")
, load the library (library(MultiKink)
)
and then run data("triceps")
.
The data are derived from an anthropometric study of 892 females under 50 years in three Gambian villages in West Africa. There are 892 observations on the following 3 variables:
- age - Age of respondents.
- lntriceps - Log of the triceps skinfold thickness.
- triceps - Triceps skinfold thickness.
The data SA_heart.csv is retrospective sample of males in a heart-disease high-risk region of the Western Cape, South Africa. There are roughly two controls per case of CHD.
Many of the CHD positive men have undergone blood pressure reduction treatment and other programs to reduce their risk factors after their CHD event. In some cases the measurements were made after these treatments. These data are taken from a larger dataset, described in Rousseauw et al, 1983, South African Medical Journal.
The data contains 462 observations on the following 10 variables.
- sbp - systolic blood pressure
- tobacco - cumulative tobacco (kg)
- ldl - low density lipoprotein cholesterol
- adiposity - a numeric vector
- famhist - family history of heart disease, a factor with levels Absent Present
- typea - type-A behavior
- obesity - a numeric vector
- alcohol - current alcohol consumption
- age - age at onset
- chd- response, coronary heart disease (1 - chd, 0 - no chd)
The file SBI.csv contains 2349 records of children admitted to the emergency room with fever and tested for serious bacterial infection (sbi). The following variables are included :
- id – patient’s number
- fever_hours – duration of the fever in hours
- age – child’s age
- sex – child’s sex (M / F)
- wcc – white cell count
- prevAB – previous antibiotics (Yes / No)
- sbi – serious bacterial infection (Not Applicable / UTI / Pneum / Bact)
- pct – procalcitonin
- crp – c-reactive protein