Prediction and Feature Assessment
Chapter 1 Introduction
This script was written for the course on Analysis of High-Dimensional Data held at the University of Bern and the ETH Zurich. Much of the content is based on the book from Hastie, Tibshirani, and Friedman (2001). The course has a focus on applications using R (R Core Team 2023). All data sets used throughout the script can be downloaded from github.
What are high-dimensional data and what is high-dimensional statistics? The Statistics Department of the University of California, Berkeley summarizes it as follows:
High-dimensional statistics focuses on data sets in which the number of features is of comparable size, or larger than the number of observations. Data sets of this type present a variety of new challenges, since classical theory and methodology can break down in surprising and unexpected ways.
High-dimensional statistics is often paraphrased with the expression \(p>>n\) which refers to a linear regression setting where the number of covariates \(p\) is much larger than the number of samples \(n\). Nevertheless, challenges with standard statistical approaches already appear when \(p\) is comparable to \(n\) and we will see that in more complex set-ups issues due to high-dimensionality often manifest itself in more subtle manners.
High-dimensional data are omnipresent and the approaches which we will discuss find applications in many disciplines. In this course we will explore examples from molecular biology, health care, speech recognition and finance.
In this script we distinguish between two typical tasks of high-dimensional statistics. The first task considers many explanatory variables \(X_1,\ldots,X_p\) and one response variable \(Y\). The goal is to predict the response and to identify the most relevant covariates. We refer to this task as Prediction and Feature Selection. The second task considers one (or few) explanatory variable \(X\) and many response variables \(Y_1, \ldots, Y_p\). The aim is to identify those variables which differ with respect to \(X\) (e.g. treatment groups A vs B). We refer to this task as Feature Assessment.
The script starts with the multiple linear regression model and the least squares estimator. We discuss the limitations of least squares in the \(p>>n\) scenario, explain the challenge of overfitting, introduce the generalization error and elaborate on the bias-variance dilemma. We then discuss methods designed to overcome challenges in high dimensions. We start with subset- and stepwise regression and then discuss in more detail regularization methods, including Ridge regression, Lasso regression and Elasticnet regression. Next, we turn our attention to binary endpoints. In particular we discuss regularization in the context of the logistic regression model. We then talk about classification trees, Random Forest and AdaBoost. We then move to time-to-event endpoints and show how to extend the previously introduced methods to survival data. We introduce the Brier score to assess the generalization error in the survival context. The last section is devoted to high-dimensional feature assessment. Based on a differential gene expression example we will discuss the issue of multiple testing, introduce methods for p-value adjustment, and finally we will touch upon variance shrinkage.