Chapter 1 Motivation, Introduction and Review

1.1 Modelling Functional Relationships

In many cases in statistics, we wish to describe a functional relationship between two quantities. These types of relationships are very common; indeed they are almost what we mean when we talk of cause and effect. Mathematically speaking, they mean that the value of one variable, \(Y\), taking values in a space \(\mathcal{Y}\), depends on the value of another variable, \(X\), taking values in a space \(\mathcal{X}\), only via the value of a function \(f\) of \(X\). 1

The extreme case of this situation is where the value \(f(x)\) determines the value of \(Y\) exactly, in which case we might as well say that \(y = f(x)\). Since we are doing statistics, we will write this in terms of a probability distribution. For discrete \(\mathcal{Y}\),2 \[\begin{equation} P_{}\left(y | x, f\right) = \begin{cases} 1 & \text{if $y = f(x)$} \\ 0 & \text{otherwise} \end{cases} \tag{1.1} \end{equation}\] If \(\mathcal{Y}\) is continuous, the idea is the same but a little more complicated.

In general, however, knowing \(f(x)\) will not be sufficient to determine the value of \(Y\) with certainty; there will be uncertainty due to other quantities, known or unknown, that affect the value. The uncertainty in our knowledge of the value of \(Y\) caused by these other effects is described by a probability distribution \[\begin{equation} P_{}\left(y |x, f, K\right) = P_{}\left(y |f(x), K\right) \tag{1.2} \end{equation}\] that nevertheless only depends on \(f(x)\)3.

The variable \(f(X)\) thus acts as a parameter for the distribution for \(Y\): different values \(x\) of \(X\) lead to different distributions. However, this does not seem sufficient to capture the idea of a functional relationship. For example, we could have a Gaussian distribution for \(Y\) whose variance was given by a function of \(X\), but we would not regard this as expressing a functional relationship between \(X\) and \(Y\). We therefore need some further constraint on the relationship between the value of the function and the probability distribution.

One way to think of such a constraint is that the function output should be a prediction of the value of \(Y\), based on the probability distribution. The natural concept of a prediction for the value of \(Y\) is simply the mean of the distribution, \({\mathrm E}[Y |f, x]\). This then leads to a very important equation, one that defines the models to be discussed in this course: we will say that a probability distribution for \(Y\), \(P_{}\left(y |f, x\right)\), describes a functional relationship between \(X\) and \(Y\) when \[\begin{equation} {\mathrm E}[Y |f, x] = f(x) \tag{1.3} \end{equation}\] In other words, instead of the value of \(Y\) being given by \(f(x)\) as in the deterministic case of Equation (1.1), rather the expectation of the value of \(Y\) under the probability distribution is given by \(f(x)\).

1.1.1 Choice of Function \(f\)

The function \(f\) will come from some given space of functions \(\mathcal{F}\), whose elements map \(\mathcal{X}\) to the space containing possible values of \({\mathrm E}[Y |f, x]\), which in our case will always be \({\mathbb R}\): that is, the elements of \(\mathcal{F}\) are real-valued functions of \(X\). However, in most cases, we do not know which function in \(\mathcal{F}\) is involved; indeed the main task is usually to estimate this function given some examples of input/output pairs \(\left\{(x_{i}, y_{i})\right\}_{i \in [1..n]}\).4 Thus, the function too is a variable, \(F\), taking values in \(\mathcal{F}\).

There are thus several ingredients to this type of modelling:

  • the input predictor space \(\mathcal{X}\) and output response space \(\mathcal{Y}\);

  • the space of possible functions \(\mathcal{F}\);

  • the probability distribution, \(P_{}\left(y |f(x), K\right)\), satisfying \({\mathrm E}[Y |f, x] = f(x)\).

1.1.2 Example: Linear Models

Consider the standard linear model (reviewed in detail in Section 1.8): \[ y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \ldots + \beta_p x_p +\epsilon \] Here, \(y\) is the response variable, the \(x_j\)’s are the predictor variables and \(\epsilon\) is an error term.

Linear models correspond to the following ingredients.

  • The space of responses \(\mathcal{Y} = {\mathbb R}\); the space of predictors \(\mathcal{X} = {\mathbb R}^p\).5

  • The space of possible functions \(\mathcal{F}\) used in linear models is generated by the linear maps \(f: \mathcal{X}\to{\mathbb R}\), i.e. from \({\mathbb R}^p\) to \({\mathbb R}\): \[\begin{equation} \mathcal{F} = \left\{f: \mathcal{X}\to {\mathbb R} :\,f(\boldsymbol{x}) = \boldsymbol{\beta}^{T}\boldsymbol{x}\right\} \end{equation}\] where \(\boldsymbol{x} = (x_1=1, x_2, x_3,...,x_p)\) and necessarily \(\boldsymbol{\beta}\in{\mathbb R}^p\)6.

  • The probability distribution is Gaussian: \[\begin{equation} P_{}\left(y |\boldsymbol{x}, f, K\right) = {\mathcal N}(y; f(\boldsymbol{x}), \sigma^{2}) \end{equation}\] Note that this guarantees that Equation (1.3) is satisfied.

While linear models offer significant inferential tools, note, however, that the use of a Gaussian distribution is restrictive; we necessarily must have that \(\mathcal{Y} = {\mathbb R}\), and the linear functions have a range, which in principle, is also all of \({\mathbb R}\). There are many quantities that must be represented by variables with more limited ranges, even a discrete set of values, and linear models do not seem suited to these cases.

1.2 Example Datasets

We introduce some of the datasets used throughout the (first half of the) course.


  • Can we fit a standard linear model to these datasets?
  • If not, why not?
  • What might we do instead?

1.2.1 Dataset A: Birthweight Data

Consider the dataset bpd from R libary SemiPar (type ?bpd for information).

library( "SemiPar" )
data( "bpd" )

This dataset is from a study of low-birthweight infants in a neonatal intensive care unit, with data collected on \(n=223\) babies by day 28 of life. We model bpd as response, with 1 indicating presence, and 0 indicating absence, of Bronchopulmonary dysplasia (BPD), taking birthweight as covariate. First, let’s plot the data.

plot( bpd )
Plot of presence of BPD against birthweight for the `bpd` dataset.

Figure 1.1: Plot of presence of BPD against birthweight for the bpd dataset.


  • How does the probability of the development of BPD depend on birthweight?

Response Type

  • bpd
  • \(\mathcal{Y} \in \{0,1\}\)
  • \({\mathrm E}[Y |f, x] \in [0,1]\)
  • \(P_{}\left(y |f, x\right)\)……..Bernoulli?

1.2.2 Dataset B: US Polio Data

Consider the polio data from library

library( "" )
data( "polio" )

This dataset is a matrix of count data, giving the monthly number of polio cases in United States from 1970 to 1983. We wish to convert this into a usable time series. In other words, we wish to covert this into a matrix with two columns with

  • covariate time in the first column ranging from 1 to 168, starting with January 1970.
  • response cases in the second column, indicating the monthly number of polio cases.

We do this as follows

uspolio <- matrix( c( 1:168, t( polio ) ), ncol = 2 ) )
colnames( uspolio ) <- c("time", "cases")

First, let’s plot the data.

plot( uspolio, type = "h" )
Plot of monthly reported number of Polio cases in the US between 1970 and 1983.

Figure 1.2: Plot of monthly reported number of Polio cases in the US between 1970 and 1983.


  • How has Polio incidence changed over time?

Response Type

  • cases
  • \(\mathcal{Y} \in \mathbb{N}\)
  • \({\mathrm E}[Y |f, x] \in {\mathbb R}_{\geq 0}\)
  • \(P_{}\left(y |f, x\right)\)……..Poisson?

1.2.3 Dataset C: Hospital Stay Data

The data, from R package npmlreg, are a subset of a larger data set collected on persons discharged from a selected Pennsylvania hospital as part of a retrospective chart review of antibiotic use in hospitals. Let’s load the data and then generate a plot.

Scatter plots of length of hospital stay against three of the most significant covariates.

Figure 1.3: Scatter plots of length of hospital stay against three of the most significant covariates.


  • How does the duration of a patient’s hospital stay vary, based on data available at the time of admission?

Response Type

  • duration
  • \(\mathcal{Y} \in {\mathbb R}_{\geq 0}\)
  • \({\mathrm E}[Y |f, x] \in {\mathbb R}_{\geq 0}\)
  • \(P_{}\left(y |f, x\right)\)……..Gamma?

1.3 Types of Variables

Before going on, it is useful to consider the different types of variables we might encounter. Variables take values in sets with different structures;

  • pure sets,
  • topological algebras (such as the real numbers \({\mathbb R}\))
  • they may be restricted to subsets of these sets (such as the positive real numbers \({\mathbb R}^+\)).

The distinctions between these types of variable are important for modelling purposes, and so here we introduce some nomenclature, described in Figure 1.4.

Diagram of the different types of variables.

Figure 1.4: Diagram of the different types of variables.

1.4 Course Content

In these lectures, we are going to broaden the range of functional relationships that we can model.

Categorical Data Analysis: we will study use of

  • Contingency Tables: measuring association between categorical variables.

  • Log-Linear Models: Modelling count data for categorical data represented in contingency tables.

Generalised Linear Models: a broad class of models permitting

  • the space of responses \(\mathcal{Y}\) to be a subset of the real numbers, e.g. positive real numbers, including countable subsets such as \(\mathbb{N}\) or the set \(\left\{0, 1\right\}\).7

  • the larger set \(\mathcal{F}\) of functions formed by composing the linear functions used in linear models with nonlinear maps from \({\mathbb R}\to{\mathbb R}\).

  • consideration of general families of probability distributions, the exponential dispersion families, chosen because the relationship between their parameters and their means allow Equation (1.3) to be imposed effectively.

We will then go on to develop the standard tools of classical statistics:

  1. Estimators for the parameters of the model: we will use maximum likelihood.

  2. Sampling distributions for these estimators, either exactly or approximately, and especially their expectations and variances.

  3. Statistical tests and confidence intervals; predictions with corresponding prediction intervals;

  4. Measures of “goodness-of-fit” for model selection.

1.5 Key References

You are encouraged to read about widely, expanding your knowledge beyond the notes contained here.

Useful references for the first half of this course are the following:

Kateri (2014) - Contingency Table Analysis - Methods and Implementation Using R, by Maria Kateri.

Tutz (2012) - Regression for Categorical Data, by Gerhard Tutz.

Agresti (2019) - An Introduction to Categorical Data Analysis, - by Alan Agresti.

Bilder and Loughin (2015) - Analysis of Categorical Data with R, - by Christopher Bilder and Thomas Loughin.

Faraway (2016) - Extending The Linear Model with R - Generalized Linear, Mixed Effects and Nonparametric Regression Models, by Julian Faraway.

You can explore the references contained within these books (and these lecture notes) for further information, and take yourselves on a journey of exploration.

1.6 Acronyms

A reference to some of the acronyms used during the course:

  • AIC - Aikaike Information Criterion
  • ANOVA - Analysis of Variance
  • BIC - Bayesian Information Criterion
  • BPD - Bronchopulmonary dysplasia (specific to bpd example dataset)
  • CDF - Cumulative Distribution Function
  • CI - Confidence Interval
  • CLT - Central Limit Theorem
  • DSSC - Data Science and Statistical Computing
  • EDF - Exponential Dispersion Family
  • EF - Exponential Family
  • GLM - Generalised Linear Model
  • GLR - Generalised Likelihood Ratio
  • HLLM - Hierarchical Log Linear Model
  • i.i.d. - independent and identically distributed
  • IWLS - Iterative Weighted Least Squares
  • LLM - Log Linear Model
  • LR - Likelihood Ratio
  • ML - Maximum Likelihood
  • MLE - Maximum Likelihood Estimator/Estimation
  • PDF - Probability Density Function
  • RSS - Residual Sum of Squares
  • SSE - Sum of Squares of the Error (equivalent to RSS)
  • SSR - Sum of Squares of the Regression
  • SST - Sum of Squares Total

1.7 Statistical Inference II: A Review

This section summarises some of the key topics from Statistical Inference II. Review the corresponding lecture notes (particularly the Chapter on Likelihood Inference) for further detail.

1.7.1 Statistical Inference

  • A statistical model (also known as a family of distributions) is a set of distributions (or densities) \(M\).

  • A parametric model is any model \(M\) that can be parameterised by a finite number of parameters.

1.7.2 Likelihood Inference Maximum Likelihood Estimation

  • Suppose data \(\boldsymbol{X}=(X_1,...,X_n)^T\) has joint density \(f(\boldsymbol{x};\theta)\). Given a statistical model parameterised by a fixed and unknown scalar \(\theta \in \Omega \subseteq {\mathbb R}\), the likelihood function, \(L(\theta)\), is the probability (density) of the observed data considered as a function of \(\theta\) \[\begin{equation} L(\theta) = L(\theta; \boldsymbol{x}) = f(\boldsymbol{x}; \theta) \end{equation}\] and the log-likelihood function is its logarithm \[\begin{equation} l(\theta) \equiv l(\theta;\boldsymbol{x}) = \log L(\theta) = \log f(\boldsymbol{x};\theta) \end{equation}\]

  • The Maximum Likelihood Estimate (MLE) \(\hat{\theta}\) for \(\theta\) given the data \(\boldsymbol{x}\) is the value of \(\theta\) which maximises the likelihood over the parameter space \[\begin{equation} \hat{\theta} = {\arg \max}_{\theta \in \Omega} l(\theta) = {\arg \max}_{\theta \in \Omega} L(\theta) \end{equation}\]

  • If \(X_1,...,X_n\) are i.i.d. then \[\begin{align} L(\theta) & = f(x_1,...,x_n; \theta) = \prod_{i=1}^n f(x_i;\theta) \\ l(\theta) & = \log \prod_{i=1}^n f(x_i;\theta) = \sum_{i=1}^n \log f(x_i;\theta) \end{align}\] Properties of the MLE

  • If \(\hat{\theta}\) is the MLE of \(\theta\) and \(\phi = g(\theta)\) is a function of \(\theta\), then \(\hat{\phi} = g(\hat{\theta})\).

  • A statistic \(S = S(\boldsymbol{X})\) is a sufficient statistic for \(\theta\) if the conditional distribution of \(\boldsymbol{X} |S = s\) does not depend on \(\theta\).

  • Factorisation Theorem: A statistic \(S = S(\boldsymbol{X})\) is sufficient for \(\theta\) if and only if there exist functions \(g(S,\theta)\) and \(h(\boldsymbol{X})\) such that for all \(\boldsymbol{X}\) and \(\theta\): \[\begin{equation} f(\boldsymbol{X};\theta) = g(S, \theta) \, h(\boldsymbol{X}) \end{equation}\] where \(h(\boldsymbol{X})\) does not depend on \(\theta\).

  • If \(S=S(\boldsymbol{X})\) is a sufficient statistic of \(\boldsymbol{x}\) for \(\theta\), and the MLE \(\hat{\theta}\) of \(\theta\) exists, then \[\begin{equation} \hat{\theta}(\boldsymbol{x}) = \hat{\theta}(s(\boldsymbol{x})) \end{equation}\]

1.7.3 (Frequentist) Hypothesis Testing Hypotheses

  • A hypothesis is any statement about the probability model \(M = \{f(\boldsymbol{x} |\theta) |\theta \in \Omega \}\) generating the data. When the probability distribution of the model is fixed, then a hypothesis is simply a statement about the parameter \(\theta\)8.

  • A hypothesis is simple if it completely specifies the associated probability distribution, \(f(\boldsymbol{x} |\theta)\). Otherwise, a hypothesis is called composite (or compound).

  • The null hypothesis, \(\mathcal{H}_0\), is a conservative hypothesis, not to be rejected unless evidence from the data is clear. The alternative hypothesis, \(\mathcal{H}_1\), specifies a particular departure from the null hypothesis that is of interest.

  • A hypothesis test is a procedure for deciding whether to reject \(\mathcal{H}_0\) or not. Hypothesis Test

  • We define a test by a critical or rejection region, \(\mathcal{R}\), which partitions the sample space \(\mathcal{X}\) such that:

    • If \(\boldsymbol{x} = (x_1,...,x_n) \in \mathcal{R}\), then we say that the test rejects \(\mathcal{H}_0\).
    • If \(\boldsymbol{x} = (x_1,...,x_n) \notin \mathcal{R}\), then the test fails to reject \(\mathcal{H}_0\).
  • A test statistic, \(T\), is a statistic derived from the sample used in hypothesis testing. The sampling distribution of \(T\) under the null hypothesis is known as the null distribution.

  • We can define the rejection region in terms of values of the test statistic \[\begin{equation} \mathcal{R} = \{ \boldsymbol{x} |T(\boldsymbol{x}) \in \mathcal{T} \} \end{equation}\] for some set of possible values \(\mathcal{T}\) . Usually, once we find a test statistic, we just ignore the original critical region and just focus on whether or not \(T\) is in \(\mathcal{T}\) or not.

  • Assuming \(\mathcal{H}_0\) is true, we can construct the sampling distribution for \(T\), \(f_T (t)\) – known as the null distribution for \(T\) – and define \(\mathcal{R}\) as corresponding to the set of values of \(T\) which are “sufficiently unlikely” to occur under \(\mathcal{H}_0\) to cause us to reject \(\mathcal{H}_0\). Significance, Power and \(p\)-values

  • For a simple null hypothesis, the significance level of a test is the probability \[\begin{equation} \alpha = P(\textrm{reject} \, \mathcal{H}_0 |\mathcal{H}_0 \, \textrm{true}) \end{equation}\] Then \[\begin{equation} \mathcal{R} = \{\boldsymbol{x} |P(T(\boldsymbol{x}) \in \mathcal{T} |\mathcal{H}_0) \leq \alpha \} \end{equation}\]

  • Two errors can arise in hypothesis testing:

    1. Reject \(\mathcal{H}_0\) when \(\mathcal{H}_0\) true - a type I error, occurs with probability \(\alpha\). If \(\mathcal{H}_0\) is simple, this is the usual significance level.
    2. Do not reject \(\mathcal{H}_0\) when \(\mathcal{H}_1\) true - a type II error, occurs with probability \(\boldsymbol{\beta}\).
  • Given an observed sample \(\boldsymbol{X} = \boldsymbol{x}\) and a simple null hypothesis \(\mathcal{H}_0\), the \(p\)-value or observed significance is the smallest value of \(\alpha\) for which we would reject the null hypothesis \(\mathcal{H}_0\) on the basis of the data \(\boldsymbol{x}\).

  • A \(p\)-value is the probability of observing a test statistic \(T\) “at least as extreme” as the one observed, \(t\), under the null distribution: \[\begin{equation} p=P(T \, \textrm{“at least as extreme as”} \, t |\mathcal{H}_0) \end{equation}\] What “extreme” is depends on the nature of the alternative hypothesis \(\mathcal{H}_1\) and the \(p\)-value quantifies the extent to which the null hypothesis is violated.

  • There is a duality of confidence intervals and hypothesis tests9.

    • A \(100(1 - \alpha)\%\) confidence region for a parameter \(\theta\) consists of all of those values of \(\theta_0\) for which the hypothesis that \(\theta=\theta_0\) will not be rejected at level \(\alpha\).
    • Equivalently, the hypothesis that \(\theta = \theta_0\) is accepted if \(\theta_0\) lies in the same confidence region. Likelihood Ratio Tests

  • Given a sample \(\boldsymbol{X} = (X_1,..., X_n)\) and the general hypotheses \(\mathcal{H}_0\) and \(\mathcal{H}_1\), the distribution of \(\boldsymbol{X}\) under \(\mathcal{H}_0\) is the null distribution denoted \(f_0(\boldsymbol{X})\) and the distribution under \(\mathcal{H}_1\) is the alternative distribution \(f_1(\boldsymbol{X})\).

  • The likelihood ratio (LR) for comparing two simple hypotheses \(\mathcal{H}_0: \boldsymbol{\theta} = \boldsymbol{\theta}_0\) and \(\mathcal{H}_1: \boldsymbol{\theta} = \boldsymbol{\theta}_1\) is \[\begin{equation} \lambda(\boldsymbol{X}) = \frac{L(\boldsymbol{\theta}_1;\boldsymbol{X})}{L(\boldsymbol{\theta}_0;\boldsymbol{X})} \end{equation}\]

  • The generalised likelihood ratio (GLR) test of \(\mathcal{H}_0: \boldsymbol{\theta} \in \Omega_0\) against \(\mathcal{H}_1: \boldsymbol{\theta} \in \Omega_1\), where \(\Omega_0 \cap \Omega_1 = \emptyset\) and \(\Omega_0 \cup \Omega_1 = \Omega\) computes \[\begin{equation} \Lambda(\boldsymbol{X}) = \frac{\max_{\boldsymbol{\theta} \in \Omega_0} L(\boldsymbol{\theta};\boldsymbol{X})}{\max_{\boldsymbol{\theta} \in \Omega} L(\boldsymbol{\theta};\boldsymbol{X})} \end{equation}\] and rejects \(\mathcal{H}_0\) when \(\Lambda(\boldsymbol{X})<c\) where \(c\) is chosen such that \(P(\Lambda(\boldsymbol{X}) \leq c |\mathcal{H}_0) = \alpha\).

  • Wilks’ Theorem:10 Consider an i.i.d. sample \(\boldsymbol{X} = (X_1,...,X_n)\) from \(f(X_i |\boldsymbol{\theta})\), where \(\boldsymbol{\theta}\) contains \(k\) unknown parameters. Suppose that \(\mathcal{H}_0\) specifies the values for \(\nu = k - k_0\) of the parameters, so that \(k_0\) parameters are still unknown. Then, under some regularity conditions and given \(\mathcal{H}_0\), as \(n \rightarrow \infty\): \[\begin{equation} W = -2 \log \Lambda(\boldsymbol{X}) \dot{\sim} \chi^2_\nu \end{equation}\] Thus for large \(n\) a GLR test will reject \(\mathcal{H}_0\) at approximate significance level \(\alpha\) if \[\begin{equation} -2 \log \Lambda(\boldsymbol{X}) \geq \chi^{2,\star}_{\nu, \alpha} \end{equation}\]

1.8 Statistical Modelling II: A Review

This section summarises some of the key topics from Statistical Modelling II. Review the corresponding lecture notes for further detail.

1.8.1 Supervised and Unsupervised Learning

  • Unsupervised Learning: investigate properties of a data structure \(\boldsymbol{Z}\). Sections 2 and 3 may be seen as unsupervised learning.

  • Supervised Learning: investigate how \(\boldsymbol{Z}\) affects reponse variable \(\boldsymbol{Y}\). This is the main focus of this course.

1.8.2 Linear Models

  • The linear model in matrix form is given by \[\begin{equation} \boldsymbol{Y} = \boldsymbol{X}\boldsymbol{\beta} + \boldsymbol{\epsilon} \tag{1.4} \end{equation}\] where
    • \(\boldsymbol{Y}^T = (y_1,...,y_n)\) is the vector of responses,
    • \(\boldsymbol{X}\) is the design matrix,
    • \(\boldsymbol{\beta}^T = (\beta_1,...,\beta_p)\) is the \(p\)-dimensional parameter vector,
    • \(\boldsymbol{\epsilon} = (\epsilon_1,...,\epsilon_n)\) is the vector of errors.
  • Taking the \(i^{\textrm{th}}\) row of Equation (1.4), one can represent \[\begin{equation} y_i = \sum_{j=1}^p \beta_j x_{ij} + \epsilon_i = \boldsymbol{x}_i^T\boldsymbol{\beta} + \epsilon_i \end{equation}\]

Note that the predictor variables \(x_j,j = 1,...,p\) (each of which corresponds to a column of \(\boldsymbol{X}\)) may be transformations or functions of the covariate variables which constitute the data matrix \(\boldsymbol{Z}\). In other words, \(\boldsymbol{X}\) is not necessarily equal to \([1,\boldsymbol{Z}]\). Assumptions

  • (A1): Linearity: \({\mathrm E}[\epsilon_i] = 0\), or \({\mathrm E}[y_i |\boldsymbol{x}_i] = \boldsymbol{x}^T \boldsymbol{\beta}\)
  • (A2’): Homoscedasticity and (A2”): Independence11: \({\mathrm{Cov}}[\epsilon_i,\epsilon_j] = \mathcal{I}_{i=j} \sigma^2\)
  • (A3): Normality: \(\epsilon_i\) is normally distributed.

In other words: \[\begin{equation} \boldsymbol{Y} \sim {\mathcal N}_n(\boldsymbol{X} \boldsymbol{\beta}, \sigma^2 I) \end{equation}\]

These assumptions can be diagnosed using the diagnostics discussed in Section 1.8.6. Estimation of Model Parameters

  • LS estimates \(\hat{\boldsymbol{\beta}}\) minimise \[\begin{equation} R(\boldsymbol{\beta}) = \boldsymbol{\epsilon}^T \boldsymbol{\epsilon} = (\boldsymbol{Y} - \boldsymbol{X}\boldsymbol{\beta})^T (\boldsymbol{Y} - \boldsymbol{X}\boldsymbol{\beta}) \end{equation}\] and satisfy \[\begin{equation} \hat{\boldsymbol{\beta}} = (\boldsymbol{X}^T\boldsymbol{X})^{-1}(\boldsymbol{X}^T\boldsymbol{Y}) \end{equation}\]

  • An unbiased estimate of \(\sigma^2\), based on known residuals \(\hat{\epsilon}_i\), is \[\begin{equation} s^2 = \frac{\boldsymbol{\hat{\epsilon}}^T\boldsymbol{\hat{\epsilon}}}{n-p} \end{equation}\] Properties of \(\hat{\boldsymbol{\beta}}\) and s^2

  • \({\mathrm E}[\hat{\boldsymbol{\beta}}] = \boldsymbol{\beta}\)

  • \({\mathrm{Var}}[\hat{\boldsymbol{\beta}}] = \sigma^2(\boldsymbol{X}^T\boldsymbol{X})^{-1}\)

  • \({\mathrm{Var}}[\boldsymbol{c}^T\hat{\boldsymbol{\beta}}] = \sigma^2 \boldsymbol{c}(\boldsymbol{X}^T\boldsymbol{X})^{-1} \boldsymbol{c}\)

  • The standard deviation (SD) of \(\boldsymbol{c}^T\hat{\boldsymbol{\beta}}\) as a measure of the precision of the estimate \(\boldsymbol{c}^T\hat{\boldsymbol{\beta}}\) is only useful if the value of \(\sigma\) is known. When it is not known we replace it by its estimate \(s\) and one obtains the standard error of \(\boldsymbol{c}^T\hat{\boldsymbol{\beta}}\): \[\begin{equation} \textrm{SE}[\boldsymbol{c}^T\hat{\boldsymbol{\beta}}] = s \sqrt{\boldsymbol{c}(\boldsymbol{X}^T\boldsymbol{X})^{-1} \boldsymbol{c}} \end{equation}\]

  • \({\mathrm E}[s^2] = \frac{{\mathrm E}[\boldsymbol{\hat{\epsilon}}^T\boldsymbol{\hat{\epsilon}}]}{n-p} = \sigma^2\) Sampling Distribution

  • The sampling distribution of a parameter estimator is the probability distribution of the estimator, when drawing repeatedly samples of the same size from the population and observing the value of the estimator for each sample.

  • Assuming (A1)-(A3), then \[\begin{equation} \boldsymbol{\hat\beta} \sim {\mathcal N}_p(\boldsymbol{\beta}, \sigma^2(\boldsymbol{X}^T\boldsymbol{X})^{-1}) \end{equation}\] and \[\begin{equation} \boldsymbol{c}^T \boldsymbol{\hat\beta} \sim {\mathcal N}_1(\boldsymbol{c}^T \boldsymbol{\beta}, \sigma^2c(\boldsymbol{X}^T\boldsymbol{X})^{-1}\boldsymbol{c}) \end{equation}\]

  • However, \(\sigma^2\) is usually unknown. We have that \[\begin{equation} \frac{1}{\sigma^2} (n-p) s^2 \sim \chi^2_{n-p} \end{equation}\] Therefore, \[\begin{equation} \frac{\boldsymbol{c}^T\boldsymbol{\hat\beta} - \boldsymbol{c}^T \boldsymbol{\beta}}{\textrm{SE}(\boldsymbol{c}^T\boldsymbol{\hat\beta})} \sim \frac{{\mathcal N}(0,1)}{\sqrt{\frac{\chi^2_{n-p}}{n-p}}} \equiv t_{n-p} \end{equation}\] In particular \[\begin{equation} \frac{\hat\beta_j - \beta_j}{\textrm{SE}(\hat\beta_j)} \sim t_{n-p} \end{equation}\] Confidence Intervals

  • A \((1-\alpha)\) level confidence interval (CI) for population parameter \(\theta\) is an interval calculated from sample \(\boldsymbol{Y}^T = (y_1,...,y_n)\) which contains \(\theta\) with sampling probability \(1-\alpha\), in the sense that if we take very many samples, and form an interval from each sample, then proportion \(1-\alpha\) will contain \(\theta\).

  • A \(1-\alpha\) CI for \(\boldsymbol{c}^T\boldsymbol{\beta}\) is given by \[\begin{equation} \boldsymbol{c}^T\boldsymbol{\hat\beta} \mp t_{n-p,\frac{\alpha}{2}} \textrm{SE}(\boldsymbol{c}^T\boldsymbol{\hat\beta}) \end{equation}\] Hypothesis Testing

  • Hypothesis Test: reject \(\mathcal{H}_0: \beta_j = \beta_j^0\) at significance level \(\alpha\) if \[\begin{equation} T = \left| \frac{\hat\beta_j - \beta_j^0}{\textrm{SE}(\hat\beta_j)} \right| > t_{n-p,\frac{\alpha}{2}} \end{equation}\] in favour of \(\mathcal{H}_1: \beta_j \neq \beta_j^0\), or equivalently if \(p<\alpha\), where \[\begin{equation} p = P(|t_{n-p}|>T_{\textrm{observed}}) \end{equation}\] is the \(p\)-value. Prediction

  • For a new predictor \(\boldsymbol{x}_0^T = (x_{01},...,x_{0p})\), the predicted value12 \(\hat{y}_0\) is \[\begin{equation} \widehat{{\mathrm E}[y_0 |\boldsymbol{x}_0]} = \hat{y}_0 = \boldsymbol{x}_0^T\boldsymbol{\hat\beta} + \hat{\epsilon}_0 = \boldsymbol{x}_0^T \boldsymbol{\hat\beta} \end{equation}\]

  • We have that13 \[\begin{equation} {\mathrm{Var}}[Y_0 - \hat{Y}_0] = {\mathrm{Var}}[Y_0] + {\mathrm{Var}}[\hat{Y}_0] = \sigma^2(1 + \boldsymbol{x}_0^T(\boldsymbol{X}^T\boldsymbol{X})^{-1}\boldsymbol{x}_0) \end{equation}\]

1.8.3 Factors

  • Categorical covariates are called factors.

  • For inclusion into a linear model, factors need to be coded. For factor \(\mathcal{A}\) with levels \(1,...,a\), we define \[\begin{equation} x_i^{\mathcal{A}} = \mathcal{I}_{\mathcal{A}=i} \end{equation}\]

  • Including all the indicators into the linear model, one gets the unconstrained model. \[\begin{equation} {\mathrm E}[y |\mathcal{A}] = \beta_0 + \beta_1 x_1^{\mathcal{A}} + ... + \beta_a x^{\mathcal{A}} \end{equation}\]

  • We need constraints on the parameters to give us a constrained model. Popular constraints include:

    • set \(\beta_i = 0\) for a single \(i\); now level \(i\) takes the role of a reference category represented by the intercept. For example, if we set \(\beta_1 = 0\) we have \[\begin{equation} {\mathrm E}[y |\mathcal{A}] = \beta_0 + \beta_2 x_2^{\mathcal{A}} + ... + \beta_a x^{\mathcal{A}} \end{equation}\]
    • The zero-sum constraint: \(\sum_{i=1}^a \beta_i = 0\).
  • Under one of these constraints, \(X\) effectively becomes an \(n\times a\) matrix with rank \(a\), and so \(X^TX\) is now invertible. \(\beta\) becomes an \(a\)-vector. Additive Model

  • Additive effect of \(\mathcal{A}\) and \(\mathcal{B}\): \[\begin{equation} y_{ijk} = \mu + \tau_i^{\mathcal{A}} + \tau_i^{\mathcal{B}} + \epsilon_{ijk} \end{equation}\] where
    • \(\tau_i^{\mathcal{A}}\) is the main effect of level \(i\) of factor \(\mathcal{A}, i = 1,...,a\)
    • \(\tau_j^{\mathcal{B}}\) is the main effect of level \(j\) of factor \(\mathcal{B}, j = 1,...,b\)
    • \(\mu\) is the intercept term
    • \(\epsilon_{ijk}\) is the error of the \(k^{\textrm{th}}\) replicate of treatment \((i,j)\).
  • Constraints: \(\tau_1^{\mathcal{A}} = \tau_1^{\mathcal{B}} = 0\). Interaction Model

  • We have \[\begin{equation} y_{ijk} = \mu + \tau_i^{\mathcal{A}} + \tau_i^{\mathcal{B}} + \tau_{i,j}^{\mathcal{A,B}} + \epsilon_{ijk} \end{equation}\] where \(\tau_{i,j}^{\mathcal{A,B}}\) is the interaction effect of level \(i\) of factor \(\mathcal{A}\) with level \(j\) of factor \(\mathcal{B}\).

  • In data there will virtually always be non-zero estimates of \(\tau_{i,j}^{\mathcal{A,B}}\), so we often wish to assess whether or not they are significantly different from zero (and thus worth including in the model).

1.8.4 Analysis of Variance (ANOVA)

  • We have, assuming (A1)-(A3) and that the model has an intercept, \[\begin{equation} SST = SSR + SSE \end{equation}\] where
    • \(SST = \sum_{i=1}^n (y_i - \bar{y})^2\)
    • \(SSE = \sum_{i=1}^n (y_i - \hat{y}_i)^2\)
    • \(SSR = \sum_{i=1}^n (\hat{y}_i - \bar{y})^2\)
  • \(R^2 = \frac{SSR}{SST}\) is the coefficient of determination, with \(0 \leq R \leq 1\) since \(SSR \leq SST\). This will always increase as we add more terms to the model.

  • \(R_{\textrm{adj}}^2 = 1 - \frac{n-1}{n-p-1} (1 - R^2)\) is known as the adjusted \(R\)-squared. This quantity penalises for large values of \(p\).

  • Sequential ANOVA involves comparing \(m\) nested models \(M_1 \subset ... \subset M_m\), with design matrices \(X_1,...,X_m\), where \(X_j\) is \(n \times p_j\) and \(X_{j+1}\) is obtained by adding \(p_{j+1} - p_j\) columns to \(X_j\), using a series of partial \(F\)-tests.
    Recall that model \(M_j\) is nested in \(M_{j+1}\) if they can be written as \[\begin{align} M_j: & \qquad {\mathrm E}[y |\mathcal{A}] = \beta_1 + \beta_2 x_2^{\mathcal{A}} + ... + \beta_{p_j}x_{p_j} \\ M_{j+1}: & \qquad {\mathrm E}[y |\mathcal{A}] = \beta_1 + \beta_2 x_2^{\mathcal{A}} + ... + \beta_{p_j}x_{p_j} + \beta_{p_{j+1}}x_{p_{j+1}} \end{align}\]

1.8.5 Model Selection

  • Model selection involves finding a good submodel by selecting a subset of indices \(I\) of those terms to be included in the submodel, with \(D\) the remainder to be deleted,so that \[\begin{equation} \textrm{E}[y |x] = \boldsymbol{x}_I^T \beta_I + \boldsymbol{x}_D^T \beta_D \end{equation}\] and \[\begin{equation} \textrm{E}_I[y |x] = \boldsymbol{x}_I^T \beta_I \end{equation}\]

  • A good submodel will have

    • \(RSS_I\) “not much larger” than RSS.
    • \(p_I\) as small as possible.
  • Selection criteria include
    • Mallow’s \(C_I\), which seeks to minimise \[\begin{equation} C_I = \frac{RSS_I}{s^2} + 2p_I - n \end{equation}\]
    • AIC, which seeks to minimise \[\begin{equation} AIC_I = - 2 l(\boldsymbol{\hat\beta}_I;Y) + 2p_I \end{equation}\]
  • Selection methods include forward selection, backward elimination and stepwise selection.

1.8.6 Diagnostics

  • Residuals are used to check linear model assumptions.

  • infludential observations are those cases featuring a particularly large impact of \(\boldsymbol{\hat\beta}\), \(s^2\) or both.

  • We have that \[\begin{equation} \hat{\boldsymbol{Y}} = \boldsymbol{H}\boldsymbol{Y} = \boldsymbol{X} \boldsymbol{\beta} + \boldsymbol{H} \boldsymbol{\epsilon} \end{equation}\] where \(\boldsymbol{H}\) is the hat matrix \(\boldsymbol{H} = \boldsymbol{X}(\boldsymbol{X}^T\boldsymbol{X})^{-1}\boldsymbol{X}^T\) is the \(n \times n\) hat matrix. Leverage Values and Studentised Residuals

  • \(h_i = H_{ii} = (\boldsymbol{x}_i^T(\boldsymbol{X}^T\boldsymbol{X})^{-1}\boldsymbol{x}_i)\) is called the leverage value of \(y_i\), and measures the impact of case \(i\) onto its own fitted value.

  • We have that \[\begin{equation} \hat{\boldsymbol{\epsilon}} = (\boldsymbol{I}_n - \boldsymbol{H}) \boldsymbol{\epsilon} \end{equation}\] If (A1) - (A3) hold, then \(\hat{\boldsymbol{\epsilon}} \sim {\mathcal N}_n(0, \sigma^2(\boldsymbol{I}_n-\boldsymbol{H}))\).

  • Studentised residuals are given by \[\begin{equation} r_i = \frac{\hat{\epsilon}_i}{s \sqrt{1-h_i}} \end{equation}\]

  • Internally studentised residuals are when all data, including case \(i\), are used to estimate \(\sigma\), and externally studentised residuals are when case \(i\) is removed to estimate \(\sigma\) (in which case the estimate \(s\) is different for each \(i\)). Influence Analysis

  • A case \(i\) is called influential if \(\boldsymbol{\hat\beta}\) or \(s^2\) change “substantially” when removing it.

  • Note that if \(h_i \rightarrow 1\), \(\hat{\epsilon}_i \approx 0\), thus cases with \(h_i \approx 1\) may be highly influential, thus the \(h_i\) are sometimes called potential influence values.

  • We can test for influential values by comparing the estimate \(\boldsymbol{\hat\beta}_{(i)}\), which is the estimate with the \(i^{\textrm{th}}\) case omitted, with the fit using all of the data, \(\boldsymbol{\hat\beta}\).

  • Cook’s distances are given by \[\begin{equation} D_i = \frac{1}{p} r_i^2 \frac{h_i}{1-h_i} \end{equation}\] where \(r_i\) are internally studentised residuals.

  • An observation is often said to be

    • potentially influential if \(h_i \geq \frac{2p}{n}\)
    • outlying if \(|r_i| > 2\) (or 3). Diagnostic Plots

  • Various plots can aid diagnose the validity of assumptions (A1)-(A3) (see Section

  • Plot \(\hat{\epsilon}_i\) against \(x_{ij}\) and \(\hat{y}_i; \, i = 1,...,n; \, j = 1,...,p\).

    • if there are “patterns”, this indicates violation of (A1).
    • if the spread of \(\hat{\epsilon}_i\) varies, this indicates violation of (A2’).
  • Plot ordered values of \(\hat{\epsilon}_i\) (or \(r_i\)) versus the corresponding quantities of a standard normal distribution \(u_i = \Phi^{-1}\left( \frac{i-0.5}{n} \right)\). Deviation from a straight line indicates non-normality (violation of (A3)).

  • For (A2”), we can plot \(\hat{\epsilon}_i\) against \(i\) (or \(t_i\) - the time at which case \(i\) is observed). If there is a pattern, this could indicate violation of (A2”), but also possibly indication of violation of (A1) instead/as well.

1.9 Further Review Topics

1.9.1 Poisson Distribution

The Poisson distribution is a simple distribution for count data, for which integer values \(x \in \{0,1,2,...\}\) and parameter \(\lambda>0\) has the form \[\begin{equation} P(X=x) = \frac{\lambda^x e^{- \lambda}}{x!} \end{equation}\] We notate this \[\begin{equation} X \sim Poi(\lambda) \end{equation}\] Mean and Variance

The mean and variance of \(X\), distributed according to \(Poi(\lambda)\), are given by \[\begin{equation} {\mathrm E}[X] = {\mathrm{Var}}[X] = \lambda \end{equation}\] Sum of Poisson Variables

If \(X_1,...,X_K\) are independent Poisson variables, with parameters \(\lambda_1,...,\lambda_K\) respectively, then their sum \[\begin{equation} \sum_{k=1}^K X_k \sim Poi(\sum_{k=1}^K \lambda_k) \end{equation}\]

1.9.2 Multinomial Distribution

Assume random variables \(\boldsymbol{Y} = (Y_1,...,Y_k)^T\), such that \(Y_j \in \{0,...,n\}\) and \(\sum_{j=1}^k Y_j = n\). We say that \(\boldsymbol{Y}\) follows a Multinomial distribution with parameters

  • \(n>0\), the number of trials
  • \(\boldsymbol{\pi} = (\pi_1,...,\pi_k)^T\), the success vector of probabilities such that \(\pi_i \in (0,1)\) and \(\sum_{j=1}^k \pi_j = 1\)

if \[\begin{equation} P(\boldsymbol{Y} = \boldsymbol{y}) = \frac{n!}{y_1!...y_k!} \pi_1^{y_1}...\pi_k^{y_k} \mathcal{I}_{\boldsymbol{Y} \in \mathcal{Y}} \end{equation}\] where

  • \(\boldsymbol{y} = (y_1,...,y_k)^T\)
  • \(\mathcal{Y} = \{\boldsymbol{y} \in \{0,...,n\}^k | \sum_{j=1}^k y_j = n\}\)

We notate this as \[\begin{equation} \boldsymbol{Y} \sim Mult(n, \boldsymbol{\pi}) \end{equation}\] Mean and Variance

The expectation is14 \[\begin{align} {\mathrm E}[Y_j] & = n \pi_j, \qquad j = 1,...,k \\ {\mathrm E}[\boldsymbol{Y}] & = n \boldsymbol{\pi} \end{align}\]

The covariance structure is15 \[\begin{align} {\mathrm{Var}}[Y_j] & = n \pi_j (1-\pi_j) \\ {\mathrm{Cov}}[Y_j,Y_{j'}] & = -n \pi_j \pi_{j'} \qquad (j \neq j') \\ {\mathrm{Var}}[\boldsymbol{Y}] & = n( \textrm{diag} (\boldsymbol{\pi}) - \boldsymbol{\pi} \boldsymbol{\pi}^T ) \end{align}\] Collapsing Categories

Collapsing categories of a Multinomial distribution leads to another Multinomial distribution with fewer categories.
For example, take a multinomial distribution with seven categories \(A_1,...,A_7\) such that \[\begin{equation} (N_1,...,N_7) \sim Mult(n;(\pi_1,...,\pi_7)) \end{equation}\] Now, if we combine the categories as follows; \(B_1 = \{A_1, A_2, A_3, A_4\}\), \(B_2 = A_5\), and \(B_3 = \{A_6, A_7\}\), then \((Y_1, Y_2, Y_3) = (N_1 + N_2 + N_3 + N_4, N_5, N_6 + N_7)\) follow a multinomial distribution \[\begin{equation} (Y_1, Y_2, Y_3) \sim Mult(n; (\pi_1^* , \pi_2^* , \pi_3^*)) \end{equation}\] with \((\pi_1^* , \pi_2^* , \pi_3^*) = (\pi_1 + \pi_2 + \pi_3 + \pi_4, \pi_5, \pi_6 + \pi_7)\). Distribution of Outcome Subsets

Suppose we are interested in the first \(Q<K\) outcomes16, then \[\begin{equation} (N_1,...,N_Q) \sim Mult(n_Q; (\pi_1/\pi_Q ,...,\pi_q/\pi_Q)) \end{equation}\] where \(n_Q = \sum_{i=1}^q n_i\) and \(\pi_Q = \sum_{i=1}^q \pi_i\). Poisson and Multinomial Distributions

The conditional distribution of independent Poisson variables \(X_1,...,X_K\) given that their sum \(\sum_{k=1}^K X_k = n\) is \[\begin{eqnarray} P( (X_1=n_1,...,X_K=n_K) | \sum_{k=1}^K X_K = n ) & = & \frac{P(X_1=n_1,...,X_K=n_K)}{P(\sum_{k=1}^K X_K = n)} \\ & = & \frac{\prod_{k=1}^K (e^{-\lambda_k} \lambda_k^{n_k})/n_k!}{(e^{-\lambda} \lambda^n)/n!} \end{eqnarray}\] where we have used the result of Section @ref(sum_poisson) in the denominator and defined \(\lambda = \sum_{k=1}^K \lambda_k\). Since \(\sum_{k=1}^K n_k = n\), and \(\prod_{k=1}^K e^{- \lambda_k} = e^{- \sum_{k=1}^K \lambda_k} = e^{-\lambda}\), we have that \[\begin{equation} P( (X_1=n_1,...,X_K=n_K) | \sum_{k=1}^K X_K = n ) = \frac{n!}{n_1!n_2!...n_K!} \prod_{k=1}^K \left( \frac{\lambda_k}{\lambda} \right)^{n_k} \end{equation}\] which is \(Mult(n,\boldsymbol{\pi})\) with \(\pi_k = \lambda_k/\lambda\), \(1 \leq k \leq K\), being the components of \(\boldsymbol{\pi}\).

1.9.3 Hypergeometric Distribution

Consider a population of \(N\) items, with \(M\) of them being of a particular type \(\mathcal{A}\) of interest.
If a sample of size \(q\) is selected from this population, then the number \(X\) of type \(\mathcal{A}\) items in the sample is modelled by the hypergeometric distribution, notated \[\begin{equation} X \sim \mathcal{H} g (N,M,q) \end{equation}\] according to which17 \[\begin{equation} P(X=x) = \frac{\binom{M}{x}\binom{N-M}{q-x}}{\binom{N}{q}} \qquad \max(0,q-(N-M)) \leq x \leq \min(q,M) \end{equation}\] Mean and Variance

The mean and variance of \(X\), distributed according to the hypergeometic distribution, are \[\begin{align} {\mathrm E}[X] & = \frac{qM}{N} \\ {\mathrm{Var}}[X] & = \frac{qM(N-q)(N-M)}{N^2(N-1)} \end{align}\] Practical Example

Hypergeometric probabilities in R are computed using dhyper, and cumulative probabilities using phyper. If \(X \sim \mathcal{H} g (12,7,3)\), then \(P(X=1)\) is computed as

dhyper(1, 12, 7, 3)

and \(P(X\leq 2)\) is computed

phyper(2, 12, 7, 3)

Use the help file for further information.

1.9.4 Chi-Squared Distribution

The central chi-squared distribution with \(n\) degrees of freedom is defined as the sum of squares of \(n\) independent standard normal variables \(Z_i \sim \mathcal{N}(0,1), i = 1,...,n\). It is denoted by \[\begin{equation} X^2 = \sum_{i=1}^n Z_i^2 \sim \chi^2(n) \end{equation}\] Mean and Variance

If \(X^2\) has the distribution \(\chi^2(n)\), then18 \[\begin{align} {\mathrm E}[X^2] & = n \\ {\mathrm{Var}}[X^2] & = 2n \end{align}\] Sum of Chi-Squareds

If \(X_1^2,...,X_m^2\) are \(m\) independent variables with chi-squared distributions \(X_i^2 \sim \chi^2(n_i)\), then19 \[\begin{equation} \sum_{i=1}^m X_i^2 = \chi^2(\sum_{i=1}^m n_i ) \end{equation}\]

1.9.5 Central Limit Theorem

The Central Limit Theorem (CLT) tells us that the average \(\bar{X}\) of a sum of \(n\) i.i.d. variables with mean \(\mu\) and variance \(\sigma^2\) will be distributed as: \[\begin{equation} \bar{X} \sim \mathcal{N}(\mu, \frac{\sigma^2}{n}) \end{equation}\] as \(n \rightarrow \infty\).

In higher dimensions, the CLT states that the average \(\bar{\boldsymbol{X}}\) of \(n\) i.i.d. variables with mean vector \(\boldsymbol{\mu}\) and variance matrix \(\Sigma\) will be such that: \[\begin{equation} \sqrt{n} ( \bar{\boldsymbol{X}} - \boldsymbol{\mu} ) \sim {\mathcal N}( \boldsymbol{0}, \Sigma ) \end{equation}\] as \(n \rightarrow \infty\).

1.9.6 Lagrange Multipliers

We here review the essence of optimisation using Lagrange multipliers. For further detail, see the relevant notes from AMVII.

  • We want to optimise a function \(f(\boldsymbol{x}) \in {\mathbb R}\), with \(\boldsymbol{x} \in {\mathbb R}^t\), subject to the vector constraint \(\boldsymbol{g}(\boldsymbol{x}) = \boldsymbol{0}\), with \(\boldsymbol{g}(\boldsymbol{x}) \in {\mathbb R}^s\) being \(s\) component constraints.

  • We construct a function, called the Lagrange function, given by \[\begin{equation} \mathcal{L}(\boldsymbol{x}, \boldsymbol{\lambda}) = f(\boldsymbol{x}) + \lambda_1 g_1(\boldsymbol{x}) + ... + \lambda_s g_s(\boldsymbol{x}) \end{equation}\] where \(\boldsymbol{g}(\boldsymbol{x}) = ( g_1(\boldsymbol{x}), ..., g_s(\boldsymbol{x}) )\) and \(\boldsymbol{\lambda} = (\lambda_1,...,\lambda_s)^T\) represents a vector of Lagrange multipliers.

Note that the Lagrange function can also be constructed by substracting the constraint functions, that is \[\begin{equation} \mathcal{L}(\boldsymbol{x}, \boldsymbol{\lambda}) = f(\boldsymbol{x}) - ( \lambda_1 g_1(\boldsymbol{x}) + ... + \lambda_s g_s(\boldsymbol{x}) ) \end{equation}\] It doesn’t make any difference to solving the problem, as the solutions for the local optima of the parameters of interest \(\boldsymbol{x}\) will be the same20.

  • To find the points of local optimisation of \(f(\boldsymbol{x})\) subject to the equality constraints, we find the stationary points of Lagrange function \(\mathcal{L}(\boldsymbol{x}, \boldsymbol{\lambda})\), that is, we solve the following system of equations: \[\begin{eqnarray} \frac{\partial \mathcal{L}}{\partial x_i} & = & 0 \qquad i = 1,...,t \\ \frac{\partial \mathcal{L}}{\partial \lambda_j} & = & 0 \qquad j = 1,...,s \end{eqnarray}\]


Agresti, A. 2019. An Introduction to Categorical Data Analysis. 3rd ed. New York: Wiley.
Bilder, C. R., and T. M. Loughin. 2015. Analysis of Categorical Data with r. Boca Raton: CRC press.
Faraway, J. J. 2016. Extending the Linear Model with r. 2nd ed. London: CRC press.
Kateri. 2014. Contingency Table Analysis - Methods and Implementation Using r. New York: Birkhauser.
Tutz, G. 2012. Regression for Categorical Data. Cambridge: Cambridge University Press.
Wilks, S. S. 1938. “The Large-Sample Distribution of the Likelihood Ratio for Testing Composite Hypotheses.” Annals of Mathematical Statistics 9 (1): 60–62.

  1. A variable in statistics is the same as a variable in mathematics more generally: an unspecified element of a given set whose elements are the possible values of the variable. The difference is that in statistics we place a probability distribution on this space of values. A variable with a probability distribution on its space of values is sometimes called a random variable (although the usual mathematical definition is not in these terms). However, I consider this terminology redundant and confusing, for the following two reasons: Firstly, nothing has happened to the variable. Rather, an extra structure (the probability distribution) has been added to the space of values, and it is this structure that we must focus on for statistics. As a result, just as in mathematics more generally, we can, and will, get away with not distinguishing between a variable and its value, using \(x\), for example, for both. Nevertheless, just as in mathematics more generally, such a distinction is sometimes useful. If it is useful, then a variable will be denoted in upper case: the usual notation \(x\) will be replaced by the notation \(X = x\), both of them indicating that the variable \(X\) has the value \(x\). Secondly, the terminology encourages the notion that there are two types of quantity in the world: “random” quantities, to be represented by “random” variables; and “non-random” quantities, to be represented by “ordinary” variables. With the possible exception of quantum mechanics, which is irrelevant to our purposes, this is false, or, rather, meaningless: try thinking of an empirical test to decide whether a quantity is “random” or “non-random”; there are only quantities, represented by variables. The purpose of probability distributions is to describe our knowledge of the values of these variables.↩︎

  2. The notation \(P_{}\left(y | \ldots\right)\) will be used to mean the probability of \(Y = y\) in the case that \(\mathcal{Y}\) is countable, and to represent the density \(P_{}\left(Y\in[y, y + dy]\right)/dy\) in the case that \(\mathcal{Y}\) is continuous.↩︎

  3. Here \(K\) represents all the other knowledge, e.g. values of other parameters, that may be relevant. We will, however, often drop the \(K\) for notational convenience.↩︎

  4. The notation \([m..n]\) will indicate all natural numbers \(i\) such that \(m \leq i \leq n\).↩︎

  5. Recall that the space of predictors is not the same as the space of covariates. For example, there might only be one covariate, \(z \in {\mathbb R}\), but many predictors, corresponding, for example, to \(x_{a} = z^{a - 1}\), \(a\in [1..p]\).↩︎

  6. Note that we here denote vectors (and later matrices too) in bold print. I hope that the dimension of any objects will be clear from the context, but I will try to embold vectors and matrices where possible (and where I remember!) for clarity. The variables in Section 1.1 could also have been vectors, but due to the vaguery of the discussion I didn’t bother.↩︎

  7. Note that this guarantees that \({\mathrm E}[Y |f, x]\in{\mathbb R}\), as we have assumed.↩︎

  8. We will here assume that \(\theta\) is a scalar parameter↩︎

  9. See the corresponding lecture notes (Theorem 6.3) for the mathematical exposition of this theorem.↩︎

  10. For the purposes of this course, we will use this theorem without proof. A sketch of the proof was outlined in Statistical Inference II. Anyone who is interested in Wilks’ original paper on the theorem can consult Wilks (1938).↩︎

  11. \(\mathcal{I}_T\) is the indicator function taking value 1 if \(T\) holds and 0 otherwise.↩︎

  12. Terminology: The term fitted value is typically used for an estimate of a piece of data used to fit the model, and predicted value is typically used for an estimate of a piece of data not used to fit the model, but they are mathematically the same.↩︎

  13. Recall that \(Y_0\) and \(\hat{Y}_0\) are independent since \(\hat{Y}_0\) is a linear combination of \(y_1,...,y_n\).↩︎

  14. Q1-3 concerns proving the results stated in this section.↩︎

  15. Note that \(\textrm{diag}(\boldsymbol{\pi})\) is the \(k \times k\) diagonal matrix with diagonal elements taking the values of the vector \(\boldsymbol{\pi}\).↩︎

  16. or indeed, any collection of \(Q<K\) outcomes.↩︎

  17. Note that this is a counting problem. The denominator is the number of ways to sample \(q\) items from the total \(N\), and the numerator is the number of ways in which to sample \(x\) of the \(M\) type \(\mathcal{A}\) objects, and \(q-x\) of the \(N-M\) non-type \(\mathcal{A}\) objects.↩︎

  18. Q1-3c involves proving that the results of this section hold.↩︎

  19. Q1-3b involves showing that the result of this section holds.↩︎

  20. Indeed, I use the subtraction convention later on.↩︎