CHAPTER 1 Preliminaries

This chapter provides a review of the tools needed for regression analysis.

Concepts in matrix theory (Stat 135) are first presented then followed by important results in theory of parametric statistical inference (Stat 131).

R codes are also included for some results for visualization.

1.1 Review of Matrix Theory

WHY DO WE NEED MATRIX THEORY?

We are dealing with multiple variables. Matrices give us compact representation of equations for simplification of calculations, instead of the summation and other basic operations.

Basic Concepts

A matrix is an array of numbers (constants or variables) containing $r$ rows and $c$ columns $\underset{3\times4} {\textbf A }= \begin{bmatrix} 0 & 9 & 2 &3 \\ 7 & 6 & 4 &5 \\ 11 & 2 & 1 & 8 \\ \end{bmatrix}$
The dimension or order is the size of the matrix, i.e. the number of rows and columns
A vector is an array of numbers (constants or variables) arranged in rows or columns $\underset{3\times1} {\textbf a }= \begin{bmatrix} 2 \\ 7 \\ 8 \end{bmatrix} \quad \underset{1\times3} {\textbf a' }= \begin{bmatrix} 2 & 7 & 8 \end{bmatrix}$
A square matrix is a matrix that has equal number of rows and columns $\underset{n\times n}{\textbf{A}} = \begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{n1} & a_{n2} & \cdots & a_{nn} \end{bmatrix}$
The diagonal elements are the elements found in the diagonal of a square matrix while those elements other than the diagonal elements are the off-diagonal or nondiagonal elements.
A diagonal matrix is a square matrix that has zero for all of its off-diagonal elements. $\begin{bmatrix} a_{11} & 0 & \cdots & 0 \\ 0 & a_{22} & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & a_{nn} \end{bmatrix}$
A triangular matrix is a square matrix with all elements above (or below) the diagonal being zero. $\begin{bmatrix} a_{11} & 0 & \cdots & 0 \\ a_{21} & a_{22} & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ a_{n1} & a_{n2} & \cdots & a_{nn} \end{bmatrix} \quad or \quad \begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ 0 & a_{22} & \cdots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & a_{nn} \end{bmatrix}$

Matrix Operations

Transpose of a matrix: $\textbf{A}'$
Trace of a matrix: $tr(\textbf{A})=\sum_{i=1}^n a_{ii}$
Addition of conformable matrices: $\underset{p \times q}{\textbf{A}}+\underset{p \times q}{\textbf{B}}$
Scalar Multiplication: $c\textbf{A}=\{ca_{ij}\}$
Multiplication of conformable matrices: $\underset{p \times q}{\textbf{A}}\times \underset{q \times r}{\textbf{C}}=\underset{p \times r}{\textbf{D}}$
Determinant of a matrix $det(\textbf{A})=|\textbf{A}|$

Special Matrices

Symmetric Matrices: If A is symmetric, then $\textbf{A}=\textbf{A}'$
Idempotent Matrix: $\textbf{A}^2=\textbf{A}$
Null Matrix: $\textbf{0}=\begin{bmatrix} 0 & 0 & \cdots & 0 \\0 & 0 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & 0 \end{bmatrix}$
Identity matrix: $\textbf{I}=\begin{bmatrix} 1 & 0 & \cdots & 0 \\0 & 1 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & 1 \end{bmatrix}$
J matrix (matrix of ones): $\textbf{J}_2=\begin{bmatrix} 1 & 1 \\ 1 & 1 \end{bmatrix},\quad \textbf{J}_{2.5}=\begin{bmatrix} 1 & 1 &1 &1 &1 \\ 1 & 1 & 1 & 1 & 1\end{bmatrix}, \quad \textbf{J}=\textbf1\textbf{1}'$
Let $\textbf{J}$ be a square matrix of ones of order $n$ . Then $\bar{\textbf{J}}=\frac{1}{n}\textbf{J}= \begin{bmatrix} 1/n & 1/n & \cdots & 1/n \\ 1/n & 1/n & \cdots & 1/n \\ \vdots & \vdots & \ddots & \vdots \\ 1/n & 1/n & \cdots & 1/n \end{bmatrix}$
Positive semi-definite: A matrix $\textbf{M}$ such that $\textbf{x}^T\textbf{M}\textbf{x}\geq0 \quad \forall\textbf{x}\in\mathbb{R}^n$

Invertibility and Singularity

An $n \times n$ square matrix $\textbf{A}$ is invertible if and only if $|\textbf{A}|\neq 0$
An $n \times n$ square matrix $\textbf{A}$ is singular if $|\textbf{A}|= 0$ and nonsingular if $|\textbf{A}|\neq 0$

Results on Inverses

If a matrix has an inverse, then the inverse is unique
$(\textbf{A}^{-1})^{-1}=\textbf{A}$
$\textbf{A}^{-1}\textbf{A}=\textbf{I}$
If $\textbf{A}$ and $\textbf{B}$ are nonsingular matrices and $\textbf{AB}$ is defined, then $(\textbf{AB})^{-1}=\textbf{B}^{-1}\textbf{A}^{-1}$
The transpose of an invertible matrix is also invertible.
A square matrix $\textbf{A}$ is orthogonal if $\textbf{A}'=\textbf{A}^{-1}\rightarrow \textbf{A}\textbf{A}'=\textbf{I}$

Linear Dependence and Ranks

If $\textbf{M}=[\textbf{m}_1, \textbf{m}_2, …,\textbf{m}_n]$ , $\textbf{m}$ s are vectors of dimension $n\times 1$ , $\textbf{m}$ s are linearly dependent if there exists constants $c_1, c_2,…,c_n$ , not all zero such that $c_1\textbf{m}_1 + c_2\textbf{m}_2 +\cdots +c_n\textbf{m}_n = \textbf{0}$ .
The rank of matrix $\textbf{M}$ is defined to be the largest number of linearly independent rows (columns) of $\textbf{M}$ .

Some results on ranks

$rk(\textbf{A})=rk(\textbf{A}')$
If $\textbf{A}$ is idempotent, $rk(\textbf{I}-\textbf{A})=rk(\textbf{I})-rk(\textbf{A})=tr(\textbf{I}-\textbf{A})$
If two square matrices $\textbf{A}$ and $\textbf{B}$ , each of order $n$ , are nonsingular, then for any matrix $\textbf{C}$ where multiplication with $\textbf{A}$ and $\textbf{B}$ are defined, the matrices $\textbf{C}$ , $\textbf{AC}$ , $\textbf{CB}$ , and $\textbf{ACB}$ all have the same rank.
The rank of the product of two matrices $\textbf{A}$ and $\textbf{B}$ is at most equal to the smaller of the ranks of $\textbf{A}$ and $\textbf{B}$ . $rk(\textbf{AB})\leq \min\{rk(\textbf{A}),rk(\textbf{B})\}$
Let $\textbf{A}$ be a square matrix of order $n$ . $|\textbf{A}| = 0$ if and only if $rk(\textbf{A}) < n$ .
Let $\textbf{A}$ and $\textbf{B}$ be both $m \times n$ matrices with ranks $r_1$ and $r_2$ respectively. Then $rk(\textbf{A} + \textbf{B}) ≤ r_1 + r_2$ .

Eigenvalues and Eigenvectors

Let $\textbf{A}$ be an $n \times n$ matrix.

A scalar $\lambda$ is an eigenvalue of $\textbf{A}$ if there $\exists$ a nonzero vector $\textbf{x}\in \mathbb{R}^n$ such that $\textbf{Ax}=\lambda\textbf{x}$ .
Any $\textbf{x}\neq\textbf{0}$ satisfying the above equation is called an eigenvector of $\textbf{A}$ corresponding to eigenvalue $\lambda$

Remarks on Eigenvalues and Eigenvectors

The eigenvalues $\lambda_1, \lambda_2, …, \lambda_n$ of $\textbf{A}$ are the real roots of the characteristic polynomial (of degree n) $|\textbf{A}-\lambda \textbf{I}|=0$ . The roots are sometimes called latent, proper, or characteristic roots.
$\textbf{A}$ is singular if and only if 0 is an eigenvalue of $\textbf{A}$
The characteristic polynomials of $\textbf{A}$ and $\textbf{A}'$ are identical, so $\textbf{A}$ and $\textbf{A}'$ have the same eigenvalues. However, their eigenvectors are not identical.
If $\textbf{A}$ has eigenvalues $\lambda_1, \lambda_2,...,\lambda_n$ , then

$tr(\textbf{A})=\sum_{i=1}^n\lambda_i \quad \text{and} \quad |\textbf{A}| = \prod_{i=1}^n\lambda_i$

Decomposition of Matrices

Spectral Decomposition Let $\textbf{A}$ be an $n \times n$ symmetric matrix. The matrix $\textbf{A}$ can be decomposed as $\textbf{P}\textbf{D}\textbf{P}'$ , where $\textbf{D}$ is a diagonal matrix with eigenvalues of $\textbf{A}$ as its diagonal elements and $\textbf{P}$ is an n-dimensional square matrix whose $i^{th}$ column is the $i^{th}$ eigenvector of $\textbf{A}$ . That is, $\textbf{A}=\sum_{i=1}^n\lambda_i\textbf{p}_i\textbf{p}_i'=\textbf{P}\textbf{D}\textbf{P}'$
Singular Value Decomposition

For any $n \times p$ matrix $\textbf{X}$ , it can be decomposed as $\textbf{X} = \textbf{U}\textbf{D}\textbf{V}'$

where
- $\textbf{U}$ is a (column) orthogonal $n \times p$ matrix.
- $\textbf{D}$ is a diagonal matrix containing the singular values $D_{ii}$ on the diagonal in decreasing order.
- $\textbf{V}$ is an orthogonal $p \times p$ matrix.
- $\textbf{U}'\textbf{U}=\textbf{V}'\textbf{V}=\textbf{I}_p$

Matrix Calculus

Let $f(\textbf{x})$ be a continuous function of the elements of the vector $\textbf{x}′ = \begin{bmatrix} x_1 & x_2 & \cdots & x_n \end{bmatrix}$ whose first and second partial derivatives $\frac{\partial f(\textbf{x})}{\partial x_i}$ , $\frac{\partial^2 f(\textbf{x})}{\partial x_i \partial x_j}$ exists for all point $\textbf{x}$ in some region of p-dimensional Euclidian Space.

Derivative of $f(\textbf{x})$ with respect to $\textbf{x}$ : $\nabla f(\textbf{x})=\frac{\partial f(\textbf{x})}{\partial{\textbf{x}}} = \left[\frac{\partial f(\textbf{x})}{\partial{x_i}} \right], \quad i=1,2,...,p$
Hessian matrix of $f(\textbf{x})$ : $H_f = \frac{\partial^2 f(\textbf{x})}{\partial{\textbf{x}}\partial{\textbf{x}'}} = \left[\frac{\partial f(\textbf{x})}{\partial{x_i}\partial{x_j}} \right], \quad i=1,2,...,p$

Some Results in Matrix Calculus

If $f(\textbf{x})=c$ , then $\frac{\partial f(\textbf{x})}{\partial \textbf{x}}=\textbf{0}$
If $f(\textbf{x}) = \textbf{a}′\textbf{x}$ , where a is a $p \times 1$ vector of constants, then $\frac{\partial f(\textbf{x})}{\partial \textbf{x}} = \textbf{a}$ .
If $f(\textbf{x}) = \textbf{x}′\textbf{Ax}$ , where $\textbf{A}$ is symmetric, then $\frac{\partial f (\textbf{x})}{\partial \textbf{x}} = 2\textbf{Ax}$ .
For general quadratic form $f (\textbf{x}) = (\textbf{a} ± \textbf{Bx})'\textbf{A}(\textbf{a} ± \textbf{Bx})$ , where a $m \times 1$ vector of constants, $\textbf{B}_{m \times p}$ is a matrix of constants, $\textbf{x}_{p \times 1}$ is a vector of variables, and $\textbf{A}_{m \times m}$ is a symmetric matrix of constants $\frac{\partial f (\textbf{x})}{\partial \textbf{x}} = ±2\textbf{B}'\textbf{A}(\textbf{a} ± \textbf{Bx}).$

R outputs

Defining Matrices $\textbf{A}$ and $\textbf{B}$

A <- matrix(c(1,2,2,3,5,8,1,3,6), ncol = 3)
B <- matrix(c(1,3,9,5,7,4,8,0,2), ncol = 3)
A ; B

##      [,1] [,2] [,3]
## [1,]    1    3    1
## [2,]    2    5    3
## [3,]    2    8    6

##      [,1] [,2] [,3]
## [1,]    1    5    8
## [2,]    3    7    0
## [3,]    9    4    2

Sum of two conformable matrices

A+B

##      [,1] [,2] [,3]
## [1,]    2    8    9
## [2,]    5   12    3
## [3,]   11   12    8

Difference of two conformable matrices

A-B

##      [,1] [,2] [,3]
## [1,]    0   -2   -7
## [2,]   -1   -2    3
## [3,]   -7    4    4

Matrix multiplication

A%*%B

##      [,1] [,2] [,3]
## [1,]   19   30   10
## [2,]   44   57   22
## [3,]   80   90   28

Transpose of a matrix: $\textbf{A}'$

t(A)

##      [,1] [,2] [,3]
## [1,]    1    2    2
## [2,]    3    5    8
## [3,]    1    3    6

Inner product of two matrices with same dimensions: $\textbf{A}'\textbf{B}$

t(A)%*%B

##      [,1] [,2] [,3]
## [1,]   25   27   12
## [2,]   90   82   40
## [3,]   64   50   20

Inverse of a matrix: $A^{-1}$

solve(A)

##      [,1]       [,2]       [,3]
## [1,]   -1  1.6666667 -0.6666667
## [2,]    1 -0.6666667  0.1666667
## [3,]   -1  0.3333333  0.1666667

Determinant of a matrix

det(A)

## [1] -6

1.2 Review of Statistical Inference

WHAT IS STATISTICAL INFERENCE?

Statistical Inference is an area in statistics that deals with the methods used to make generalizations or inference about some characteristics of the population based on information contained in the sample.

Approaches to inference

Estimation: estimate the value of the parameter of interest.
- Point Estimation: calculate a single number as our guess to the unknown parameter.
- Confidence Interval Estimation: create an interval which we hope contains the unknown parameter with a specific level of confidence.
Hypothesis Testing: make decisions on whether or not the sample agrees with the researcher’s assertion regarding some characteristic of the population
- Parametric test: test hypotheses concerning the specific distributional characteristics (parameter) of the population.
- Non-parametric test: make inferences about population without assuming a specific distribution

Example:

Objective: “How effective is Minoxidil in treating male pattern baldness?”

Specific Objectives:

(point estimation) to estimate the population proportion of patients who will show new hair growth after being treated with Minoxidil
(hypothesis testing) to determine whether treatment using Minoxidil is better than the existing treatment that is known to stimulate hair growth among 40% of patients with male pattern baldness

Basic Definitions

Let the random variable $Y$ have a probability density function $f(y)$ (or probability mass function $p(y)$ for discrete). The expected value or mean of Y, denoted by $\mu_Y$ or $E(Y)$ is defined as

$E(Y)=\int_{-\infty} ^\infty y f(y)dy \quad E(Y) = \sum_{\forall y} y p(x)$

The variance of a random variable $Y$ , denoted by $\sigma^2_Y$ or $Var(Y)$ , is defined as $Var(Y) = E[(Y-\mu_Y)^2]=E(Y^2)-\mu_Y^2$
The covariance of Y and Z, denoted by $Cov(Y,Z)$ , or $\sigma_{Y,Z}$ , is defined by $Cov(Y,Z) = E[(Y-\mu_Y)(Z-\mu_Z)]=E(YZ)-\mu_Y\mu_Z$
Independence of two random variables. Let $Y\sim f_Y$ and $Z \sim f_Z$ . The random variables $Y$ and $Z$ are said to be independent if and only if $f_{Y,Z}(y,z) = f_Y(y) f_Z(z)$ where $f_{Y,Z}(y,z)$ is the joint probability function of $Y$ and $Z$

Random Vectors

Suppose $\underset{n \times 1}{\boldsymbol{Y}}$ is a vector of $n$ random variables, $\boldsymbol{Y}=\begin{bmatrix} Y_1&Y_2 & \cdots & Y_n \end{bmatrix}'$

The expectation of $\boldsymbol{Y}$ is

$E(\boldsymbol{Y})=E \begin{bmatrix} Y_1\\Y_2 \\ \vdots \\Y_n \end{bmatrix} = \begin{bmatrix} E(Y_1)\\ E(Y_2) \\ \vdots \\ E(Y_n) \end{bmatrix}$ This is also referred as the mean vector of $\boldsymbol{Y}$ , and can be denoted as: $\boldsymbol{\mu} = \begin{bmatrix} \mu_1\\ \mu_2 \\ \vdots \\ \mu_n \end{bmatrix} \tag{1.1}$
The covariance of $\boldsymbol{Y}$ (also known as variance-covariance matrix or dispersion matrix of $\boldsymbol{Y}$ ) is

$\begin{align} \text{cov}(\boldsymbol{Y})&=E\left[\left(\boldsymbol{Y}-\boldsymbol{\mu}\right)\left(\boldsymbol{Y}-\boldsymbol{\mu}\right)'\right] \\ &= E(\boldsymbol{Y}\boldsymbol{Y}') - \boldsymbol{\mu}\boldsymbol{\mu}' \end{align}$

which is often denoted by

$\boldsymbol{\Sigma} = \begin{bmatrix} \sigma_{11} & \sigma_{12} & \cdots & \sigma_{1n} \\ \sigma_{21} & \sigma_{22} & \cdots & \sigma_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ \sigma_{n1} & \sigma_{n2} & \cdots & \sigma_{nn} \\ \end{bmatrix} \tag{1.2}$

where
- the diagonal elements are the variances of $Y_i$ : $\sigma_{ii}=\sigma^2_i=Var(Y_i)$
- the off-diagonal elements are the covariances of $Y_i$ and $Y_j$ : $\sigma_{ij}=cov(Y_i,Y_j)$
The covariance matrix is sometimes also written as $V(\textbf{Y})$ or $Var(\boldsymbol{Y})$
The correlation matrix of $\boldsymbol{Y}$ is defined as

$\textbf{P}_\rho=\rho_{ij} = \begin{bmatrix} 1 & \rho_{12} & \cdots & \rho_{1n} \\ \rho_{21} & 1 & \cdots & \rho_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ \rho_{n1} & \rho_{n2} & \cdots & 1 \\ \end{bmatrix}$

where $\rho_{ij}=\sigma_{ij}/\sqrt{\sigma_{ii}\sigma_{jj}}$ is the correlation of $Y_i$ and $Y_j$ .

If we define $\textbf{D}_\sigma=[\text{diag}(\boldsymbol{\Sigma}))]^{1/2} = diag(\sqrt{\sigma_{11}}, \sqrt{\sigma_{22}},\cdots,\sqrt{\sigma_{nn}})$

then we can obtain $\textbf{P}_\rho$ from $\boldsymbol{\Sigma}$ :

$\textbf{P}_\rho=\textbf{D}_\sigma^{-1}\boldsymbol{\Sigma}\textbf{D}_\sigma^{-1}$

and vice versa:

$\boldsymbol{\Sigma} = \textbf{D}_\sigma \textbf{P}_\rho \textbf{D}_\sigma$

Remarks on variance and correlation:
- The Variance-Covariance matrix and the Correlation Matrix are always symmetric.
- The diagonal elements of the correlation matrix are always equal to 1.
- Let $\textbf{C}$ be a matrix of constants and $\boldsymbol{Y}$ be a random vector where $Var(\boldsymbol{Y})=\boldsymbol{\Sigma}$ . Then $Var(\textbf{C}\boldsymbol{Y})=\textbf{C}\boldsymbol{\Sigma}\textbf{C}'$
Let $\textbf{A}$ be an $n$ -dimensional symmetric matrix. The scalar quantity $\boldsymbol{Y}'\textbf{A}\boldsymbol{Y}$ is known as quadratic form in $\boldsymbol{Y}$ .

Remarks:
- $E(\boldsymbol{Y}'\textbf{A}\boldsymbol{Y}) = tr(\textbf{A}\boldsymbol{\Sigma})+\boldsymbol{\mu}'\textbf{A}\boldsymbol{\mu}$
- Under Multivariate Normality, $Var(\boldsymbol{Y}'\textbf{A}\boldsymbol{Y}) = 2tr(\textbf{A}\boldsymbol{\Sigma})^2+4\boldsymbol{\mu}'\textbf{A}\boldsymbol{\Sigma}\textbf{A}\boldsymbol{\mu}$

The Normal Distribution

We say that a random variable $Y$ follows the normal distribution denoted by $Y\sim Normal(\mu,\sigma^2)$ if and only if the pdf of $Y$ is given by:

$f_Y(y)=\frac{1}{\sqrt{2\pi\sigma^2}}\exp\left\{-\frac{1}{2}\frac{(x-\mu)^2}{\sigma^2}\right\}$

Remarks:

If $Y\sim Normal(\mu,\sigma^2)$ then, $E(Y)=\mu$ , $Var(Y)=\sigma^2$ , $m_Y(t)=\exp\{\mu t+\frac{1}{2}\sigma^2 t^2\}$
The Normal Distribution provides a reasonably good description of the graph of the relative frequency distribution of several random variables.
A lot of procedures in inferential statistics assume that the population is normally distributed.
In Stat 136, one of the the most common assumptions is that errors are normally distributed, and expected to be 0.

Results in Sampling from the Normal Distribution

(Sample Mean) Let $X_1,X_2,...,X_n \overset{iid}{\sim} Normal(\mu,\sigma^2)$ $\bar{X}=\frac{1}{n}\sum_{i=1}^n X_i\sim Normal\left(\mu,\frac{\sigma^2}{n}\right)$
(Sum of Squares of Standard Normal) Let $Y_1,...,Y_n \overset{iid}{\sim} N(0,1)$ . Then $\sum_{i=1}^n Y_i^2 \sim \chi^2_{(\nu=n)}$
(Sample Variance). Let $S^2$ be the sample variance of $X_1,X_2,...,X_n\overset{iid}{\sim}N(\mu,\sigma^2)$ . Then $\frac{(n-1)S^2}{\sigma^2} \sim \chi^2_{(\nu = n-1)}$
(T Statistic). Let $Y\sim N(0,1), Z\sim \chi^2_\nu$ , $Y$ and $Z$ are independent. Then $T = \frac{Y}{\sqrt{Z/\nu}} \sim t_{(\nu)}$

Exercise: Let $X_1,X_2,...,X_n \overset{iid}{\sim} N(\mu,\sigma^2)$ . Show that $\frac{\bar{X}-\mu}{S/\sqrt{n}} \sim t_{(n-1)}$

Solution

Since $\bar{X} \sim ~N(\mu,\sigma^2/n)$ , then $Y=\frac{\bar{X}-\mu}{\sqrt{\sigma^2/n}}\sim N(0,1)$

Let $Z= \frac{(n-1)S^2}{\sigma^2} \sim \chi^2_{(\nu = n-1)}$

Now, let $T=\frac{Y}{\sqrt{Z/(n-1)}}$ . From Result 4, $T$ follows the T distribution with $\nu=n-1$ .

Simplifying $T$ , we get

$\begin{align} T&=\frac{Y}{\sqrt{Z/(n-1)}}\\ &= \frac{Y}{\sqrt{\frac{(n-1)S^2}{\sigma^2}/(n-1)}}\\ &=\frac{(\bar{X}-\mu)/\sqrt{\sigma^2/n}}{\sqrt{\frac{(n-1)S^2}{\sigma^2}/(n-1)}}\\ &=\frac{(\bar{X}-\mu)}{\sqrt{\sigma^2/n}\sqrt{\frac{S^2}{\sigma^2}}}\\ &=\frac{(\bar{X}-\mu)}{S/\sqrt{n}}\sim t_{(n-1)} \quad \blacksquare \end{align}$

(F-Statistic). Let $U\sim\chi^2_{(\nu_1)}, V\sim\chi^2_{(\nu_2)}$ , $U$ and $V$ are independent. Then $F = \frac{U/\nu_1}{V/\nu_2}\sim F(\nu_1,\nu_2)$ Remark: This will be useful in ANOVA outputs in regression.

Approaches to Inference

Point Estimation uses information in a sample to arrive at a single number that will serve as an estimate of the value of the target parameter. The following are important concepts in estimation:

A point estimator $\hat{\theta}$ is unbiased if $E (\hat{\theta}) = \theta$ .
Let $T_1,...,T_n$ be a sequence of estimators of $\theta$ , where $T_n$ is the same estimator based on a r.s. of size $n$ . The sequence $\{T_n\}$ is said to be:
- MSE-consistent iff $MSE_\theta(T_n)\rightarrow 0 \quad \text{as} \quad n \rightarrow \infty,\quad \forall \theta \in \Omega$
- weakly consistent iff: $P(|T_n-\theta|<\varepsilon) \rightarrow 1 \quad \text{as} \quad n \rightarrow \infty, \quad \forall \theta \in \Omega$
A statistic $T$ is a sufficient statistic if the conditional probability function of the sample observations, given $T$ , does not depend on the parameter $\theta$ . (Also check: Factorization Criterion for Sufficiency)
A statistic $T$ is said to be complete if and only if $E(g(T)) = 0$ implies $g(T)$ is almost surely equal to 0.
It is easy to find complete sufficient statistics when the distribution is a member of exponential family of distributions.
An estimator $\hat{θ}$ is said to be uniformly minimum variance unbiased estimator (UMVUE) for $\theta$ if for any other unbiased estimator $\hat{\theta}'$ , $Var(\hat{\theta}) ≤ Var(\hat{\theta}'), \quad \forall \hat{\theta}'$ .
One of the most popular way of finding the UMVUE is the Lehmann Scheffe Theorem. It states that any unbiased estimator of $\theta$ which is a function of a complete sufficient statistic is said to be the UMVUE for $\theta$ .

Interval Estimation uses sample data to calculate the lower and upper bound of an interval such that the researcher can be highly confident that this interval contains the value of the target parameter.

We usually construct a (1 − α)100% confidence interval for the unknown parameter.
The confidence coefficient gives the coverage probability, i.e., the probability that the CI, before sampling, will enclose the true parameter value. Note, however, that once a sample has been observed, a CI ceases to be random and has probability of either 0 or 1 of trapping the true parameter value.
- $(1-\alpha)$ is the probability that you will obtain a sample such that if you compute a $(1-\alpha)$ CI, it will capture the parameter, NOT the probability that the parameter is within a specified interval.
- The interval is random, the parameter is not.
The most popular way in constructing CIs is the pivotal quantity method. In PQM, you manipulate a pivot, which is a random entity that contains the unknown parameter we are estimating whose distribution is independent of any unknown parameter.

Hypothesis testing uses sample data to evaluate the validity of a conjecture regarding unknown parameters.

The null hypothesis is the statement being testing; the conjecture the experimenter doubts to be true.
The alternative hypothesis is the operational statement of the theory that the experimenter believes to be true and wished to prove.

Note: The null hypothesis and alternative hypothesis must be non-overlapping statements about the population.
The test statistic is a statistic computed from the sample data that is especially sensitive to the differences between the null and alternative hypotheses.

Note: The test statistic should tend to take on certain values when Ho is true and tend to different values when Ha is true. The decision to reject Ho depends on the value of the test statistic
The region of rejection can be thought of as the set of values of the test of statistic that will lead to the rejection of the null hypothesis.
Errors in Hypothesis Testing:
- Type I error: incorrectly rejecting the null when it is true.
- Type II error: incorrectly accepting the null when it is false.
- Since the Type I error is usually the more drastic of the two errors in hypothesis testing, it is a common approach to set an upper bound to the probability of committing a Type I error ( $\alpha$ ), then find the test with the lowest probability of committing a Type II error ( $\beta$ ).
- Both probability of errors in hypothesis testing can be diminished by increasing the sample size.
The level of significance ( $\alpha$ ) is the is the maximum probability of committing a Type I error the researcher is willing to commit.
The power of the test ( $1-\beta$ ) is the probability of correctly rejecting the null hypothesis
The power function $K_\phi(\theta)$ gives the probability of rejecting the null hypothesis based on a value of the parameter.
Likelihood Ratio Test
- One of the most popular way in creating a test is the likelihood ratio test which depends on the test statistic $\lambda$ given by $\lambda = \frac{\underset{\Omega_0}{\sup}\mathcal{L}(\theta,\textbf{X})}{\underset{\Omega}{\sup}\mathcal{L}(\theta,\textbf{X})}$
- The asymptotic distribution under $Ho$ of the test statistic $-2\ln(\lambda)$ is given by $\chi^2_{(\nu)}$ , where $\nu$ is the number of unknown parameters on the parameter space minus the number of unknown parameters under the null hypothesis.

1.3 The Model Building Process

WHAT ARE MODELS?

Model is a set of assumptions that summarizes the structure of a system.

Types of Models

Deterministic Models: models that produce the same exact result for a particular set of input.

Example: income for a day as a function of items sold.
Stochastic Models: models that describe the unpredictable variation of the outcomes of a random experiment.

Example: Grade of Stat 136 students using their Stat 131 grades. Take note that Stat 136 grades may still vary due to other random factors.

In Statistics, we are focused on Stochastic Models.

General Classification of Stochastic Models

structural models - explain the variability of the variable of interest by using the variability of other variables.
nonstructural models - explain the variability of a variable using past values or observations.
”black-box” models - models whose main concern is to simply predict values of the dependent variable using a set of independent variables. Its main characteristic is that the model itself has little to no interpretation

Purpose of Modelling

to understand the mechanism that generates the data
to predict the values of the dependent variable given the independent variables
to optimize the response indexed by the dependent variable

Types of Variables in a Regression Problem

dependent (regressand, endogenous, target, output, response variable) - whose variability is being studied or explained within the system.
independent (regressor, exogenous, feature, input, explanatory variable) - used to explain the behavior of the dependent variable. The variability of this variable is explained outside of the system.

Examples
1. Can we predict a selling price of a house from certain characteristics of the house? (Sen and Srivastava, Regression Analysis)
  
  dependent variable - price of the house
  independent variables - number of bedrooms, floor space, garage size, etc.
2. Is a person’s brain size and body size predictive of his or her intelligence? (Willerman et al., 1991)
  
  dependent variable - IQ level
  independent variables - brain size based on MRI scans, height, and weight of a person.
3. What are the variables that affect the total expenditure of Filipino households based on the Family Income and Expenditure Survey (PSA, 2012)?
  
  dependent variable - total annual expenditure of the households
  independent variables - total household income, whether the household is agricultural, total number of household members
4. What are the determinants of a movie’s box-office performance? (Scott, 2019)
  
  dependent variable - box office figure
  independent variables - production budget, marketing budget, critical reception, genre of the movie

Types of Data

Time-series data - a set of observations on the values that a variable takes at different times (example: daily, weekly, monthly, quarterly, annually, etc.)

Date	Passengers
1949-01-01	112
1949-01-31	118
1949-03-02	132
1949-04-02	129
1949-05-02	121
1949-06-02	135
1949-07-02	148
1949-08-01	148
1949-09-01	136
1949-10-01	119
1949-11-01	104
1949-12-01	118
1950-01-01	115
1950-01-31	126
1950-03-02	141
1950-04-02	135
1950-05-02	125
1950-06-02	149
1950-07-02	170
1950-08-01	170
1950-09-01	158
1950-10-01	133
1950-11-01	114
1950-12-01	140
1951-01-01	145
1951-01-31	150
1951-03-02	178
1951-04-02	163
1951-05-02	172
1951-06-02	178
1951-07-02	199
1951-08-01	199
1951-09-01	184
1951-10-01	162
1951-11-01	146
1951-12-01	166
1952-01-01	171
1952-01-31	180
1952-03-02	193
1952-04-01	181
1952-05-02	183
1952-06-01	218
1952-07-02	230
1952-08-01	242
1952-09-01	209
1952-10-01	191
1952-11-01	172
1952-12-01	194
1953-01-01	196
1953-01-31	196
1953-03-02	236
1953-04-02	235
1953-05-02	229
1953-06-02	243
1953-07-02	264
1953-08-01	272
1953-09-01	237
1953-10-01	211
1953-11-01	180
1953-12-01	201
1954-01-01	204
1954-01-31	188
1954-03-02	235
1954-04-02	227
1954-05-02	234
1954-06-02	264
1954-07-02	302
1954-08-01	293
1954-09-01	259
1954-10-01	229
1954-11-01	203
1954-12-01	229
1955-01-01	242
1955-01-31	233
1955-03-02	267
1955-04-02	269
1955-05-02	270
1955-06-02	315
1955-07-02	364
1955-08-01	347
1955-09-01	312
1955-10-01	274
1955-11-01	237
1955-12-01	278
1956-01-01	284
1956-01-31	277
1956-03-02	317
1956-04-01	313
1956-05-02	318
1956-06-01	374
1956-07-02	413
1956-08-01	405
1956-09-01	355
1956-10-01	306
1956-11-01	271
1956-12-01	306
1957-01-01	315
1957-01-31	301
1957-03-02	356
1957-04-02	348
1957-05-02	355
1957-06-02	422
1957-07-02	465
1957-08-01	467
1957-09-01	404
1957-10-01	347
1957-11-01	305
1957-12-01	336
1958-01-01	340
1958-01-31	318
1958-03-02	362
1958-04-02	348
1958-05-02	363
1958-06-02	435
1958-07-02	491
1958-08-01	505
1958-09-01	404
1958-10-01	359
1958-11-01	310
1958-12-01	337
1959-01-01	360
1959-01-31	342
1959-03-02	406
1959-04-02	396
1959-05-02	420
1959-06-02	472
1959-07-02	548
1959-08-01	559
1959-09-01	463
1959-10-01	407
1959-11-01	362
1959-12-01	405
1960-01-01	417
1960-01-31	391
1960-03-02	419
1960-04-01	461
1960-05-02	472
1960-06-01	535
1960-07-02	622
1960-08-01	606
1960-09-01	508
1960-10-01	461
1960-11-01	390
1960-12-01	432

Cross-section data - data on one or more variables collected at the same period in time

	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
Mazda RX4	21.0	6	160.0	110	3.90	2.620	16.46	0	1	4	4
Mazda RX4 Wag	21.0	6	160.0	110	3.90	2.875	17.02	0	1	4	4
Datsun 710	22.8	4	108.0	93	3.85	2.320	18.61	1	1	4	1
Hornet 4 Drive	21.4	6	258.0	110	3.08	3.215	19.44	1	0	3	1
Hornet Sportabout	18.7	8	360.0	175	3.15	3.440	17.02	0	0	3	2
Valiant	18.1	6	225.0	105	2.76	3.460	20.22	1	0	3	1
Duster 360	14.3	8	360.0	245	3.21	3.570	15.84	0	0	3	4
Merc 240D	24.4	4	146.7	62	3.69	3.190	20.00	1	0	4	2
Merc 230	22.8	4	140.8	95	3.92	3.150	22.90	1	0	4	2
Merc 280	19.2	6	167.6	123	3.92	3.440	18.30	1	0	4	4
Merc 280C	17.8	6	167.6	123	3.92	3.440	18.90	1	0	4	4
Merc 450SE	16.4	8	275.8	180	3.07	4.070	17.40	0	0	3	3
Merc 450SL	17.3	8	275.8	180	3.07	3.730	17.60	0	0	3	3
Merc 450SLC	15.2	8	275.8	180	3.07	3.780	18.00	0	0	3	3
Cadillac Fleetwood	10.4	8	472.0	205	2.93	5.250	17.98	0	0	3	4
Lincoln Continental	10.4	8	460.0	215	3.00	5.424	17.82	0	0	3	4
Chrysler Imperial	14.7	8	440.0	230	3.23	5.345	17.42	0	0	3	4
Fiat 128	32.4	4	78.7	66	4.08	2.200	19.47	1	1	4	1
Honda Civic	30.4	4	75.7	52	4.93	1.615	18.52	1	1	4	2
Toyota Corolla	33.9	4	71.1	65	4.22	1.835	19.90	1	1	4	1
Toyota Corona	21.5	4	120.1	97	3.70	2.465	20.01	1	0	3	1
Dodge Challenger	15.5	8	318.0	150	2.76	3.520	16.87	0	0	3	2
AMC Javelin	15.2	8	304.0	150	3.15	3.435	17.30	0	0	3	2
Camaro Z28	13.3	8	350.0	245	3.73	3.840	15.41	0	0	3	4
Pontiac Firebird	19.2	8	400.0	175	3.08	3.845	17.05	0	0	3	2
Fiat X1-9	27.3	4	79.0	66	4.08	1.935	18.90	1	1	4	1
Porsche 914-2	26.0	4	120.3	91	4.43	2.140	16.70	0	1	5	2
Lotus Europa	30.4	4	95.1	113	3.77	1.513	16.90	1	1	5	2
Ford Pantera L	15.8	8	351.0	264	4.22	3.170	14.50	0	1	5	4
Ferrari Dino	19.7	6	145.0	175	3.62	2.770	15.50	0	1	5	6
Maserati Bora	15.0	8	301.0	335	3.54	3.570	14.60	0	1	5	8
Volvo 142E	21.4	4	121.0	109	4.11	2.780	18.60	1	1	4	2

Panel data - data on one or more variables collected at several time points and from several observations (or panel member)

Show entries

Search:

	Region	Year	Telephones
1	N.Amer	1951	45939
2	Europe	1951	21574
3	Asia	1951	2876
4	S.Amer	1951	1815
5	Oceania	1951	1646
6	Africa	1951	89
7	Mid.Amer	1951	555
8	N.Amer	1956	60423
9	Europe	1956	29990
10	Asia	1956	4708
11	S.Amer	1956	2568
12	Oceania	1956	2366

Showing 1 to 12 of 49 entries

Previous1 2 3 4 5Next

Remark: Time series and cross-sectional data can be thought of as special cases of panel data that are in one dimension only (one panel member or individual for the former, one time point for the latter).

In this course, we will focus on cross-sectional data. As early as now, try to find a cross-sectional dataset for your research project.

That is, find (or gather) a dataset with $n$ observations and $p$ variables.

Steps in the Model-Building Process

Planning
- define the problem
- identify the dependent/independent variables
- establish goals
Development of the model
- collect data
- preliminary description/exploration of the data
- specify the model
- fit the model
- validate assumptions
- remedy to regression problems
- obtain the best model
Verification and Maintenance
- check model adequacy
- check sign of coefficient
- check stability of parameters
- check forecasting ability
- update parameters

Here in Stat 136, we are focused on the theory behind step (2) of the Model-Building Process. You will just learn the other steps naturally as you go along the way in your BS Stat journey.

1.4 Measures of Correlation

CORRELATIONAL ANALYSIS vs REGRESSION ANALYSIS

Both correlational analysis and regression analysis are oftentimes used to describe the relationship of several variables. However, both methodologies are different.

In correlational analysis, your main goal is to simply describe the relationship of (usually) two variables.
Regression analysis, on the other hand, gives more information than the relationship of the variables: you will be able to create an equation that lets you examine the structure the governs the random phenomenon being studied, and predict values of one variable using another variables.

Although different, correlational analysis is oftentimes done as a preliminary step to explore data before doing regression analysis.

Correlation Coefficient

Measures the degree of association of 2 or more variables.
The coefficient does not imply structural relationship and does not indicate causality between the variables.
The value is usually between -1 and +1.
Any value close to +1 implies strong direct relationship while a value close to -1 implies strong inverse relationship. A value close to 0 implies weak or no relationship.
There are many types of correlation coefficient. 6 are summarized here.

Pearson’s $r$

The Pearson product-moment correlation coefficient measures the correlation between two continuous variables. This coefficient may underestimate the degree of association if relationship is nonlinear.

$r = \frac{\sum_{i=1}^n(x_i-\bar{x})(y_i-\bar{y})}{\sqrt{(\sum_{i=1}^n(x_i-\bar{x})^2)(\sum_{i=1}^n(y_i-\bar{y})^2)}}$

Example

A sample of 30 towns were drawn and mortality rate and calcium concentrating in drinking water were determined.

$Y$ = 7-year mortality rate (per 100,000)
$X$ = average calcium ion concentration in drinking water (ppm)

calcium <- read.csv("calcium.csv")

mortality	calcium
1247	105
1800	14
1807	15
1359	84
1307	78
1555	39
1260	21
1742	8
1569	91
1772	15
1668	17
1609	18
1299	78
1392	73
1254	96
1428	39
1723	44
1547	9
1591	16
1828	8
1466	5
1558	10
1637	10
1755	12
1491	20
1318	122
1379	94
1096	138
1402	37
1704	26

cor.test(calcium$mortality, calcium$calcium, method = "pearson")

## 
##  Pearson's product-moment correlation
## 
## data:  calcium$mortality and calcium$calcium
## t = -6.0649, df = 28, p-value = 1.537e-06
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.8759845 -0.5397840
## sample estimates:
##        cor 
## -0.7535183

Conclusion: Reject the hypothesis that there is no correlation.
The data indicates that there is a strong inverse relationship between mortality and calcium level in drinking water.

Spearman’s $\rho$

The Spearman Rank Correlation is a measure of correlation of rankings. The variables were both measured at least in an ordinal scale.
$\rho = \frac{6 \sum d_i^2}{n (n^2-1)}$ where $d_i$ is the difference between ranks of two variables of observation $i$

Example

Ten materials for artificial reef were evaluated.

$Y$ = ranking according to number of invertebrates attracted after 1 month.
$X$ = ranking according to cost and availability of materials.

library(readr)
reef <- read_csv("reef.csv")

Material	X	Y
1	1	3
2	4	2
3	2	4
4	3	5
5	5	1
6	6	7
7	7	8
8	8	6
9	9	9
10	10	10

cor.test(reef$X, reef$Y, method = "spearman")

## 
##  Spearman's rank correlation rho
## 
## data:  reef$X and reef$Y
## S = 38, p-value = 0.01367
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##      rho 
## 0.769697

Kendall’s $\tau$

The Kendall Rank correlation coefficient coefficient is also used to measure the ordinal association between two measured quantities.

$\begin{align} \tau &= \frac{(\text{number of concordant pairs}-\text{number of discordant pairs})}{\text{number of pairs}} \\ &= 1-\frac{2(\text{number of discordant pairs})}{n\choose 2} \end{align}$

Example

We use the sample example in the Spearman Correlation:

$Y$ = ranking according to number of invertebrates attracted after 1 month.
$X$ = ranking according to cost and availability of materials.

With respect to Material 6, there is only one material that is discordant to it, while the other eight are concordant to it.

cor.test(reef$X, reef$Y, method = "kendall")

## 
##  Kendall's rank correlation tau
## 
## data:  reef$X and reef$Y
## T = 36, p-value = 0.01667
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
## tau 
## 0.6

The 3 correlation coefficients above are the most common coefficients, especially the Pearson’s r for numeric and continuous variables. The Kendall’s Tau and Spearman’s Rho cannot be used directly for continuous variables, unless they are converted to ranks.

The following coefficients measure association of categorical variables.

$\phi$ Coefficient

The Phi Coefficient is used if both variables are dichotomous.

	Y

X	0	1
0	a	b
1	c	d

$\phi=\frac{|ad-bc|}{\sqrt{(a+b)(c+d)(a+c)(b+d)}}$

Contingency Coefficient

Used if variables are both categorical.

	Y

X	category 1	category 2	category 3
category 1	a	d	g
category 2	b	e	h
category 3	c	f	i

$C = \sqrt{\frac{\chi^2}{n+\chi^2}}$ where $\chi^2$ is the chi-squared test statistic which can be computed using the contingency table.

Other measures of association if at least one is categorical

Biserial - one variable is continuous vs. another continuous variable which has been artificially dichotomized.
Point-biserial - one continuous vs. another which is a true dichotomy; conservative, and more applicable and safe to use when in doubt.
Tetrachoric - both variables are quantitative and both have been artificially dichotomized.
Eta-coefficient - one variable is interval and one is nominal.

1.5 Overview of Regression

Regression analysis is a statistical tool which utilizes the relation between two or more quantitative variables so that one variable can be predicted from the other(s).

The main objective of the analysis is to extract structural relationships among variables within a system. It is of interest to examine the effects that some variables exert (or appear to exert) on other variable(s).

Linear regression is used for a special class of relationships, namely, those that can be described by straight lines. The term simple linear regression refers to the case wherein only two variables are involved; otherwise, it is known as multiple linear regression.

A useful way of beginning regression analysis is by drawing a graph of one variable against the other variable. This graph, called the scatter diagram, can serve both to suggest a relationship, and to demonstrate possible inadequacies of it. The graph indicates the general tendency by which one variable varies with changes in another variable. The scatter diagram is useful in the simple linear regression case.

house <- read_csv("house.csv")

## Rows: 25 Columns: 2
## ── Column specification ──────
## Delimiter: ","
## dbl (2): PRICE, TAX
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

PRICE	TAX
53	652
55	1000
56	897
58	964
64	1099
44	960
49	678
72	800
82	1038
85	1200
45	860
47	600
49	676
56	1287
60	834
62	734
64	551
66	1355
35	561
38	489
43	752
46	774
46	440
50	549
65	900

lm(PRICE~TAX, house) |> summary()

## 
## Call:
## lm(formula = PRICE ~ TAX, data = house)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.527  -7.723  -1.681   4.165  20.187 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 31.39144    7.43543   4.222 0.000324 ***
## TAX          0.02931    0.00864   3.392 0.002505 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.44 on 23 degrees of freedom
## Multiple R-squared:  0.3335, Adjusted R-squared:  0.3045 
## F-statistic: 11.51 on 1 and 23 DF,  p-value: 0.002505

CHAPTER 1 Preliminaries

1.1 Review of Matrix Theory

Basic Concepts

Matrix Operations

Special Matrices

Invertibility and Singularity

Linear Dependence and Ranks

Eigenvalues and Eigenvectors

Decomposition of Matrices

Matrix Calculus

R outputs

1.2 Review of Statistical Inference

Basic Definitions

Random Vectors

The Normal Distribution

Approaches to Inference

1.3 The Model Building Process

Types of Models

General Classification of Stochastic Models

Purpose of Modelling

Types of Variables in a Regression Problem

Types of Data

Steps in the Model-Building Process

1.4 Measures of Correlation

Pearson’s rr

Spearman’s \rho\rho

Kendall’s \tau\tau

\phi\phi Coefficient

Contingency Coefficient

Other measures of association if at least one is categorical

1.5 Overview of Regression

Pearson’s $r$

Spearman’s $\rho$

Kendall’s $\tau$

$\phi$ Coefficient