CHAPTER 1 Preliminaries
This chapter provides a review of the tools needed for regression analysis.
Concepts in matrix theory (Stat 135) are first presented then followed by important results in theory of parametric statistical inference (Stat 131).
R codes are also included for some results for visualization.
1.1 Review of Matrix Theory
WHY DO WE NEED MATRIX THEORY?
We are dealing with multiple variables. Matrices give us compact representation of equations for simplification of calculations, instead of the summation and other basic operations.
Basic Concepts
A matrix is an array of numbers (constants or variables) containing \(r\) rows and \(c\) columns \[ \underset{3\times4} {\textbf A }= \begin{bmatrix} 0 & 9 & 2 &3 \\ 7 & 6 & 4 &5 \\ 11 & 2 & 1 & 8 \\ \end{bmatrix} \]
The dimension or order is the size of the matrix, i.e. the number of rows and columns
A vector is an array of numbers (constants or variables) arranged in rows or columns \[ \underset{3\times1} {\textbf a }= \begin{bmatrix} 2 \\ 7 \\ 8 \end{bmatrix} \quad \underset{1\times3} {\textbf a' }= \begin{bmatrix} 2 & 7 & 8 \end{bmatrix} \]
A square matrix is a matrix that has equal number of rows and columns \[ \underset{n\times n}{\textbf{A}} = \begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{n1} & a_{n2} & \cdots & a_{nn} \end{bmatrix} \]
The diagonal elements are the elements found in the diagonal of a square matrix while those elements other than the diagonal elements are the off-diagonal or nondiagonal elements.
A diagonal matrix is a square matrix that has zero for all of its off-diagonal elements. \[ \begin{bmatrix} a_{11} & 0 & \cdots & 0 \\ 0 & a_{22} & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & a_{nn} \end{bmatrix} \]
A triangular matrix is a square matrix with all elements above (or below) the diagonal being zero. \[ \begin{bmatrix} a_{11} & 0 & \cdots & 0 \\ a_{21} & a_{22} & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ a_{n1} & a_{n2} & \cdots & a_{nn} \end{bmatrix} \quad or \quad \begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ 0 & a_{22} & \cdots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & a_{nn} \end{bmatrix} \]
Matrix Operations
- Transpose of a matrix: \(\textbf{A}'\)
- Trace of a matrix: \(tr(\textbf{A})=\sum_{i=1}^n a_{ii}\)
- Addition of conformable matrices: \(\underset{p \times q}{\textbf{A}}+\underset{p \times q}{\textbf{B}}\)
- Scalar Multiplication: \(c\textbf{A}=\{ca_{ij}\}\)
- Multiplication of conformable matrices: \(\underset{p \times q}{\textbf{A}}\times \underset{q \times r}{\textbf{C}}=\underset{p \times r}{\textbf{D}}\)
- Determinant of a matrix \(det(\textbf{A})=|\textbf{A}|\)
Special Matrices
Symmetric Matrices: If A is symmetric, then \(\textbf{A}=\textbf{A}'\)
Idempotent Matrix: \(\textbf{A}^2=\textbf{A}\)
Null Matrix: \(\textbf{0}=\begin{bmatrix} 0 & 0 & \cdots & 0 \\0 & 0 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & 0 \end{bmatrix}\)
Identity matrix: \(\textbf{I}=\begin{bmatrix} 1 & 0 & \cdots & 0 \\0 & 1 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & 1 \end{bmatrix}\)
J matrix (matrix of ones): \[ \textbf{J}_2=\begin{bmatrix} 1 & 1 \\ 1 & 1 \end{bmatrix},\quad \textbf{J}_{2.5}=\begin{bmatrix} 1 & 1 &1 &1 &1 \\ 1 & 1 & 1 & 1 & 1\end{bmatrix}, \quad \textbf{J}=\textbf1\textbf{1}' \]
Let \(\textbf{J}\) be a square matrix of ones of order \(n\). Then \[ \bar{\textbf{J}}=\frac{1}{n}\textbf{J}= \begin{bmatrix} 1/n & 1/n & \cdots & 1/n \\ 1/n & 1/n & \cdots & 1/n \\ \vdots & \vdots & \ddots & \vdots \\ 1/n & 1/n & \cdots & 1/n \end{bmatrix} \]
Positive semi-definite: A matrix \(\textbf{M}\) such that \(\textbf{x}^T\textbf{M}\textbf{x}\geq0 \quad \forall\textbf{x}\in\mathbb{R}^n\)
Invertibility and Singularity
An \(n \times n\) square matrix \(\textbf{A}\) is invertible if and only if \(|\textbf{A}|\neq 0\)
An \(n \times n\) square matrix \(\textbf{A}\) is singular if \(|\textbf{A}|= 0\) and nonsingular if \(|\textbf{A}|\neq 0\)
Results on Inverses
If a matrix has an inverse, then the inverse is unique
\((\textbf{A}^{-1})^{-1}=\textbf{A}\)
\(\textbf{A}^{-1}\textbf{A}=\textbf{I}\)
If \(\textbf{A}\) and \(\textbf{B}\) are nonsingular matrices and \(\textbf{AB}\) is defined, then \((\textbf{AB})^{-1}=\textbf{B}^{-1}\textbf{A}^{-1}\)
The transpose of an invertible matrix is also invertible.
A square matrix \(\textbf{A}\) is orthogonal if \(\textbf{A}'=\textbf{A}^{-1}\rightarrow \textbf{A}\textbf{A}'=\textbf{I}\)
Linear Dependence and Ranks
If \(\textbf{M}=[\textbf{m}_1, \textbf{m}_2, …,\textbf{m}_n]\) , \(\textbf{m}\)s are vectors of dimension \(n\times 1\), \(\textbf{m}\)s are linearly dependent if there exists constants \(c_1, c_2,…,c_n\), not all zero such that \(c_1\textbf{m}_1 + c_2\textbf{m}_2 +\cdots +c_n\textbf{m}_n = \textbf{0}\).
The rank of matrix \(\textbf{M}\) is defined to be the largest number of linearly independent rows (columns) of \(\textbf{M}\).
Some results on ranks
\(rk(\textbf{A})=rk(\textbf{A}')\)
If \(\textbf{A}\) is idempotent, \(rk(\textbf{I}-\textbf{A})=rk(\textbf{I})-rk(\textbf{A})=tr(\textbf{I}-\textbf{A})\)
If two square matrices \(\textbf{A}\) and \(\textbf{B}\) , each of order \(n\), are nonsingular, then for any matrix \(\textbf{C}\) where multiplication with \(\textbf{A}\) and \(\textbf{B}\) are defined, the matrices \(\textbf{C}\), \(\textbf{AC}\), \(\textbf{CB}\) , and \(\textbf{ACB}\) all have the same rank.
The rank of the product of two matrices \(\textbf{A}\) and \(\textbf{B}\) is at most equal to the smaller of the ranks of \(\textbf{A}\) and \(\textbf{B}\). \[ rk(\textbf{AB})\leq \min\{rk(\textbf{A}),rk(\textbf{B})\} \]
Let \(\textbf{A}\) be a square matrix of order \(n\) . \(|\textbf{A}| = 0\) if and only if \(rk(\textbf{A}) < n\).
Let \(\textbf{A}\) and \(\textbf{B}\) be both \(m \times n\) matrices with ranks \(r_1\) and \(r_2\) respectively. Then \(rk(\textbf{A} + \textbf{B}) ≤ r_1 + r_2\).
Eigenvalues and Eigenvectors
Let \(\textbf{A}\) be an \(n \times n\) matrix.
- A scalar \(\lambda\) is an eigenvalue of \(\textbf{A}\) if there \(\exists\) a nonzero vector \(\textbf{x}\in \mathbb{R}^n\) such that \(\textbf{Ax}=\lambda\textbf{x}\).
- Any \(\textbf{x}\neq\textbf{0}\) satisfying the above equation is called an eigenvector of \(\textbf{A}\) corresponding to eigenvalue \(\lambda\)
Remarks on Eigenvalues and Eigenvectors
The eigenvalues \(\lambda_1, \lambda_2, …, \lambda_n\) of \(\textbf{A}\) are the real roots of the characteristic polynomial (of degree n) \(|\textbf{A}-\lambda \textbf{I}|=0\) . The roots are sometimes called latent, proper, or characteristic roots.
\(\textbf{A}\) is singular if and only if 0 is an eigenvalue of \(\textbf{A}\)
The characteristic polynomials of \(\textbf{A}\) and \(\textbf{A}'\) are identical, so \(\textbf{A}\) and \(\textbf{A}'\) have the same eigenvalues. However, their eigenvectors are not identical.
If \(\textbf{A}\) has eigenvalues \(\lambda_1, \lambda_2,...,\lambda_n\), then
\[ tr(\textbf{A})=\sum_{i=1}^n\lambda_i \quad \text{and} \quad |\textbf{A}| = \prod_{i=1}^n\lambda_i \]
Decomposition of Matrices
Spectral Decomposition Let \(\textbf{A}\) be an \(n \times n\) symmetric matrix. The matrix \(\textbf{A}\) can be decomposed as \(\textbf{P}\textbf{D}\textbf{P}'\), where \(\textbf{D}\) is a diagonal matrix with eigenvalues of \(\textbf{A}\) as its diagonal elements and \(\textbf{P}\) is an n-dimensional square matrix whose \(i^{th}\) column is the \(i^{th}\) eigenvector of \(\textbf{A}\). That is, \[ \textbf{A}=\sum_{i=1}^n\lambda_i\textbf{p}_i\textbf{p}_i'=\textbf{P}\textbf{D}\textbf{P}' \]
Singular Value Decomposition
For any \(n \times p\) matrix \(\textbf{X}\), it can be decomposed as \[ \textbf{X} = \textbf{U}\textbf{D}\textbf{V}' \]
where
\(\textbf{U}\) is a (column) orthogonal \(n \times p\) matrix.
\(\textbf{D}\) is a diagonal matrix containing the singular values \(D_{ii}\) on the diagonal in decreasing order.
\(\textbf{V}\) is an orthogonal \(p \times p\) matrix.
\(\textbf{U}'\textbf{U}=\textbf{V}'\textbf{V}=\textbf{I}_p\)
Matrix Calculus
Let \(f(\textbf{x})\) be a continuous function of the elements of the vector \(\textbf{x}′ = \begin{bmatrix} x_1 & x_2 & \cdots & x_n \end{bmatrix}\) whose first and second partial derivatives \(\frac{\partial f(\textbf{x})}{\partial x_i}\), \(\frac{\partial^2 f(\textbf{x})}{\partial x_i \partial x_j}\) exists for all point \(\textbf{x}\) in some region of p-dimensional Euclidian Space.
Derivative of \(f(\textbf{x})\) with respect to \(\textbf{x}\): \[\nabla f(\textbf{x})=\frac{\partial f(\textbf{x})}{\partial{\textbf{x}}} = \left[\frac{\partial f(\textbf{x})}{\partial{x_i}} \right], \quad i=1,2,...,p\]
Hessian matrix of \(f(\textbf{x})\): \[ H_f = \frac{\partial^2 f(\textbf{x})}{\partial{\textbf{x}}\partial{\textbf{x}'}} = \left[\frac{\partial f(\textbf{x})}{\partial{x_i}\partial{x_j}} \right], \quad i=1,2,...,p \]
Some Results in Matrix Calculus
If \(f(\textbf{x})=c\), then \(\frac{\partial f(\textbf{x})}{\partial \textbf{x}}=\textbf{0}\)
If \(f(\textbf{x}) = \textbf{a}′\textbf{x}\), where a is a \(p \times 1\) vector of constants, then \(\frac{\partial f(\textbf{x})}{\partial \textbf{x}} = \textbf{a}\).
If \(f(\textbf{x}) = \textbf{x}′\textbf{Ax}\), where \(\textbf{A}\) is symmetric, then \(\frac{\partial f (\textbf{x})}{\partial \textbf{x}} = 2\textbf{Ax}\).
For general quadratic form \(f (\textbf{x}) = (\textbf{a} ± \textbf{Bx})'\textbf{A}(\textbf{a} ± \textbf{Bx})\), where a \(m \times 1\) vector of constants, \(\textbf{B}_{m \times p}\) is a matrix of constants, \(\textbf{x}_{p \times 1}\) is a vector of variables, and \(\textbf{A}_{m \times m}\) is a symmetric matrix of constants \[ \frac{\partial f (\textbf{x})}{\partial \textbf{x}} = ±2\textbf{B}'\textbf{A}(\textbf{a} ± \textbf{Bx}). \]
R outputs
Defining Matrices \(\textbf{A}\) and \(\textbf{B}\)
## [,1] [,2] [,3]
## [1,] 1 3 1
## [2,] 2 5 3
## [3,] 2 8 6
## [,1] [,2] [,3]
## [1,] 1 5 8
## [2,] 3 7 0
## [3,] 9 4 2
Sum of two conformable matrices
## [,1] [,2] [,3]
## [1,] 2 8 9
## [2,] 5 12 3
## [3,] 11 12 8
Difference of two conformable matrices
## [,1] [,2] [,3]
## [1,] 0 -2 -7
## [2,] -1 -2 3
## [3,] -7 4 4
Matrix multiplication
## [,1] [,2] [,3]
## [1,] 19 30 10
## [2,] 44 57 22
## [3,] 80 90 28
Transpose of a matrix: \(\textbf{A}'\)
## [,1] [,2] [,3]
## [1,] 1 2 2
## [2,] 3 5 8
## [3,] 1 3 6
Inner product of two matrices with same dimensions: \(\textbf{A}'\textbf{B}\)
## [,1] [,2] [,3]
## [1,] 25 27 12
## [2,] 90 82 40
## [3,] 64 50 20
Inverse of a matrix: \(A^{-1}\)
## [,1] [,2] [,3]
## [1,] -1 1.6666667 -0.6666667
## [2,] 1 -0.6666667 0.1666667
## [3,] -1 0.3333333 0.1666667
Determinant of a matrix
## [1] -6
1.2 Review of Statistical Inference
WHAT IS STATISTICAL INFERENCE?
Statistical Inference is an area in statistics that deals with the methods used to make generalizations or inference about some characteristics of the population based on information contained in the sample.
Approaches to inference
- Estimation: estimate the value of the parameter of interest.
- Point Estimation: calculate a single number as our guess to the unknown parameter.
- Confidence Interval Estimation: create an interval which we hope contains the unknown parameter with a specific level of confidence.
- Hypothesis Testing: make decisions on whether or not the sample agrees with the researcher’s assertion regarding some characteristic of the population
- Parametric test: test hypotheses concerning the specific distributional characteristics (parameter) of the population.
- Non-parametric test: make inferences about population without assuming a specific distribution
Example:
Objective: “How effective is Minoxidil in treating male pattern baldness?”
Specific Objectives:
- (point estimation) to estimate the population proportion of patients who will show new hair growth after being treated with Minoxidil
- (hypothesis testing) to determine whether treatment using Minoxidil is better than the existing treatment that is known to stimulate hair growth among 40% of patients with male pattern baldness
Basic Definitions
- Let the random variable \(Y\) have a probability density function \(f(y)\) (or probability mass function \(p(y)\) for discrete). The expected value or mean of Y, denoted by \(\mu_Y\) or \(E(Y)\) is defined as
\[ E(Y)=\int_{-\infty} ^\infty y f(y)dy \quad E(Y) = \sum_{\forall y} y p(x) \]
The variance of a random variable \(Y\), denoted by \(\sigma^2_Y\) or \(Var(Y)\), is defined as \[ Var(Y) = E[(Y-\mu_Y)^2]=E(Y^2)-\mu_Y^2 \]
The covariance of Y and Z, denoted by \(Cov(Y,Z)\), or \(\sigma_{Y,Z}\), is defined by \[ Cov(Y,Z) = E[(Y-\mu_Y)(Z-\mu_Z)]=E(YZ)-\mu_Y\mu_Z \]
Independence of two random variables. Let \(Y\sim f_Y\) and \(Z \sim f_Z\). The random variables \(Y\) and \(Z\) are said to be independent if and only if \[ f_{Y,Z}(y,z) = f_Y(y) f_Z(z) \] where \(f_{Y,Z}(y,z)\) is the joint probability function of \(Y\) and \(Z\)
Random Vectors and Matrices
Expectation of a random vector. If \(\underset{n \times 1}{\textbf{y}}\) is a vector of random variables, then \[ E(\textbf{y})=E \begin{bmatrix} Y_1\\Y_2 \\ \vdots \\Y_n \end{bmatrix} = \begin{bmatrix} E(Y_1)\\ E(Y_2) \\ \vdots \\ E(Y_n) \end{bmatrix} \]
Expectation of a random matrix. Let \(\textbf{X}=\{X_{ij}\}\) be a matrix of random variables. Then \[ E(\textbf{X})=\{\mu_{ij}\} \]
- Remarks: \(E(\textbf{AXB}\pm\textbf{F})=\textbf{A}E(\textbf{X})\textbf{B}\pm\textbf{F}\) where \(\textbf{A}\) , \(\textbf{B}\), and \(\textbf{F}\) are constant matrices.
Variance-Covariance Matrix. Let \(\textbf{y}_{n\times 1}\) be a vector of random variables, \(\textbf{y}'=\begin{bmatrix} Y_1 & Y_2 & \cdots &Y_n \end{bmatrix}\). The variance-covariance matrix of \(\textbf{y}\) is defined as \[\begin{align} Var(\textbf{y})&= E \{ (\textbf{y} - E(\textbf{y}))(\textbf{y} - E(\textbf{y}))' \}\\ &= E \left\{ \begin{bmatrix} Y_1-E(Y_1) \\ Y_2-E(Y_2) \\ \vdots \\ Y_n-E(Y_n) \end{bmatrix} \begin{bmatrix} Y_1-E(Y_1) & Y_2-E(Y_2) & \cdots & Y_n-E(Y_n) \end{bmatrix} \right\}\\ &= \{\sigma_{ij}\}=\boldsymbol{\Sigma} \end{align}\]
where
the diagonal elements are the variances of \(Y_i\): \(\sigma_{ii}=\sigma_i=Var(Y_i)\)
the off-diagonal elements are the covariances of \(Y_i\) and \(Y_j\): \(\sigma_{ij}=Cov(Y_i,Y_j)\)
Correlation Matrix. Let \(\textbf{y}_{n\times 1}\) be a vector of random variables, \(\textbf{y}'=\begin{bmatrix} Y_1 & Y_2 & \cdots &Y_n \end{bmatrix}\), and let \(\sigma_i\) be the standard deviation of \(Y_i\). The correlation matrix of \(\textbf{y}\) denoted as \(\rho(\textbf{y})\) is defined as:
\[\begin{align}\rho(\textbf{y})&= diag^{-1}\{\sigma_1 \quad \sigma_2 \quad \cdots\sigma_n\}\boldsymbol{\Sigma}diag^{-1}\{\sigma_1 \quad \sigma_2 \quad \cdots\sigma_n\}\\&= \begin{bmatrix} \frac{1}{\sigma_1} & 0 & \cdots & 0 \\ 0 & \frac{1}{\sigma_2} & \cdots &0 \\ \vdots & \vdots & \ddots &\vdots\\ 0 & 0 & \cdots & \frac{1}{\sigma_n} \end{bmatrix} \begin{bmatrix} \sigma_1^2 & \sigma_{12} & \cdots & \sigma_{1n} \\ \sigma_{21} & \sigma_2^2 & \cdots & \sigma_{2n} \\ \vdots & \vdots & \ddots &\vdots\\ \sigma_{n1} & \sigma_{n2} & \cdots & \sigma_n^2 \end{bmatrix} \begin{bmatrix} \frac{1}{\sigma_1} & 0 & \cdots & 0 \\ 0 & \frac{1}{\sigma_2} & \cdots &0 \\ \vdots & \vdots & \ddots &\vdots\\ 0 & 0 & \cdots & \frac{1}{\sigma_n} \end{bmatrix}\\ &= \left\{\frac{\sigma_{ij}}{\sigma_i \sigma_j}\right\}, \quad i,j = 1,...,n\end{align}\]
Remarks on variance and correlation:
The Variance-Covariance matrix and the Correlation Matrix are always symmetric.
The diagonal elements of the correlation matrix are always equal to 1.
Let \(\textbf{C}\) be a matrix of constants and \(\textbf{y}\) be a random vector where \(Var(\textbf{y})=\boldsymbol{\Sigma}\). Then \(Var(\textbf{Cy})=\textbf{C}\boldsymbol{\Sigma}\textbf{C}'\)
Quadratic Form. Let \(\textbf{y}\) be a vector of \(n\) random variables with mean \(\boldsymbol{\mu}\) and \(Var(\textbf{y})=\boldsymbol{\Sigma}\) and let \(\textbf{A}\) be an \(n\)-dimensional symmetric matrix. The scalar quantity \(\textbf{y}'\textbf{A}\textbf{x}\) is known as quadratic form in \(\textbf{y}\).
Remarks:
\(E(\textbf{y}'\textbf{A}\textbf{y}) = tr(\textbf{A}\boldsymbol{\Sigma})+\boldsymbol{\mu}'\textbf{A}\boldsymbol{\mu}\)
Under Multivariate Normality, \(Var(\textbf{y}'\textbf{A}\textbf{y})=2tr(\textbf{A}\boldsymbol{\Sigma})^2+4\boldsymbol{\mu}'\textbf{A}\boldsymbol{\Sigma}\textbf{A}\boldsymbol{\mu}\)
The Normal Distribution
We say that a random variable \(Y\) follows the normal distribution denoted by \(Y\sim Normal(\mu,\sigma^2)\) if and only if the pdf of \(Y\) is given by:
\[ f_Y(y)=\frac{1}{\sqrt{2\pi\sigma^2}}\exp\left\{-\frac{1}{2}\frac{(x-\mu)^2}{\sigma^2}\right\} \]
Remarks:
- If \(Y\sim Normal(\mu,\sigma^2)\) then, \(E(Y)=\mu\), \(Var(Y)=\sigma^2\), \(m_Y(t)=\exp\{\mu t+\frac{1}{2}\sigma^2 t^2\}\)
- The Normal Distribution provides a reasonably good description of the graph of the relative frequency distribution of several random variables.
- A lot of procedures in inferential statistics assume that the population is normally distributed.
- In Stat 136, one of the the most common assumptions is that errors are normally distributed, and expected to be 0.
Results in Sampling from the Normal Distribution
(Sample Mean) Let \(X_1,X_2,...,X_n \overset{iid}{\sim} Normal(\mu,\sigma^2)\) \[ \bar{X}=\frac{1}{n}\sum_{i=1}^n X_i\sim Normal\left(\mu,\frac{\sigma^2}{n}\right) \]
(Sum of Squares of Standard Normal) Let \(Y_1,...,Y_n \overset{iid}{\sim} N(0,1)\). Then \[ \sum_{i=1}^n Y_i^2 \sim \chi^2_{(\nu=n)} \]
(Sample Variance). Let \(S^2\) be the sample variance of \(X_1,X_2,...,X_n\overset{iid}{\sim}N(\mu,\sigma^2)\). Then \[ \frac{(n-1)S^2}{\sigma^2} \sim \chi^2_{(\nu = n-1)} \]
(T Statistic). Let \(Y\sim N(0,1), Z\sim \chi^2_\nu\), \(Y\) and \(Z\) are independent. Then \[ T = \frac{Y}{\sqrt{Z/\nu}} \sim t_{(\nu)} \]
Exercise: Let \(X_1,X_2,...,X_n \overset{iid}{\sim} N(\mu,\sigma^2)\). Show that \(\frac{\bar{X}-\mu}{S/\sqrt{n}} \sim t_{(n-1)}\)
Solution
Since \(\bar{X} \sim ~N(\mu,\sigma^2/n)\), then \(Y=\frac{\bar{X}-\mu}{\sqrt{\sigma^2/n}}\sim N(0,1)\)
Let \(Z= \frac{(n-1)S^2}{\sigma^2} \sim \chi^2_{(\nu = n-1)}\)
Now, let \(T=\frac{Y}{\sqrt{Z/(n-1)}}\) . From Result 4, \(T\) follows the T distribution with \(\nu=n-1\) .
Simplifying \(T\), we get
\[\begin{align} T&=\frac{Y}{\sqrt{Z/(n-1)}}\\ &= \frac{Y}{\sqrt{\frac{(n-1)S^2}{\sigma^2}/(n-1)}}\\ &=\frac{(\bar{X}-\mu)/\sqrt{\sigma^2/n}}{\sqrt{\frac{(n-1)S^2}{\sigma^2}/(n-1)}}\\ &=\frac{(\bar{X}-\mu)}{\sqrt{\sigma^2/n}\sqrt{\frac{S^2}{\sigma^2}}}\\ &=\frac{(\bar{X}-\mu)}{S/\sqrt{n}}\sim t_{(n-1)} \quad \blacksquare \end{align}\]
- (F-Statistic). Let \(U\sim\chi^2_{(\nu_1)}, V\sim\chi^2_{(\nu_2)}\), \(U\) and \(V\) are independent. Then \[ F = \frac{U/\nu_1}{V/\nu_2}\sim F(\nu_1,\nu_2) \] Remark: This will be useful in ANOVA outputs in regression.
Approaches to Inference
Point Estimation uses information in a sample to arrive at a single number that will serve as an estimate of the value of the target parameter. The following are important concepts in estimation:
- A point estimator \(\hat{\theta}\) is unbiased if \(E (\hat{\theta}) = \theta\).
- Let \(T_1,...,T_n\) be a sequence of estimators of \(\theta\), where \(T_n\) is the same estimator based on a r.s. of size \(n\). The sequence \(\{T_n\}\) is said to be:
- MSE-consistent iff \[ MSE_\theta(T_n)\rightarrow 0 \quad \text{as} \quad n \rightarrow \infty,\quad \forall \theta \in \Omega \]
- weakly consistent iff: \[ P(|T_n-\theta|<\varepsilon) \rightarrow 1 \quad \text{as} \quad n \rightarrow \infty, \quad \forall \theta \in \Omega \]
- A statistic \(T\) is a sufficient statistic if the conditional probability function of the sample observations, given \(T\) , does not depend on the parameter \(\theta\). (Also check: Factorization Criterion for Sufficiency)
- A statistic \(T\) is said to be complete if and only if \(E(g(T)) = 0\) implies \(g(T)\) is almost surely equal to 0.
- It is easy to find complete sufficient statistics when the distribution is a member of exponential family of distributions.
- An estimator \(\hat{θ}\) is said to be uniformly minimum variance unbiased estimator (UMVUE) for \(\theta\) if for any other unbiased estimator \(\hat{\theta}'\), \(Var(\hat{\theta}) ≤ Var(\hat{\theta}'), \quad \forall \hat{\theta}'\).
- One of the most popular way of finding the UMVUE is the Lehmann Scheffe Theorem. It states that any unbiased estimator of \(\theta\) which is a function of a complete sufficient statistic is said to be the UMVUE for \(\theta\).
Interval Estimation uses sample data to calculate the lower and upper bound of an interval such that the researcher can be highly confident that this interval contains the value of the target parameter.
- We usually construct a (1 − α)100% confidence interval for the unknown parameter.
- The confidence coefficient gives the coverage probability, i.e., the probability that the CI, before sampling, will enclose the true parameter value. Note, however, that once a sample has been observed, a CI ceases to be random and has probability of either 0 or 1 of trapping the true parameter value.
- \((1-\alpha)\) is the probability that you will obtain a sample such that if you compute a \((1-\alpha)\) CI, it will capture the parameter, NOT the probability that the parameter is within a specified interval.
- The interval is random, the parameter is not.
- The most popular way in constructing CIs is the pivotal quantity method. In PQM, you manipulate a pivot, which is a random entity that contains the unknown parameter we are estimating whose distribution is independent of any unknown parameter.
Hypothesis testing uses sample data to evaluate the validity of a conjecture regarding unknown parameters.
The null hypothesis is the statement being testing; the conjecture the experimenter doubts to be true.
The alternative hypothesis is the operational statement of the theory that the experimenter believes to be true and wished to prove.
Note: The null hypothesis and alternative hypothesis must be non-overlapping statements about the population.
The test statistic is a statistic computed from the sample data that is especially sensitive to the differences between the null and alternative hypotheses.
Note: The test statistic should tend to take on certain values when Ho is true and tend to different values when Ha is true. The decision to reject Ho depends on the value of the test statistic
The region of rejection can be thought of as the set of values of the test of statistic that will lead to the rejection of the null hypothesis.
Errors in Hypothesis Testing:
Type I error: incorrectly rejecting the null when it is true.
Type II error: incorrectly accepting the null when it is false.
Since the Type I error is usually the more drastic of the two errors in hypothesis testing, it is a common approach to set an upper bound to the probability of committing a Type I error (\(\alpha\)), then find the test with the lowest probability of committing a Type II error (\(\beta\)).
Both probability of errors in hypothesis testing can be diminished by increasing the sample size.
The level of significance (\(\alpha\)) is the is the maximum probability of committing a Type I error the researcher is willing to commit.
The power of the test (\(1-\beta\)) is the probability of correctly rejecting the null hypothesis
The power function \(K_\phi(\theta)\) gives the probability of rejecting the null hypothesis based on a value of the parameter.
Likelihood Ratio Test
- One of the most popular way in creating a test is the likelihood ratio test which depends on the test statistic \(\lambda\) given by \[ \lambda = \frac{\underset{\Omega_0}{\sup}\mathcal{L}(\theta,\textbf{X})}{\underset{\Omega}{\sup}\mathcal{L}(\theta,\textbf{X})} \]
- The asymptotic distribution under \(Ho\) of the test statistic \(-2\ln(\lambda)\) is given by \(\chi^2_{(\nu)}\), where \(\nu\) is the number of unknown parameters on the parameter space minus the number of unknown parameters under the null hypothesis.
1.3 The Model Building Process
WHAT ARE MODELS?
Model is a set of assumptions that summarizes the structure of a system.
Types of Models
Deterministic Models: models that produce the same exact result for a particular set of input.
Example: income for a day as a function of items sold.
Stochastic Models: models that describe the unpredictable variation of the outcomes of a random experiment.
Example: Grade of Stat 136 students using their Stat 131 grades. Take note that Stat 136 grades may still vary due to other random factors.
In Statistics, we are focused on Stochastic Models.
General Classification of Stochastic Models
- structural models - explain the variability of the variable of interest by using the variability of other variables.
- nonstructural models - explain the variability of a variable using past values or observations.
- ”black-box” models - models whose main concern is to simply predict values of the dependent variable using a set of independent variables. Its main characteristic is that the model itself has little to no interpretation
Purpose of Modelling
to understand the mechanism that generates the data
to predict the values of the dependent variable given the independent variables
to optimize the response indexed by the dependent variable
Types of Variables in a Regression Problem
dependent (regressand, endogenous, target, output, response variable) - whose variability is being studied or explained within the system.
independent (regressor, exogenous, feature, input, explanatory variable) - used to explain the behavior of the dependent variable. The variability of this variable is explained outside of the system.
Examples
Can we predict a selling price of a house from certain characteristics of the house? (Sen and Srivastava, Regression Analysis)
dependent variable - price of the house
independent variables - number of bedrooms, floor space, garage size, etc.Is a person’s brain size and body size predictive of his or her intelligence? (Willerman et al., 1991)
dependent variable - IQ level
independent variables - brain size based on MRI scans, height, and weight of a person.What are the variables that affect the total expenditure of Filipino households based on the Family Income and Expenditure Survey (PSA, 2012)?
dependent variable - total annual expenditure of the households
independent variables - total household income, whether the household is agricultural, total number of household membersWhat are the determinants of a movie’s box-office performance? (Scott, 2019)
dependent variable - box office figure
independent variables - production budget, marketing budget, critical reception, genre of the movie
Types of Data
Time-series data - a set of observations on the values that a variable takes at different times (example: daily, weekly, monthly, quarterly, annually, etc.)
Date Passengers 1949-01-01 112 1949-01-31 118 1949-03-02 132 1949-04-02 129 1949-05-02 121 1949-06-02 135 1949-07-02 148 1949-08-01 148 1949-09-01 136 1949-10-01 119 1949-11-01 104 1949-12-01 118 1950-01-01 115 1950-01-31 126 1950-03-02 141 1950-04-02 135 1950-05-02 125 1950-06-02 149 1950-07-02 170 1950-08-01 170 1950-09-01 158 1950-10-01 133 1950-11-01 114 1950-12-01 140 1951-01-01 145 1951-01-31 150 1951-03-02 178 1951-04-02 163 1951-05-02 172 1951-06-02 178 1951-07-02 199 1951-08-01 199 1951-09-01 184 1951-10-01 162 1951-11-01 146 1951-12-01 166 1952-01-01 171 1952-01-31 180 1952-03-02 193 1952-04-01 181 1952-05-02 183 1952-06-01 218 1952-07-02 230 1952-08-01 242 1952-09-01 209 1952-10-01 191 1952-11-01 172 1952-12-01 194 1953-01-01 196 1953-01-31 196 1953-03-02 236 1953-04-02 235 1953-05-02 229 1953-06-02 243 1953-07-02 264 1953-08-01 272 1953-09-01 237 1953-10-01 211 1953-11-01 180 1953-12-01 201 1954-01-01 204 1954-01-31 188 1954-03-02 235 1954-04-02 227 1954-05-02 234 1954-06-02 264 1954-07-02 302 1954-08-01 293 1954-09-01 259 1954-10-01 229 1954-11-01 203 1954-12-01 229 1955-01-01 242 1955-01-31 233 1955-03-02 267 1955-04-02 269 1955-05-02 270 1955-06-02 315 1955-07-02 364 1955-08-01 347 1955-09-01 312 1955-10-01 274 1955-11-01 237 1955-12-01 278 1956-01-01 284 1956-01-31 277 1956-03-02 317 1956-04-01 313 1956-05-02 318 1956-06-01 374 1956-07-02 413 1956-08-01 405 1956-09-01 355 1956-10-01 306 1956-11-01 271 1956-12-01 306 1957-01-01 315 1957-01-31 301 1957-03-02 356 1957-04-02 348 1957-05-02 355 1957-06-02 422 1957-07-02 465 1957-08-01 467 1957-09-01 404 1957-10-01 347 1957-11-01 305 1957-12-01 336 1958-01-01 340 1958-01-31 318 1958-03-02 362 1958-04-02 348 1958-05-02 363 1958-06-02 435 1958-07-02 491 1958-08-01 505 1958-09-01 404 1958-10-01 359 1958-11-01 310 1958-12-01 337 1959-01-01 360 1959-01-31 342 1959-03-02 406 1959-04-02 396 1959-05-02 420 1959-06-02 472 1959-07-02 548 1959-08-01 559 1959-09-01 463 1959-10-01 407 1959-11-01 362 1959-12-01 405 1960-01-01 417 1960-01-31 391 1960-03-02 419 1960-04-01 461 1960-05-02 472 1960-06-01 535 1960-07-02 622 1960-08-01 606 1960-09-01 508 1960-10-01 461 1960-11-01 390 1960-12-01 432 Cross-section data - data on one or more variables collected at the same period in time
mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
Panel data - data on one or more variables collected at several time points and from several observations (or panel member)
Remark: Time series and cross-sectional data can be thought of as special cases of panel data that are in one dimension only (one panel member or individual for the former, one time point for the latter).
In this course, we will focus on cross-sectional data. As early as now, try to find a cross-sectional dataset for your research project.
That is, find (or gather) a dataset with \(n\) observations and \(p\) variables.
Steps in the Model-Building Process
Planning
- define the problem
- identify the dependent/independent variables
- establish goalsDevelopment of the model
- collect data
- preliminary description/exploration of the data
- specify the model
- fit the model
- validate assumptions
- remedy to regression problems
- obtain the best modelVerification and Maintenance
- check model adequacy
- check sign of coefficient
- check stability of parameters
- check forecasting ability
- update parameters
Here in Stat 136, we are focused on the theory behind step (2) of the Model-Building Process. You will just learn the other steps naturally as you go along the way in your BS Stat journey.
1.4 Measures of Correlation
CORRELATIONAL ANALYSIS vs REGRESSION ANALYSIS
Both correlational analysis and regression analysis are oftentimes used to describe the relationship of several variables. However, both methodologies are different.
In correlational analysis, your main goal is to simply describe the relationship of (usually) two variables.
Regression analysis, on the other hand, gives more information than the relationship of the variables: you will be able to create an equation that lets you examine the structure the governs the random phenomenon being studied, and predict values of one variable using another variables.
Although different, correlational analysis is oftentimes done as a preliminary step to explore data before doing regression analysis.
Correlation Coefficient
- Measures the degree of association of 2 or more variables.
- The coefficient does not imply structural relationship and does not indicate causality between the variables.
- The value is usually between -1 and +1.
- Any value close to +1 implies strong direct relationship while a value close to -1 implies strong inverse relationship. A value close to 0 implies weak or no relationship.
- There are many types of correlation coefficient. 6 are summarized here.
Pearson’s \(r\)
The Pearson product-moment correlation coefficient measures the correlation between two continuous variables. This coefficient may underestimate the degree of association if relationship is nonlinear.
\[ r = \frac{\sum_{i=1}^n(x_i-\bar{x})(y_i-\bar{y})}{\sqrt{(\sum_{i=1}^n(x_i-\bar{x})^2)(\sum_{i=1}^n(y_i-\bar{y})^2)}} \]
Example
A sample of 30 towns were drawn and mortality rate and calcium concentrating in drinking water were determined.
- \(Y\) = 7-year mortality rate (per 100,000)
- \(X\) = average calcium ion concentration in drinking water (ppm)
mortality | calcium |
---|---|
1247 | 105 |
1800 | 14 |
1807 | 15 |
1359 | 84 |
1307 | 78 |
1555 | 39 |
1260 | 21 |
1742 | 8 |
1569 | 91 |
1772 | 15 |
1668 | 17 |
1609 | 18 |
1299 | 78 |
1392 | 73 |
1254 | 96 |
1428 | 39 |
1723 | 44 |
1547 | 9 |
1591 | 16 |
1828 | 8 |
1466 | 5 |
1558 | 10 |
1637 | 10 |
1755 | 12 |
1491 | 20 |
1318 | 122 |
1379 | 94 |
1096 | 138 |
1402 | 37 |
1704 | 26 |
##
## Pearson's product-moment correlation
##
## data: calcium$mortality and calcium$calcium
## t = -6.0649, df = 28, p-value = 1.537e-06
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.8759845 -0.5397840
## sample estimates:
## cor
## -0.7535183
Conclusion: Reject the hypothesis that there is no correlation.
The data indicates that there is a strong inverse relationship between mortality and calcium level in drinking water.
Spearman’s \(\rho\)
The Spearman Rank Correlation is a measure of correlation of rankings. The variables were both measured at least in an ordinal scale.
\[
\rho = \frac{6 \sum d_i^2}{n (n^2-1)}
\]
where \(d_i\) is the difference between ranks of two variables of observation \(i\)
Example
Ten materials for artificial reef were evaluated.
- \(Y\) = ranking according to number of invertebrates attracted after 1 month.
- \(X\) = ranking according to cost and availability of materials.
Material | X | Y |
---|---|---|
1 | 1 | 3 |
2 | 4 | 2 |
3 | 2 | 4 |
4 | 3 | 5 |
5 | 5 | 1 |
6 | 6 | 7 |
7 | 7 | 8 |
8 | 8 | 6 |
9 | 9 | 9 |
10 | 10 | 10 |
##
## Spearman's rank correlation rho
##
## data: reef$X and reef$Y
## S = 38, p-value = 0.01367
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.769697
Kendall’s \(\tau\)
The Kendall Rank correlation coefficient coefficient is also used to measure the ordinal association between two measured quantities.
\[\begin{align} \tau &= \frac{(\text{number of concordant pairs}-\text{number of discordant pairs})}{\text{number of pairs}} \\ &= 1-\frac{2(\text{number of discordant pairs})}{n\choose 2} \end{align}\]
Example
We use the sample example in the Spearman Correlation:
- \(Y\) = ranking according to number of invertebrates attracted after 1 month.
- \(X\) = ranking according to cost and availability of materials.
With respect to Material 6, there is only one material that is discordant to it, while the other eight are concordant to it.
##
## Kendall's rank correlation tau
##
## data: reef$X and reef$Y
## T = 36, p-value = 0.01667
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
## tau
## 0.6
The 3 correlation coefficients above are the most common coefficients, especially the Pearson’s r for numeric and continuous variables. The Kendall’s Tau and Spearman’s Rho cannot be used directly for continuous variables, unless they are converted to ranks.
The following coefficients measure association of categorical variables.\(\phi\) Coefficient
The Phi Coefficient is used if both variables are dichotomous.
X | 0 | 1 |
0 | a | b |
1 | c | d |
\[ \phi=\frac{|ad-bc|}{\sqrt{(a+b)(c+d)(a+c)(b+d)}} \]
Contingency Coefficient
Used if variables are both categorical.
X | category 1 | category 2 | category 3 |
category 1 | a | d | g |
category 2 | b | e | h |
category 3 | c | f | i |
\[ C = \sqrt{\frac{\chi^2}{n+\chi^2}} \] where \(\chi^2\) is the chi-squared test statistic which can be computed using the contingency table.
Other measures of association if at least one is categorical
- Biserial - one variable is continuous vs. another continuous variable which has been artificially dichotomized.
- Point-biserial - one continuous vs. another which is a true dichotomy; conservative, and more applicable and safe to use when in doubt.
- Tetrachoric - both variables are quantitative and both have been artificially dichotomized.
- Eta-coefficient - one variable is interval and one is nominal.
1.5 Overview of Regression
Regression analysis is a statistical tool which utilizes the relation between two or more quantitative variables so that one variable can be predicted from the other(s).
The main objective of the analysis is to extract structural relationships among variables within a system. It is of interest to examine the effects that some variables exert (or appear to exert) on other variable(s).
Linear regression is used for a special class of relationships, namely, those that can be described by straight lines. The term simple linear regression refers to the case wherein only two variables are involved; otherwise, it is known as multiple linear regression.
A useful way of beginning regression analysis is by drawing a graph of one variable against the other variable. This graph, called the scatter diagram, can serve both to suggest a relationship, and to demonstrate possible inadequacies of it. The graph indicates the general tendency by which one variable varies with changes in another variable. The scatter diagram is useful in the simple linear regression case.
## Rows: 25 Columns: 2
## ── Column specification ──────
## Delimiter: ","
## dbl (2): PRICE, TAX
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
PRICE | TAX |
---|---|
53 | 652 |
55 | 1000 |
56 | 897 |
58 | 964 |
64 | 1099 |
44 | 960 |
49 | 678 |
72 | 800 |
82 | 1038 |
85 | 1200 |
45 | 860 |
47 | 600 |
49 | 676 |
56 | 1287 |
60 | 834 |
62 | 734 |
64 | 551 |
66 | 1355 |
35 | 561 |
38 | 489 |
43 | 752 |
46 | 774 |
46 | 440 |
50 | 549 |
65 | 900 |
##
## Call:
## lm(formula = PRICE ~ TAX, data = house)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.527 -7.723 -1.681 4.165 20.187
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 31.39144 7.43543 4.222 0.000324 ***
## TAX 0.02931 0.00864 3.392 0.002505 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.44 on 23 degrees of freedom
## Multiple R-squared: 0.3335, Adjusted R-squared: 0.3045
## F-statistic: 11.51 on 1 and 23 DF, p-value: 0.002505