CHAPTER 2 Introduction to Regression and Linear Models
Regression Analysis is a statistical tool that utilizes the relation between a dependent variable and one or more independent variables so that the dependent variable can be predicted using the independent variable/s.
In this course, we will focus on datasets and models where the dependent and independent variables are numeric or quantitative in nature.
2.1 Historical Origin of the term “Regression”
Regression analysis was first developed by Sir Francis Galton in the latter part of the 19th century.
Galton had studied the heights of fathers and sons and noted that the heights of sons of both tall and short fathers appeared to revert or regress to the mean of height of the fathers. He considered this tendency to be a regression to mediocrity.
Galton developed a mathematical description of this regression tendency, the precursor of today’s regression models.
The term regression persists to this day to describe statistical relations between variables, but may be differently described from how Galton first used it.
Most of the statistical methodologies pioneered by Galton (regression, psychometrics, usage of questionnaires) are unfortunately utilized by Galton for eugenics and scientific racism.
2.2 Uses of Regression Analysis
- Data Description – summarize and describe the data; describes the relationship of the dependent variable against the independent variable
- Prediction – forecast the expected value of the variable of interest given the values of the other (independent) variables; very important and useful in planning
- Structural Analysis – the use of an estimated model for the quantitative measurement of the relationships of the variables (example: economic variables); it facilitates the comparison of rival theories of the same phenomena; quantifies the relationship between the variables
2.3 Classification of Regression Models
In terms of distributional assumptions
- Parametric – assumes a fixed structural form where the dependent variable (linearly) depends on the independent variables; the distribution is known and is indexed by unknown parameters
- Nonparametric – the dependent variable depends on the explanatory variables but the distribution is not specified (distribution-free) and not indexed by a parameter
- Semi-parametric – considers an unknown distribution but indexed by some parameter
In terms of types of dependent and independent variables
Dependent | Independent | Model |
---|---|---|
Continuous | Continuous | Classical Regression |
Continuous | Continuous added Categorical |
Classical Regression with use of Dummy Variables |
Continuous | Categorical added Continuous |
Analysis of Covariance (ANCOVA) |
Continuous | All Categorical | Analysis of Variance (ANOVA) |
Categorical | Any Combination | Logistic Regression |
Categorical | All Categorical | Log-Linear Models |
2.4 Introduction: Random Vectors
Definition 2.1 Suppose Yn×1 is a vector of n random variables, Y=[Y1Y2⋯Yn]′. Then Yn×1 is a random vector.
Mean Vector, Covariance Matrix, and Correlation Matrix
Definition 2.2 The expectation of Y is E(Y)=E[Y1Y2⋮Yn]=[E(Y1)E(Y2)⋮E(Yn)]
This is also referred as the mean vector of Y, and can be denoted as:
μ=[μ1μ2⋮μn]
Definition 2.3 The Variance of Y (also known as variance-covariance matrix or dispersion matrix of Y) is
Var(Y)=E[(Y−μ)(Y−μ)′]=E(YY′)−μμ′
The variance-covariance matrix is often denoted by
Σ=[σ11σ12⋯σ1nσ21σ22⋯σ2n⋮⋮⋱⋮σn1σn2⋯σnn]
where
the diagonal elements are the variances of Yi: σii=σ2i=Var(Yi)
the off-diagonal elements are the covariances of Yi and Yj: σij=cov(Yi,Yj)
The variance-covariance matrix is sometimes also written as V(Y) or Cov(Y)
Theorem 2.1 For n×1 constant vectors a and b, and random vector Y with mean vector μ and covariance matrix Σ,
E(Y+a)=μ+a
E(a′Y)=a′μ
Var(Y+a)=Σ
Var(a′Y)=a′Σa
cov(a′Y,b′Y)=a′Σb
Theorem 2.2 Let A be a k×n matrix of constants, B be a m×n matrix of constants, and Y is a n×1 random vector with covariance matrix Σ. Then:
Var(AY)=AΣA′
Var(AY+b)=AΣA′
Cov(AY,BY)=AΣB′
Theorem 2.3 Let A be an n-dimensional symmetric matrix. The scalar quantity Y′AY is known as quadratic form in Y. Then:
E(Y′AY)=tr(AΣ)+μ′Aμ
Under Multivariate Normality, Var(Y′AY)=2tr(AΣ)2+4μ′AΣAμ
Definition 2.4 The correlation matrix of Y is defined as
Pρ=ρij=[1ρ12⋯ρ1nρ211⋯ρ2n⋮⋮⋱⋮ρn1ρn2⋯1]
where ρij=σij/√σiiσjj is the correlation of Yi and Yj.
If we define Dσ=[diag(Σ))]1/2=diag(√σ11,√σ22,⋯,√σnn)
then we can obtain Pρ from Σ:
Pρ=D−1σΣD−1σ
and vice versa:
Σ=DσPρDσ
Remarks on variance and correlation:
The Variance-Covariance matrix and the Correlation Matrix are always symmetric.
The diagonal elements of the correlation matrix are always equal to 1.
The Multivariate Normal
This is just a quick introduction. Theory and more properties will be discussed in Stat 147.
If we are going to use the normal Error assumption in our regression models, knowledge of the multivariate normal random variable is important.
Definition 2.5 (Multivariate Normal)
Let μ∈Rn and let Σ be a n×n positive semidefinite matrix as defined in Equations (2.1) and (2.2) respectively.
The n×1 vector Y is said to have a multivariate distribution with parameters μ and Σ, written as Y∼Nn(μ,Σ), if and only if l′Y∼N(l′μ,l′Σl) for every n×1 vector l.
The definition simply states that for Y to be multivariate normal, every linear combination of its components must be univariate normal with parameters l′μ and l′Σl.
l1Y1+l2Y2+⋯+lnYn∼N(l′μ,l′Σl)
Properties of the Multivariate Normal
If Y∼Nn(μ,Σ), then the following properties hold:
The mean and variance of Y are E(Y)=μ and V(Y)=Σ.
For any vector of constants a, Y+a∼N(μ+a,Σ)
The marginal distributions of the components are univariate normal, i.e. Yi∼N(μi,σii), i=1,...,n, where μi and σii are the mean and variance respectively of component Yi.
For two components Yi and Yj, i≠j, their covariance can be found on the off-diagonal elements of Σ, i.e. cov(Yi,Yj)=σij
If L is a (p×n) matrix of rank p, then LY∼Np(Lμ,LΣL′)
The joint PDF of Y is given by fY(y)=1(2π)n/2|Σ|1/2exp{−12(y−μ)′Σ−1(y−μ)},y∈Rn
Questions:
If two random variables are independent, are they uncorrelated?
Answer
Yes.
If two random variables are uncorrelated, are they independent?
Answer
Generally, No. But if they are normally distributed, then Yes.
This implies that if Y∼N(μ,Σ) where Σ=diag(σ21,σ22,⋯,σ2n), then the marginals are mutually independent to each other, i.e. Y1,Y2,⋯YnInd∼N(μi,σ2i).
For a multivariate random vector Y=[Y1Y2⋯Yn]′, if the marginal components are all univariate normal, i.e. Yi∼N(μi,σii) for all i, then does this imply that Y follows a multivariate normal distribution?
Answer
No. Not necessarily.
Again, all possible linear combinations of the components must be univariate normal.
As a counter example, suppose Y=[Y1Y2]′ has a joint PDF
fY(y)=12πe−12(y21+y22)×[1+y1y2e−12(y21+y22)],y∈R2 This is NOT the pdf of a bivariate normal. Therefore, Y does not follow a multivariate normal distribution.
However, if we derive the marginal distributions of Y1 and Y2, we will obtain univariate normal PDFs.
Proof:
fY1(y1)=∫∞−∞f(y)dy2=∫∞−∞12πe−12(y21+y22)×[1+y1y2e−12(y21+y22)]dy2=∫∞−∞12πe−12(y21+y22)dy2⏟(a)+∫∞−∞12πe−12(y21+y22)y1y2e−12(y21+y22)dy2⏟(b)
Aside (a):
(a)=1√2πe−12y21∫∞−∞1√2πe−12y22⏟pdf of N(0,1)dy2=1√2πe−12y21 Aside (b):
(b)=∫∞−∞12πy1y2e−(y21+y22)dy2=∫∞−∞12πy1e−y21y2e−y22dy2=12πy1e−y21∫∞−∞y2e−y22dy2=12πy1e−y21(−e−y222|x2=+∞x2=−∞)=12πy1e−y21(0−0)=0 Therefore, the marginal pdf of Y1 is
fY1(y1)=1√2πe−12y21 which is a univariate Normal(0,1).
Using the same process, we can also see that Y2∼N(0,1)
Therefore, we showed a multivariate random vector that DOES NOT follow the multivariate normal distribution, but has marginal components that each has univariate normal PDFs.
Having univariate normal as marginal distributions does not imply that the joint distribution is multivariate normal. ◼
2.5 The Linear Model
Two points can be represented by a straight line. Recall the slope-intercept equation of a line
Y=a+mX
where a is the y-intercept, and m is the slope of the line.
Example 1: Some deterministic models can be represented by a straight line.
You are selling hotdogs with a unit price of 25 pesos per piece. Assuming there are no tips and you have no other items to sell, then the daily sales has a deterministic relationship with the number of hotdogs sold.
sales=25×hotdogs
However, most phenomena are governed by some probability or randomness. What if we are modelling the expected net income where we consider the tips, possible spoilage of food, and other random scenario?
That is why we add a random error term ε to characterize a stochastic linear model.
sales=25×hotdogs+ε where ε is a random value that may be positive or negative.
Example 2: a variable can be a function of another variable, but the relationship may be stochastic, not deterministic.
Suppose we have the following variables collected from 10 students.
- X - highschool exam score in algebra
- Y - Stat 101 Final Exam scores
Student | X | Y |
---|---|---|
1 | 90 | 85 |
2 | 87 | 87 |
3 | 85 | 89 |
4 | 85 | 90 |
5 | 95 | 92 |
6 | 96 | 94 |
7 | 82 | 80 |
8 | 78 | 75 |
9 | 75 | 60 |
10 | 84 | 78 |
There seems to be a linear relationship between X and Y (although not perfectly linear).
Our main goal is to represent this relationship using an equation.
In this graph:
- Y=β0+β1X represents the straight line.
- Yi=β0+β1Xi+εi represents the points in the scatter plot
- The βs are the model parameters
- The εs are the error terms
- If we are asking “which equation best represents the data”, it is the same as asking “what are the values of β that best represent the data”.
- If there are k independent variables, the points are represented by: Yi=β0+β1Xi1+β2Xi2+⋯+βkXik+εi
Justification of the Error Term
- The error term represents the effect of many factors (apart form the X’s) not included in the model which do affect the response variable to some extent and which vary at random without reference to the independent variables.
- Even if we know that the relevant factors have significant effect on the response, there is still a basic and unpredictable element of randomness in responses which can be characterized by the inclusion of a random error term.
- ε accounts for errors of observations or measurements in recording Y.
- Errors can be positive or negative, but are expected to be very small (close to 0)
Linear Model in Matrix Form
Take note that we assume a model for every observation i=1,⋯,n, which implies that we are essentially handling n equations. To facilitate handling n equations, we can use the concepts in matrix theory.
Instead of Yi=β0+β1Xi1+β2Xi2+⋯+βkXik+εi for each observation i from 1 to n, we can make it compact using matrix notations.
Given the following vectors and matrices:
Y=[Y1Y2⋮Yn]X=[1X11X12⋯X1k1X21X12⋯X2k⋮⋮⋮⋱⋮1Xn1Xn2⋯Xnk]β=[β0β1β2⋮βk]ε=[ε1ε2⋮εn]
Definition 2.6 (Matrix Form of the Linear Equation)
The k-variable, n-observations linear model can be written as
Yn×1=Xn×(k+1)β(k+1)×1+εn×1[Y1Y2⋮Yn]=[1X11X12⋯X1k1X21X12⋯X2k⋮⋮⋮⋱⋮1Xn1Xn2⋯Xnk][β0β1β2⋮βk]+[ε1ε2⋮εn][Y1Y2⋮Yn]=[β0+β1X11+β2X12+⋯βkX1k+ε1β0+β1X21+β2X12+⋯βkX2k+ε2⋮β0+β1Xn1+β2Xn2+⋯βkXnk+εn]
where
- Yi is the value of the response variable on the ith trial. Collectively, Y is the response vector.
- Xij is a known constant, namely, the value of the jth independent variable on the ith trial. Collectively, X is the design matrix.
- β0,β1,⋯,βk are parameters. Collectively, β is the regression coefficients vector.
- εi is a random error term on trial i=1,...,n. Collectively, ε is the error term vector.
The statistical uses of the linear model:
- The model summarizes the linear relationship between Y and the X s.
- It can help explain how variability in Y is affected by the X s.
- It can help determine Y given prior knowledge on X .
the Normal Error Assumption
Recall our assumptions for the error term:
the expected value of the unknown quantity εi is 0 for every observation i
the variance of the errors term is the same for all observations
the error terms are independent from each other
the error terms follow a normal distribution
In other words, ε1,ε2,⋯,εniid∼N(0,σ2)
We can express it as a random vector of errors ε=[ε1ε2⋯εn]′, and its distribution is assumed to be n-variate normal. That is,
ε∼Nn(0,σ2I)
Illustrations for the Case of 2 Independent Variables
Suppose you have the following dataset:
Y | X1 | X2 |
---|---|---|
-44.420115 | -4.688959 | -6.3638502 |
45.491984 | 4.130522 | 8.3478034 |
22.410224 | -1.026449 | 7.9266283 |
-42.823099 | -9.147966 | -5.9889278 |
27.451357 | 2.231132 | 1.7787346 |
55.106047 | 7.741150 | 6.0452510 |
-61.114066 | -9.084008 | -4.9985867 |
48.783780 | 4.494911 | 6.4357302 |
-5.055165 | 4.547764 | -5.7041587 |
-24.917757 | -7.281202 | -2.1957758 |
-4.085608 | 5.741016 | -6.2462851 |
-39.714749 | -7.503980 | -3.8645170 |
-14.247992 | -2.965437 | -0.2934940 |
35.744026 | 7.974895 | 1.5049660 |
33.232878 | 9.520508 | -4.4449692 |
40.754880 | 7.409358 | 0.9830919 |
64.832420 | 8.690991 | 7.4928578 |
-10.740910 | 1.701707 | -2.8703850 |
27.173631 | -5.613791 | 9.9569232 |
-24.511849 | 6.436055 | -9.8713667 |
In a 2-dimensional plane, it can be visualized as follows
It can be also visualized using a 3-dimensional plane. You can drag around to change the view angle.
Now, we want a line (or a sheet) that passes through the center of the points. Let’s try this equation: Y=2+3X1+4X2 The line that passes through the center of the points can be visualized in this 3D graph:
2.6 Assumptions
The multiple linear regression model is represented by the matrix equation Y=Xβ+ε
Given n observations, we want to fit this equation.
How do we regulate the balance between summarizing data and model fit?
IMPOSE ASSUMPTIONS
Types of Error Assumptions
Classical Assumptions: E(εi)=0,Var(εi)=σ2∀i,Cov(εi,εj)=0∀i≠j
Normal Error Model Assumptions: εiiid∼N(0,σ2)
Linear Regression Variable Assumptions
Independent Variables: Assumed to be constant, predetermined, or something that is already gathered; uncorrelated or linearly independent from each other.
Dependent Variable: Assumed to still be unknown and inherently random; each observation is independent with one another.
Important Features of the Model
(Yi as a random variable). The observed value of Yi in the ith trial is the sum of two components: Yi=β0+β1Xi1+β2Xi2+⋯+βkXik⏟constant terms+εi⏟random term
Hence, Yi is a random variable.
(Expectation of Yi). Since E(εi)=0, it follows that E(Yi)=β0+β1Xi1+β2Xi2+⋯+βkXik. Thus, the response Yi, when the level of the k independent variables (X’s) in the ith trial are known, comes from a probability distribution whose mean is β0+β1Xi1+β2Xi2+⋯+βkXik. This constant value is referred to as the regression function for the model.
(Error Term). The observed value of the Yi in the ith trial exceeds or falls short of the value of the regression function by the error term amount εi.
(Homoscedasticity). The error terms are assumed to have a constant variance. Thus, the responses Yi have the same constant variance.
(Independence of Observations). The error terms are assumed to be uncorrelated. Hence, the outcome in any one trial has no effect on the error term for any other trial – as to whether it is positive or negative, small or large. Since the error terms εi and εj are uncorrelated, so are the responses Yi and Yj.
(Distribution of Yi). The assumption of normality of error terms implies that the Yi are also independent (but not necessarily identical) normal random variables. That is: Yiind∼Normal(μ=β0+β1Xi1+β2Xi2+⋯+βkXik,σ2)
Is it justifiable to impose normality assumption on the error terms?
- A major reason why the normality assumption for the error terms is justifiable in many situations is that the error terms frequently represent the effects of many factors omitted explicitly in the model, that do affect the response to some extent and that vary at random without reference to the independent variables. Also, there might be random measurement errors in recording Y.
- Insofar as these random effects have a degree of mutual independence, the composite error term representing all these factors will tend to comply with the CLT and the error term distribution would approach normality as the number of factor effects becomes large.
- Insofar as these random effects have a degree of mutual independence, the composite error term representing all these factors will tend to comply with the CLT and the error term distribution would approach normality as the number of factor effects becomes large.
- A second reason why the normality assumption for the error terms is frequently justifiable is that some of the estimation and testing procedures to be discussed in the next chapters are based on the t-distribution, which is not sensitive to moderate departures from normality.
- Thus, unless the departures from normality are serious, particularly with respect to skewness, the actual confidence coefficients and risks of errors will be close to the levels for exact normality.
- A major reason why the normality assumption for the error terms is justifiable in many situations is that the error terms frequently represent the effects of many factors omitted explicitly in the model, that do affect the response to some extent and that vary at random without reference to the independent variables. Also, there might be random measurement errors in recording Y.
In summary, the regression assumptions are the following:
- the expected value of the unknown quantity εi is 0 for every observation i
- the variance of the errors term is the same for all observations
- the error terms and observations are uncorrelated
- the error terms follow a normal distribution (under the Normal error model assumption)
- the independent variables are linearly independent from each other
- the independent variables are assumed to be constants and not random