CHAPTER 2 Introduction to Regression and Linear Models

Regression Analysis is a statistical tool that utilizes the relation between a dependent variable and one or more independent variables so that the dependent variable can be predicted using the independent variable/s.

In this course, we will focus on datasets and models where the dependent and independent variables are numeric or quantitative in nature.


2.1 Historical Origin of the term “Regression”

  • Regression analysis was first developed by Sir Francis Galton in the latter part of the 19th century.

  • Galton had studied the heights of fathers and sons and noted that the heights of sons of both tall and short fathers appeared to revert or regress to the mean of height of the fathers. He considered this tendency to be a regression to mediocrity.

  • Galton developed a mathematical description of this regression tendency, the precursor of today’s regression models.

  • The term regression persists to this day to describe statistical relations between variables, but may be differently described from how Galton first used it.

  • Most of the statistical methodologies pioneered by Galton (regression, psychometrics, usage of questionnaires) are unfortunately utilized by Galton for eugenics and scientific racism.


2.2 Uses of Regression Analysis

  • Data Description – summarize and describe the data; describes the relationship of the dependent variable against the independent variable
  • Prediction – forecast the expected value of the variable of interest given the values of the other (independent) variables; very important and useful in planning
  • Structural Analysis – the use of an estimated model for the quantitative measurement of the relationships of the variables (example: economic variables); it facilitates the comparison of rival theories of the same phenomena; quantifies the relationship between the variables

2.3 Classification of Regression Models

In terms of distributional assumptions

  • Parametric – assumes a fixed structural form where the dependent variable (linearly) depends on the independent variables; the distribution is known and is indexed by unknown parameters
  • Nonparametric – the dependent variable depends on the explanatory variables but the distribution is not specified (distribution-free) and not indexed by a parameter
  • Semi-parametric – considers an unknown distribution but indexed by some parameter

In terms of types of dependent and independent variables

Dependent Independent Model
Continuous Continuous Classical Regression
Continuous

Continuous

added Categorical

Classical Regression with use of Dummy Variables
Continuous

Categorical

added Continuous

Analysis of Covariance (ANCOVA)
Continuous All Categorical Analysis of Variance (ANOVA)
Categorical Any Combination Logistic Regression
Categorical All Categorical Log-Linear Models

2.4 Introduction: Random Vectors

Definition 2.1 Suppose Yn×1 is a vector of n random variables, Y=[Y1Y2Yn]. Then Yn×1 is a random vector.

Mean Vector, Covariance Matrix, and Correlation Matrix

Definition 2.2 The expectation of Y is E(Y)=E[Y1Y2Yn]=[E(Y1)E(Y2)E(Yn)]

This is also referred as the mean vector of Y, and can be denoted as:

μ=[μ1μ2μn]

Definition 2.3 The Variance of Y (also known as variance-covariance matrix or dispersion matrix of Y) is

Var(Y)=E[(Yμ)(Yμ)]=E(YY)μμ

The variance-covariance matrix is often denoted by

Σ=[σ11σ12σ1nσ21σ22σ2nσn1σn2σnn]

where

  • the diagonal elements are the variances of Yi: σii=σ2i=Var(Yi)

  • the off-diagonal elements are the covariances of Yi and Yj: σij=cov(Yi,Yj)

The variance-covariance matrix is sometimes also written as V(Y) or Cov(Y)

Theorem 2.1 For n×1 constant vectors a and b, and random vector Y with mean vector μ and covariance matrix Σ,

  1. E(Y+a)=μ+a

  2. E(aY)=aμ

  3. Var(Y+a)=Σ

  4. Var(aY)=aΣa

  5. cov(aY,bY)=aΣb

Theorem 2.2 Let A be a k×n matrix of constants, B be a m×n matrix of constants, and Y is a n×1 random vector with covariance matrix Σ. Then:

  1. Var(AY)=AΣA

  2. Var(AY+b)=AΣA

  3. Cov(AY,BY)=AΣB

Theorem 2.3 Let A be an n-dimensional symmetric matrix. The scalar quantity YAY is known as quadratic form in Y. Then:

  • E(YAY)=tr(AΣ)+μAμ

  • Under Multivariate Normality, Var(YAY)=2tr(AΣ)2+4μAΣAμ

Definition 2.4 The correlation matrix of Y is defined as

Pρ=ρij=[1ρ12ρ1nρ211ρ2nρn1ρn21]

where ρij=σij/σiiσjj is the correlation of Yi and Yj.

If we define Dσ=[diag(Σ))]1/2=diag(σ11,σ22,,σnn)

then we can obtain Pρ from Σ:

Pρ=D1σΣD1σ

and vice versa:

Σ=DσPρDσ

Remarks on variance and correlation:

  • The Variance-Covariance matrix and the Correlation Matrix are always symmetric.

  • The diagonal elements of the correlation matrix are always equal to 1.

The Multivariate Normal

This is just a quick introduction. Theory and more properties will be discussed in Stat 147.

If we are going to use the normal Error assumption in our regression models, knowledge of the multivariate normal random variable is important.

Definition 2.5 (Multivariate Normal)

Let μRn and let Σ be a n×n positive semidefinite matrix as defined in Equations (2.1) and (2.2) respectively.

The n×1 vector Y is said to have a multivariate distribution with parameters μ and Σ, written as YNn(μ,Σ), if and only if lYN(lμ,lΣl) for every n×1 vector l.

The definition simply states that for Y to be multivariate normal, every linear combination of its components must be univariate normal with parameters lμ and lΣl.

l1Y1+l2Y2++lnYnN(lμ,lΣl)

Properties of the Multivariate Normal

If YNn(μ,Σ), then the following properties hold:

  1. The mean and variance of Y are E(Y)=μ and V(Y)=Σ.

  2. For any vector of constants a, Y+aN(μ+a,Σ)

  3. The marginal distributions of the components are univariate normal, i.e. YiN(μi,σii), i=1,...,n, where μi and σii are the mean and variance respectively of component Yi.

  4. For two components Yi and Yj, ij, their covariance can be found on the off-diagonal elements of Σ, i.e. cov(Yi,Yj)=σij

  5. If L is a (p×n) matrix of rank p, then LYNp(Lμ,LΣL)

  6. The joint PDF of Y is given by fY(y)=1(2π)n/2|Σ|1/2exp{12(yμ)Σ1(yμ)},yRn

Questions:

  1. If two random variables are independent, are they uncorrelated?

    Answer

    Yes.

  2. If two random variables are uncorrelated, are they independent?

    Answer

    Generally, No. But if they are normally distributed, then Yes.

    This implies that if YN(μ,Σ) where Σ=diag(σ21,σ22,,σ2n), then the marginals are mutually independent to each other, i.e. Y1,Y2,YnIndN(μi,σ2i).

  3. For a multivariate random vector Y=[Y1Y2Yn], if the marginal components are all univariate normal, i.e. YiN(μi,σii) for all i, then does this imply that Y follows a multivariate normal distribution?

    Answer

    No. Not necessarily.

    Again, all possible linear combinations of the components must be univariate normal.

    As a counter example, suppose Y=[Y1Y2] has a joint PDF

    fY(y)=12πe12(y21+y22)×[1+y1y2e12(y21+y22)],yR2 This is NOT the pdf of a bivariate normal. Therefore, Y does not follow a multivariate normal distribution.

    However, if we derive the marginal distributions of Y1 and Y2, we will obtain univariate normal PDFs.

    Proof:

    fY1(y1)=f(y)dy2=12πe12(y21+y22)×[1+y1y2e12(y21+y22)]dy2=12πe12(y21+y22)dy2(a)+12πe12(y21+y22)y1y2e12(y21+y22)dy2(b)

    Aside (a):

    (a)=12πe12y2112πe12y22pdf of N(0,1)dy2=12πe12y21 Aside (b):

    (b)=12πy1y2e(y21+y22)dy2=12πy1ey21y2ey22dy2=12πy1ey21y2ey22dy2=12πy1ey21(ey222|x2=+x2=)=12πy1ey21(00)=0 Therefore, the marginal pdf of Y1 is

    fY1(y1)=12πe12y21 which is a univariate Normal(0,1).

    Using the same process, we can also see that Y2N(0,1)

    Therefore, we showed a multivariate random vector that DOES NOT follow the multivariate normal distribution, but has marginal components that each has univariate normal PDFs.

    Having univariate normal as marginal distributions does not imply that the joint distribution is multivariate normal.

2.5 The Linear Model

Two points can be represented by a straight line. Recall the slope-intercept equation of a line

Y=a+mX

where a is the y-intercept, and m is the slope of the line.


Example 1: Some deterministic models can be represented by a straight line.

You are selling hotdogs with a unit price of 25 pesos per piece. Assuming there are no tips and you have no other items to sell, then the daily sales has a deterministic relationship with the number of hotdogs sold.

sales=25×hotdogs

However, most phenomena are governed by some probability or randomness. What if we are modelling the expected net income where we consider the tips, possible spoilage of food, and other random scenario?

That is why we add a random error term ε to characterize a stochastic linear model.

sales=25×hotdogs+ε where ε is a random value that may be positive or negative.


Example 2: a variable can be a function of another variable, but the relationship may be stochastic, not deterministic.

Suppose we have the following variables collected from 10 students.

student <- read.csv("student.csv")
  • X - highschool exam score in algebra
  • Y - Stat 101 Final Exam scores
Student X Y
1 90 85
2 87 87
3 85 89
4 85 90
5 95 92
6 96 94
7 82 80
8 78 75
9 75 60
10 84 78

 

020406080100020406080100
Highschool Algebra Grade vs Stat 101 Final ExamHighschool Algebra GradeStat 101 Final Exam Grade

 

There seems to be a linear relationship between X and Y (although not perfectly linear).

Our main goal is to represent this relationship using an equation.

 

In this graph:

  • Y=β0+β1X represents the straight line.
  • Yi=β0+β1Xi+εi represents the points in the scatter plot
  • The βs are the model parameters
  • The εs are the error terms
  • If we are asking “which equation best represents the data”, it is the same as asking “what are the values of β that best represent the data”.
  • If there are k independent variables, the points are represented by: Yi=β0+β1Xi1+β2Xi2++βkXik+εi

Justification of the Error Term

  • The error term represents the effect of many factors (apart form the X’s) not included in the model which do affect the response variable to some extent and which vary at random without reference to the independent variables.
  • Even if we know that the relevant factors have significant effect on the response, there is still a basic and unpredictable element of randomness in responses which can be characterized by the inclusion of a random error term.
  • ε accounts for errors of observations or measurements in recording Y.
  • Errors can be positive or negative, but are expected to be very small (close to 0)

Linear Model in Matrix Form

Take note that we assume a model for every observation i=1,,n, which implies that we are essentially handling n equations. To facilitate handling n equations, we can use the concepts in matrix theory.

Instead of Yi=β0+β1Xi1+β2Xi2++βkXik+εi for each observation i from 1 to n, we can make it compact using matrix notations.

Given the following vectors and matrices:

Y=[Y1Y2Yn]X=[1X11X12X1k1X21X12X2k1Xn1Xn2Xnk]β=[β0β1β2βk]ε=[ε1ε2εn]

Definition 2.6 (Matrix Form of the Linear Equation)
The k-variable, n-observations linear model can be written as

Yn×1=Xn×(k+1)β(k+1)×1+εn×1[Y1Y2Yn]=[1X11X12X1k1X21X12X2k1Xn1Xn2Xnk][β0β1β2βk]+[ε1ε2εn][Y1Y2Yn]=[β0+β1X11+β2X12+βkX1k+ε1β0+β1X21+β2X12+βkX2k+ε2β0+β1Xn1+β2Xn2+βkXnk+εn]

where

  • Yi is the value of the response variable on the ith trial. Collectively, Y is the response vector.
  • Xij is a known constant, namely, the value of the jth independent variable on the ith trial. Collectively, X is the design matrix.
  • β0,β1,,βk are parameters. Collectively, β is the regression coefficients vector.
  • εi is a random error term on trial i=1,...,n. Collectively, ε is the error term vector.

The statistical uses of the linear model:

  • The model summarizes the linear relationship between Y and the X s.
  • It can help explain how variability in Y is affected by the X s.
  • It can help determine Y given prior knowledge on X .

the Normal Error Assumption

Recall our assumptions for the error term:

  • the expected value of the unknown quantity εi is 0 for every observation i

  • the variance of the errors term is the same for all observations

  • the error terms are independent from each other

  • the error terms follow a normal distribution

In other words, ε1,ε2,,εniidN(0,σ2)

We can express it as a random vector of errors ε=[ε1ε2εn], and its distribution is assumed to be n-variate normal. That is,
εNn(0,σ2I)

Illustrations for the Case of 2 Independent Variables

Suppose you have the following dataset:

Y X1 X2
-44.420115 -4.688959 -6.3638502
45.491984 4.130522 8.3478034
22.410224 -1.026449 7.9266283
-42.823099 -9.147966 -5.9889278
27.451357 2.231132 1.7787346
55.106047 7.741150 6.0452510
-61.114066 -9.084008 -4.9985867
48.783780 4.494911 6.4357302
-5.055165 4.547764 -5.7041587
-24.917757 -7.281202 -2.1957758
-4.085608 5.741016 -6.2462851
-39.714749 -7.503980 -3.8645170
-14.247992 -2.965437 -0.2934940
35.744026 7.974895 1.5049660
33.232878 9.520508 -4.4449692
40.754880 7.409358 0.9830919
64.832420 8.690991 7.4928578
-10.740910 1.701707 -2.8703850
27.173631 -5.613791 9.9569232
-24.511849 6.436055 -9.8713667

In a 2-dimensional plane, it can be visualized as follows

It can be also visualized using a 3-dimensional plane. You can drag around to change the view angle.

Now, we want a line (or a sheet) that passes through the center of the points. Let’s try this equation: Y=2+3X1+4X2 The line that passes through the center of the points can be visualized in this 3D graph:


2.6 Assumptions

The multiple linear regression model is represented by the matrix equation Y=Xβ+ε

Given n observations, we want to fit this equation.

How do we regulate the balance between summarizing data and model fit?

 

IMPOSE ASSUMPTIONS

Types of Error Assumptions

  • Classical Assumptions: E(εi)=0,Var(εi)=σ2i,Cov(εi,εj)=0ij

  • Normal Error Model Assumptions: εiiidN(0,σ2)

Linear Regression Variable Assumptions

  • Independent Variables: Assumed to be constant, predetermined, or something that is already gathered; uncorrelated or linearly independent from each other.

  • Dependent Variable: Assumed to still be unknown and inherently random; each observation is independent with one another.

Important Features of the Model

  1. (Yi as a random variable). The observed value of Yi in the ith trial is the sum of two components: Yi=β0+β1Xi1+β2Xi2++βkXikconstant terms+εirandom term

    Hence, Yi is a random variable.

  2. (Expectation of Yi). Since E(εi)=0, it follows that E(Yi)=β0+β1Xi1+β2Xi2++βkXik. Thus, the response Yi, when the level of the k independent variables (X’s) in the ith trial are known, comes from a probability distribution whose mean is β0+β1Xi1+β2Xi2++βkXik. This constant value is referred to as the regression function for the model.

  3. (Error Term). The observed value of the Yi in the ith trial exceeds or falls short of the value of the regression function by the error term amount εi.

  4. (Homoscedasticity). The error terms are assumed to have a constant variance. Thus, the responses Yi have the same constant variance.

  5. (Independence of Observations). The error terms are assumed to be uncorrelated. Hence, the outcome in any one trial has no effect on the error term for any other trial – as to whether it is positive or negative, small or large. Since the error terms εi and εj are uncorrelated, so are the responses Yi and Yj.

  6. (Distribution of Yi). The assumption of normality of error terms implies that the Yi are also independent (but not necessarily identical) normal random variables. That is: YiindNormal(μ=β0+β1Xi1+β2Xi2++βkXik,σ2)

    Is it justifiable to impose normality assumption on the error terms?

    • A major reason why the normality assumption for the error terms is justifiable in many situations is that the error terms frequently represent the effects of many factors omitted explicitly in the model, that do affect the response to some extent and that vary at random without reference to the independent variables. Also, there might be random measurement errors in recording Y.
      • Insofar as these random effects have a degree of mutual independence, the composite error term representing all these factors will tend to comply with the CLT and the error term distribution would approach normality as the number of factor effects becomes large.
    • A second reason why the normality assumption for the error terms is frequently justifiable is that some of the estimation and testing procedures to be discussed in the next chapters are based on the t-distribution, which is not sensitive to moderate departures from normality.
      • Thus, unless the departures from normality are serious, particularly with respect to skewness, the actual confidence coefficients and risks of errors will be close to the levels for exact normality.

In summary, the regression assumptions are the following:

  • the expected value of the unknown quantity εi is 0 for every observation i
  • the variance of the errors term is the same for all observations
  • the error terms and observations are uncorrelated
  • the error terms follow a normal distribution (under the Normal error model assumption)
  • the independent variables are linearly independent from each other
  • the independent variables are assumed to be constants and not random

© 2024 Siegfred Roi Codia. All rights reserved.