Chair of Spatial Data Science and Statistical Learning

Lecture 3 Linear Regression

3.1 Overview

In this lecture we will introduce linear regression. We will start with the simple linear regression in Section 3.2 and focus on the history, model assumptions, and properties of the least squares estimator. Finally, we will extend this model by adding more covariates to obtain a multiple linear regression in Section 3.3 and in Section 3.4 finally show that with likelihood estimation we obtain the same result as with OLS.

3.2 Simple Linear Regression

3.2.1 History on Linear Models

Linear models have a rich history, tracing back to early statistical work in the 19th century. One foundational study about the concept of “regression” was conducted by Sir Francis Galton, who explored the relationship between the heights of parents and their children: In the plot it can be observed that an extreme value in the heights of the parents does not automatically lead to an extreme value for the child. Instead, there is a tendency to “regress to” the (conditional) mean.

3.2.2 Regression Equation

To capture this relationship between two variables, a simple linear line that represents this conditional mean of the data, can be used.

We explain a so-called response variable Yi by using the function f(xi), which models the average relationship between the covariates and the response. Moreover, we add εi as a residual (error term) to capture the unexplainable random noise.

Yi=f(xi)+εiresponse=average+errorvariablerelationshipterm

There are many different options to model equation f(xi). In our case with only one covariate, a simple linear line can be used to represent the conditional mean of the data: Yi=β0+β1xi+εi i.e. f(xi)=β0+β1xi where β0 is the intercept, β1 the slope coefficient, and xi a single covariate. The challenge now is determining how to find the best estimates for β0 and β1 that accurately fit the data.

3.2.3 Concept: Least Squares Estimation

Carl Friedrich Gauss’ development of the method of least squares provided an elegant solution to this problem. The core idea is to minimize the sum of squared differences between the regression line and the observed data:

(β0^,β1^)=argminβ0,β1i=1n(Yi(β0+β1xi))2whereYi(β0+β1xi)=εi

Squaring the differences ensures that larger deviations are given more weight and eliminates the issue of residuals canceling each other out due to differing signs.

In the case of Sir Francis Galton’s height example the resulting regression line looks as follows:

3.2.4 Application: Least Squares Estimation

For a better understanding we provide an interactive shiny application. The example used in the app focuses on the influence of speed in mph on the braking distance in feet. We mark the data points in blue and the regression line in red. Further, the plot shows three green squares, which represent the squared residuals of three random observations. Summing up these squares over all data points yields the residual sum of squares SSR which is shown in the top left corner. You can move the sliders and observe how the red regression line adapts. Since minimizing the SSR is the goal in OLS estimation, aim to reduce it in order to find the best fit.

Click here for the full version of the ‘Linear Regression’ shiny app.

3.2.5 Model Assumptions

  • The error terms are independently and identically distributed (iid) with E(εi)=0 and Var(εi)=σ2 (homoscedastic variance).

  • For determining Test- and Confidence Intervals we additionally assume the error terms εi to be normally distributed, i.e. εiN(0,σ2).

Thus, the resulting distribution of Yi is

  • E(Yi)=E(β0+β1xi+εi)=β0+β1xi+E(εi)=β0+β1xi,
  • Var(Yi)=Var(β0+β1xi+εi)=Var(εi)=σ2, YiiidN(β0+β1xi,σ2).

3.2.6 Result: Least Squares Estimator

The least squares estimators for the intercept and the slope are:

  • β1^=i=1n(xix¯)(yiy¯)i=1n(xix¯)2
  • β0^=y¯β1x¯
We provide a full derivation of both results in the following dropdown:
Derivation

We start with the sum of squared residuals (SSR) for a simple linear regression model:

SSR=i=1n(yi(β0+β1xi))2

The goal is to minimize SSR with respect to β0 and β1. To do so, we take partial derivatives with respect to β0 and β1, and set them to zero:

  1. Partial Derivative with Respect to β0: SSRβ0=2i=1n(yi(β0+β1xi))=0

    Simplify: i=1nyi=nβ0+β1i=1nxi

    Rearrange to express the first equation: (1)β0n+β1i=1nxi=i=1nyi

  2. Partial Derivative with Respect to β1: SSRβ1=2i=1nxi(yi(β0+β1xi))=0

    Simplify: i=1nxiyi=β0i=1nxi+β1i=1nxi2

    Rearrange to express the second equation: (2)β0i=1nxi+β1i=1nxi2=i=1nxiyi


From equations (1) and (2), we now solve for β0 and β1:

  1. Rewrite equation (1) for β0: β0=1n(i=1nyiβ1i=1nxi)

  2. Substitute this expression for β0 into equation (2): 1n(i=1nyiβ1i=1nxi)i=1nxi+β1i=1nxi2=i=1nxiyi

  3. Simplify to isolate β1^: β1=i=1n(xix¯)(yiy¯)i=1n(xix¯)2

    Where x¯=1ni=1nxi and y¯=1ni=1nyi.

  4. Use the value of β1 to compute β0^: β0=y¯β1x¯

3.2.7 Properties of the Least Squares Estimator

There are two important properties of the LS-estimator: unbiasedness and consistency.

  • Unbiased: The LS-estimator is unbiased, i.e. E(β^0)=β0,E(β^1)=β1.

  • Variance: Var(β^0)=σ2i=1nxi2ni=1n(xix¯)2,Var(β^1)=σ2i=1n(xix¯)2.

  • If n i=1n(xix¯)2 holds, β^0 and β^1 are consistent as well since the variances decrease towards zero.

In the following we provide proofs for both conditions:

Proof: Unbiasedness

1. Estimator for β^1

The formula for the slope estimator β^1 is:

β^1=i=1n(xix¯)(yiy¯)i=1n(xix¯)2

Substitute yi=β0+β1xi+εi and break into terms:

β^1=i=1n(xix¯)(β0+β1xi+εiy¯)i=1n(xix¯)2=i=1n(xix¯)(β0y¯)+i=1n(xix¯)β1xi+i=1n(xix¯)εii=1n(xix¯)2

The first term simplifies because:

i=1n(xix¯)=0(since x¯ is the mean of xi)

For the second term:

i=1n(xix¯)xi=i=1n(xix¯)(x¯+(xix¯))=i=1n(xix¯)2

So this term simplifies to:

β1i=1n(xix¯)2

For the third term:

i=1n(xix¯)εi

Since εi is iid random with E[εi]=0, its expectation is:

E[i=1n(xix¯)εi]=0

Thus, the expectation of β^1 is:

E[β^1]=β1i=1n(xix¯)2i=1n(xix¯)2=β1.


2. Estimator for β^0

The formula for the intercept estimator β^0 is:

β^0=y¯β^1x¯

Substitute y¯=1ni=1nyi and yi=β0+β1xi+εi:

y¯=1ni=1n(β0+β1xi+εi)=β0+β1x¯+ε¯

Here, ε¯=1ni=1nεi, and since E[εi]=0, we have E[ε¯]=0

Now substitute into β^0:

β^0=y¯β^1x¯=(β0+β1x¯+ε¯)β^1x¯

Taking the expectation: E[β^0]=β0+β1x¯E[β^1]x¯

From the previous result, E[β^1]=β1. Substituting:

E[β^0]=β0+β1x¯β1x¯=β0

Proof: Variance and Consistency
  1. Variance of β^1:

Using the definition of β^1:

β^1=i=1n(xix¯)(yiy¯)i=1n(xix¯)2

We can rewrite the numerator: β^1=i=1n(xix¯)(β0+β1xi+εiβ0¯β1¯x¯ε¯)i=1n(xix¯)2 Since the coefficients are constants, we can rewrite β0¯=β0 and β1¯=β1. Furthermore, by definition ε¯=0. Hence, the numerator can be summarized as: β^1=i=1n(xix¯)(β1(xix¯)+εi)i=1n(xix¯)2=β1+i=1n(xix¯)εii=1n(xix¯)2

We take the variance: Var(β1^)=Var(β1+i=1n(xix¯)εii=1n(xix¯)2) Since the coefficient β1 is assumed as fix, its (co-)variance is equal to zero:

Var(β^1)=Var(i=1n(xix¯)εii=1n(xix¯)2)

The denominator i=1n(xix¯)2 is a constant, so we can pull it out by squaring:

Var(β^1)=Var(i=1n(xix¯)εi)(i=1n(xix¯)2)2 Now the only random part are the εi, which are independent and identically distributed with Var(εi)=σ2. The variance of this weighted sum of independent variables is:

Var(i=1n(xix¯)εi)=i=1n(xix¯)2Var(εi)=σ2i=1n(xix¯)2

Thus:

Var(β^1)=σ2i=1n(xix¯)2(i=1n(xix¯)2)2

Simplify: Var(β^1)=σ2i=1n(xix¯)2


2. Variance of β^0:

Using β^0=y¯β^1x¯, the variance is:

Var(β^0)=Var(y¯)+x¯2Var(β^1)2x¯Cov(y¯,β^1)

Variance of y¯: Since y¯=1ni=1nyi and yi=β0+β1xi+εi, the variance of y¯ is:

Var(y¯)=σ2n

Covariance of y¯ and β^1: We add this term since both y¯ and β^1 consist of the random variable εi. However, the covariance Cov(y¯,β^1) is 0 because y¯ depends on the average of εi, which is zero by definition. Further, β^1 depends on the deviations from x¯ (see above), which are orthogonal to y¯.

Combine terms: Substitute into the formula:

Var(β^0)=σ2n+x¯2σ2i=1n(xix¯)2

Factor out σ2i=1n(xix¯)2:

Var(β^0)=σ2i=1nxi2ni=1n(xix¯)2

3. Consistency

Both terms tend towards zero when n:

For Var(β1^): limnσ2i=1n(xix¯)2=0

For Var(β0^): limnσ2i=1nxi2ni=1n(xix¯)2=0

Unless the xi’s are constant (which they’re not in most cases), the sums in both denominators keeps increasing as you add more data points.

Hence, the estimators are asymptotically efficient.

3.3 Multiple Linear Regression

Until now we only considered models with one covariate X. From now on consider p covariates X1,,Xp with observations x1,,xn for each covariate. Extending the example from above, we could probably also be interested in the parents’ income or the diet of the children as influence factors of the height.

General model assumption: Yi=f(xi)+εiresponse=average+errorvariablerelationshipterm

From now we call this model just linear model.

3.3.1 Model Formulation

The formula of the linear regression model is as follows:

Yi=β0+β1xi1+β2xi2++βpxip+εi, i.e., f(xi1,,xip)=β0+β1xi1+β2xi2++βpxip

In matrix notation this can be rewritten as: y=Xβ+ε,

where:

  • y is an n×1 vector of observed responses,
  • X is an n×(p+1) design matrix (with rows corresponding to observations and columns to predictors, including a column of 1s for the intercept),
  • β is a (p+1)×1 vector of coefficients to be estimated,
  • ε is an n×1 vector of residuals (assumed to be i.i.d. with εN(0,σ2I)).

3.3.2 Least Squares Estimation

Estimate the unknown regression coefficients by minimizing the squared differences between Yi and β0+β1xi1++βpxip:

β^=(β0^,,βp^)=argminβ0,,βpLS(β0,β1,,βp)

with LS being the least squares criterion: LS(β0,β1,,βp):=i=1n(Yiβ0β1xi1βpxip)2

The derivation of the coefficient vector is provided in the dropdown:
Derivation

Goal: Minimize the Residual Sum of Squares (SSR)

The least squares criterion equals the residual sum of squares (SSR). In matrix notation it is formulated as: SSR(β)=yXβ2=(yXβ)(yXβ).

Expanding the quadratic form: SSR(β)=yy2βXy+βXXβ.

Expanding the quadratic form: SSR(β)=yy2βXy+βXXβ.

Step 1: First-order condition:

To minimize SSR(β), take the derivative with respect to β and set it equal to zero:

SSR(β)β=2Xy+2XXβ=0.

Simplify: XXβ=Xy.

Step 2: Solve for β:

If XX is invertible (i.e., X has full column rank), we can solve for β as: β^=(XX)1Xy. Thus, this is the OLS estimator for β.

3.3.3 Interpretation

For β0: The average Yi if all xi1,,xip are zero.

For βj with j>0:

  • βj>0: If the covariate Xj increases by one unit, the response variable Y increases (ceteris paribus) by βj on average.
  • βj<0: If the covariate Xj increases by one unit, the response variable Y decreases (ceteris paribus) by βj on average.

3.4 Likelihood Estimation

With the Gaussian error we can also use likelihood inference:

YiiidN(β0+β1xi1+β2xi2++βpxip,σ2)

This yields the same result as LS-estimation:
Proof

1. Likelihood Function

Given the normality of the residuals εi, the likelihood of observing the data y=(y1,,yn) given the parameters β and σ2 is:

L(β,σ2|y)=i=1n12πσ2exp((yixiβ)22σ2),

where xi=(1,xi1,xi2,,xip) is the vector of predictors for the i-th observation.

The log-likelihood function is: (β,σ2|y)=n2log(2πσ2)12σ2i=1n(yixiβ)2.

2. Maximizing the Log-Likelihood

To estimate β, we maximize the log-likelihood (β,σ2|y) w.r.t. to it.

Since n2log(2πσ2) does not depend on β, we focus on the second term:

Q(β)=12σ2i=1n(yixiβ)2.

Maximizing Q(β) with respect to β is equivalent to minimizing:

SSR(β)=i=1n(yixiβ)2=yXβ2, which is equivalent to the LS criterion (Residual Sum of Squares: SSR). Consequently, the resulting estimators are equivalent, too.

Likelihood estimation has an advantage over OLS because it utilizes the full probability distribution of the data and thus more information. While OLS minimizes residuals and focuses on the relationship between predictors and the response, likelihood estimation incorporates information about the variance structure and shape of the residual distribution, enabling more efficient parameter estimates and richer inference. It allows for estimating additional parameters (e.g., error variance) and allows for hypothesis testing, model comparison, and application to more complex models like GLMs.