CHAPTER 3 Estimation Procedures

Click for quick review

The Linear Model

$\underset{n\times1}{\textbf{Y}} = \underset{n\times (k+1)}{\textbf{X}}\underset{(k+1)\times1}{\boldsymbol{\beta}} + \underset{n\times 1}{\boldsymbol{\varepsilon}}\\\begin{bmatrix} Y_1 \\ Y_2 \\ \vdots\\ Y_n \end{bmatrix} =\begin{bmatrix} 1 & X_{11} & X_{12} & \cdots & X_{1k} \\ 1 & X_{21} & X_{12} & \cdots & X_{2k} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & X_{n1} & X_{n2} & \cdots & X_{nk} \end{bmatrix}\begin{bmatrix}\beta_0 \\ \beta_1 \\ \beta_2 \\ \vdots\\ \beta_k \end{bmatrix} + \begin{bmatrix} \varepsilon_1 \\ \varepsilon_2 \\ \vdots\\ \varepsilon_n\end{bmatrix}$

where

$\textbf{Y}$ is the response vector.
$\textbf{X}$ is the design matrix.
$\boldsymbol{\beta}$ is the regression coefficients vector.
$\boldsymbol{\varepsilon}$ is the error term vector.

Assumptions

the expected value of the unknown quantity $\varepsilon_i$ is 0 for every observation $i$
the variance of the errors term is the same for all observations
the error terms and observations are uncorrelated
the error terms follow a normal distribution
the independent variables are linearly independent from each other

Reasons for Having Assumptions

to make sure that the parameters are estimable
to make sure that the solution to the estimation problem is unique
to aid in the interpretation of the model parameters
to derive more important results that can be used for further analyses

GENERAL PROBLEM:

Given the following model, for observation $i=1,...,n$ $Y_i=\beta_0+\beta_1X_{i1}+\beta_2 X_{i2}+\dots+\beta_kX_{ik}+\varepsilon_i$ or the matrix form $\textbf{Y} = \textbf{X}\boldsymbol{\beta}+\boldsymbol{\varepsilon}$ How do we estimate the $\beta s$ ?

There are several estimation procedures that can be used in linear models. The course will focus on least squares, but others are presented as well.

Figure 3.1: Scatterplot with fitted line

One criterion of the fitted line: have all points to be close to it. That is, we want to have small $\varepsilon_i$

3.1 Method of Least Squares

Objective function: Minimize $\sum_{i=1}^n\varepsilon_i^2$ or $\boldsymbol{\varepsilon}'\boldsymbol{\varepsilon}$ , the inner product of the error term vector.

In simple linear regression $y_i=\beta_0 +\beta x_i+\varepsilon_i$ , the estimator for $\beta_0$ and $\beta$ can be derived as follows:
$\hat{\beta_0}=\underset{\beta_0}{\arg \min}\sum_{i=1}^n \varepsilon_i^2 \quad \hat{\beta}=\underset{\beta}{\arg \min}\sum_{i=1}^n \varepsilon_i^2$

Full solution

GOAL: Find value of $\beta_0$ and $\beta$ that minimizes $\sum_{i=1}^n\varepsilon_i^2$ , which are the solutions to the equations $\frac{d}{d\beta_0}\sum_{i=1}^n\varepsilon_i^2=0\quad\text{and}\quad\frac{d}{d\beta}\sum_{i=1}^n\varepsilon_i^2=0$

For the Intercept

$\begin{align} \frac{d}{d\beta_0}\sum_{i=1}^n\varepsilon_i^2 &= \frac{d}{d\beta_0}\sum_{i=1}^n(y_i-\beta_0-\beta x_i)^2 \\ &= -2\sum_{i=1}^n(y_i-\beta_0-\beta x_i) \end{align}$

Now, equate to $0$ and solve for $\hat{\beta_0}$

$\begin{align} 0&=-2\sum_{i=1}^n(y_i-\hat{\beta_0}-\hat{\beta} x_i)\\ 0&=\sum_{i=1}^ny_i - n\hat{\beta_0}-\hat{\beta} \sum_{i=1}^nx_i\\ n\hat{\beta_0} &= \sum_{i=1}^ny_i -\hat{\beta} \sum_{i=1}^nx_i \\ \hat{\beta_0} &= \bar{y} -\hat{\beta} \bar{x} \end{align}$

For the Slope
Since there is already an estimate for $\beta_0$ , we can already plug it in the equation before solving for $\hat{\beta}$ .

$\begin{align} \frac{d}{d\beta}\sum_{i=1}^n\varepsilon^2 &= \frac{d}{d\beta}\sum_{i=1}^n[y_i-\hat{\beta_0}-\beta x_i]^2 \\ &= \frac{d}{d\beta}\sum_{i=1}^n[y_i-(\bar{y}-\beta\bar{x})-\beta x_i]^2 \\ &= \frac{d}{d\beta}\sum_{i=1}^n[(y_i-\bar{y})-\beta( x_i-\bar{x})]^2 \\ &= -2\sum_{i=1}^n\left\{[(y_i-\bar{y})-\beta( x_i-\bar{x})]( x_i-\bar{x})\right\} \\ &= -2\sum_{i=1}^n[(y_i-\bar{y})( x_i-\bar{x})-\beta( x_i-\bar{x})( x_i-\bar{x})] \end{align}$

Equating to $0$ to solve for $\hat{\beta_0}$

$\begin{align} 0 &= -2\sum_{i=1}^n[(y_i-\bar{y})( x_i-\bar{x})-\hat{\beta}( x_i-\bar{x})( x_i-\bar{x})] \\ 0 &= \sum_{i=1}^n[(y_i-\bar{y})( x_i-\bar{x})]- \hat{\beta}\sum_{i=1}^n( x_i-\bar{x})^2 \\ \hat{\beta}&= \frac{\sum_{i=1}^n[(y_i-\bar{y})( x_i-\bar{x})]}{\sum_{i=1}^n( x_i-\bar{x})^2} \end{align}$

Therefore, we have $\hat{\beta_0} = \bar{y} -\hat{\beta} \bar{x}$ and $\hat{\beta} = \frac{\sum_{i=1}^n(y_i-\bar{y})(x_i-\bar{x})}{\sum_{i=1}^n( x_i-\bar{x})^2}$ $\blacksquare$

How do we generalize this for the multiple linear regression?

The Normal Equations

Definition 3.1 The normal equation is a closed-form solution used to find an estimated value of the parameter that minimizes an objective function.

In the least squares method, the objective function (or the cost function) is $\boldsymbol{\varepsilon}'\boldsymbol{\varepsilon}=(\textbf{Y}-\textbf{X}\boldsymbol{\beta})'(\textbf{Y}-\textbf{X}\boldsymbol{\beta})$ .

If we minimize this with respect to the parameters $\beta_j$ , we will obtain the normal equations $\frac{\partial\boldsymbol{\varepsilon}'\boldsymbol{\varepsilon}}{\partial\beta_j} = 0, \quad j =0,...,k$ . Expanding this, we have the following normal equations:

In $(k+1)$ equations form : $\frac{\partial\sum_{i=1}^n\varepsilon_i^2}{\partial\beta_j} = 0, \quad j =0,...,k$

$\begin{align}1st :&\quad \hat\beta_0n +\hat\beta_1\sum X_{i1} + \hat\beta_2\sum X_{i2} + \cdots + \hat\beta_k\sum X_{ik}&=\quad&\sum Y_i \\2nd :&\quad \hat\beta_0\sum X_{i1} + \hat\beta_1\sum X_{i1}^2 + \hat\beta_2\sum X_{i1}X_{i2} + \cdots + \hat\beta_k\sum X_{i1}X_{ik} &=\quad& \sum X_{i1} Y_i \\ \vdots\quad & &&\quad\vdots \\ (k+1)th :&\quad \hat\beta_0\sum X_{ik} + \hat\beta_1\sum X_{i1} X_{ik} + \hat\beta_2\sum X_{i2}X_{ik} + \cdots + \hat\beta_k\sum X_{ik}^2 &=\quad& \sum X_{ik} Y_i\end{align}$

In matrix form: $\frac{\partial\boldsymbol{\varepsilon}'\boldsymbol{\varepsilon}}{\partial{\boldsymbol{\beta}}} = \textbf{0}$

$\textbf{X}'\textbf{X}\hat{\boldsymbol{\beta}} = \textbf{X}'\textbf{Y}$

$\begin{bmatrix}n & \sum X_{i1} & \sum X_{i2} & \cdots & \sum X_{ik} \\\sum X_{i1} & \sum X_{i1}^2 & \sum X_{i1}X_{i2} & \cdots & \sum X_{i1}X_{ik} \\\vdots & \vdots & \vdots & \ddots & \vdots \\\sum X_{ik} & \sum X_{i1} X_{ik} & \sum X_{i2}X_{ik} & \cdots & \sum X_{ik}^2 \\\end{bmatrix}\begin{bmatrix} \hat{\beta_0}\\ \hat\beta_1\\ \hat\beta_2\\ \vdots \\ \hat\beta_k\end{bmatrix}= \begin{bmatrix} \sum Y_i \\ \sum X_{i1} Y_i \\ \vdots\\ \sum X_{ik} Y_i\end{bmatrix}$

The solution to these equations is unique, and will be the ordinary least squares estimator for $\boldsymbol{\beta}$ .

All of these normal equations are unconditionally true under method of least squares.

The OLS Estimator

Theorem 3.1 (Ordinary Least Squares Estimator)

$(\textbf{X}'\textbf{X})^{-1}$ exists, the estimator of

$\boldsymbol{\beta}$ that will minimize

$\boldsymbol{\varepsilon}'\boldsymbol{\varepsilon}$ is given by

$\boldsymbol{\hat{\beta}}=(\textbf{X}'\textbf{X})^{-1}(\textbf{X}'\textbf{Y})$

Remarks:

$(\textbf{X}'\textbf{X})^{−1}$ exists when $\textbf{X}$ is of full column rank.
This implies that the independent variables should be independent from each other.
Furthermore, if $n < p$ , then $\textbf{X}$ cannot be full column rank. Therefore, there should be more observations than the number of regressors.

Since the estimator $\hat{\boldsymbol{\beta}}$ is a random vector, we also want to explore its mean and variance.

Theorem 3.2 The OLS estimator $\hat{\boldsymbol\beta}$ is an unbiased estimator for $\boldsymbol{\beta}$

Theorem 3.3 The variance-covariance matrix of $\hat{\boldsymbol{\beta}}$ is given by $\sigma^2(\textbf{X}'\textbf{X})^{-1}$

Remark: $Var(\hat{\beta}_j)$ is the $(j+1)^{th}$ diagonal element of $\sigma^2(\textbf{X}'\textbf{X})^{-1}$

More properties regarding the method of least squares will be discussed in Section 4.5 , where we also explain that the OLS estimator for $\boldsymbol{\beta}$ is also the BLUE, MLE, and UMVUE for $\boldsymbol{\beta}$

Q: What does it mean when the variance of an estimator $\hat{\beta}_j$ is large?

Exercises

Exercise 3.1 Prove Theorem 3.1

Show that the least squares estimator for the coefficient vector

$\boldsymbol{\beta}$ is

$\boldsymbol{\hat{\beta}} = (\textbf{X}'\textbf{X})^{-1}(\textbf{X}'\textbf{Y})$ .
Assume that

$(\textbf{X}'\textbf{X})^{-1}$ exists.

Hints

Minimize $\boldsymbol{\varepsilon}'\boldsymbol{\varepsilon}$ with respect to $\boldsymbol{\beta}$ .

Note that $\boldsymbol{\varepsilon}=\textbf{Y}-\textbf{X}\boldsymbol{\beta}$ .

Recall matrix calculus.

You will obtain the normal equations, then solve for $\boldsymbol{\beta}$ .

Exercise 3.2 Prove Theorem 3.2

Hint

Show $E(\hat{\boldsymbol{\beta}})=\boldsymbol{\beta}$

Exercise 3.3 Prove Theorem 3.3

Hint

Show $Var(\hat{\boldsymbol{\beta}})= \sigma^2(\textbf{X}'\textbf{X})^{-1}$

The following exercises will help us understand some requirements before obtaining the estimate $\hat{\boldsymbol{\beta}}$

Exercise 3.4 Let $\textbf{X}$ be the $n\times p$ design matrix. Prove that if $\textbf{X}$ has linearly dependent columns, then $\textbf{X}'\textbf{X}$ is not invertible

Hints

The following should be part of your proof. Do not forget to cite reasons.

highest possible rank of $\textbf{X}$

highest possible rank of $\textbf{X}'\textbf{X}$

order of $\textbf{X}'\textbf{X}$

determinant of $\textbf{X}'\textbf{X}$

Exercise 3.5 Prove that if $n < p$ , then $\textbf{X}'\textbf{X}$ is not invertible.

Hints

Almost same hints as previous exercise.

3.2 Interpretation of the Coefficients

In general, ceteris paribus(all things being the same)

$\hat{\beta_j}$ = change in the estimated mean of $Y$ per unit change in variable $X_j$ holding other independent variables constant.
$\hat{\beta_0}$ = value of the estimated mean of $Y$ when all independent variables are set to 0.

We don’t always interpret the intercept, but why include it in the estimation?

Figure 3.2: Fitted line if estimated intercept is forced to be 0

Intercept is only interpretable if a value of 0 is logical for all independent variables included in the model.
Regardless of its interpretability, we still include it in the estimation. That is why in the design matrix $\textbf{X}$ , the first column is a vector of $1$ s
If we force to have an intercept of equal to 0, then the fitted line is the line that passes through the center of the points and the origin.

Caution on the Interpretation of the Coefficients

Coefficients are partial. Assume that other effects are held constant.
Validity of interpretation depends on whether the assumption of uncorrelatedness among the $X_j$ s holds
Affected by the range of $X$ used in estimation

Example:

Suppose that we are statistical consultants hired by a client to provide advice on how to improve sales of a particular product. The advertising data set (from ISLR) consists of the sales of that product in 200 diﬀerent markets, along with advertising budgets for the product in each of those markets for three diﬀerent media: TV, radio, and newspaper.

advertising <- read.csv("advertising.csv")

id	tv	radio	newspaper	sales
1	230.1	37.8	69.2	22.1
2	44.5	39.3	45.1	10.4
3	17.2	45.9	69.3	9.3
4	151.5	41.3	58.5	18.5
5	180.8	10.8	58.4	12.9
6	8.7	48.9	75.0	7.2
7	57.5	32.8	23.5	11.8
8	120.2	19.6	11.6	13.2
9	8.6	2.1	1.0	4.8
10	199.8	2.6	21.2	10.6
11	66.1	5.8	24.2	8.6
12	214.7	24.0	4.0	17.4
13	23.8	35.1	65.9	9.2
14	97.5	7.6	7.2	9.7
15	204.1	32.9	46.0	19.0
16	195.4	47.7	52.9	22.4
17	67.8	36.6	114.0	12.5
18	281.4	39.6	55.8	24.4
19	69.2	20.5	18.3	11.3
20	147.3	23.9	19.1	14.6
21	218.4	27.7	53.4	18.0
22	237.4	5.1	23.5	12.5
23	13.2	15.9	49.6	5.6
24	228.3	16.9	26.2	15.5
25	62.3	12.6	18.3	9.7
26	262.9	3.5	19.5	12.0
27	142.9	29.3	12.6	15.0
28	240.1	16.7	22.9	15.9
29	248.8	27.1	22.9	18.9
30	70.6	16.0	40.8	10.5
31	292.9	28.3	43.2	21.4
32	112.9	17.4	38.6	11.9
33	97.2	1.5	30.0	9.6
34	265.6	20.0	0.3	17.4
35	95.7	1.4	7.4	9.5
36	290.7	4.1	8.5	12.8
37	266.9	43.8	5.0	25.4
38	74.7	49.4	45.7	14.7
39	43.1	26.7	35.1	10.1
40	228.0	37.7	32.0	21.5
41	202.5	22.3	31.6	16.6
42	177.0	33.4	38.7	17.1
43	293.6	27.7	1.8	20.7
44	206.9	8.4	26.4	12.9
45	25.1	25.7	43.3	8.5
46	175.1	22.5	31.5	14.9
47	89.7	9.9	35.7	10.6
48	239.9	41.5	18.5	23.2
49	227.2	15.8	49.9	14.8
50	66.9	11.7	36.8	9.7
51	199.8	3.1	34.6	11.4
52	100.4	9.6	3.6	10.7
53	216.4	41.7	39.6	22.6
54	182.6	46.2	58.7	21.2
55	262.7	28.8	15.9	20.2
56	198.9	49.4	60.0	23.7
57	7.3	28.1	41.4	5.5
58	136.2	19.2	16.6	13.2
59	210.8	49.6	37.7	23.8
60	210.7	29.5	9.3	18.4
61	53.5	2.0	21.4	8.1
62	261.3	42.7	54.7	24.2
63	239.3	15.5	27.3	15.7
64	102.7	29.6	8.4	14.0
65	131.1	42.8	28.9	18.0
66	69.0	9.3	0.9	9.3
67	31.5	24.6	2.2	9.5
68	139.3	14.5	10.2	13.4
69	237.4	27.5	11.0	18.9
70	216.8	43.9	27.2	22.3
71	199.1	30.6	38.7	18.3
72	109.8	14.3	31.7	12.4
73	26.8	33.0	19.3	8.8
74	129.4	5.7	31.3	11.0
75	213.4	24.6	13.1	17.0
76	16.9	43.7	89.4	8.7
77	27.5	1.6	20.7	6.9
78	120.5	28.5	14.2	14.2
79	5.4	29.9	9.4	5.3
80	116.0	7.7	23.1	11.0
81	76.4	26.7	22.3	11.8
82	239.8	4.1	36.9	12.3
83	75.3	20.3	32.5	11.3
84	68.4	44.5	35.6	13.6
85	213.5	43.0	33.8	21.7
86	193.2	18.4	65.7	15.2
87	76.3	27.5	16.0	12.0
88	110.7	40.6	63.2	16.0
89	88.3	25.5	73.4	12.9
90	109.8	47.8	51.4	16.7
91	134.3	4.9	9.3	11.2
92	28.6	1.5	33.0	7.3
93	217.7	33.5	59.0	19.4
94	250.9	36.5	72.3	22.2
95	107.4	14.0	10.9	11.5
96	163.3	31.6	52.9	16.9
97	197.6	3.5	5.9	11.7
98	184.9	21.0	22.0	15.5
99	289.7	42.3	51.2	25.4
100	135.2	41.7	45.9	17.2
101	222.4	4.3	49.8	11.7
102	296.4	36.3	100.9	23.8
103	280.2	10.1	21.4	14.8
104	187.9	17.2	17.9	14.7
105	238.2	34.3	5.3	20.7
106	137.9	46.4	59.0	19.2
107	25.0	11.0	29.7	7.2
108	90.4	0.3	23.2	8.7
109	13.1	0.4	25.6	5.3
110	255.4	26.9	5.5	19.8
111	225.8	8.2	56.5	13.4
112	241.7	38.0	23.2	21.8
113	175.7	15.4	2.4	14.1
114	209.6	20.6	10.7	15.9
115	78.2	46.8	34.5	14.6
116	75.1	35.0	52.7	12.6
117	139.2	14.3	25.6	12.2
118	76.4	0.8	14.8	9.4
119	125.7	36.9	79.2	15.9
120	19.4	16.0	22.3	6.6
121	141.3	26.8	46.2	15.5
122	18.8	21.7	50.4	7.0
123	224.0	2.4	15.6	11.6
124	123.1	34.6	12.4	15.2
125	229.5	32.3	74.2	19.7
126	87.2	11.8	25.9	10.6
127	7.8	38.9	50.6	6.6
128	80.2	0.0	9.2	8.8
129	220.3	49.0	3.2	24.7
130	59.6	12.0	43.1	9.7
131	0.7	39.6	8.7	1.6
132	265.2	2.9	43.0	12.7
133	8.4	27.2	2.1	5.7
134	219.8	33.5	45.1	19.6
135	36.9	38.6	65.6	10.8
136	48.3	47.0	8.5	11.6
137	25.6	39.0	9.3	9.5
138	273.7	28.9	59.7	20.8
139	43.0	25.9	20.5	9.6
140	184.9	43.9	1.7	20.7
141	73.4	17.0	12.9	10.9
142	193.7	35.4	75.6	19.2
143	220.5	33.2	37.9	20.1
144	104.6	5.7	34.4	10.4
145	96.2	14.8	38.9	11.4
146	140.3	1.9	9.0	10.3
147	240.1	7.3	8.7	13.2
148	243.2	49.0	44.3	25.4
149	38.0	40.3	11.9	10.9
150	44.7	25.8	20.6	10.1
151	280.7	13.9	37.0	16.1
152	121.0	8.4	48.7	11.6
153	197.6	23.3	14.2	16.6
154	171.3	39.7	37.7	19.0
155	187.8	21.1	9.5	15.6
156	4.1	11.6	5.7	3.2
157	93.9	43.5	50.5	15.3
158	149.8	1.3	24.3	10.1
159	11.7	36.9	45.2	7.3
160	131.7	18.4	34.6	12.9
161	172.5	18.1	30.7	14.4
162	85.7	35.8	49.3	13.3
163	188.4	18.1	25.6	14.9
164	163.5	36.8	7.4	18.0
165	117.2	14.7	5.4	11.9
166	234.5	3.4	84.8	11.9
167	17.9	37.6	21.6	8.0
168	206.8	5.2	19.4	12.2
169	215.4	23.6	57.6	17.1
170	284.3	10.6	6.4	15.0
171	50.0	11.6	18.4	8.4
172	164.5	20.9	47.4	14.5
173	19.6	20.1	17.0	7.6
174	168.4	7.1	12.8	11.7
175	222.4	3.4	13.1	11.5
176	276.9	48.9	41.8	27.0
177	248.4	30.2	20.3	20.2
178	170.2	7.8	35.2	11.7
179	276.7	2.3	23.7	11.8
180	165.6	10.0	17.6	12.6
181	156.6	2.6	8.3	10.5
182	218.5	5.4	27.4	12.2
183	56.2	5.7	29.7	8.7
184	287.6	43.0	71.8	26.2
185	253.8	21.3	30.0	17.6
186	205.0	45.1	19.6	22.6
187	139.5	2.1	26.6	10.3
188	191.1	28.7	18.2	17.3
189	286.0	13.9	3.7	15.9
190	18.7	12.1	23.4	6.7
191	39.5	41.1	5.8	10.8
192	75.5	10.8	6.0	9.9
193	17.2	4.1	31.6	5.9
194	166.8	42.0	3.6	19.6
195	149.7	35.6	6.0	17.3
196	38.2	3.7	13.8	7.6
197	94.2	4.9	8.1	9.7
198	177.0	9.3	6.4	12.8
199	283.6	42.0	66.2	25.5
200	232.1	8.6	8.7	13.4

We use the lm function again.

fit_adv <- lm(sales ~ tv + radio + newspaper, data = advertising)

## 
## Call:
## lm(formula = sales ~ tv + radio + newspaper, data = advertising)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.8277 -0.8908  0.2418  1.1893  2.8292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.938889   0.311908   9.422   <2e-16 ***
## tv           0.045765   0.001395  32.809   <2e-16 ***
## radio        0.188530   0.008611  21.893   <2e-16 ***
## newspaper   -0.001037   0.005871  -0.177     0.86    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.686 on 196 degrees of freedom
## Multiple R-squared:  0.8972, Adjusted R-squared:  0.8956 
## F-statistic: 570.3 on 3 and 196 DF,  p-value: < 2.2e-16

The estimation equation is given by:

$\widehat{sales} = 2.939 + 0.046 \cdot \text{tv} + 0.189\cdot\text{radio} - 0.001 \cdot \text{newspaper}$

The table above displays the multiple regression coeﬃcient estimates when TV, radio, and newspaper advertising budgets are used to predict product sales using the Advertising data.

We interpret these results as follows:

“For a given amount of TV and newspaper advertising, for every $1000 increase in radio ads spending, the higher the average sales by 189 units”

Note: Careful on the interpretation and use of the word “leads to”, “causes to”, and other similar terms, as regression equations do not necessarily imply a causal relationship, unless you pair it with “increase/decrease in average”.

For example, “spending an additional $1,000 on radio advertising will increase in sales by 189 units” may be misleading. Instead, use “On the average, spending an additional $1,000 on radio advertising will increase in sales by 189 units”

3.3 (Optional) Other Estimation Methods

As mentioned, we will focus on the least squares estimation in this course. The following are just quick introduction of other methods in estimating the coefficients or the equation that predicts $Y$

Method of Least Absolute Deviations

Objective function: Minimize $\sum_{i=1}^n|\varepsilon_i|$

Solution is quantile regression model.
Solution is robust to extreme observations. The method places less emphasis on outlying observations than does the method of least squares.
This method is one of a variety of robust methods that have the property of being insensitive to both outlying data values and inadequacies of the model employed.
The solution may not be unique.

Backfitting Method

Assuming an additive model $Y_i=\beta_0+ \beta_1X_{i1}+\beta_2X_{i2}+\cdots+\beta_kX_{ik}+\varepsilon_i$ with constraints $\sum_{i=1}^n\beta_jX_{ij}=0$ for $j=1,…,k$ and $\quad \beta_0=\frac{1}{n\sum_{i=1}^nY_i}$

Fit the coefficient of the most important variable.
Compute for residuals: $Y_1 = Y − \hat{\beta}_1 X_1$
Fit the coefficient for the next important variable.
Compute for residuals: $Y_2 = Y_1 − \hat{\beta}_2 X_2$
Continue until last coefficient is computed.

Gradient Descent

Gradient Descent is an optimization algorithm for finding a local minimum of a differentiable function. This is one of the first algorithms that you will encounter if you will study machine learning.

Initialize the parameters of the model randomly
Compute the gradient of the cost function with respect to each parameter. It involves making partial differentiation of cost function with respect to the parameters.
Update the parameters of the model by taking steps in the opposite direction of the model. Choose a hyperparameter learning rate which is denoted by alpha. It helps in deciding the step size of the gradient.
Repeat steps 2 and 3 iteratively to get the best parameter for the defined model.

If the cost function is the sum of squares of errors, then OLS is better.

But if the cost function has no closed form solution, then gradient descent may become useful.

Source: https://www.geeksforgeeks.org/gradient-descent-in-linear-regression/

Nonparametric Regression

Fit the model $Y = f (X_1, X2, \cdots , X_k ) + \varepsilon_i$ where the function $f$ is has no parameters.

It is the unknown function $f$ being estimated, not just the parameters as what we do in parametric fitting.

Some examples:

Kernel Estimation
Projection Pursuit
Spline Smoothing
Wavelets Fitting
Iterative Backfitting

Example(Anscombe dataset): predict per capita income based on budget for education of a state.

	education	income	young	urban
ME	189	2824	350.7	508
NH	169	3259	345.9	564
VT	230	3072	348.5	322
MA	168	3835	335.3	846
RI	180	3549	327.1	871
CT	193	4256	341.0	774
NY	261	4151	326.2	856
NJ	214	3954	333.5	889
PA	201	3419	326.2	715
OH	172	3509	354.5	753
IN	194	3412	359.3	649
IL	189	3981	348.9	830
MI	233	3675	369.2	738
WI	209	3363	360.7	659
MN	262	3341	365.4	664
IO	234	3265	343.8	572
MO	177	3257	336.1	701
ND	177	2730	369.1	443
SD	187	2876	368.7	446
NE	148	3239	349.9	615
KA	196	3303	339.9	661
DE	248	3795	375.9	722
MD	247	3742	364.1	766
DC	246	4425	352.1	1000
VA	180	3068	353.0	631
WV	149	2470	328.8	390
NC	155	2664	354.1	450
SC	149	2380	376.7	476
GA	156	2781	370.6	603
FL	191	3191	336.0	805
KY	140	2645	349.3	523
TN	137	2579	342.8	588
AL	112	2337	362.2	584
MS	130	2081	385.2	445
AR	134	2322	351.9	500
LA	162	2634	389.6	661
OK	135	2880	329.8	680
TX	155	3029	369.4	797
MT	238	2942	368.9	534
ID	170	2668	367.7	541
WY	238	3190	365.6	605
CO	192	3340	358.1	785
NM	227	2651	421.5	698
AZ	207	3027	387.5	796
UT	201	2790	412.4	804
NV	225	3957	385.1	809
WA	215	3688	341.3	726
OR	233	3317	332.7	671
CA	273	3968	348.4	909
AK	372	4146	439.7	484
HI	212	3513	382.9	831

# Least Squares
lm(income ~ education, data = Anscombe)

## 
## Call:
## lm(formula = income ~ education, data = Anscombe)
## 
## Coefficients:
## (Intercept)    education  
##    1645.383        8.048

# Least Absolute Deviation (Quantile Regression)
library(quantreg)
rq(income ~ education, data = Anscombe)

## Call:
## rq(formula = income ~ education, data = Anscombe)
## 
## Coefficients:
## (Intercept)   education 
##  1209.67290    10.25234 
## 
## Degrees of freedom: 51 total; 49 residual