Steps for calculus solution to least square estimate
In order to find minimal values of a least squares function, we may follow the following steps:
Construct the sum-of-squares function as:
S(\beta) = \sum_{i=1}^n(y_i- \beta x_i)^2
Differentiate with respect \beta;
\frac{dS}{d\beta}.
Set equal to 0 and solve for \beta
\frac{dS}{d\beta}=0.
Check that you have found a minimum by computing second derivatives.
Common notation and sum of square functions
In order to derive the least squares estimate, we need some notation on formula.
Let \bar{x} and \bar{y} denote the sample means of the x_i’s and y_i’s respectively for i=1,\ldots,n.
Let S_{xx} = \sum_{i=1}^n(x_i-\bar{x})^2 which can be written as \sum_{i=1}^nx_i^2-(\sum_{i=1}^nx_i)^2/n. This is often called the corrected sum of squares of x.
Similarly define S_{yy} = \sum_{i=1}^n(y_i-\bar{y})^2 the corrected sum of squares for y.
Let S_{xy} =\sum_{i=1}^n(x_i-\bar{x})(y_i-\bar{y}) = \sum_{i=1}^nx_iy_i-\bigg(\sum_{i=1}^nx_i\sum_{i=1}^ny_i\bigg)/n which is called the corrected sum of products of x and y.
Least squares estimates for a simple linear regression
The least squares estimates for a simple linear regression is given by \begin{aligned} \hat{\beta} &= \frac{S_{xy}}{S_{xx}},\nonumber\\ \hat{\alpha} &= \bar{y}-\frac{S_{xy}}{S_{xx}}\bar{x}.\nonumber\end{aligned}
Now let us show this.
\newline
Model: y_i = \alpha+\beta x_i+\epsilon_i
S(\alpha,\beta) = \sum_{i=1}^n(y_i-\alpha-\beta x_i)^2
\newline
Here there we have two parameters to estimate \alpha and \beta.
\begin{aligned} S(\alpha,\beta) &= \sum_iy_i^2-2\sum (\alpha+\beta x_i)y_i+\sum_i(\alpha+\beta x_i)^2\nonumber\\ &=\sum_iy_i^2-2\alpha\sum y_i-2\beta\sum x_iy_i+n\alpha^2+2\alpha\beta\sum_ix_i+\beta^2\sum_i x_i^2\nonumber\end{aligned}
Thus
\begin{aligned} \frac{\partial S}{\partial \alpha}&=2n\alpha+2\beta\sum_ix_i-2\sum_iy_i\nonumber\\ &=2\left(n\alpha+\beta\sum_ix_i-\sum_iy_i\right)\nonumber\\ \frac{\partial S}{\partial \beta} &= -2\sum_ix_iy_i+2\alpha\sum_ix_i+2\beta\sum_ix_i^2\nonumber\\ &=2\left(\alpha\sum_ix_i+\beta\sum_ix_i^2-\sum_ix_iy_i\right)\nonumber\end{aligned}
i.e. we must solve
\begin{aligned} n\alpha+\beta\sum_ix_i &= \sum_iy_i\\ \alpha\sum_ix_i+\beta\sum_ix_i^2 &= \sum_ix_iy_i\end{aligned}
Divide (1) by n:
\alpha = \bar{y}-\bar{x}\beta
Substitute in (2):
\begin{aligned} \bar{y}\sum_ix_i-\bar{x}\beta\sum_ix_i+\beta\sum_ix_i^2 &=\sum_ix_iy_i\nonumber\\ n\bar{x}\bar{y}-\beta n\bar{x}^2+\beta\sum_ix_i^2 &=\sum_ix_iy_i\nonumber\\ \beta &= \frac{\sum_ix_iy_i-n\bar{x}\bar{y}}{\sum_ix_i^2-n\bar{x}^2}\nonumber\\ &=\frac{\sum_i(x_i-\bar{x})(y_i-\bar{y})}{\sum_i(x_i-\bar{x})^2}\nonumber\end{aligned}
Once again we denote estimates with a hat (\hat{.}) symbol. Thus,
\hat{\beta} = \frac{\sum_i(x_i-\bar{x})(y_i-\bar{y})}{\sum_i(x_i-\bar{x})^2}
and
\hat{\alpha} = \bar{y}-\hat{\beta}\bar{x}
Since S_{xx} = \sum_i(x_i-\bar{x})^2, S_{xy} = \sum_i(x_i-\bar{x})(y_i-\bar{y}) then
\begin{aligned} \hat{\beta} &= \frac{S_{xy}}{S_{xx}}\nonumber\\ \hat{\alpha} &= \bar{y}-\frac{S_{xy}}{S_{xx}}\bar{x}.\nonumber\end{aligned}
In the above approach, the second derivatives of the sum-of-squares functions w.r.t the parameters should be computed to verify that a minimum has been found.
\begin{aligned} \frac{\partial^2 S}{\partial \alpha^2}&=2n\\ \\ \frac{\partial^2 S}{\partial \beta^2} &= 2\sum_ix_i^2\nonumber \end{aligned}
since n>0 and \sum_ix_i^2 >0 then in both cases we have found minimal values.
Protein in pregnancy
Data were collected through interest in whether (and, if so, in what way) the level of protein changes in expectant mothers throughout pregnancy. Observations have been taken on 19 healthy women. Each woman was at a different stage of pregnancy (gestation). Fit a simple linear regression that describes the relationship between the mothers’ protein levels and the gestation length, and then estimate the parameters in this model
Protein level (mgml^{-1}) | Gestation (weeks) |
---|---|
0.38 | 11 |
0.58 | 12 |
0.51 | 13 |
0.38 | 15 |
0.58 | 17 |
0.67 | 18 |
0.84 | 19 |
0.56 | 21 |
0.78 | 22 |
0.86 | 25 |
0.65 | 27 |
0.74 | 28 |
0.83 | 29 |
0.99 | 30 |
0.84 | 31 |
1.04 | 33 |
0.92 | 34 |
1.18 | 35 |
0.92 | 36 |
Data:(y_i,x_i), \quad i=1,\dots,n
y_i, protein level of mother i
x_i, gestation of baby i (in weeks)
Model:\mathrm{E}(Y_i)=\beta_0 + \beta_1 x_i
\newline
Find the least square estimates using the following summary statistics:
n=19, \sum x_i =456, \sum x_i^2 =12164, \sum y_i = 14.25, \sum x_i y_i =369.87 and \sum y_i^2 =11.55
\begin{aligned} S_{xx} &= \sum_{i=1}^nx_i^2-(\sum_{i=1}^nx_i)^2/n \\ &= 12164 - 456^2/19\\ &= 1220 \end{aligned} \quad \quad \quad \quad \quad \quad \begin{aligned} S_{xy} &= \sum_{i=1}^nx_iy_i-(\sum_{i=1}^nx_i\sum_{i=1}^ny_i)/n \\ &= 369.87 - (456 \times 14.25)/19\\ &= 27.87 \end{aligned}
\begin{aligned} \hat{\beta} &= \frac{S_{xy}}{S_{xx}}\\ &= \frac{27.87}{1220}\\ &= 0.02284 \end{aligned} \quad \quad \quad \quad \quad \quad \quad \quad \quad \begin{aligned} \alpha &= \bar{y}-\frac{S_{xy}}{S_{xx}}\bar{x}\\ &= 14.25/19 - 27.87/1220 \times 456/19\\ &= 0.20174 .\nonumber \end{aligned}