Chapter 10 Multiple regression

10.1 Formula

In a multiple linear regression there is more than one independent (input, explanatory) variable \(X\). Let the number of explanatory variables be denoted by \(k\). The explanatory variables will be denoted \(X_1\), …, \(X_k\). The \(i\)-th observation of the \(X_2\) variable will be denoted \(x_{i2}\) or \(x_{i,2}\).

The fitted multiple linear regression equation has the form:

\[\widehat{y_i} = \widehat{\beta}_0 + \widehat{\beta}_1 x_{i1} + \widehat{\beta}_2 x_{i2} + \cdots + \widehat{\beta}_k x_{ik}, \tag{10.1}\]

where \(\widehat{y_i}\) is the fitted value of the response for observation \(i\), and \(\widehat{\beta}_0\), \(\widehat{\beta}_1\), \(\ldots\), \(\widehat{\beta}_k\) are the estimated regression coefficients.

If there are two explanatory variables (\(k=2\)), the regression equation (10.1) describes a plane in 3D-space (see Figure 10.1); if there are more \(X\)s, the equation describes a hyperplane in \((k+1)\)-dimensional space.

Figure 10.1: A 3D-plot illustrating a multiple regression with two explantory variables.

10.2 Interpretation

Example:

The data were collected around 1888 from 47 French-speaking “provinces” of Switzerland. The Fertility variable is a fertility index, based on the number of children per woman, and is expressed as a percentage relative to the highly fertile Hutterite group, which serves as the benchmark. The Education variable represents the percentage of draftees with education beyond primary school. The Infant.Mortality variable indicates the percentage of live births who die before reaching one year of age.

## 
## Call:
## lm(formula = Fertility ~ Education + Infant.Mortality, data = swiss)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.3906  -6.0088  -0.9624   5.8808  21.0736 
## 
## Coefficients:
##                  Estimate Std. Error t value    Pr(>|t|)    
## (Intercept)       48.8213     8.8904   5.491 0.000001875 ***
## Education         -0.8167     0.1298  -6.289 0.000000127 ***
## Infant.Mortality   1.5187     0.4287   3.543    0.000951 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.426 on 44 degrees of freedom
## Multiple R-squared:  0.5648, Adjusted R-squared:  0.545 
## F-statistic: 28.55 on 2 and 44 DF,  p-value: 0.00000001126

10.3 Binary variables in linear regression

## 
## Call:
## lm(formula = right_hand_span ~ height + gender, data = a)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.5086 -0.9920  0.0785  1.0021  4.2563 
## 
## Coefficients:
##             Estimate Std. Error t value        Pr(>|t|)    
## (Intercept)  4.76530    2.11696   2.251           0.025 *  
## height       0.08472    0.01253   6.759 0.0000000000527 ***
## genderMale   1.32469    0.23267   5.693 0.0000000250411 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.518 on 378 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.4449, Adjusted R-squared:  0.442 
## F-statistic: 151.5 on 2 and 378 DF,  p-value: < 0.00000000000000022

10.4 Matrix notation