# 2 Introduction

Modeling = development of mathematical expressions that describe the behavior of a random variable of interest.

This variable is called the **response** (or dependent) variable and denoted with \(Y\).

Other variables which are thought to provide information on the behavior of \(Y\) are incorporated into the model as **predictor** or **explanatory** variables (also called the independent variables). We denote them with \(X\).

Data consist of information taken from \(n\) units. Subscripts \(i = 1,..., n\) identify the particular unit from which the observations were taken.

Additional subscripts can be used to identify different predictors.

All models involve unknown constants, called **parameters**, which control the behavior of the model. These parameters are denoted by Greek letters (e.g. \(\beta\)) and are to be estimated from the data.

We denote **estimates** using hat notation, e.g. \(\hat{\beta}\).

In this module we will study **linear models**. Here the parameters enter the model as simple coefficients on the \(X\)s or functions of \(X\)s.

## 2.1 Introductory Examples

A first look at how \(Y\) changes as \(X\) is varied is seen in a scatterplot.

### 2.1.1 Mother and daughter heights

Data from Pearson and Lee (1903).

- \(\bar{y}=\) 63.75, sd(\(y\)) = 2.6

- Taller mothers have taller daughters.
- Since most points fall above line \(y=x\) most daughters are taller.

- Does the data follow a linear pattern? If so we can use the linear regression line to summarise the data.
- We can use the regression line to predict a daughters height based on her mother’s height.
- \(\hat{y}=\) 29.92 +0.54 \(x\)

### 2.1.2 Bacterial count and storage temperature

- Points are jittered to avoid overprinting.
- It does not appear to be a linear relationship.
- Consider a transformation?

- Log transformed bacteria counts appear to have a linear relationship with temperature.

### 2.1.3 Yield and Rainfall

The dataset is from Ramsey and Schafer (2002). The data on corn yields and rainfall are in `ex0915’ in library(Sleuth3). Variables:

- Yield: corn yield (bushels/acre)
- Rainfall: rainfall (inches/year)
- Year: year.

```
## The following objects are masked from ex0915 (pos = 9):
##
## Rainfall, Year, Yield
```

```
## The following objects are masked from ex0915 (pos = 16):
##
## Rainfall, Year, Yield
```

```
## The following objects are masked from ex0915 (pos = 23):
##
## Rainfall, Year, Yield
```

```
## The following objects are masked from ex0915 (pos = 30):
##
## Rainfall, Year, Yield
```

```
## The following objects are masked from ex0915 (pos = 37):
##
## Rainfall, Year, Yield
```

```
## The following objects are masked from ex0915 (pos = 44):
##
## Rainfall, Year, Yield
```

```
## The following objects are masked from ex0915 (pos = 51):
##
## Rainfall, Year, Yield
```

```
## The following objects are masked from ex0915 (pos = 59):
##
## Rainfall, Year, Yield
```

You must enable Javascript to view this page properly.

### 2.1.4 Driving

Example from: Weisberg (2005). Study how fuel consumption varies over 50 US states and the District of Columbia and the effect of state gasoline tax on the consuption.

Variable:

- FuelC: Gasoline sold for road use, thousands of gallons
- Drivers: Number of licensed drivers in the state
- Income: Per person personal income for the year 2000, in thousands of dollars
- Miles: Miles of Federal-aid highway miles in the state
- Pop: 2001 population age 16 and over
- Tax: Gasoline state tax rate, cents per gallon
- State: State name

We will use a scatterplot matrix.

- Both Drivers and FuelC are state totals so will be larger in more populous states.
- Income is per person, we want to make variables comparable.

Transform variables:

- FuelC2:FuelC/Pop
- Drivers2: Drivers/Pop
- Miles2:log\(_2\)(Miles)

- FuelC decreases as tax increases but there is a lot of variation.
- Fuel is weakly related to a number of other variables.

Other graphical representations of the dataset:

```
library(MASS)
parcoord(driving2[, c(2,6, 1,3,4,5)])
```

### 2.1.5 Fuel Consumption

Information was recorded on fuel usage and average temperature (\(^oF\)) over the course of one week for eight office complexes of similar size. Data from Bowerman and Schafer (1990).

We expect fuel use to be driven by weather conditions.

**Fuel use:** response or dependent variable. Denoted by \(Y\).

**Temperature:** Explanatory or predictor variable. Denoted by \(X\).

We observe n=8 pairs: \((x_{i}, y_{i}), i =1,...,8\).

Temp | Fuel |
---|---|

28.0 | 12.4 |

28.0 | 11.7 |

32.5 | 12.4 |

39.0 | 10.8 |

45.9 | 9.4 |

57.8 | 9.5 |

58.1 | 8.0 |

62.5 | 7.5 |

The scatterplot shows that fuel use decreases roughly linearly as temperature increases.

We assume there’s an underlying true line: \[\mbox{Fuel} =\beta_{0} + \beta_{1}\mbox{Temp} + \epsilon\]

or, more generally: \(y =\beta_{0} + \beta_{1}x + \epsilon.\)

The intercept (\(\beta_0\)) and slope (\(\beta_1\)), are unknown parameters and \(\epsilon\) is the random error component.

For each observation we have: \(y_i =\beta_{0} + \beta_{1}x_i + \epsilon_i\).

We can estimate \(\beta_0\) and \(\beta_1\) from the available data.

One method that can be used to do this is the method of ordinary least squares.

NOTE: other models are possible:

### References

Pearson, K, and A Lee. 1903. “On the Laws of Inheritance in Man: I. Inheritance of Physical Characters.” *Biometrika* 2: 357–462.

Ramsey, Fred, and Daniel Schafer. 2002. *The Statistical Sleuth: A Course in Methods of Data Analysis*. 2nd ed. Duxbury Press.

Weisberg, Sanford. 2005. *Applied linear regression*. Wiley-Blackwell.

Bowerman, Bruce L., and Daniel Schafer. 1990. *Linear Statistical Models*. 2nd ed. Thomson Wadsworth.