5 Statistical Models

When we use theory and generate hypotheses, we are attempting to create a model of real-world phenomenon. For example, conisder the following theory proposed by Edwin Shneidman: suicide is caused by psychache (unbearable mental pain). Imagine this was truly how suicide worked in the real world. Consider the following three posisble models:

Researcher 1: as pyschache increases so does suicide risk
Researcher 2: as social connectedness decreases, suicide risk increases
Researcher 3: presence of gene XX4r2 causes suicide

Each researcher could collect data to test how well the model fits the real-world phenomenon. If the model fits the data well, it can accurately predict suicide. For example, statistical analyses might indicate that increased psychache is an accurate predictor of suicide (more to come on regression and casual experiments in later chapters!). If it does not fit the data well, it likely does not accurately represent the real-world phenomenon of interest. For example, if researcher 3 collected genetic data and the presence of the hypothetical gene did not lead to suicide in some individuals, it would indicate a poor model fit.

5.1 A basic model

Let’s try to model the mean height of psychology professors (in centimeters). You can measure all the psych professors in the world, so go to the Arts and Sciences Building and ask four of your psychology professors. You get the following data.

Name	Height
Tyler	181
Steve	190
Jenny	173
Cindy	158

The average height of these professors is 175.5cm. This mean is a model. The model can be represented as:

$x_{i} = \overset{―}{x} + e_{i}$

Here: $x_{i}$ presents the height of professor i, $\overset{―}{x}$ represents the sample mean height of the professors; and $e_{i}$ represent the difference between the professor and the mean, or errors.

We can assess how well the model fits with the data we collected. For our model, it would make sense to try to calculate how large our $e_{i}$ s are, as these represent the model error.

5.2 Deviations

One method to assess the quality of the fit of the model, our mean, to the data is compare how different our data are from the model. You now know that these are model errors. We can subtract the mean from each value to create a numerical representation of this fit. For example, Tyler is 181cm tall. Our model suggests that the average height is 175.5cm tall. We can calculate the deviation here as:

$e_{i} = (x_{i} - \overset{―}{x}) = (181 - 175.5) = 5.5$

If we sum all the errors up across all our data, we get:

Name	Deviation
Tyler	5.5
Steve	14.5
Jenny	-2.5
Cindy	-17.5

So, $\sum e_{i} = 5.5 + 14.5 + (- 2.5) + (- 17.5) = 0$ . What?? That can’t be right. Well, yes, it is. The sum of errors around a mean is zero.

$\sum_{i = 1}^{n} e_{i} = 0$

There is a way to bypass this statistical conundrum.

5.3 Variance and Standard Deviation

We may effectively model the fit of our mean model with the variance and standard deviation. These are extremely important in statistics so it’s imperative to become familiar with them.

Above we calculated the the deviation of each score. The variance is, in essence, the average squared differnce between a score and its mean.

$σ^{2} = \frac{\sum_{i = 1}^{N} {(x_{i} - \bar{x})}^{2}}{N}$

But for a sample, our equation is (see last chapter for the rationale):

$s^{2} = \frac{\sum_{i = 1}^{N} {(x_{i} - \bar{x})}^{2}}{N - 1}$

This equation simply means we add up all the squared differences between a score and the mean and divide by the number of scores. So, the squared deviations are:

Name	Squared
Tyler	30.25
Steve	210.25
Jenny	6.25
Cindy	306.25

We then add up the squared deviations, $30.25 + 210.25 + 6.25 + 306.26 = 553$ . And divide by the number of scores (with sample adjustment to $N - 1$ ), $4 - 1 = 3$ , to get:

$σ^{2} = \frac{\sum_{i = 1}^{N} {(x_{i} - \bar{x})}^{2}}{N - 1} = \frac{30.25 + 210.25 + 6.25 + 306.26}{4 - 1} = \frac{553}{3} = 184.33$

Thus, the variance of the heights of psychology professors is $184.33$ . The standard deviation is simply the squared root of the variance:

$s = \sqrt{\frac{\sum_{i = 1}^{N} {(x_{i} - \bar{x})}^{2}}{N - 1}}$

While you might think that the standard deviation (SD) is the average absolute difference between a score and the mean, it is not. For example, the SD of our heights is 13.5769412. But the average deviation is $\frac{| 5.5 | + | 14.5 | + | - 2.5 | + | - 17.5 |}{4} = 13.33$ . It is most likely helpful to think of the variance as the average squared deviation and the SD as the root of the variance.

5.4 Advanced Models

While above we have simply modeled a mean, later chapters will build up to more advanced models, such as:

$y_{i} = β_{0} + x_{1 i} β_{1} + x_{2 i} β_{2} + x_{3 i} β_{3} + e_{i}$

Don’t be intimidated, this is a whole lot like your classic high school’s $y = m x + b$ , with some intercepts and slopes. More to come.

Practice Problems

Calculate the mean, variance, and standard deviation for both the height (in cms) and weight (in kgs) of these NHL players.

Player	Height	Weight
Connor McDavid	185.4	99.0
Auston Matthews	190.5	93.0
Sidney Crosby	180.0	91.0
Alex Ovechkin	191.0	107.9

Write out the model for NHL height.
What are the $e_{i}$ values for each player when modeling their height?

Answers

Mean_Height	186.725000
SD_Height	5.148058
var_Height	26.502500
Mean_Weight	97.725000
SD_Weight	7.587435
var_Weight	57.569167

Write out the model for NHL height.

$h e i g h t_{i} = {\overset{―}{x}}_{h e i g h t} + e_{i}$

What are the $e_{i}$ values for each player when modeling their height?

Player	Height	e_i
Connor McDavid	185.4	-1.325
Auston Matthews	190.5	3.775
Sidney Crosby	180.0	-6.725
Alex Ovechkin	191.0	4.275