5  Statistical Models

When we use theory and generate hypotheses, we are attempting to create a model of real-world phenomenon. For example, conisder the following theory proposed by Edwin Shneidman: suicide is caused by psychache (unbearable mental pain). Imagine this was truly how suicide worked in the real world. Consider the following three posisble models:

Each researcher could collect data to test how well the model fits the real-world phenomenon. If the model fits the data well, it can accurately predict suicide. For example, statistical analyses might indicate that increased psychache is an accurate predictor of suicide (more to come on regression and casual experiments in later chapters!). If it does not fit the data well, it likely does not accurately represent the real-world phenomenon of interest. For example, if researcher 3 collected genetic data and the presence of the hypothetical gene did not lead to suicide in some individuals, it would indicate a poor model fit.

5.1 A basic model

Let’s try to model the mean height of psychology professors (in centimeters). You can measure all the psych professors in the world, so go to the Arts and Sciences Building and ask four of your psychology professors. You get the following data.

Name Height
Tyler 181
Steve 190
Jenny 173
Cindy 158

The average height of these professors is 175.5cm. This mean is a model. The model can be represented as:

\(x_i = \overline{x} + e_i\)

Here: \(x_i\) presents the height of professor i, \(\overline{x}\) represents the sample mean height of the professors; and \(e_i\) represent the difference between the professor and the mean, or errors.

We can assess how well the model fits with the data we collected. For our model, it would make sense to try to calculate how large our \(e_i\)s are, as these represent the model error.

5.2 Deviations

One method to assess the quality of the fit of the model, our mean, to the data is compare how different our data are from the model. You now know that these are model errors. We can subtract the mean from each value to create a numerical representation of this fit. For example, Tyler is 181cm tall. Our model suggests that the average height is 175.5cm tall. We can calculate the deviation here as:

\(e_i = (x_i - \overline{x}) = (181 - 175.5) = 5.5\)

If we sum all the errors up across all our data, we get:

Name Deviation
Tyler 5.5
Steve 14.5
Jenny -2.5
Cindy -17.5

So, \(\sum{e_i}=5.5 + 14.5 + (-2.5) + (-17.5) = 0\). What?? That can’t be right. Well, yes, it is. The sum of errors around a mean is zero.

\(\sum_{i=1}^n{e_i}=0\)

There is a way to bypass this statistical conundrum.

5.3 Variance and Standard Deviation

We may effectively model the fit of our mean model with the variance and standard deviation. These are extremely important in statistics so it’s imperative to become familiar with them.

Above we calculated the the deviation of each score. The variance is, in essence, the average squared differnce between a score and its mean.

\(\sigma^2 = {\frac{\sum\limits_{i = 1}^N {\left( {x_i - \bar x} \right)^2 }}{N} }\)

But for a sample, our equation is (see last chapter for the rationale):

\(s^2 = {\frac{\sum\limits_{i = 1}^N {\left( {x_i - \bar x} \right)^2 }}{N-1} }\)

This equation simply means we add up all the squared differences between a score and the mean and divide by the number of scores. So, the squared deviations are:

Name Squared
Tyler 30.25
Steve 210.25
Jenny 6.25
Cindy 306.25

We then add up the squared deviations, \(30.25+210.25+6.25+306.26=553\). And divide by the number of scores (with sample adjustment to \(N-1\)), \(4-1=3\), to get:

\(\sigma^2 = {\frac{\sum\limits_{i = 1}^N {\left( {x_i - \bar x} \right)^2 }}{N-1} } = \frac{30.25+210.25+6.25+306.26}{4-1} = \frac{553}{3}=184.33\)

Thus, the variance of the heights of psychology professors is \(184.33\). The standard deviation is simply the squared root of the variance:

\(s = \sqrt{{\frac{\sum\limits_{i = 1}^N {\left( {x_i - \bar x} \right)^2 }}{N-1} }}\)

While you might think that the standard deviation (SD) is the average absolute difference between a score and the mean, it is not. For example, the SD of our heights is 13.5769412. But the average deviation is \(\frac{|5.5| + |14.5 |+ |-2.5| + |-17.5|}{4} = 13.33\). It is most likely helpful to think of the variance as the average squared deviation and the SD as the root of the variance.

5.4 Advanced Models

While above we have simply modeled a mean, later chapters will build up to more advanced models, such as:

\(y_i=\beta_0+x_{1i}\beta_1+x_{2i}\beta_2+x_{3i}\beta_3+e_i\)

Don’t be intimidated, this is a whole lot like your classic high school’s \(y=mx+b\), with some intercepts and slopes. More to come.

  1. Calculate the mean, variance, and standard deviation for both the height (in cms) and weight (in kgs) of these NHL players.
Player Height Weight
Connor McDavid 185.4 99.0
Auston Matthews 190.5 93.0
Sidney Crosby 180.0 91.0
Alex Ovechkin 191.0 107.9
  1. Write out the model for NHL height.

  2. What are the \(e_i\) values for each player when modeling their height?

Mean_Height 186.725000
SD_Height 5.148058
var_Height 26.502500
Mean_Weight 97.725000
SD_Weight 7.587435
var_Weight 57.569167
  1. Write out the model for NHL height.

\(height_i=\overline{x}_{height}+e_i\)

  1. What are the \(e_i\) values for each player when modeling their height?
Player Height e_i
Connor McDavid 185.4 -1.325
Auston Matthews 190.5 3.775
Sidney Crosby 180.0 -6.725
Alex Ovechkin 191.0 4.275