5 Day 5

Announcements:

  • Stat Help Lab:

    • Monday - Friday \(\approx\) 9 AM - 6 PM
  • Homework point recovery

    • I’m writing this August 31

    • Hopefully past me has graded them

      • He did
    • Question 5 - Qualitative vs Quantitative

  • Supplemental materials

  • Extracurricular materials

Review

  • Which way is this histogram skewed?

  • Numerical summaries help us understand data sets faster

  • Notation is how we communicate statistical/mathematical equations:

    • Sample size (# of individuals in a sample)

      • Denoted with \(n\) (lower-case)
    • Population size

      • Denoted with \(N\) (capital)
    • Data values can be denoted as \(x_1,x_2,x_3,...\)

    • \(x_1\) refers to the observed value of the variable \(x\) from individual 1

    • Summation:

      • This is referring to the sum (addition) of everything contained in the expression

      • We denote this with the Greek letter \(\Sigma\)

      \[\sum\limits_{i=1}^nx_i=x_1+x_2+...+x_n\]

  • Mean

    • Balance point (fulcrum) of the dataset

    • Population mean (denoted \(\mu\)):

    \[\mu = {1 \over N}\sum\limits_{i=1}^Nx_i\] \[\mu = {\sum\limits_{i=1}^Nx_i\over N}\]

\[N \approx 10000\]

\[\mu = {1 \over 10000}(x_1 + x_2 +x_3 + ...+x_{10000})\]

\[\mu = 1258.771\]

  • Is \(\mu\) a statistic?
1149 1577 1138 1319 1399 1074 1091 1324 1048 1462

\[\bar{x} = {1 \over n}\sum\limits_{i=1}^nx_i\]

\[\bar{x} = {\sum\limits_{i=1}^nx_i\over n}\]













  • Properties of mean:

  • Common

  • Easy to interpret

  • Susceptible to outliers

  • A statistic is resistant if its value is not affected heavily by outliers

  • Median

    • Half of the data points are above the median, half are below
1149 1577 1138 1319 1399 1074 1091 1324 1048 1462
  • Sort your data in increasing order (low to high)
1048 1074 1091 1138 1149 1319 1324 1399 1462 1577
  • If \(n\) is odd: choose position \({(n+1)\over2}\) in the ordered dataset

  • If \(n\) is even after ordering:

    • Pick \(n\over 2\) and \({n \over 2}+1\)

    • Average the two data points

\[{10\over2}=5 \ , \ {10 \over2}+1=6\] \[{\ \ \ + \ \ \ \over2}= \ \ \]

  • Properties of median:

    • It doesn’t use all of the data directly

    • This makes it resistant

      • Outliers have little/no effect
  • Difference between median and mean depend on skew of the histogram

  • Mode

    • Where the peak is

    • What could be the mode here?

  • The most frequent observation

  • Useful for qualitative data

    • “Which credit card is most commonly used by our customers?”
  • Not as useful for quantitative data

    • “What’s the most common weight of cattle on our research farms?”
  • A data set can have any number of modes (0,1,2,…)

V1 8 10 7 11 14 12 11 9 14 11 8 2 8 10 7 12 11 12 12 10
V2 5 13 15 9 10 13 16 11 13 11 7 15 10 16 11 18 13 10 9 9
  • Whatever method works best for you is the one you use

  • Personally, I prefer to make a frequency table:

Value Freq
2 1
5 1
7 3
8 3
9 4
10 6
11 7
12 4
13 4
14 2
15 2
16 2
18 1

Measures of Spread

  • Statistics is roughly about making inference from data

  • Is it fair to make inference from mean/median/mode alone?

    • Is the mean resistant?

    • Is the median representative of all the data?

    • Does the mode tell us about outliers?

  • When we look at data there’s a certain spread to it:

Months Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
San Francisco 51 54 55 56 58 60 60 61 63 62 58 52
St. Louis 30 35 44 57 66 75 79 78 70 59 45 35

\[\mu_{sf} = {{51+54+55+56+58+60+60+61+63+62+58+52}\over 12} = 57.5\]

\[\mu_{sl} = {{30+35+44+57+66+75+79+78+70+59+45+35}\over 12} = 56.1\] - The means are similar

  • Are the temperatures in each city similar?

    • Common sense: Who’s been to San Francisco?
  • Note: The program I use to make this is bad at dotplots

    • There’s a reason for that

  • Are they different?

    • In what way?
  • Spread is an important metric for understanding the differences or variation in data

Range

  • Simplest measure of spread

  • Difference between the largest and smallest data value

\[Range = Max - Min\]

Months Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
San Francisco 51 54 55 56 58 60 60 61 63 62 58 52
St. Louis 30 35 44 57 66 75 79 78 70 59 45 35
  • Calculate the range of both San Francisco and St. Louis in this data set





  • What do our results mean?

  • With every measure/metric there’s good and bad

  • Range does let us look at spread

    • It uses two values so it’s not a perfect measure

  • These two data sets could have the same range and mean

    • Are they the same data set if that’s true?

  • Smaller spread is usually closer to the mean

  • Larger spread is usually further from the mean

Variance

  • Measure of how far, on average, values in a data set are from the mean

    • Population mean (denoted \(\mu\)):

    \[\mu = {1 \over N}\sum\limits_{i=1}^Nx_i\]

    \[\mu = {\sum\limits_{i=1}^Nx_i\over N}\]

    • Sample mean (denoted \(\bar{x}\))

    \[\bar{x} = {1 \over n}\sum\limits_{i=1}^nx_i\]

    \[\bar{x} = {\sum\limits_{i=1}^nx_i\over n}\]

  • Stay with me

  • Let \(x_1,x_2,...,x_N\) be values in a population of \(N\) size

  • The difference between \(i^{th}\) population value and the mean is:

\[x_i - \mu\]

  • We want to take these differences from \(1\) to \(i\) and divide them by \(N\)

  • There’s one problem though:

  • Positive and negative difference can cancel out

  • We can fix that by squaring the value of each difference:

\[(x_i - \mu)^2\]

  • Given this:

    • Variance should never be negative

    • Zero or positive

  • Larger variance means more variability





  • Population variance (denoted \(\sigma^2\)):

\[\sigma^2 = {{{\sum\limits_{i=1}^N}(x_i-\mu)^2}\over N}\]

  • Sample variance (denoted \(s^2\)):

\[s^2 = {{{\sum\limits_{i=1}^n}(x_i-\bar{x})^2}\over (n-1)}\]

  • Remember that statistics is roughly:

    • Making inference about a population parameter using sample statistics
  • In practice we almost never have a population variance

    • We use a sample variance to estimate population variance

Practice: Calculate Sample Variance

A company that manufactures batteries is testing a new type of battery designed for laptop computers. They measure the lifetimes, in hours, of six batteries, and the results are 3, 4, 6, 5, 4, and 2. Find the sample variance, including its unit, of the lifetimes











Standard Deviation

  • Variance is a squared unit of the data

    • It’s annoying to think in squares

  • The value of variance we calculated in the previous example is “\(Hours^2\)

  • This is an easy fix though:

\[\sqrt{Hours^2}=Hours\] - We call this standard deviation

  • It lets us work in the original units of the data

  • The notation is easy to understand as well

\[\sqrt{\sigma^2}=\sigma \rightarrow Population \ Standard \ Deviation\]

\[\sqrt{s^2}=s \rightarrow Sample \ Standard \ Deviation\]

  • In our previous practice example we calculated a variance in \(Hours^2\)

    • What is the standard deviation in hours?
Battery A B C D E F
Lifespan 3 4 6 5 4 2









Practice: Mean, Variance, and Standard Deviation

17 40 24 18 16
  • Compute:

    • Sample Mean

    • Sample Variance

    • Sample Standard Deviation









1149 1577 1138 1319 1399 1074 1091 1324 1048 1462
  • Compute:

    • Sample Range

    • Sample Standard Deviation

  • The population standard deviation of this data is \(188\)

    • What do you make of the sample standard deviation you computed?













Empirical Rule

  • Many data sets has a single peak in the center and an approximately symmetric shape

    • We call this a bell-shape

  • When we see this distribution we can use standard deviation to describe how much of the data is within a certain range of the mean

  • The Empirical Rule

  • For a population that has an approximately bell-shaped distribution:

  • \(\approx 68\%\) of the data is within ONE standard deviation of the mean

\[\approx 68\% = \begin{cases} \mu - \sigma \\ \mu + \sigma \end{cases}\]

  • \(\approx 95\%\)$ of the data is within TWO standard deviations of the mean

\[\approx 95\% = \begin{cases} \mu - 2\sigma \\ \mu + 2\sigma \end{cases}\]

  • \(\approx\) All or almost all of the data is within THREE standard deviations of the mean

\[\approx 100\% = \begin{cases} \mu - 3\sigma \\ \mu + 3\sigma \end{cases}\]

Practice: Empirical Rule

A large class of 200 students took an exam. The scores had sample mean \(\bar{x} = 65\) and sample standard deviation \(s = 10\). The histogram is approximately bell-shaped.

  1. Find an interval that is likely to contain approximately 68% of the scores.

  2. Approximately what percentage of the scores were between 45 and 85?

  3. Approximately how many students had scores between 45 and 85?













  • Go away

Extracurricular

I promised I’d explain the Kansas SAT test scores histogram:

set.seed(80)
empir_rule <- data.frame(x = rnorm(200,65,10))

ggplot(data=empir_rule,aes(x=x)) + geom_histogram(stat="bin",binwidth=6,color="black",fill="violet") + 
  labs(x="Student Exam Scores",y="Frequency",title="Empirical Rule Example")

  • Mean and standard deviation/variance of a sample/population can be used to simulate the data

    • Ideally you know the shape of the data too

    • Since I know this is approximately bell shaped:

      • I can make fake data using the sample statistics