5 Day 4

Announcements:

  • Supplemental Homework Review

    • One of the other instructors made this

    • Thanks Eli!

  • No office hours today

    • Thursday 9 AM - 10 AM

    • After 2:45 PM please do not talk to me

  • Stat Help Lab

    • Should open next week
  • Homework point recovery will open after this class

    • No due date but please respect my time

      • Don’t give me 10 homeworks on November 1st
    • I’m not going to grade them yet

      • Don’t stress email me, I won’t reply
  • Extracurricular questions and supplemental material

Review

  • When we have one quantitative variable we have several options:

Histograms

  • Histogram: visual representation of a frequency distribution

    • Not a bar graph
  • Start by making a frequency distribution:

    • Define classes (numeric intervals)

    • Record the frequency of occurrences

Class Frequency
0.00-0.99 9
1.00-1.99 26
2.00-2.99 11
3.00-3.99 13
4.00-4.99 3
5.00-5.99 1
6.00-6.99 2
  • Lower class limit: the smallest value that can appear in that class

  • Upper class limit: largest value that can appear in that class

  • Class width: the difference between consecutive lower class limits

  • General requirements for quantitative frequency distributions:

    • Every observation must fall into one of the classes

    • The classes must not overlap

    • The classes must be of equal width

    • There must be no gaps between classes

  • Bar height (y-axis) represents class frequency

  • Bar width (x-axis) represents class width

  • Left edge of each bar corresponds to the lower class limit

  • No gaps between classes so no gaps between bars of a histogram

  • We care about the shape of our data

  • The shape of our data can help us observe the distribution of our data

  • Vocabulary for describing the shape of data:

    • Symmetric - mirror image on both sides of it’s center

    • Positively-skewed - Long, narrow tail to the right

    • Negatively-skewed - Long, narrow tail to the left

    • Unimodal - One peak/hump

    • Bimodal - two peaks/humps

    • Uniform - b o x

  • Histograms can be used to summarize both small and large data sets

  • Sometimes we prefer more detailed visualizations for smaller data sets

Steam-and-leaf plots

  • Each observation should have at least two digits

    • The digit furthest to the right is the “leaf”

    • The digits to the left form the “stem”

  • The data:

  • The stem-and-leaf plot:

Dotplots

  • We can represent each observation by a dot above its value on a number line

  • Data:

3 6 2 5 1 2 3 4 3 4
  • Dotplot:

Numerical Descriptions

  • Graphics are good for taking data and making it easier to view

  • Numerical summaries are how we take data and make it easier to understand

  • We’re going to specifically cover:

    • Mean

    • Median

    • Mode

  • Refresher:

    • A parameter describes a population

    • A statistic describes a sample

  • “Use of sample statistics to describe population parameters”

    • Statistical Inference in a nutshell

  • Which way is this histogram skewed?

  • How would you describe the “center” of this data?

    • Mean or average: Balance point (fulcrum) of the dataset.

    • Median: Half of the data points are above the median, half are below.

    • Mode: Where the peak is.

Mean

  • Most commonly used metric for summarizing data

  • Sum all of the data then divide by the number of observations

7 3 12 3 5

Mean = 7+3+12+3+55=305=6 - If the data we calculated a mean for comes from a sample:

  • Sample mean

  • If the data we calculated a mean for comes from a population:

    • Population mean
  • Which one is a parameter?

    • A statistic?
  • Mathematical Notation Soapbox

    • “Letter math”

    • I resisted it forever

    • But trust me, it does end up being helpful

  • Data values can be denoted as x1,x2,x3,...

    • x1 refers to the observed value of the variable x from individual 1

    • x can be anything

      • It doesn’t even have to be x

      • It’s convention, not law

  • Sample size (the number of individuals in the sample)

    • Denoted with n (Note: lower-case)
  • Population size

    • Denoted with N (Note: capital)
  • Summation:

    • This is referring to the sum (addition) of everything contained in the expression

    • We denote this with the Greek letter Σ

  • With this notation we can describe:

ni=1xi=x1+x2+...+xn

  • “The summation of xi to the nth term, indexed by 1”

  • With sigma notation we can express the sample and population mean formulas:

    • Sample mean (denoted ˉx):

    1nni=1xi

    • Population mean (denoted μ):

    1NNi=1xi

  • Greek letters usually mean population parameters

  • Lower-case letters usually mean sample statistics

  • In practice:

Table 5.1: Sample of College Seniors
Student Absences
1 2
2 6
3 1
4 2
5 4
6 0
7 1
8 3
9 0
10 2

1nni=1xi 110(x1+x2+x3+x4+x5+x6+x7+x8+x9+x10)

110(2+6+1+2+4+0+1+3+0+2)

11021=2110=2.1 - Properties of mean:

  • Common

  • Easy to interpret

  • Susceptible to outliers

    • The average number of Super Bowl rings between me and Tom Brady is 3.5

    • (As of 2021) the top 1% of households in the United States hold 32.3% of the country’s wealth, while the bottom 50% hold 2.6%

  • A statistic is resistant if its value is not affected heavily by outliers

    • Is the mean resistant?

Median

  • Middle value, half the data are below and half are above
7 3 12 3 5
  • Sort your data in increasing order (low to high)
3 3 5 7 12
  • The median is 5

  • If n is odd: choose position (n+1)2 in the ordered dataset

  • So n=5

    (n+1)2=(5+1)2=3

    • We pick the 3rd data point after sorting
7 3 12 3 5 8
  • If n is even after ordering:

    • Pick n2 and n2+1

    • Average the two data points

3 3 5 7 8 12
  • So n=6

    62=3 , 62+1=4

    • 3rd data point: 5

    • 4th data point: 7

    (5+7)2=6

    • The median is 6
  • Properties of median:

    • It doesn’t use all of the data directly

    • This makes it resistant

      • Outliers have little/no effect
    • Sometimes a more realistic measurement:

      • Median Household Income (Kansas): $57,422

      • Average Household Income (Kansas): $77,509

      • Why does median make more sense than average here?

    • Difference between median and mean depend on skew of the histogram

Mode

  • The most frequent observation

  • Useful for qualitative data

    • “Which credit card is most commonly used by our customers?”
  • Not as useful for quantitative data

    • “What’s the most common weight of cattle on our research farms?”

    • Why isn’t this a helpful metric?

  • A data set can have any number of modes (0,1,2,…)

7 3 12 3 5
  • The most frequently observed value in this data set is 3

    • So the mode is 3
  • The below data set displays a sample of n=7 observations:

2 1 3 4 3 5 4
  1. Find the mean, round to two decimal places if necessary:

  2. Find the median, round to two decimal places if necessary:

  3. Find the mode(s):

A group of restaurants reported in a recent year that the mean price for their dinner was $18, and the median was $22. If a histogram was constructed for the price of dinner at the group of restaurants, would you expect it to be skewed to the right, skewed to the left, or approximately symmetric?