4 Day 3

Reminders:

  • My Office: Dickens 009C

  • Office Hours: 3-4 PM Monday/Wednesday

    • This specific Wednesday I’m moving office hours around

    • Check Canvas

  • Zoom Office Hours

    • On-request, I won’t sit in a Zoom room by default every week

    • Email me:

  • ALEKS issues

    • Please email me

    • Apparently we can fix more with ALEKS than IT

  • Homework policy

    • On Canvas
  • Class room conduct:

    • If I’m prompting a question you can just reply

    • Outside of a prompted question please raise your hand

    • If you’re arriving late/leaving early, just be respectful with your entrance/exit

      • If you need to step out for whatever reason, same courtesies apply
    • Try not to disrupt class/break laws in general

    • You can take pictures of annotations on the screen anytime

    • You can record the lecture

    • You just can’t distribute the recordings, that ends badly

      • Technically everything I produce is owned by the University
  • Attendance policy:

    • You should attend class

    • I am obligated to take attendance at random

      \[{n \choose k}p^k(1-p)^{n-k}\] \[p = 0.5\]

    • If you need to/happened to miss class for whatever reason just email me

      • I have systems available for keeping you from falling behind

Review

  • Data set

    • Collected information
  • Individuals

    • Something the information is collected on
  • Variables

    • Characteristics about the individuals we collected information from
  • Qualitative (Categorical) variable

    • Values represent categories

      • Identifying labels/names

      • “Can I do math with this by default?”

        • “Is the math arbitrary?” (i.e., does the 1 in “rank 1 salesperson at Amazon” actually represent a quantity? or does it describe them?)
  • Quantitative variable

    • Values represent meaningful numbers
  • Qualitative variables can be ordinal or nominal

    • Ordinal variables

      • Categories/values of the variable have a natural ordering

        • Letter grade: A, B, C, D
    • Nominal variable

      • Categories/values of the variable cannot be ordered

        • Degree programs
  • Quantitative variables can be discrete or continuous

    • Discrete variable

      • A countable number of values (0, 1, 2, 3, 4, …)

        • Number of students in a classroom
    • Continous variable

      • A continuous range of numbers (0, 0.1, 0.11, 0.111, …)

        • Temperature
  • Quantitative variables can be Interval or Ratio:

    • Interval level

      • Can’t make a meaningful ratio

      • Zero doesn’t mean absence

      • Temperature in Celsius/Fahrenheit (Does 0 mean there’s no heat?)

    • Ratio level

      • Ratios make sense

      • Zero represents absence of the variable

      • If you’re 0 inches tall do you have height?

  • Raw data isn’t entirely useful

  • Statistics is really good at summarizing and visualizing data

  • Bar graph: One or more categorical variable

  • Histogram: One numerical variable

  • Scatterplot: More than one numerical variable

  • Frequency distribution

    • Group data into categories

    • Record the number of observations that fall into each category

    • “How frequently do these variables occur in my sample?”

  • Relative frequency distribution

    • Divide the number in each category by the total number of observations

    • This gives us the proportion of units in each category

    • “What percentage of my sample is represented by this variable?”

  • Count up how many times each variable occurs in the sample

  • For each variable, divide the occurrences of the variable by the sample total

    • \(4\) customers use Visa

    • \(10\) customers total in the sample

    • \({4 \over 10}=0.4\)

    • \(0.4*100\%=40\%\)

  • Graphs are prettier than tables

Credit Card Frequency Relative Frequency
Master Card 11 0.22
Visa 23 0.46
Am. Express 9 0.18
Discover 7 0.14
  • One or more categorical variables

    • So we use a bar graph

  • We can make a bar graph using relative frequency too

  • We can also just flip this horizontal

  • Side-by-side bar graphs can be used to compare two or more categorical variables with the same categories

  • Bar graphs showing frequency can be converted into pie charts

  • Generally a pie chart will show relative frequency

  • They’re very pretty

    • Not very useful

Graphical Summaries Continued

  • We’ve looked at some qualitative (categorical) visualizations

  • What about quantitative (numerical) visualizations?

  • When we have one quantitative variable we have several options:

    • Histograms

    • Steam-and-leaf plots

    • Dotplots

  • With two quantitative variables we generally use a scatterplot

    • We’ll talk about this at length in Chapter 4 (so not important right now)

    • Side Note: we can use more than two quantitative variables in a scatterplot

      • It’s not very useful

      • Why? Can you think in 3 dimensions? What about 4? 5?

Frequency Distribution

  • In a study of the emissions of particulate matter, the amount of emission for 65 vehicles were recorded in a table:

  • Population?

  • Sample?

  • Variables?

  • Variable types?

  • Quantitative variables can also be summarized with a frequency distribution

    • Define interval(s) for the data (referred to as classes/class)

    • Record the number of observations that fall into each class

Class Frequency
0.00-0.99 9
1.00-1.99 26
2.00-2.99 11
3.00-3.99 13
4.00-4.99 3
5.00-5.99 1
6.00-6.99 2
  • Lower class limit: the smallest value that can appear in that class

    • Lower class limit of the third class?
  • Upper class limit: largest value that can appear in that class

    • Upper class limit of the first class?
  • Class width: the difference between consecutive lower class limits

    • Class width of the above frequency distribution?
  • There’s no “one” right way to choose the number of classes or the width for a frequency distribution

  • There are general requirements however:

    • Every observation must fall into one of the classes

    • The classes must not overlap

    • The classes must be of equal width

    • There must be no gaps between classes

      • Even if there are no observations in a class it must be included in the frequency distribution

Histograms

  • Histogram: visual representation of a frequency distribution

    • Not a bar graph
  • Bar height (y-axis) represents class frequency

  • Bar width (x-axis) represents class width

  • Left edge of each bar corresponds to the lower class limit

  • No gaps between classes so no gaps between bars of a histogram

    • If there are gaps between the bars what is it?
  • This histogram has 7 classes

    • You can choose a different number of classes

    • You can choose different widths

    • Free will exists, there are no rules

  • It’s not wrong (it’s very unhinged behavior)

    • Is this interpretable though?

  • Is this one okay?

  • We care about the shape of our data

    • This is the primary purpose of a histogram

    • So we want to not fail at that task

  • The shape of our data can help us observe the distribution of our data

  • Vocabulary for describing the shape of data:

    • Symmetric - mirror image on both sides of it’s center

    • Positively-skewed - Long, narrow tail to the right

    • Negatively-skewed - Long, narrow tail to the left

    • Unimodal - One peak/hump

    • Bimodal - two peaks/humps

    • Uniform - b o x

  • Histogram showing the GPAs of a sample of students at a certain college

  • Which class has the highest frequency?

  • How many students were in the sample?

  • What percentage of the students had GPAs between 2.0 and 3.0?

  • Describe the shape of the above histogram.

  • One of these histograms represents the age at death from natual causes, (heart attack, cancer, etc.)

  • The other represents the age at death from accidents

  • Which represents the age at death from accidents?

    • Justify your answer
  • Histograms can be used to summarize both small and large data sets

  • Sometimes we prefer more detailed visualizations for smaller data sets

  • Stem-and-leaf plots and dotplots are alternative summaries that display the actual values

Stem-and-leaf plots

  • Each observation should have at least two digits

    • The digit furthest to the right is the “leaf”

    • The digits to the left form the “stem”

  • The data:

  • The stem-and-leaf plot:

  • Can you describe the shape of the data?

Dotplots

  • We can represent each observation by a dot above its value on a number line

  • Data:

3 6 2 5 1 2 3 4 3 4
  • Dotplot: