3 Day 2

  • Hello!

  • About Me

  • My Office: Dickens 009C

  • Office Hours: 3-4 PM Monday/Wednesday

  • “Open Door” Policy

  • Zoom Office Hours

    • On-request, I won’t sit in a Zoom room by default every week

    • Email me:

  • These notes are always available

  • Burning questions/emergencies and couldn’t catch me in my office

    • Just email me, \(\approx\) 72-business-hour response time

      • If I’m past that window, I’m either dead or on-vacation

      • I will follow up when/if I return

  • The speech I wish I could have given Monday:

    • I was once, literally, right where you are

Review

  • Two general forms of statistics:

    • Descriptive

      • Quarterback Passer Rating
    • Inferential

      • Unemployment negatively effects GDP (Okun’s Law)
  • Population

    • Entire collection of individuals we’re seeking information on
  • Sample

    • A subset of that population

    • The actual observed group

  • How often do K-State students attend home football games?

    • Randomly survey 100 students that walk into the library on a random Monday
  • Why do we sample?

    • I want to determine which gas station in Manhattan, KS is the least popular

      • Do I ask every person in town?
  • Simple Random Sampling

    • A sample chosen by a method where every selection from the population made is equally likely to make up the sample
  • Stratified Sampling

    • Divide the population into similar groups (i.e., group students by college)

    • Randomly sample from those groups (strata)

  • Cluster Sampling

    • Divide the population into clusters (i.e., split Manhattan, KS by street block)

    • Randomly sample from the clusters

  • Systematic Sampling

    • Randomly choose a start point in a “lined-up” population

    • Sample every \(k^{th}\) item

    • i.e., Starting from the \(4^{th}\) batch of ice cream produced on a given day, Call Hall will check the quality of every \(4^{th}\) batch that comes off the production line

  • Sample of Convenience

    • Class height
  • Voluntary Response Samples

    • Customer support reviews
  • Parameter

    • Describes an entire population
  • Statistic

    • Describes a sample

Types of Data

  • Data set

    • Collected information
  • Individuals

    • Something the information is collected on

    • People/Places/Things/etc.

  • Variables

    • Characteristics about the individuals we collected information from

  • We collected information on students

  • The variables are major, exam score, and grade

  • The values of those variables are called data

  • How many individuals?

  • What are the variables?

  • What are the data for individual 3

Variables

  • Qualitative (Categorical) variable

    • Values represent categories

      • Identifying labels/names

      • Can’t really do math with a label or name

      • We code these into numbers to fix that

        • i.e., Cat-owners = 0 | Dog-owners = 1 | Both = 2
  • Quantitative variable

    • Values represent meaningful numbers

      • Height of a person, sales of a product

      • We can do math with these

A lot of how we do statistics depends on what data we have

  • Qualitative variables can be ordinal or nominal

    • Ordinal variables

      • Categories/values of the variable have a natural ordering

        • Letter grade: A, B, C, D

        • Clothing size: S, M, L

    • Nominal variable

      • Categories/values of the variable cannot be ordered

        • State of residence

        • Gender

  • Quantitative variables can be discrete or continuous

    • Discrete variable

      • A countable number of values (0, 1, 2, 3, 4, …)

        • Number of students in a classroom

        • Population size of fish in a pond

        • How many times a coin flip was successfully called

    • Continous variable

      • A continuous range of numbers (0, 0.1, 0.11, 0.111, …)

        • Temperature

        • Volume of liquid in a glass

        • Height/Weight

  • Quantitative variables can be categorized by level of measurement used for obtaining data values:

    • Interval level

      • Numerical measurement

      • Differences between values make sense

      • Ratios don’t make sense because zero has no meaning

      • Temperature in Celsius/Fahrenheit (Does 0 mean there’s no heat?)

      • Dates (Is there a meaningful ratio you can make out of 1997 and 2020?)

    • Ratio level

      • Numerical measurement

      • Differences between values make sense

      • Ratios also make sense

      • Zero has meaning, it represents absence of the quantity

      • Height (If you’re 0 inches tall, do you have height? Is there a meaningful percentage difference in height between 64 and 67 inches?)

  • Categorize the variables:

    • Music

      • Qualitative
    • Food quality

      • Qualitative
    • Closing time

      • Quantitative
    • Own money spent

      • Quantitative

Communicating with Data

  • Raw data isn’t entirely useful

  • Statistics is really good at summarizing and visualizing data

  • This is the primary focus of the next chapter

  • Choosing the “best” graph for displaying our data depends on our data

    • What kind of data do we have?

      • Categorical?

      • Numerical?

    • What are we trying to do?

      • Look at the distribution of our data?

      • See how two or more variables are related?

  • Bar graph: One or more categorical variable

  • Histogram: One numerical variable

  • Scatterplot: More than one numerical variable

Summarizing Data

  • Even when clean, data is messy

  • Interpreting information is how we make decisions

  • Every decision we make is data driven

    • Even when it’s “emotional data”
  • Statistics gives us tools to summarize and interpret data rapidly

Frequency Distribution

  • Credit cards used by the last 10 customers at a store

    • Population?

    • Sample?

    • How many variables?

    • What type of variable?

  • Frequency distribution

    • Groups data into categories

    • Records the number of observations that fall into each category

    • “How frequently do these variables occur in my sample?”

  • Relative frequency distribution

    • Divide the number in each category by the total number of observations

    • This gives us the proportion of units in each category

    • “What percentage of my sample is represented by this variable?”

  • Count up how many times each variable occurs in the sample

  • For each variable, divide the occurrences of the variable by the sample total

    • \(4\) customers use Visa

    • \(10\) customers total in the sample

    • \({4 \over 10}=0.4\)

    • \(0.4*100\%=40\%\)

  • How is this useful?

    • What percentage of the class drinks coffee?

    • What percentage of the class drinks tea?

    • Whats our sample?

    • Population?

    • What are the variables and variable types?

Bar Graphs

  • Graphs are prettier than tables

    • Barely a subjective statement

  • How many customers does this business lose if they stop taking Discover?

  • Take the frequency and relative frequency distributions:

Credit Card Frequency Relative Frequency
Master Card 11 0.22
Visa 23 0.46
Am. Express 9 0.18
Discover 7 0.14
  • One or more categorical variables

    • So we use a bar graph

  • We can make a bar graph using relative frequency too

  • We can also just flip this horizontal

  • This is useful for when you have longer category names

  • Side-by-side bar graphs can be used to compare two or more categorical variables with the same categories

Pie Charts

  • Bar graphs showing frequency can be converted into pie charts

  • Generally a pie chart will show relative frequency

    • “What’s my piece of the pie?”

  • They’re very pretty

    • Not very useful

    • Interpretability is everything

  • Approximately how large is Borneo?

  • Approximately how much larger is New Guinea than Sumatra

  • Someone says that Madagascar and Baffin Island together are larger than New Guinea:

    • Is this correct?