4 Day 3

Reminders:

My Office: Dickens 009C
Office Hours: 3-4 PM Monday/Wednesday
- This specific Wednesday I’m moving office hours around
- Check Canvas
Zoom Office Hours
- On-request, I won’t sit in a Zoom room by default every week
- Email me: rmsholl@ksu.edu
ALEKS issues
- Please email me
- Apparently we can fix more with ALEKS than IT
Homework policy
- On Canvas
Class room conduct:
- If I’m prompting a question you can just reply
- Outside of a prompted question please raise your hand
- If you’re arriving late/leaving early, just be respectful with your entrance/exit
  - If you need to step out for whatever reason, same courtesies apply
- Try not to disrupt class/break laws in general
- You can take pictures of annotations on the screen anytime
- You can record the lecture
- You just can’t distribute the recordings, that ends badly
  - Technically everything I produce is owned by the University
Attendance policy:
- You should attend class
- I am obligated to take attendance at random
  
  ${n \choose k}p^k(1-p)^{n-k}$ $p = 0.5$
- If you need to/happened to miss class for whatever reason just email me
  - I have systems available for keeping you from falling behind

Review

Data set
- Collected information
Individuals
- Something the information is collected on
Variables
- Characteristics about the individuals we collected information from
Qualitative (Categorical) variable
- Values represent categories
  - Identifying labels/names
  - “Can I do math with this by default?”
    - “Is the math arbitrary?” (i.e., does the 1 in “rank 1 salesperson at Amazon” actually represent a quantity? or does it describe them?)
Quantitative variable
- Values represent meaningful numbers
Qualitative variables can be ordinal or nominal
- Ordinal variables
  - Categories/values of the variable have a natural ordering
    - Letter grade: A, B, C, D
- Nominal variable
  - Categories/values of the variable cannot be ordered
    - Degree programs
Quantitative variables can be discrete or continuous
- Discrete variable
  - A countable number of values (0, 1, 2, 3, 4, …)
    - Number of students in a classroom
- Continous variable
  - A continuous range of numbers (0, 0.1, 0.11, 0.111, …)
    - Temperature
Quantitative variables can be Interval or Ratio:
- Interval level
  - Can’t make a meaningful ratio
  - Zero doesn’t mean absence
  - Temperature in Celsius/Fahrenheit (Does 0 mean there’s no heat?)
- Ratio level
  - Ratios make sense
  - Zero represents absence of the variable
  - If you’re 0 inches tall do you have height?
Raw data isn’t entirely useful
Statistics is really good at summarizing and visualizing data
Bar graph: One or more categorical variable

Histogram: One numerical variable

Scatterplot: More than one numerical variable
Frequency distribution
- Group data into categories
- Record the number of observations that fall into each category
- “How frequently do these variables occur in my sample?”
Relative frequency distribution
- Divide the number in each category by the total number of observations
- This gives us the proportion of units in each category
- “What percentage of my sample is represented by this variable?”

Count up how many times each variable occurs in the sample
For each variable, divide the occurrences of the variable by the sample total
- $4$ customers use Visa
- $10$ customers total in the sample
- ${4 \over 10}=0.4$
- $0.4*100\%=40\%$
Graphs are prettier than tables

Credit Card	Frequency	Relative Frequency
Master Card	11	0.22
Visa	23	0.46
Am. Express	9	0.18
Discover	7	0.14

One or more categorical variables
- So we use a bar graph

We can make a bar graph using relative frequency too

We can also just flip this horizontal

Side-by-side bar graphs can be used to compare two or more categorical variables with the same categories

Bar graphs showing frequency can be converted into pie charts
Generally a pie chart will show relative frequency

They’re very pretty
- Not very useful

Graphical Summaries Continued

We’ve looked at some qualitative (categorical) visualizations
What about quantitative (numerical) visualizations?
When we have one quantitative variable we have several options:
- Histograms
- Steam-and-leaf plots
- Dotplots
With two quantitative variables we generally use a scatterplot
- We’ll talk about this at length in Chapter 4 (so not important right now)
- Side Note: we can use more than two quantitative variables in a scatterplot
  - It’s not very useful
  - Why? Can you think in 3 dimensions? What about 4? 5?

Frequency Distribution

In a study of the emissions of particulate matter, the amount of emission for 65 vehicles were recorded in a table:

Population?
Sample?
Variables?
Variable types?
Quantitative variables can also be summarized with a frequency distribution
- Define interval(s) for the data (referred to as classes/class)
- Record the number of observations that fall into each class

Class	Frequency
0.00-0.99	9
1.00-1.99	26
2.00-2.99	11
3.00-3.99	13
4.00-4.99	3
5.00-5.99	1
6.00-6.99	2

Lower class limit: the smallest value that can appear in that class
- Lower class limit of the third class?
Upper class limit: largest value that can appear in that class
- Upper class limit of the first class?
Class width: the difference between consecutive lower class limits
- Class width of the above frequency distribution?
There’s no “one” right way to choose the number of classes or the width for a frequency distribution
There are general requirements however:
- Every observation must fall into one of the classes
- The classes must not overlap
- The classes must be of equal width
- There must be no gaps between classes
  - Even if there are no observations in a class it must be included in the frequency distribution

Histograms

Histogram: visual representation of a frequency distribution
- Not a bar graph
Bar height (y-axis) represents class frequency
Bar width (x-axis) represents class width

Left edge of each bar corresponds to the lower class limit
No gaps between classes so no gaps between bars of a histogram
- If there are gaps between the bars what is it?
This histogram has 7 classes
- You can choose a different number of classes
- You can choose different widths
- Free will exists, there are no rules

It’s not wrong (it’s very unhinged behavior)
- Is this interpretable though?

Is this one okay?
We care about the shape of our data
- This is the primary purpose of a histogram
- So we want to not fail at that task
The shape of our data can help us observe the distribution of our data
Vocabulary for describing the shape of data:
- Symmetric - mirror image on both sides of it’s center
- Positively-skewed - Long, narrow tail to the right
- Negatively-skewed - Long, narrow tail to the left
- Unimodal - One peak/hump
- Bimodal - two peaks/humps
- Uniform - b o x

Histogram showing the GPAs of a sample of students at a certain college

Which class has the highest frequency?
How many students were in the sample?
What percentage of the students had GPAs between 2.0 and 3.0?
Describe the shape of the above histogram.
One of these histograms represents the age at death from natual causes, (heart attack, cancer, etc.)
The other represents the age at death from accidents

Which represents the age at death from accidents?
- Justify your answer
Histograms can be used to summarize both small and large data sets
Sometimes we prefer more detailed visualizations for smaller data sets
Stem-and-leaf plots and dotplots are alternative summaries that display the actual values

Stem-and-leaf plots

Each observation should have at least two digits
- The digit furthest to the right is the “leaf”
- The digits to the left form the “stem”
The data:

The stem-and-leaf plot:

Can you describe the shape of the data?

Dotplots

We can represent each observation by a dot above its value on a number line
Data:

Dotplot: