19 Day 18
Announcements
Exam next week
Study now, make this the easy exam
Review session Wednesday
Come with questions on the content of the exam
We’ll be reviewing different content that a few instructors have made
Cheat sheet design session as well
Day 19 content is partially shoved into Today’s lecture
- That’s why we can do review on Wednesday
Review
Uncertainty and Statistics
“All models are wrong, but some are useful” - George E. P. Box
What is the primary goal of statistics?
- Can we achieve this goal without making any errors?
We define uncertainty from two central concepts:
For every population we can draw information from, there is a sample X
The mean of our sample is ˉx
ˉx is assumed to be normally distributed under two circumstances
- The distribution of X, and thus our population, is NORMAL
- Our sample size of X is >30
Why does point (1) make sense?
If we draw samples from a population that has any distribution
It’s samples will have the same distribution
A condition where this doesn’t hold is when we transform our data somehow
- i.e., Taking samples from a normally distributed population but transforming the samples to strictly positive, discrete values
Why does point (2) make sense?
In applied statistics we check our assumptions
Logically the only way to check this assumption would be to take a lot of samples
We’ve done that outside of this class, our assumptions do hold (refer to the CLT App in the bookdown)
Logically however:
We assume that the distribution of ˉx is normal even upon one observation arising from a sample of n>30
Empirical and mathematical proof are shockingly easy to obtain (and do by hand!)
I do recommend just seeking out a program to do it for you however
When we sample from a population, and compute a statitic, we’re always slightly wrong
- Statistics accepts errors on the single condition that you can quantify those errors
Central Limit Theorem
Given we have an understanding of the CLT, most of this class should feel essentially intuitive
The Central Limit Theorem boils down to the statement that:
- “…a large, properly draw sample will resemble the population from which it is drawn.” - Charles Wheelan
Think about sampling 40 defensive linemen from random college football teams
We can pretty intuitively assume that they’re going to be pretty heavy on average
If we got an average weight of 140 lbs. we’d be concerned we sampled wrong
Not because it’s impossible
But because it’s highly unlikely
That’s it.
Confidence Intervals are a way to quantify uncertainty
- Put simply, we’re saying “The true value is X, give or take a”
We have two methods for calculating confidence intervals:
- z-method (given σ is known)
ˉx±zα/2⋅σ√n
- t-method (given σ is unknown)
ˉx±tα/2⋅s√n
t-method refers directly to the Students t-distribution
Similar to the standard normal distribution
Unimodal
Symmetric around 0
Wider (or heavier) tails than the standard normal
- Meaning it’s more spread out
Distinguished by degrees of freedom (df=n−1)
- As df increase the t-distribution converges to a normal distribution
In practice, we always use the t-method
- Why?
Confidence Intervals for Proportions
To get a confidence interval we need:
Point estimate
Margin of Error
Our point estimate for p (population proportion):
- ˆp
- Margin of error is given by:
zα/2√ˆp(1−ˆp)n
We need to justify being able to make a confidence interval
Given the CLT for proportions:
nˆp≥10andn(1−ˆp)≥10
then the 100(1−α)% confidence interval for p is:
ˆp±zα/2√ˆp(1−ˆp)n(Margin of Error)
- Recall that the critical value zα/2 comes from the standard normal distribution.
Example 1: Sleep Apnea
Sleep apnea is a disorder in which there are pauses in breathing during sleep. In a sample of 427 people aged 65 or over, 104 had sleep apnea.
a. Find a point estimate for the population proportion of those aged 65 and over who have sleep apnea. (Round your answer to 3 decimal places.)
ˆp=104427≈0.244
b. Compute the margin of error for a 99% confidence level.
For 99% confidence, α/2=0.005⇒zα/2=2.576. The margin of error is:
zα/2×√ˆp(1−ˆp)n=2.576×√0.244(1−0.244)427≈0.054
c. Construct a 99% confidence interval for the proportion of those aged 65 or over who have sleep apnea.
We construct the 99% confidence interval for p by:
ˆp±Margin of Error⇒0.244±0.054
or (0.190,0.298)
d. Does it appear that more than 9% of elderly people have sleep apnea? Yes, all the values in the CI are greater than 0.09.
Example 2: Sleep Study
Researchers want to estimate the mean amount of sleep per night for students at Upper Midwest University. A sample of 15 students had an average of 6.4 hours per night and a standard deviation of 1.27 hours.
a. Should we use the z-method or t-method?
t method, because σ is unknown.
b. Find a point estimate for the mean amount of sleep per night for UMU students.
ˉx=6.4
c. Compute the margin of error for a 90% confidence level. (Round your answer to 3 decimal places.)
α=1−0.90=0.10⇒α/2=0.05
With df=n−1=14, tα/2=1.761 from the t-table. Compute the margin of error:
tα/2×s√n=1.761×1.27√15≈0.577
d. Construct a 90% confidence interval for the mean amount of sleep per night for students at UMU.
The 90% CI for the mean amount of sleep is:
ˉx±Margin of Error⇒6.4±0.577
or (5.823,6.977)
e. Does the confidence interval contradict a claim that the mean amount of sleep for UMU students is 6 hours? Explain.
No, because 6 is contained in the confidence interval.
Example 3: SAT Scores
A college admissions officer sampled 107 freshmen and found that 38 scored more than 510 on the math SAT.
a. Find a point estimate for the proportion of all entering freshmen at this college who scored more than 510 on the math SAT. (Round the answer to 3 decimal places.)
ˆp=38107≈0.355
b. Compute the margin of error for a 95% confidence level. (Round your answer to 3 decimal places.)
α=1−0.95=0.05⇒α/2=0.025⇒zα/2=1.96
Compute the margin of error:
zα/2×√ˆp(1−ˆp)n=1.96×√0.355(1−0.355)107≈0.091
c. Construct a 95% confidence interval for the proportion of all entering freshmen who scored more than 510 on the math SAT.
ˆp±Margin of Error⇒0.355±0.091
or (0.264,0.446).
We are 95% confident that between 26.4% and 44.6% of the entering freshmen scored more than 510 on the math SAT.
Hypothesis Testing
My philosophy for scientific communication should be apparent by now:
Why should you care?
What does this do?
Why do we use this?
Making claims about statements is how science works
- We can convert statements into parameters, and make claims about them using statistics
We call this: “Hypothesis Testing”
In hypothesis testing, there are two competing statements about population parameters:
H0≡null hypothesisvsH1≡alternate hypothesis
Upon forming these statements:
We use data collected from a sample to test if we
- Reject H0, thereby supporting H1
A statistical hypothesis is often a statement about population parameter(s)
The null hypothesis, H0, states that the parameter is equal to a specific value
H0:μ=35
The alternate hypothesis, H1, states that the value of the parameter differs from the value specified by the null hypothesis
H1:μ<35
H1:μ>35
H1:μ≠35
There are three types of alternate hypothesis
- Consider H0:μ=35
H1:μ<35 ⇒ called left-tailed alternate hypothesis
H1:μ>35 ⇒ called right-tailed alternate hypothesis
H1:μ≠35 ⇒ called two-tailed alternate hypothesis
Left-tailed and right-tailed hypotheses are called one-tailed hypotheses
A null hypothesis is generally thought of as a default state of nature (e.g. existing knowledge)
An alternate hypothesis, on the other hand, contradicts the default state (e.g. new knowledge)
In most cases, whatever we wish to establish is placed in the alternate hypothesis
After developing H0 and H1, we collect a set of data
Based on the data, we construct a test statistic to reach one of the following decisions:
Reject H0
Fail to reject H0
If we reject H0
- We conclude that H1 is true
If we fail to reject H0
- We conclude that the data do not provide enough evidence to reject H0
Errors in Hypothesis Testing
We’re making inference from samples
We will be a certain amount of wrong
Depending on how wrong we are, and in what way, we define two types of errors
Type I error: H0 is true in reality, but we reject H0
Type II error: H1 is true in reality, but we do not reject H0
DecisionH0 TrueH0 FalseReject H0Type I errorCorrect decisionDon’t reject H0Correct decisionType II error
The probability of having the Type I error is denoted by α
The probability of having the Type II error is denoted by β
Minimizing both α and β is impossible due to a trade-off between α and β
- If we decrease α, then β tends to increase
In general, controlling α-level (chance of making Type I error) is more important since Type I error leads to the acceptance of incorrect new knowledge
- Go away