B References
Guided Practice problems are intended to stretch your thinking, and you can check yourself by reviewing the footnote solution for any Guided Practice.↩︎
Mine Çetinkaya-Rundel, David Diez, Andrew Bray, Albert Kim, Ben Baumer, Chester Ismay and Christopher Barr (2020). openintro: Data Sets and Supplemental Functions from ‘OpenIntro’ Textbooks and Labs. R package version 2.0.0. https://github.com/OpenIntroStat/openintro.↩︎
The proportion of the 224 patients who had a stroke within 365 days: \(45/224 = 0.20.\)↩︎
The loan’s grade is B, and the borrower rents their residence.↩︎
There are multiple strategies that can be followed. One common strategy is to have each student represented by a row, and then add a column for each assignment, quiz, or exam. Under this setup, it is easy to review a single line to understand the grade history of a student. There should also be columns to include student information, such as one column to list student names.↩︎
Each county may be viewed as a case, and there are eleven pieces of information recorded for each case. A table with 3,142 rows and 14 columns could hold these data, where each row represents a county and each column represents a particular piece of information.↩︎
The
group
variable can take just one of two group names, making it categorical. Thenum_migraines
variable describes a count of the number of migraines, which is an outcome where basic arithmetic is sensible, which means this is a numerical outcome; more specifically, since it represents a count,num_migraines
is a discrete numerical variable.↩︎Two example questions: (1) What is the relationship between loan amount and total income? (2) If someone’s income is above the average, will their interest rate tend to be above or below the average?↩︎
In some disciplines, it’s customary to refer to the explanatory variable as the independent variable and the response variable as the dependent variable. However, this becomes confusing since a pair of variables might be independent or dependent, so we avoid this language.↩︎
The
mcu_films
data used in this exercise can be found in the openintro R package.↩︎The
run17
data used in this exercise can be found in the cherryblossom R package.↩︎The
migraine
data used in this exercise can be found in the openintro R package.↩︎The
sinusitis
data used in this exercise can be found in the openintro R package.↩︎The
daycare_fines
data used in this exercise can be found in the openintro R package.↩︎The
biontech_adolescents
data used in this exercise can be found in the openintro R package.↩︎The
penguins
data used in this exercise can be found in the palmerpenguins R package.↩︎Artwork by Allison Horst.↩︎
The
smoking
data used in this exercise can be found in the openintro R package.↩︎The
usairports
data used in this exercise can be found in the airports R package.↩︎The data used in this exercise can be found in the unvotes R package.↩︎
The
ukbabynames
data used in this exercise can be found in the ukbabynames R package.↩︎The
netflix_titles
data used in this exercise can be found in the tidytuesdayR R package.↩︎The data used in this exercise comes from the JSR Launch Vehicle Database, 2019 Feb 10 Edition.↩︎
The
seattlepets
data used in this exercise can be found in the openintro R package.↩︎There are actually a few ways we could measure time within a list, and different ways result in different types of variables. We could observe the amount of time that has passed since the first item in the list, resulting in a continuous numerical variable. We could observe the number of items that have been presented so far, resulting in a discrete numerical variable. We could pick a point within the list that divides it into “early” and “late” periods, resulting in an ordinal categorical variable.↩︎
There are no “right” answers to these questions, but the purpose in asking them is to think about the fact that no single variable captures every aspect of a theoretical construct. Some variables relate to the amount of education someone received (e.g., number of years in school), while others relate to how well they did (grades or SAT). Some variables relate to how much someone earns (salary), while others relate to how much they own (net worth). Even if all of these variables have good construct validity, some of them may be better suited than others to addressing different research questions.↩︎
The question “Over the last five years, what is the average time to complete a degree for Duke undergrads?” is only relevant to students who complete their degree; the average cannot be computed using a student who never finished their degree. Thus, only Duke undergrads who graduated in the last five years represent cases in the population under consideration. Each such student is an individual case. For the question “Does a new drug reduce the number of deaths in patients with severe heart disease?”, a person with severe heart disease represents a case. The population includes all people with severe heart disease.↩︎
Answers will vary. From our own anecdotal experiences, we believe people tend to rant more about products that fell below expectations than rave about those that perform as expected. For this reason, we suspect there is a negative bias in product ratings on sites like Amazon. However, since our experiences may not be representative, we also keep an open mind.↩︎
This is a different concept than a control group, which we discuss in the second principle and in Section 2.3.2.↩︎
Also called a lurking variable, confounding factor, or a confounder.↩︎
Human subjects are often called patients, volunteers, or study participants.↩︎
There are always some researchers involved in the study who do know which patients are receiving which treatment. However, they do not interact with the study’s patients and do not tell the blinded health care professionals who is receiving which treatment.↩︎
The researchers assigned the patients into their treatment groups, so this study was an experiment. However, the patients could distinguish what treatment they received because a stent is a surgical procedure. There is no equivalent surgical placebo, so this study was not blind. The study could not be double-blind since it was not blind.↩︎
Ultimately, can we make patients think they got treated from a surgery? In fact, we can, and some experiments use a sham surgery. In a sham surgery, the patient does undergo surgery, but the patient does not receive the full treatment, though they will still get a placebo effect.↩︎
No. See the paragraph following the question!↩︎
Answers will vary. Population density may be important. If a county is very dense, then this may require a larger percentage of residents to live in housing units that are in multi-unit structures. Additionally, the high density may contribute to increases in property value, making homeownership unfeasible for many residents.↩︎
Derived from similar figures in Chance and Rossman (2018) and Ramsey and Schafer (2012).↩︎
The data used in this exercise comes from the General Social Survey, 2018.↩︎
The
cia_factbook
data used in this exercise can be found in the openintro R package.↩︎The
county_complete
data used in this exercise can be found in the openintro R package.↩︎0.451 represents the proportion of individual applicants who have a mortgage. 0.802 represents the fraction of applicants with mortgages who applied as individuals.↩︎
0.122 represents the fraction of joint borrowers who own their home. 0.135 represents the home-owning borrowers who had a joint application for the loan.↩︎
Answers may vary a little. The counties with population gains tend to have higher income (median of about $45,000) versus counties without a gain (median of about $40,000). The variability is also slightly larger for the population gain group. This is evident in the IQR, which is about 50% bigger in the gain group. Both distributions show slight to moderate right skew and are unimodal. The box plots indicate there are many observations far above the median in each group, though we should anticipate that many observations will fall beyond the whiskers when examining any dataset that contain more than a few hundred data points.↩︎
Answers will vary. The side-by-side box plots are especially useful for comparing centers and spreads, while the hollow histograms are more useful for seeing distribution shape, skew, modes, and potential anomalies.↩︎
The ridge plot give us a better sense of the shape, and especially modality, of the data.↩︎
The ridge plot give us a better sense of the shape, and especially modality, of the data.↩︎
The
antibiotics
data used in this exercise can be found in the openintro R package.↩︎The
immigration
data used in this exercise can be found in the openintro R package.↩︎The
heart_transplant
data used in this exercise can be found in the openintro R package.↩︎Answers may vary. Scatterplots are helpful in quickly spotting associations relating variables, whether those associations come in the form of simple trends or whether those relationships are more complex.↩︎
Consider the case where your vertical axis represents something “good” and your horizontal axis represents something that is only good in moderation. Health and water consumption fit this description: we require some water to survive, but consume too much and it becomes toxic and can kill a person.↩︎
\(x_1\) corresponds to the interest rate for the first loan in the sample, \(x_2\) to the second loan’s interest rate, and \(x_i\) corresponds to the interest rate for the \(i^{th}\) loan in the dataset. For example, if \(i = 4,\) then we’re examining \(x_4,\) which refers to the fourth observation in the dataset.↩︎
The sample size was \(n = 50.\)↩︎
Other ways to describe data that are right skewed: skewed to the right, skewed to the high end, or skewed to the positive end.↩︎
The interest rates for individual loans.↩︎
Remember that uni stands for 1 (think unicycles), and bi stands for 2 (think bicycles).↩︎
There might be two height groups visible in the dataset: one of the students and one of the adults. That is, the data are probably bimodal.↩︎
Figure 4.7 shows three distributions that look quite different, but all have the same mean, variance, and standard deviation. Using modality, we can distinguish between the first plot (bimodal) and the last two (unimodal). Using skewness, we can distinguish between the last plot (right skewed) and the first two. While a picture, like a histogram, tells a more complete story, we can use modality and shape (symmetry/skew) to characterize basic information about a distribution.↩︎
Since \(Q_1\) and \(Q_3\) capture the middle 50% of the data and the median splits the data in the middle, 25% of the data fall between \(Q_1\) and the median, and another 25% falls between the median and \(Q_3.\)↩︎
These visual estimates will vary a little from one person to the next: \(Q_1 \approx\) 8%, \(Q_3 \approx\) 14%, IQR \(\approx\) 14 - 8 = 6%.↩︎
Mean is affected more than the median. Standard deviation is affected more than the IQR.↩︎
If we are looking to simply understand what a typical individual loan looks like, the median is probably more useful. However, if the goal is to understand something that scales well, such as the total amount of money we might need to have on hand if we were to offer 1,000 loans, then the mean would be more useful.↩︎
The
mammals
data used in this exercise can be found in the openintro R package.↩︎The
cia_factbook
data used in this exercise can be found in the openintro R package.↩︎The
pm25_2011_durham
data used in this exercise can be found in the openintro R package.↩︎The
oscars
data used in this exercise can be found in the openintro R package.↩︎The
county_complete
data used in this exercise can be found in the usdata R package.↩︎The
county_complete
data used in this exercise can be found in the usdata R package.↩︎The
nyc_marathon
data used in this exercise can be found in the openintro R package.↩︎If a model underestimates an observation, then the model estimate is below the actual. The residual, which is the actual observation value minus the model estimate, must then be positive. The opposite is true when the model overestimates the observation: the residual is negative.↩︎
Gray diamond: \(\hat{y} = 41+0.59x = 41+0.59\times 85.0 = 91.15 \rightarrow e = y - \hat{y} = 98.6-91.15=7.45.\) This is close to the earlier estimate of 7. pink triangle: \(\hat{y} = 41+0.59x = 97.3 \rightarrow e = -3.3.\) This is also close to the estimate of -4.↩︎
Formally, we can compute the correlation for observations \((x_1, y_1),\) \((x_2, y_2),\) …, \((x_n, y_n)\) using the formula \[r = \frac{1}{n-1} \sum_{i=1}^{n} \frac{x_i-\bar{x}}{s_x}\frac{y_i-\bar{y}}{s_y}\] where \(\bar{x},\) \(\bar{y},\) \(s_x,\) and \(s_y\) are the sample means and standard deviations for each variable.↩︎
We’ll leave it to you to draw the lines. In general, the lines you draw should be close to most points and reflect overall trends in the data.↩︎
Larger family incomes are associated with lower amounts of aid, so the correlation will be negative. Using a computer, the correlation can be computed: -0.499.↩︎
There are applications where the sum of residual magnitudes may be more useful, and there are plenty of other criteria we might consider. However, this book only applies the least squares criterion.↩︎
About \(R^2 = (-0.97)^2 = 0.94\) or 94% of the variation in the response variable is explained by the linear model.↩︎
The difference \(SST - SSE\) is called the regression sum of squares, \(SSR,\) and can also be calculated as \(SSR = (\hat{y}_1 - \bar{y})^2 + (\hat{y}_2 - \bar{y})^2 + \cdots + (\hat{y}_n - \bar{y})^2.\) \(SSR\) represents the variation in \(y\) that was accounted for in our model.↩︎
\(SST\) can be calculated by finding the sample variance of the response variable, \(s^2\) and multiplying by \(n-1.\)↩︎
The
exam_grades
data used in this exercise can be found in the openintro R package.↩︎The
husbands_wives
data used in this exercise can be found in the openintro R package.↩︎The
corr_match
data used in this exercise can be found in the openintro R package.↩︎The
corr_match
data used in this exercise can be found in the openintro R package.↩︎The
bdims
data used in this exercise can be found in the openintro R package.↩︎The
coast_starlight
data used in this exercise can be found in the openintro R package.↩︎The
babies_crawl
data used in this exercise can be found in the openintro R package.↩︎The
starbucks
data used in this exercise can be found in the openintro R package.↩︎The
starbucks
data used in this exercise can be found in the openintro R package.↩︎The
coast_starlight
data used in this exercise can be found in the openintro R package.↩︎The
bdims
data used in this exercise can be found in the openintro R package.↩︎The
county_2019
data used in this exercise can be found in the usdata R package.↩︎The
cats
data used in this exercise can be found in the MASS R package.↩︎The
urban_owner
data used in this exercise can be found in the usdata R package.↩︎The
babies_crawl
data used in this exercise can be found in the openintro R package.↩︎The
trees
data used in this exercise can be found in the datasets R package.↩︎The
corr_match
data used in this exercise can be found in the openintro R package.↩︎We would be assuming that these two variables are independent.↩︎
The study is an experiment, as subjects were randomly assigned a “male” file or a “female” file (remember, all the files were actually identical in content). Since this is an experiment, the results can be used to evaluate a causal relationship between the sex of a candidate and the promotion decision.↩︎
The test procedure we employ in this section is sometimes referred to as a permutation test. The difference between the two is how the explanatory variable was assigned. Permutation tests are used for observational studies, where the explanatory variable was not randomly assigned..↩︎
\(18/24 - 17/24=0.042\) or about 4.2% in favor of the male personnel. This difference due to chance is much smaller than the difference observed in the actual groups.↩︎
This reasoning does not generally extend to anecdotal observations. Each of us observes incredibly rare events every day, events we could not possibly hope to predict. However, in the non-rigorous setting of anecdotal evidence, almost anything may appear to be a rare event, so the idea of looking for rare events in day-to-day activities is treacherous. For example, we might look at the lottery: there was only a 1 in 176 million chance that the Mega Millions numbers for the largest jackpot in history (October 23, 2018) would be (5, 28, 62, 65, 70) with a Mega ball of (5), but nonetheless those numbers came up! However, no matter what numbers had turned up, they would have had the same incredibly rare odds. That is, any set of numbers we could have observed would ultimately be incredibly rare. This type of situation is typical of our daily lives: each possible event in itself seems incredibly rare, but if we consider every alternative, those outcomes are also incredibly rare. We should be cautious not to misinterpret such anecdotal evidence.↩︎
This context might feel strange if physical video stores predate you. If you’re curious about what those were like, look up “Blockbuster”.↩︎
Success is often defined in a study as the outcome of interest, and a “success” may or may not actually be a positive outcome. For example, researchers working on a study on COVID prevalence might define a “success” in the statistical sense as a patient who has COVID-19. A more complete discussion of the term success will be given in Chapter 11.↩︎
The
avandia
data used in this exercise can be found in the openintro R package.↩︎The
heart_transplant
data used in this exercise can be found in the openintro R package.↩︎About 4.8% of the patients (3 on average) in the simulation will have a complication, as this is what was seen in the sample. We will, however, see a little variation from one simulation to the next.↩︎
This case study is described in Made to Stick by Chip and Dan Heath. Little known fact: the teaching principles behind many OpenIntro resources are based on Made to Stick.↩︎
Because 50% is not in the interval estimate for the true parameter, we can say that there is convincing evidence against the hypothesis that 50% of listeners can guess the tune. Moreover, 50% is a substantial distance from the largest resample statistic, suggesting that there is very convincing evidence against this hypothesis.↩︎
If we want to be more certain we will capture the fish, we might use a wider net. Likewise, we use a wider confidence interval if we want to be more certain that we capture the parameter.↩︎
There are many choices for implementing a random selection of YouTube videos, but it isn’t clear how “random” they are.↩︎
In general, the distributions are reasonably symmetric. The case study for the medical consultant is the only distribution with any evident skew (the distribution is skewed right).↩︎
It is also introduced as the Gaussian distribution after Frederic Gauss, the first person to formalize its mathematical expression.↩︎
We use the standard deviation as a guide. Nel is 1 standard deviation above average on the SAT: \(1500 + 300 = 1800.\) Sian is 0.6 standard deviations above the mean on the ACT: \(21+0.6 \times 5 = 24.\) In Figure 8.5, we can see that Nel did better compared to other test takers than Sian did, so their score was better.↩︎
\(Z_{Sian} = \frac{x_{Sian} - \mu_{ACT}}{\sigma_{ACT}} = \frac{24 - 21}{5} = 0.6\)↩︎
For \(x_1=95.4\) mm: \(Z_1 = \frac{x_1 - \mu}{\sigma} = \frac{95.4 - 92.6}{3.6} = 0.78.\) For \(x_2=85.8\) mm: \(Z_2 = \frac{85.8 - 92.6}{3.6} = -1.89.\)↩︎
Because the absolute value of Z score for the second observation is larger than that of the first, the second observation has a more unusual head length.↩︎
If 84% had lower scores than Nel, the number of people who had better scores must be 16%. (Generally ties are ignored when the normal model, or any other continuous distribution, is used.)↩︎
We found the probability to be 0.6664. A picture for this exercise is represented by the shaded area below “0.6664”.↩︎
Numerical answers: (a) 0.9772. (b) 0.0228.↩︎
This sample was taken from the USDA Food Commodity Intake Database.↩︎
Remember: draw a picture first, then find the Z score. (We leave the pictures to you.) The Z score can be found by using the percentiles and the normal probability table. (a) We look for 0.95 in the probability portion (middle part) of the normal probability table, which leads us to row 1.6 and (about) column 0.05, i.e., \(Z_{95}=1.65.\) Knowing \(Z_{95}=1.65,\) \(\mu = 1500,\) and \(\sigma = 300,\) we setup the Z score formula: \(1.65 = \frac{x_{95} - 1500}{300}.\) We solve for \(x_{95}\): \(x_{95} = 1995.\) (b) Similarly, we find \(Z_{97.5} = 1.96,\) again setup the Z score formula for the heights, and calculate \(x_{97.5} = 76.5.\)↩︎
Numerical answers: (a) 0.1131. (b) 0.3821.↩︎
This is an abbreviated solution. (Be sure to draw a figure!) First find the percent who get below 1500 and the percent that get above 2000: \(Z_{1500} = 0.00 \to 0.5000\) (area below), \(Z_{2000} = 1.67 \to 0.0475\) (area above). Final answer: \(1.0000-0.5000 - 0.0475 = 0.4525.\)↩︎
5’5’’ is 65 inches. 5’7’’ is 67 inches. Numerical solution: \(1.000 - 0.0649 - 0.8183 = 0.1168,\) i.e., 11.68%.↩︎
First draw the pictures. To find the area between \(Z=-1\) and \(Z=1,\) use
pnorm()
or the normal probability table to determine the areas below \(Z=-1\) and above \(Z=1.\) Next verify the area between \(Z=-1\) and \(Z=1\) is about 0.68. Repeat this for \(Z=-2\) to \(Z=2\) and also for \(Z=-3\) to \(Z=3.\)↩︎900 and 2100 represent two standard deviations above and below the mean, which means about 95% of test takers will score between 900 and 2100. Since the normal model is symmetric, then half of the test takers from part (a) (\(\frac{95\%}{2} = 47.5\%\) of all test takers) will score 900 to 1500 while 47.5% score between 1500 and 2100.↩︎
We will leave it to you to draw a picture. The Z scores are \(Z_{left} = -1.96\) and \(Z_{right} = 1.96.\) The area between these two Z scores is \(0.9750 - 0.0250 = 0.9500.\) This is where “1.96” comes from in the 95% confidence interval formula.↩︎
No. Just as some observations occur more than 1.96 standard deviations from the mean, some point estimates will be more than 1.96 standard errors from the parameter. A confidence interval only provides a plausible range of values for a parameter. While we might say other values are implausible based on the data, this does not mean they are impossible.↩︎
This exercise was inspired by discussion on Dr. Allan Rossman’s blog Ask Good Questions.↩︎
Making a Type 1 Error in this context would mean that reminding students that money not spent now can be spent later does not affect their buying habits, despite the strong evidence (the data suggesting otherwise) found in the experiment. Notice that this does not necessarily mean something was wrong with the data or that we made a computational mistake. Sometimes data simply point us to the wrong conclusion, which is why scientific studies are often repeated to check initial findings.↩︎
To lower the Type 2 Error rate, we want to convict more guilty people. We could lower the standards for conviction from “beyond a reasonable doubt” to “beyond a little doubt”. Lowering the bar for guilt will also result in more wrongful convictions, raising the Type 1 Error rate.↩︎
Observed control survival rate: \(\hat{p}_C = \frac{11}{50} = 0.22.\) Treatment survival rate: \(\hat{p}_T = \frac{14}{40} = 0.35.\) Observed difference: \(\hat{p}_T - \hat{p}_C = 0.35 - 0.22 = 0.13.\)↩︎
\(H_0:\) There is no association between the consultant’s contributions and the clients’ complication rate. In statistical language, \(p = 0.10.\) \(H_A:\) Patients who work with the consultant tend to have a complication rate lower than 10%, i.e., \(p < 0.10.\)↩︎
There is sufficiently strong evidence to reject the null hypothesis in favor of the alternative hypothesis. We would conclude that there is evidence that the consultant’s surgery complication rate is lower than the US standard rate of 10%.↩︎
No. Not necessarily. The evidence supports the alternative hypothesis that the consultant’s complication rate is lower, but the lower rate could be the result of many factors other than the consultant’s skill (including picking healthier clients).↩︎
Because the \(p\) is unknown but expected to be around 2/3, we will use 2/3 in place of \(p\) in the formula for the standard error.
\(SE = \sqrt{\frac{p(1-p)}{n}} \approx \sqrt{\frac{2/3 (1 - 2/3)} {300}} = 0.027.\)↩︎This is equivalent to asking how often the \(Z\) score will be larger than -2.58 but less than 2.58. (For a picture, see Figure 11.2.) To determine this probability, look up -2.58 and 2.58 in a normal probability table or use software (0.0049 and 0.9951). Thus, there is a \(0.9951-0.0049 \approx 0.99\) probability that the unobserved random variable \(X\) will be within 2.58 standard deviations of the mean.↩︎
Since the necessary conditions for applying the normal model have already been checked for us, we can go straight to the construction of the confidence interval: \(\text{point estimate} \pm 2.58 \times SE\) Which gives an interval of (0.018, 0.162). We are 99% confident that implanting a stent in the brain of a patient who is at risk of stroke increases the risk of stroke within 30 days by a rate of 0.018 to 0.162 (assuming the patients are representative of the population).↩︎
We must find \(z^{\star}\) such that 90% of the distribution falls between -\(z^{\star}\) and \(z^{\star}\) in the standard normal model, \(N(\mu=0, \sigma=1).\) We can look up -\(z^{\star}\) in the normal probability table (or use software) by looking for a lower tail of 5% (the other 5% is in the upper tail), thus \(z^{\star} = 1.65.\) The 90% confidence interval can then be computed as \(\text{point estimate} \pm 1.65 \times SE \to (4.4\%, 13.6\%).\) (Note: the conditions for normality had earlier been confirmed for us.) That is, we are 90% confident that implanting a stent in a stroke patient’s brain increased the risk of stroke within 30 days by 4.4% to 13.6%. Note, the problem was set up as 90% to indicate that there was not a need for a high level of confidence (such as 95% or 99%). A lower degree of confidence increases potential for error, but it also produces a more narrow interval.↩︎
\(H_0:\) there is not support for the regulation; \(H_0:\) \(p \leq 0.50.\) \(H_A:\) the majority of borrowers support the regulation; \(H_A:\) \(p > 0.50.\)↩︎
Independence holds since the poll is based on a random sample. The success-failure condition also holds, which is checked using the null value \((p_0 = 0.5)\) from \(H_0:\) \(np_0 = 826 \times 0.5 = 413,\) \(n(1 - p_0) = 826 \times 0.5 = 413.\) Recall that here, the best guess for \(p\) is \(p_0\) which comes from the null hypothesis (because we assume the null hypothesis is true when performing the testing procedure steps). \(H_0:\) there is not support for the regulation; \(H_0:\) \(p \leq 0.50.\) \(H_A:\) the majority of borrowers support the regulation; \(H_A:\) \(p > 0.50.\)↩︎
The study is an experiment, as patients were randomly assigned an experiment group. Since this is an experiment, the results can be used to evaluate a causal relationship between blood thinner use after CPR and whether patients survived.↩︎
Because the patients were randomized, the subjects are independent, both within and between the two groups. The success-failure condition is also met for both groups as all counts are at least 10. This satisfies the conditions necessary to model the difference in proportions using a normal distribution. Compute the sample proportions \((\hat{p}_{\text{fish oil}} = 0.0112,\) \(\hat{p}_{\text{placebo}} = 0.0155),\) point estimate of the difference \((0.0112 - 0.0155 = -0.0043),\) and standard error \(SE = \sqrt{\frac{0.0112 \times 0.9888}{12933} + \frac{0.0155 \times 0.9845}{12938}},\) \(SE = 0.00145.\) Next, plug the values into the general formula for a confidence interval, where we’ll use a 95% confidence level with \(z^{\star} = 1.96:\) \(-0.0043 \pm 1.96 \times 0.00145 = (-0.0071, -0.0015).\) We are 95% confident that fish oils decreases heart attacks by 0.15 to 0.71 percentage points (off of a baseline of about 1.55%) over a 5-year period for subjects who are similar to those in the study. Because the interval is entirely below 0, and the treatment was randomly assigned the data provide strong evidence that fish oil supplements reduce heart attacks in patients like those in the study.↩︎
This is an experiment. Patients were randomized to receive mammograms or a standard breast cancer exam. We will be able to make causal conclusions based on this study.↩︎
\(H_0:\) the breast cancer death rate for patients screened using mammograms is the same as the breast cancer death rate for patients in the control, \(p_{MGM} - p_{C} = 0.\) \(H_A:\) the breast cancer death rate for patients screened using mammograms is different than the breast cancer death rate for patients in the control, \(p_{MGM} - p_{C} \neq 0.\)↩︎
For an example of a two-proportion hypothesis test that does not require the success-failure condition to be met, see Section 12.1.↩︎
The
yawn
data used in this exercise can be found in the openintro R package.↩︎The
heart_transplant
data used in this exercise can be found in the openintro R package.↩︎There is a large literature on understanding and improving bootstrap intervals, see Hesterberg (2015) titled “What Teachers Should Know About the Bootstrap” and Hayden (2019) titled “Questionable Claims for Simple Versions of the Bootstrap” for more information.↩︎
Using the formula for the bootstrap SE interval, we find the 95% confidence interval for \(\mu\) is: \(17,140 \pm 2 \cdot 2,891.87 \rightarrow\) ($11,356.26, $22,923.74). We are 95% confident that the true average car price at Awesome Auto is somewhere between $11,356.26 and $22,923.74.↩︎
By looking at the percentile values in Figure 13.6, the middle 90% of the bootstrap standard deviations are given by the 5 percentile ($3,602.5) and 95 percentile ($8,737.2). That is, we are 90% confident that the true standard deviation of car prices is between $3,602.5 and $8,737.2. Note, the problem was set up as 90% to indicate that there was not a need for a high level of confidence (such a 95% or 99%). A lower degree of confidence increases potential for error, but it also produces a more narrow interval.↩︎
More nuanced guidelines would consider further relaxing the particularly extreme outlier check when the sample size is very large. However, we’ll leave further discussion here to a future course.↩︎
We want to find the shaded area above -1.79 (we leave the picture to you). The lower tail area has an area of 0.0447, so the upper area would have an area of \(1 - 0.0447 = 0.9553.\)↩︎
The sample size is under 30, so we check for obvious outliers: since all observations are within 2 standard deviations of the mean, there are no such clear outliers.↩︎
\(\bar{x} \ \pm\ t^{\star}_{14} \times SE \ \to\ 0.287 \ \pm\ 1.76 \times 0.0178 \ \to\ (0.256, 0.318).\) We are 90% confident that the average mercury content of croaker white fish (Pacific) is between 0.256 and 0.318 ppm.↩︎
No, a confidence interval only provides a range of plausible values for a population parameter, in this case the population mean. It does not describe what we might observe for individual observations.↩︎
\(H_0:\) The average 10-mile run time was the same for 2006 and 2017. \(\mu = 93.29\) minutes. \(H_A:\) The average 10-mile run time for 2017 was different than that of 2006. \(\mu \neq 93.29\) minutes.↩︎
With a sample of 100, we should only be concerned if there is are particularly extreme outliers. The histogram of the data doesn’t show any outliers of concern (and arguably, no outliers at all).↩︎
The
bdims
data used in this exercise can be found in the openintro R package.↩︎The
bdims
data used in this exercise can be found in the openintro R package.↩︎The
bdims
data used in this exercise can be found in the openintro R package.↩︎The
bdims
data used in this exercise can be found in the openintro R package.↩︎The
births14
data used in this exercise can be found in the openintro R package.↩︎The
births14
data used in this exercise can be found in the openintro R package.↩︎\(H_0:\) the exams are equally difficult, on average. \(\mu_A - \mu_B = 0.\) \(H_A:\) one exam was more difficult than the other, on average. \(\mu_A - \mu_B \neq 0.\)↩︎
Since the exams were shuffled, the “treatment” in this case was randomly assigned, so independence within and between groups is satisfied. The summary statistics suggest the data are roughly symmetric about the mean, and the min/max values don’t suggest any outliers of concern.↩︎
The boundaries of the confidence interval are around 4.5 and 12. Since the confidence interval roughly spans 2 standard errors on either side of the mean, we can divide the width of the confidence interval by four to get a rough estimate of the standard error: \(\frac{12 - 4.5}{4} = 1.875\).↩︎
The point estimate of the population difference (\(\bar{x}_{n} - \bar{x}_{s}\)) is 0.593.↩︎
This condition is also called “homogeneity of variance”.↩︎
First compute the pooled standard deviation: \(s_{\textit{pool}} = \sqrt{\frac{(n_n - 1) s^2_n + (n_s - 1) s^2_s}{n_n + n_s - 2}}\\ = \sqrt{\frac{(867 - 1) 1.233^2 + (114 - 1) 1.597^2}{867 + 114 - 2}} = 1.28\). Then use \(s_{\textit{pool}}\) to find the standard error: \(SE(\bar{x}_{n} - \bar{x}_{s}) = s_{\textit{pool}} \sqrt{1 / n_{n} + 1 / n_{s}}\\ = 1.28 \sqrt{1 / 867 + 1 / 114} = 0.128\).↩︎
You can watch an episode of John Oliver on Last Week Tonight to explore the present day offenses of the tobacco industry. Please be aware that there is some adult language.↩︎
The
diamonds
data used in this exercise can be found in the ggplot2 R package.↩︎The
lizard_run
data used in this exercise can be found in the openintro R package.↩︎The
chickwts
data used in this exercise can be found in the datasets R package.↩︎The
epa2021
data used in this exercise can be found in the openintro R package.↩︎The bootstrapped differences in sample means vary roughly from 0.7 to 7.5, a range of $6.80. Although the bootstrap distribution is not symmetric, we use the empirical rule (that with bell-shaped distributions, most observations are within two standard errors of the center), the standard error of the mean differences is approximately $1.70. You might note that the standard error calculation given in Section 15.3 is \(SE(\bar{x}_{diff}) = \sqrt{s^2_{diff}/n_{diff}}\\ = \sqrt{13.4^2/68} = \$1.62\) (values from Section 15.3), very close to the bootstrap approximation.↩︎
The average price difference is only mildly useful for this question. Examine the distribution shown in Figure 15.10. There are certainly a handful of cases where Amazon prices are far below the UCLA Bookstore’s, which suggests it is worth checking Amazon (and probably other online sites) before purchasing. However, in many cases the Amazon price is above what the UCLA Bookstore charges, and most of the time the price isn’t that different. Ultimately, if getting a book immediately from the bookstore is notably more convenient, e.g., to get started on reading or homework, it’s likely a good idea to go with the UCLA Bookstore unless the price difference on a specific book happens to be quite large. For reference, this is a very different result from what we (the authors) had seen in a similar dataset from 2010. At that time, Amazon prices were almost uniformly lower than those of the UCLA Bookstore’s and by a large margin, making the case to use Amazon over the UCLA Bookstore quite compelling at that time. Now we frequently check multiple websites to find the best price.↩︎
The
hsb2
data used in this exercise can be found in the openintro R package.↩︎The
climate70
data used in this exercise can be found in the openintro R package.↩︎The
friday
data used in this exercise can be found in the openintro R package.↩︎\(H_0:\) The average on-base percentage is equal across the four positions. \(H_A:\) The average on-base percentage varies across some (or all) groups.↩︎
See, for example, this blog post.↩︎
See additional details on ANOVA calculations for interested readers.↩︎
There are \(k = 3\) groups, so \(df_{G} = k - 1 = 2.\) There are \(n = n_1 + n_2 + n_3 = 429\) total observations, so \(df_{E} = n - k = 426.\) Then the \(F\)-statistic is computed as the ratio of \(MSG\) and \(MSE:\) \(F = \frac{MSG}{MSE} = \frac{0.00803}{0.00158} = 5.082 \approx 5.077.\) \((F = 5.077\) was computed by using values for \(MSG\) and \(MSE\) that were not rounded.)↩︎
The
Cuckoo
data used in this exercise can be found in the Stat2Data R package.↩︎The data
Cuckoo
used in this exercise can be found in the Stat2Data R package.↩︎The answer to this question relies on the idea that statistical data analysis is somewhat of an art. That is, in many situations, there is no “right” answer. As you do more and more analyses on your own, you will come to recognize the nuanced understanding which is needed for a particular dataset. In terms of the Great Depression, we will provide two contrasting considerations. Each of these points would have very high leverage on any least-squares regression line, and years with such high unemployment may not help us understand what would happen in other years where the unemployment is only modestly high. On the other hand, these are exceptional cases, and we would be discarding important information if we exclude them from a final analysis.↩︎
We look in the second row corresponding to the family income variable. We see the point estimate of the slope of the line is -0.0431, the standard error of this estimate is 0.0108, and the \(t\)-test statistic is \(T = -3.98\). The p-value corresponds exactly to the two-sided test we are interested in: 0.0002. The p-value is so small that we reject the null hypothesis and conclude that family income and financial aid at Elmhurst College for freshman entering in the year 2011 are negatively correlated and the true slope parameter is indeed less than 0, just as we believed in our analysis of Figure 5.14.↩︎
The trend appears to be linear, the data fall around the line with no obvious outliers, the variance is roughly constant. These are also not time series observations. Least squares regression can be applied to these data.↩︎
The
bdims
data used in this exercise can be found in the openintro R package.↩︎The
births14
data used in this exercise can be found in the openintro R package.↩︎The
cats
data used in this exercise can be found in the MASS R package.↩︎The
bac
data used in this exercise can be found in the openintro R package.↩︎