3 Exercises
3.1 Exercise 1 - cheese
dataset
- Choose the best two variables among (H2S, log(H2S), Lactic Acid, log(Lactic Acid)) to explain Taste and construct a multiple linear regression model using them.
The variable that best described a linear relationship with Taste was chosen he be log(H2S) in Practical 1. To choose a second variable, we can take another look at the plots created in Practical 1.
# taste vs lactic acid
plot(Taste ~ Lactic.Acid, data = cheese, xlab = "Lactic acid concentration", ylab = "Taste score")
# taste vs H2S
plot(Taste ~ H2S, data = cheese, xlab = "H2S concentration", ylab = "Taste score")
# taste vs lactic acid
plot(Taste ~ log(Lactic.Acid), data = cheese, xlab = " Log lactic acid concentration", ylab = "Taste score")
# taste vs H2S
plot(Taste ~ log(H2S), data = cheese, xlab = "Log H2S concentration", ylab = "Taste score")
The 2 best variables to explain Taste are log(H2S) and .
- Estimate the coefficients using the vector-matrix formulation and check they are same as the R output.
Remember to define the design matrix and the response vector correctly.
The R
command to create the design matrix is
Give your answers to 2 decimal places.
The intercept is roughly .
The coefficient describing the effect of log(H2S) is roughly .
The coefficient describing the effect of Lactic Acid is roughly .
- Interpret the estimated coefficients.
The model tells us that for every 1 unit increase in log(H2S), the Taste score goes up by roughly , .
Similarly, for every 1 unit increase in Lactic Acid, the Taste score goes up by roughly , .
3.2 Exercise 2 - Nicholas Cage data
Hollywood legend Nicholas Cage seems to have a problem. It appears that every time he releases a new film upon the world many people drown by falling into pools of water. Coincidence? Or, are some of his films that bad?
Data: Cage.csv
Read in the data using:Cage <- read.csv("Cage.csv")
- Produce a scatterplot of NumDrowned (y) against NumFilms (x).
plot(NumDrowned ~ NumFilms, data = Cage, xlab = "Number of Nicholas Cage films released in a year", ylab = "Number of people who drowned falling into pools that year")
Use the cor.test
command to perform a correlation hypothesis test. What does this tell us about
the relationship between NumDrowned and NumFilms?
The sample correlation coefficient is (to 3 decimal places).
According to the hypothesis test carried out, we . This means we conclude the true correlation between the two variables is .
cor.test(Cage$NumDrowned, Cage$NumFilms)
##
## Pearson's product-moment correlation
##
## data: Cage$NumDrowned and Cage$NumFilms
## t = 2.6785, df = 9, p-value = 0.02527
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1101273 0.9045101
## sample estimates:
## cor
## 0.6660043
This is an example of spurious correlation, where two variables that are not related to each other in any way, that is, they are independent, could be inferred as being related. The number of Nicholas Cage films released in a year is clearly not related to the number of drowning accidents in that same year, but if we just take our correlation hypothesis test on face value, then we would think otherwise.
3.3 Exercise 3 - Context: identifying relationships
For the following contexts determine whether fitting a regression model would be appropriate. If a regression model is appropriate, identify which variable is the response variable and which is the explanatory variable.
- Is federal spending, on average, higher or lower in countries with high rates of poverty?
Regression model appropriate?
Federal spending: Poverty rates:
- A study was conducted to determine whether surgery or chemotherapy results in higher survival rates for a certain type of cancer.
Regression model appropriate?
Type of treatment: Survival rates:
- A study found that, overall, left-handed people die at a younger age than right-handed people.
Regression model appropriate?
Age of death: Left- or right-handed:
- Per capita cheese consumption is correlated with the number of people who died getting tangled in bed sheets.
Regression model appropriate?
Number of people who died getting tangled in bed sheets: Per capita cheese consumption:
- An experiment was conducted to test the effects of sleep deprivation on human reaction times.
Regression model appropriate?
Hours of sleep: Reaction times:
- A study was conducted in order to predict the GPA of university students given their high school GPA.
Regression model appropriate?
GPA of university students: High school GPA:
- A company wants to know if there is a significant relationship between its advertising expenditures and its sales volume.
Regression model appropriate?
Sales volume: Advertising expenditures:
- A sample of insured drivers with similar insurance policies were randomly selected. Interest is in determining whether there is a significant relationship between driving experience and insurance premium.
Regression model appropriate?
Driving experience: Insurance premium:
- Ice cream sales are correlated with murder rates in the US.
Regression model appropriate?
Murder rates: Ice cream sales: