# Chapter 3 Jan 20–26: T-Tests and Linear Regression

Important Request: As you read through this week’s chapter, please note down anything that you have questions about. Then we will address these questions when we have our video calls.

This week, our goals are to…

1. Apply the basic logic of hypothesis testing.

2. Conduct a t-test on RCT-style data.

3. Use simple linear regression to make a prediction about a dependent variable using one independent variable.

4. Calculate a residual error for each member of a dataset, based on regression results.

• On a Windows or Linux computer, to open a link in a new browser tab, hold down the control key and then click on the link.
• On a Mac computer, to open a link in a new browser tab, hold down the command key and then click on the link.

## 3.1 Hypothesis Tests

### 3.1.1 What is a Hypothesis Test?

Read the following, if it is not yet familiar to you:

### 3.1.2 Two-Sample Hypothesis Tests (T-Test)

T-tests are used when we want to evaluate the difference between means from two independent groups. If you want to compare more than two means, you would use a different statistical test (ANOVA, which we will cover soon). We will focus on two forms of the t-test: independent samples and dependent samples t-tests.

### 3.1.3 Independent Samples T-Test

An independent samples t-test is used when you have means from two separate groups that you want to compare. For example, you might have math exam scores from some men and some women, and you want to see if there is a gender difference. Because you have two independent groups—men and women—an independent samples t-test would be appropriate. This is considered a between-subjects design.

The null hypothesis for a t-test is that the difference between means (of the women’s scores and men’s scores) is 0:

$H_0 = \mu_1 - \mu_2 = 0$

In other words, we start with the assumption that there is no difference between men and women. If we find evidence that there is a difference, we will reject the null hypothesis (which is the assumption that they’re the same). If we do not find any evidence that the math scores of the sampled men and the sampled women are different, we fail to reject the null hypothesis.

Why don’t we just say that we “accept the null hypothesis”? Because we don’t know for sure if the null hypothesis is true or not. All we know is that we didn’t find any evidence to say otherwise. So the null hypothesis might be true, but maybe there is truly a difference between the two groups and we just didn’t have enough data to detect it. This is tricky stuff to get used to at first. We can talk more about this and go through more examples together on our video calls.

If two samples come from the same population, we expect them to have equal means (although with sampling variation, they may not be exactly equal). Under the null hypothesis, we expect that the there are no differences between the groups (the experimental manipulation didn’t have an effect).

There are three assumptions of the independent samples t-tests that must be met in order to for the results to be valid and interpretable:

1. Homogeneity of variances: We assume that the two groups have the same variance.
2. Population is normally distributed.
3. Independence of scores: Each individual (or observation) contributes only one score (data point). If they submit more than one score, then those responses (scores) are correlated with each other and therefore not independent.

### 3.1.4 Dependent Samples T-test

A dependent samples (or paired) t-test is used when the two means you want to compare come from the same person. For example, you might have pre-test and post-test scores for individuals after they have undergone some sort of intervention. You want to know if their scores increased (or decreased) between from pre-test to post-test. This is a within-subjects design.

### 3.1.5 Errors in Hypothesis Testing

There are two types of errors that occur in hypothesis testing:

• Type I error: rejecting the null hypothesis when it’s actually true.
• Type II error: failing to reject the null hypothesis when it’s actually false.

Look through the following resources on these types of errors:

### 3.1.6 T-Tests in R

• Unpaired Two-Samples T-test in R – You can run the commands at this page if you wish, as you read through. In this week’s assignment, you will have to do a very similar procedure but on different data.

The following resources may also be useful:

## 3.2 Simple Linear Regression

This week we are going to explore simple linear regression, which is a method for finding out the linear association between a dependent variable and one independent variable.

Sometimes, the slope of the regression line is referred to with the Greek letter beta. Just keep in mind that this just means the slope of the line.16

When we do a regression and get an estimate of a slope of the relationship between Y and X, there are two possibilities:

1. In the population from which the sample is drawn, there is no true relationship between Y and X. The slope is 0. As you have more or less of X, Y doesn’t change at all. This is called the null hypothesis.
2. In the population from which the sample is drawn, there is a true, non-zero relationship between Y and X. This is called the alternative hypothesis.

If the p-value of your regression estimate is less than 0.05 (or 5%), then (assuming your regression meets other conditions that we will discuss later) you can conclude that Scenario #2 above is correct and that the estimate is trustworthy for the population. In statistical jargon, this is called rejecting the null hypothesis because our analysis had sufficient evidence to make us 95% certain that the alternate hypothesis is true. 1-p = level of certainty about the alternate hypothesis.

## 3.3 Assignment

Make sure that you carefully show all of the work you did to get to your answers. For tasks that require you to use R, make sure you put all of the R commands into a new R script file (just like last week) and you submit the commands you used as well as the output you get as part of your assignment.

### 3.3.1 T-Test Without R

For this part of the assignment, we will use the 2 Sample T-Test tool, which I will call the tool throughout this section of the assignment.

Imagine that we, some researchers, are trying to answer the following research question: How does fertilizer affect plant growth?

We conduct a randomized controlled trial in which some plants are given a fixed amount of fertilizer (treatment group) and other plants are given no fertilizer (control group). Then we measure how much each plant grows over the course of a month. Let’s say we have ten plants in each group and we find the following amounts of growth.

The 10 plants in the control group each grew this much (each number corresponds to one plant’s growth):

3.8641111
4.1185322
2.6598828
0.3559656
2.8639095
0.9020122
5.0527020
2.3293899
3.5117162
4.3417785

The 10 plants in the treatment group each grew this much:

7.532292
1.445972
6.875600
6.518691
1.193905
4.659153
3.512655
4.578366
8.791810
4.891557

Delete the numbers that are pre-populated in the tool. Copy and paste our control data in as Sample 1 and our treatment data in as Sample 2.

Task 1: What is the mean and standard deviation of the control data? What is the mean and standard deviation of the treatment data? Do not calculate these by hand. The tool will tell these to you in the sample summary section.

You’ll see that the tool has drawn the distributions of the data for our treatment and control groups. That’s how you can visualize the effect size (impact) of an RCT. It has also given us a verdict at the bottom that the “Sample 2 mean is greater.” This means that this particular statistical test (a t-test) concludes that we are more than 95% certain that sample 1 (the control group) and sample 2 (the treatment group) are drawn from separate populations. In this case, the control group is sampled from the “population” of plants that didn’t get fertilizer and the treatment group is sampled from the “population” of those that did.

This process is called inference. We are making the inference, based on our 20-plant study, that in the broader population of plants, fertilizer is associated with more growth. The typical statistical threshold for inference is 95% certainty. In the difference of means section of the tool, you’ll see p = 0.0468 written. This is called a p-value. The following formula gives us the percentage of certainty we have in a statistical estimate, based on the p-value (which is written as p): $$\text{Level of Certainty} = (1-p)*100$$. To be 95% certain or higher, the p-value must be equal to 0.05 or lower. That’s why you will often see p<0.05 written in studies and/or results tables.

With these particular results, our experiment found statistically significant evidence that fertilizer is associated with plant growth.

Task 2: What was the null hypothesis in this t-test that we conducted?

Task 3: What was the alternate hypothesis in this t-test that we conducted?

Now, click on the radio buttons next to ‘Sample 1 summary’ and ‘Sample 2 summary.’ This will allow you to compare different distributions to each other quickly, without having to change the numbers (raw data) above. Let’s imagine that the control group had not had as much growth as it did. Change the Sample 2 mean from 5 to 4.5.

Task 4: What is the new p-value of this t-test, with the new mean for Sample 2? What is the conclusion of our experiment, with these new numbers? Use the proper statistical language to write your answer.

Task 5: Gradually reduce the standard deviation of Sample 2 until the results are statistically significant at the 95% certainty level. What is the relationship between the standard deviation of your samples and our ability to distinguish them from each other statistically?17

### 3.3.2 T-Test in R

Please make a new R script file for this week’s assignment, just like you did last week.

To complete this part of the assignment, you may need to install and load packages in R. For example, you will need to use the package ggpubr. Before you use any of the commands from this package, you need to load it into R, by adding the following line of code to your R script file:

library(ggpubr)

If this gives you an error message, it means the package is not installed yet (and therefore cannot be loaded). To install the package, run this command:

install.packages("ggpubr")

You should only have to run the above command once ever, not every time you use R.

Then once again run the following command to load the package (you DO need to run this command every time you use this package in R):

library(ggpubr)

Task 6: Load the plant growth data from the previous section of this assignment into R, using the code below. This code should go into your script file and not directly into the console.

control <- c(3.8641111,4.1185322,2.6598828,0.3559656,2.8639095,0.9020122,5.0527020,2.3293899,3.5117162,4.3417785)
treatment <- c(7.532292,1.445972,6.875600,6.518691,1.193905,4.659153,3.512655,4.578366,8.791810,4.891557)

my_data <- data.frame(
group = rep(c("control", "treatment"), each = 10),
growth = c(control, treatment)
)

Task 7: Now run View(my_data) to see the dataset you just created. The name of the dataset you created is my_data. Does my_data correctly match the data presented earlier in the assignment?

Now your data is loaded into R. You just learned one way to manually load data into R.

Task 8: How many observations (rows) are in my_data? Use R to figure this out. Hint: If your dataset’s name was plants, you would run the command nrow(plants). If your dataset’s name was cnepwhdld, then you would run the command nrow(cnepwhdld).

Task 9: What are the names of the variables in your dataset? Use R to figure this out. Hint: If your dataset’s name was plants, you would run the command names(plants).

Task 10: How many variables are in your dataset? Use R to figure this out. Hint: If your dataset’s name was plants, you would run the command ncol(plants) or length(names(plants)).

For the next few tasks, refer to the Unpaired Two-Samples T-test in R resource page that you read earlier. You should modify the code you find at this page to answer the questions below.

We again return to the same research question from earlier in this assignment: How does fertilizer affect plant growth?

Task 11: Visualize the data using box plots, using the ggboxplot function from the ggpubr package. Include the plot in your submitted assignment. Which group has a higher mean?

Now let’s check the t-test assumptions.

Task 12: Are the two samples (treatment and control) independent? Just answer this using your brain and the example at the Unpaired Two-Samples T-test page (don’t use R).

Task 13: Are the data from the two samples normally distributed? Use the Shapiro-Wilk test as described at the resource page. Include the result you get and your interpretation of it. Be sure to do this separately for the treatment and control groups.

Task 14: Further assess the normality of the samples by making a density plot and a Q-Q plot, as described here: Normality Test in R. Make sure to do it separately for treatment and control. Include your plots with your submitted assignment.

Task 15: Do the two samples (treatment and control) have similar variances? Use the F-test demonstrated in the Unpaired Two-Samples T-test page to figure this out. Make sure you include your results as well as your interpretation of them.

Task 16: Conduct a t-test using the method described under Compute independent t-test - Method 2. Include the output in your assignment along with your interpretation. Is this result the same as it was in the previous section of this assignment?

### 3.3.3 Exploring Linear Relationships Some More

We’re going to start this section by looking at the correlation coefficient, another way to determine how related two variables18 are to one another.

Look at the following fitness dataset containing five people:

• WeeklyWeightliftHours is the number of hours per week the person spends weightlifting.
• WeightLiftedKG is how much weight the person could lift on the day of the survey.
Name <- c("Person A","Person B","Person C","Person D","Person E")
WeeklyWeightliftHours <- c(3,4,4,2,6)
WeightLiftedKG <- c(20,30,21,25,40)

fitness <- data.frame(Name, WeeklyWeightliftHours, WeightLiftedKG)
fitness
 Name WeeklyWeightliftHours WeightLiftedKG Person A 3 20 Person B 4 30 Person C 4 21 Person D 2 25 Person E 6 40

Task 17: What is a reasonable research question that we could ask with this data?

Task 18: What is the dependent variable and independent variable for a quantitative analysis that we could do to answer this research question?

Task 19: What is the correlation coefficient for WeightLiftedKG and WeeklyWeightliftHours? Show all of your work/calculations.

Here’s the answer, but you still need to make sure you do and show the work correctly:

# Calculate correlation between two vectors/variables
CorrelationWeightHours <- cor(fitness$WeeklyWeightliftHours,fitness$WeightLiftedKG)
CorrelationWeightHours
##  0.7677303

And here’s what it looks like visually:

plot(fitness$WeeklyWeightliftHours,fitness$WeightLiftedKG)
reg1 <-  lm(fitness$WeightLiftedKG~fitness$WeeklyWeightliftHours)
abline(reg1) Now let’s look at the linear regression output for this data:

summary(reg1)
##
## Call:
## lm(formula = fitness$WeightLiftedKG ~ fitness$WeeklyWeightliftHours)
##
## Residuals:
##      1      2      3      4      5
## -3.818  1.955 -7.045  5.409  3.500
##
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)
## (Intercept)                     11.136      8.199   1.358    0.267
## fitness$WeeklyWeightliftHours 4.227 2.037 2.075 0.130 ## ## Residual standard error: 6.043 on 3 degrees of freedom ## Multiple R-squared: 0.5894, Adjusted R-squared: 0.4525 ## F-statistic: 4.307 on 1 and 3 DF, p-value: 0.1296 Task 20: At the bottom of the regression output, you’ll see Multiple R-squared: 0.5894. What is the relationship between Multiple R-squared and the correlation coefficient that you calculated earlier on the same data? Hint: the correlation coefficient is often referred to as R. Now let’s look at some other data which is less correlated: Name2 <- c("Person F","Person G","Person H","Person I","Person J") WeeklyWeightliftHours2 <- c(3,4,4,1,3) WeightLiftedKG2 <- c(20,30,21,20,35) fitness2 <- data.frame(Name2, WeeklyWeightliftHours2, WeightLiftedKG2) fitness2  Name2 WeeklyWeightliftHours2 WeightLiftedKG2 Person F 3 20 Person G 4 30 Person H 4 21 Person I 1 20 Person J 3 35 # Correlation in dataset fitness2 cor(fitness2$WeeklyWeightliftHours,fitness2$WeightLiftedKG) ##  0.3251082 plot(fitness2$WeeklyWeightliftHours,fitness2$WeightLiftedKG) reg2 <- lm(fitness2$WeightLiftedKG~fitness2$WeeklyWeightliftHours) abline(reg2) Above, the two variables are much less correlated with each other. Task 21: What would the R^2 be in the regression output for the linear regression on these two variables in the fitness2 data? Now we’re going to learn more about the regression output. First, watch the following video for a review, if you would like: Let’s step back and think about why we do regressions. Of course, we do them to see if the dependent variable and independent variables are associated with each other statistically, but we also do them to find out if the trends that we see in our data are (or are not) similar to those in the population at large. Consider the datasets above about weightlifting. Let’s say we wanted to know about the hours spent lifting and weight lifted in Boston. So then people in Boston would be our population of interest. The five people in the fitness dataset are five people that we surveyed out of this population. These five people are our sample. These are important terms to remember. Our goal is to use the sample (the data that we do have) to learn whatever we can about the population as a whole. When we did the regression reg119 we found that an additional hour of weightlifting is associated with an additional 4.227 predicted additional kilograms of weight lifted. But that’s only for our sample of five people. What about all of Boston? That’s where inference comes in. Inference is when you use your sample to attempt to figure stuff out about your whole population. And this is what the Std. Error (standard error), t value, and Pr(>|t|) (p-value) in the regression output are all about. To reiterate, we have our regression line for the five people. You saw this line drawn in the scatterplot earlier when we were talking about correlations. But what would the line look like for the entire population of Boston? Would it look the same or would the slope be different? If we want to know the true slope of the regression line in the population (the statistical relationship between hours spent lifting and weight lifted in all of Boston), we have to look at the other columns of the regression output and use inferential statistics. Now look again at this regression output: summary(reg1) ## ## Call: ## lm(formula = fitness$WeightLiftedKG ~ fitness$WeeklyWeightliftHours) ## ## Residuals: ## 1 2 3 4 5 ## -3.818 1.955 -7.045 5.409 3.500 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 11.136 8.199 1.358 0.267 ## fitness$WeeklyWeightliftHours    4.227      2.037   2.075    0.130
##
## Residual standard error: 6.043 on 3 degrees of freedom
## Multiple R-squared:  0.5894, Adjusted R-squared:  0.4525
## F-statistic: 4.307 on 1 and 3 DF,  p-value: 0.1296

Task 22: What is the equation of the regression line? Be sure to include a slope and an intercept,20 and write the equation in the format $$y = mx+b$$.

Task 23: In the output above, what is the p-value for the WeeklyWeightliftHours coefficient (estimate) of 4.227? What does this p-value mean for the question of whether the true population regression line also has a slope of 4.227?

Task 24: In the output above, what do we get when we divide the coefficient for WeeklyWeightliftHours by its standard error? Do you see that number somewhere else in the same line of the regression table?

### 3.3.4 Predicted Values and Residuals

Task 25: Copy the fitness data into a table in Word or Excel.21 Here’s a copy of it, so that you don’t have to go up and find it:

fitness
 Name WeeklyWeightliftHours WeightLiftedKG Person A 3 20 Person B 4 30 Person C 4 21 Person D 2 25 Person E 6 40

Task 26: Add new columns to this table for the following items: 1) Predicted WeightLiftedKG, 2) Residual, 3) Difference from mean. Plug each WeeklyWeightliftHours value into the regression equation to get the Predicted WeightLiftedKG for each person. These are called the predicted values or fitted values. This is how much weight the regression line “thinks” each person lifted, based on the data we gave it.

Task 27: Calculate a residual value for each person. This is the difference between the actual and predicted value of how much each person lifted.

Task 28: Are predicted values and residuals calculated for the dependent variable or independent variable?

Task 29: Are predicted values and residuals calculated for the y values or the x values?22

Task 30: What is the total sum of squares of the residuals? (Take each residual that you just calculated, square it, and then add up those five numbers). This is called the SSR, or Sum of Squares Regression.

Task 31: Calculate the difference from the mean of each Y value. Here’s how you do that:

1. Calculate the mean of WeightLiftedKG (across all five people in the data).

2. For each person (row), calculate the difference between the mean and WeightLiftedKG. This is the difference from the mean.

Task 32: Calculate the sum of squares of the difference from the mean. This is called SST, or Sum of Squares Total.

Task 33: Calculate $$1-\frac{SSR}{SST}$$. Show your work. This should be equal to the Multiple R-squared value from the regression output.

Remember: We want SSR to be as low as possible, which will make $$R^2$$ as high as possible. We want $$R^2$$ to be high. SSR is calculated from the residuals, and residuals are errors in the regression model’s prediction. SSR is all of the error totaled up together. So if SSR is low then error is low. $$R^2$$ is one of the most commonly used metrics to determine how well a regression model fit your data.

Task 34: At the start of this chapter, I requested that you write down any questions you have. Please include them with your submitted assignment. If you don’t have any, just state this in your answer.

Task 35: Please upload your Week 1 assignment (from last week) to the D2L Dropbox. I learned that we are required to use D2L for this, contrary to my instructions from last week. Sorry about any confusion! We will not be using the system I described last week. The dropbox is located at Assessments -> Assignments -> Dropbox for all assignments in D2L.

Task 36: Please upload this assignment that you just finished today to the very same D2L Dropbox used in the previous task. Please follow the same file naming convention that we used last week.

Task 37: Also e-mail your new completed assignment to me at .

1. We have also referred previously to the slope as m (as in mx+b), $$b_1$$ (as in $$b_1x + b_0$$), coefficient, and coefficient estimate.

2. Remember, when we are analyzing the data in an RCT, we are trying to figure out if the treatment and control groups had different or similar results. We are seeing if we can distinguish the two groups from each other in any way. The mean and standard deviation of the data in the two groups are the key parameters that help us tell the treatment and control groups apart, which is why you need to play around with the t-test tool to understand these relationships.

3. Remember that a variable is another name for a column of data, or a vector as R calls it. Just a set of numbers or characteristics.

4. We could call this regression whatever we want, not necessarily reg1.

5. Note that the intercept is also sometimes referred to as the constant.

6. You will have to submit this table with your homework, in case that helps you decide whether to use Word or Excel. You can also hand-write it and take a photo and submit that.

7. This is identical to the previous question but with different terminology.