Doing-It-in-R.knit

Doing it in R

Getting started in R

What is R?

R is a flexible open-language software that manipulates, visualizes, and processes data. Because this software is free, it should be easily accessible and easily downloadable. This software needs to be paired with the graphical user interface, RStudio.

Downloading R

What is RStudio?

RStudio is the interface that processes the program R. This is known as a GUI or a graphical user interface. Both applications must be downloaded.

Where to find R?

R can be found through a network called CRAN. Also known as the Comprehensive R Archive Network. https://cran.r-project.org/

(Note) Versions of R, such as R-4.2.3 can be dissected as: major version 4, minor version 2, patch 3

Where to find RStudio?

Be sure to never download applications through suspicious links. As this application is open source it should be free. Here is a verified link that has been used to download RStudio. https://posit.co/download/rstudio-desktop/

How to Download R and RStudio?

There are differences in the downloading process for R on Mac, Windows, and Linux. First, you will need to determine which software you are working from. Below are attached steps to download R and RStudio.

Macintosh OS

To locate the software your Mac is running follow these steps:

Locate the Apple symbol in the top left-hand corner.
Click “About This Mac”.
Locate “Version”. (For example, my Mac’s version is “10.15.5”.)

Next, match the R-package with the corresponding software on your Mac device. For example: my R-package would be “R-4.2.3” as this is the most compatible version for my OS.

Double click the correct package and let it download.
Once the download is complete, open the installer.
Continue through the installer.
Once finished, close the installer.

Lastly, download RStudio to pair with the R software. Be sure to note your OS version, as there are different versions for RStudio.

Find the most compatible version of RStudio through this link: https://posit.co/download/rstudio-desktop/
Double click the correct package and let download.
Copy RStudio to Applications.
Launch RStudio.

Windows

To locate which software your Windows is running follow these steps:

Go to settings which should be found in the bottom left-hand corner.
Click “Systems” and then “About”.
Locate “Edition” and “Version”. (For example, my Windows is “Windows 10 Pro” version “22H2”.)

Next, download the “base” package for R on Windows. This version should be “4.3.1”. Follow these steps.

Double click the package and let it download.
Open the package and allow the app to make changes to your device.
The installation should proceed, choose language and continue through the installer.
Finish installing and close the installer.

Finally, to install RStudio follow these steps:

Use this link: https://posit.co/download/rstudio-desktop/
Download RStudio for Windows ( on the right hand side).
Double click and allow the app to make changes to the device.
Continue installing through the installer, and then click finish.
There will be a one time setup, use the default 64-bit version.
Load RStudio and proceed with the one time setup, you should enable automatic crash reporting.

Basic descriptive statistics

R has a few built-in data sets that you can play around with, which is perfect for learning some basic functions. Let’s look at a data set called ‘mtcars’. You could just type in ‘mtcars’ in your R console and execute it. That will show you every row and column of that data set. Let’s instead use the head() function to look at just the first few rows of the data set.

head(mtcars)

If you execute this line, you should see the following printout in your R console:

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

This is the first 6 rows of the mtcars data set. Each row represents a certain car model. The data set has 11 columns, each representing a characteristic of the cars: ‘mpg’ stands for “miles per gallon”, “cyl” stands for the number of cylinders.

If you want to examine only one variable at a time, say, “mpg”, then you can pull that out by using the “$” character to specify which column in a dataframe you want to “pull out”.

mtcars$mpg

##  [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
## [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
## [31] 15.0 21.4

This pulls out and displays all 32 of the mpg’s for the cars in this data set. There are a large variety of functions that give you basic descriptive statistics for a given column/variable/vector. Each of the examples in the following code snippets have a parentheses where you can put one vector.

mean(mtcars$mpg) # The average mpg

## [1] 20.09062

median(mtcars$mpg) # The median

## [1] 19.2

sd(mtcars$mpg) # The standard deviation

## [1] 6.026948

You can get a histogram for a variable/vector by using the hist() function:

hist(mtcars$mpg)

This is a good time to stop and talk about arguments. All the function calls so far have had only one argument, “mtcars$mpg”. For a simple function like mean() or sd(), that’s all you need. However, many functions allow for lots of customization., For instance, I’m not 100% satisfied with the output we got from our hist() function call. My main gripe is the label for the x-axis. You can change this by adding a new argument in your function call. Just make sure to separate your arguments with commas:

hist(mtcars$mpg,xlab="Miles per gallon",main="My beautiful data!")

The “xlab” argument lets you change the label used on the x-axis and the “main” argument lets you change the main title atop the figure. If you want to see what a function does, what its arguments are set to by default (before you potentially change them), and what arguments you can include, execute a function with a “?” before it, e.g., “?hist”. This will bring up a little help page going over this information. The “help” it provides my be a bit confusing to new comers, but as you engross yourself in the world of R, you’ll get the hang of it!

Last thing I’ll go over in this section is how to create a scatter plot. To do this, you use the “plot()” function. You need to supply two arguments and, just like always, you have to separate each argument with a comma.

plot(mtcars$cyl,mtcars$mpg,ylab="Miles per gallon",xlab="# of cylinders")

I added just a little bit of customization at the end with “ylab” and “xlab”. As you can see, The more cylinders a car has, the lower its MPG tends to be.

Importing and cleaning data

Probably the easiest data format to import into R is the “comma separated values” (.csv) format. Excel files and Google Sheets can be “Save as”d in this format. You use the function “read.csv()” to import that data. The one mandatory argument for this function is a file path put in quotes telling R where to find the file in your computer. For instance, your import call might look something like this:

my_data <- read.csv("~/Desktop/my_data.csv")

You might need to specify a few other things, like whether the first row of the data are headings/labels and whatnot.

Really, though, the easiest way to import data is to use R Studio’s “Import Dataset” dialogue:

A picture

If you’re opening a CSV file, you’d select “From Text (base)” and follow the dialogue menus from there. You can also open Excel, SPSS, Stata, or SAS files. In each instance, you’re prompted to navigate over to where the data set is in your computer and there are a few options you can toggle on or off (e.g., specify whether the first row is headers/labels). Most of the time, the window will generate the code necessary to run to import your data so that you can copy and poste it into your script for next time. It might also append a line like “view(my_data)” at the end of this sample code. You probably don’t want to keep that part in there. Every time you open your script back up and run it, it’s going to show you your data, even if you don’t want to look at it just yet.

Cleaning data

For all of the examples in this document, I’m going to rely on made up (i.e., simulated) data. Some instructional R books and R documents have you download specific data sets from the web. To simplify things, we’ll just generate things within the document itself.

Let’s say you have some reaction time data paired with age:

set.seed(1987)
rt=rnorm(40,mean=1000,sd=200)
age=rnorm(40,mean=35,sd=4)

The first line “set.seed(1987)” is used to make sure my “random” data and your “random” data end up matching. (It doesn’t matter what number you put in parentheses, just as long as it matches). The next few lines use the rnorm() function to create simulated data sets of a certain size (40), with a certain mean and standard deviation. I’m “randomly” created 40 reaction time data points “rt” with a mean of 1000 (msec) and a standard deviation of 200. I’ve also created a “random” set of 40 ages with a mean of 35 and a standard deviation of 4. If you ran the second two lines, the ones beginning with “rt=” and “age=”, without running the first line, then you would get slightly different data compared to the examples here. The “set.seed()” function forces the “random” numbers to be a more specific, pre-determined set of random numbers.

You can use the “hist()” function to examine these data.

hist(rt)

This is a visualization of how our randomly generated reaction time data ended up looking. Note that real reaction time data aren’t usually normally distributed like this. They’re usually positively skewed.

Next, we examine the age data:

hist(age)

Now I’ll add some outlier data and show how you would normally remove such data:

rt=c(rt,8000) # Add a reaction time of 8000 msec (8 seconds)
age=c(age,99) # Add a 99 year old to the age data

In both lines, we’re taking a value and re-assigning it to be a concatenation “c()” of itself and one new thing. You could keep manually adding elements to these vectors/variables by comma-separating them.

Now let’s look at our new data:

hist(rt)

hist(age)

In both cases, there’s a pretty obvious outlier in the data.

Normally, when you import real data, you want to do some basic visualizations like this to try and spot these. Before we address how to remove outliers, we have to address the fact that our “data” right now aren’t really formatted the way they would be, had they been imported as a “data frame”. Right now, these are just two loose vectors/variables. That’s not how they’ll come when you import a data frame. To fix this, I’ll run the following code:

my_data=data.frame(rt,age)

Now we have a variable saved called “my_data” which is a 40 x 2 data frame. In other words, it has 40 rows and 2 columns, one for “rt” and one for “age”. Because of the way we created this data, you could either run function like this: “hist(age)” or like this: “hist(my_data$age)”. Note, however, that when you import data straight in as a data frame, you will have to go straight to using the “$” notation.

In the current scenario, we might make a rule that says “no one over 85 was included in the data”. One way to do this in your data is to use the “subset()” function. The first argument in this function is the data frame you want to create a subset of. The second argument is a condition for keeping data.

cleaned_data=subset(my_data,my_data$age < 85)

This line creates a new data frame called “cleaned_data” which takes every row from “my_data” if they meet the condition where the “age” for that row is “less than” (aka. “<”) 85.

Baby’s first statistical tests

You won’t often need to use R to deal with z-scores. You’re also seldom going to use group z-tests or 1 sample t-tests in practice. Having said that, going through how to do these in R is a great way to learn more about R, as you’ll soon see.

z-scores

If you have a z score, you might want to know how much of the “bottom” or “top” percent of the population that z score cuts off. In other words, you might want to know the percentile rank associated with a z score. One way to do this is to use the “pnorm()” function. By default, it assumes the mean and sd are 0 and 1, respectively. But if you want to use “raw scores” instead of standardized z-scores, you can specify a different mean and sd.

“pnorm()” is a probability function – hence the “p” – for the normal distribution – hence the “norm”. For any value, x, that you plug in, “pnorm()” tell you the cumulative area under the curve associated with that score. So let’s start real basic with an x = 0:

pnorm(0)

## [1] 0.5

The output is 0.5, or 50%. That means that a z-score of 0 divides the bottom 50% of the population from the top 50%. In the next code snippet, I’ll simultaneously enter scores ranging from -2 to +2.

pnorm(c(-2,-1.5,-1,-.5,0,.5,1,1.5,2))

## [1] 0.02275013 0.06680720 0.15865525 0.30853754 0.50000000 0.69146246 0.84134475
## [8] 0.93319280 0.97724987

These values show you that a z-score of -2 divides the bottom 2.28% of the population from the remaining 97.72% of the population. A z-score of -1.5 divides the bottom 6.68% of the population from the remaining 93.32% of the population. And so on.

Group z-tests

Let’s say you are drawing a sample of people from the general population and giving them IQ tests. We know that the population mean for IQ is 100 and the population sd is 15. So, if I want to simulate a sample of 20 people from the general population…

normal.people.sample=rnorm(20,mean=100,sd=15)

The “rnorm()” function [“random” + “NORMal”] generates random data drawn from a normal distribution. The first argument specifies n, the number of random draws you want to make. By default, the mean and sd are set to 0 and 1, but I’ve overriden these defaults and specidied them to be equal to 100 and 15.

In this situation, we know the null hypothesis is true: There is no difference in IQ between the group of people I just sampled from and the general population. I know this because I literally told R to have the same population parameters as people in the general population.

Now we need to create a z-score that represents how far our sample mean has gotten away from the null-hypothesized sampling distribution of the sample mean. In other words, our null hypothesis says all of our sample means of size 20 should hover around 0, but small deviations from 0 are possible too. The further away a sample mean gets from 0, the less likely that sample mean is to have been observed, according to the null hypothesis.

Recall that the z-score for a sample mean is the distance between the sample mean and the population mean divided by the standard error of the mean (SEM). The SEM is the sample standard deviation divided by the square root of the sample size.

distance=mean(normal.people.sample)-100
SEM=15/sqrt(20)
z=distance/SEM
z

## [1] -1.459416

The first line, starting with “distance” is calculating the difference between our sample mean of IQ and the known population mean of IQ which is 100. The next line calculates the standard error of the mean and saves it as a variable called SEM. I use the “sqrt()” function to derive the square root of the sample size, which is 20. For now, with a simple example like this, it’s okay to “hard code” that 20 in there. In the future, though, you might want to replace the “20” with a “the size of my variable”: “length(normal.people.sample)”. That way, if you re-use your data, you won’t need to remember to change the sample size.

The last few lines calculates the z-score associated with our sample mean. Again, this represents how far our sample mean has deviated from the population meen in standard error units: “How many standard errors of the mean away from 0 did we get?”

We got a z-score of -1.459416

pnorm(-1.459416)

## [1] 0.07222532

As you can see from the code snippet above, a z-score of -1.459416 only separates roughly the bottom 7% of the area under the curve from the remaining 93 In other words, observing a z-score of -1.459416 (or lower) has about a 7% chance of occurring when you assume the null hypothesis is true. Our p-value is .072. Since this isn’t below .05, it is not statistically significant.

For our results to have been statistically significant, our z-score woudl’ve had to be either -1.645 (or below) or +1.645 (or above).

pnorm(-1.645) # 0.04998491

## [1] 0.04998491

pnorm(1.645) # 0.9500151

## [1] 0.9500151

Confidence intervals for Group z-test

Confidence intervals go hand-in-hand with null hypothesis significance testing. Except confidence intervals tell you whether the null hypothesis is plausible AND it tells you extra information. Sample means drawn from a population with a known mean and standard deviation are relatively easy to work with. It’s the sample mean plus-or-minus the product of the SEM and the z-score that would achieve statistical signfiicance at a given alpha level. For us, our alpha is .05, and as I’ve just pointed out, a z-score with an absolute value of 1.645 gives you statistical significance.

mean(normal.people.sample) # Our sample mean

## [1] 95.10497

mean(normal.people.sample)-SEM*1.645  # The lower boundary of our 95% confidence interval

## [1] 89.58747

mean(normal.people.sample)+SEM*1.645  # The upper boundary of our 95% confidence interval

## [1] 100.6225

Thus, our sample mean was 95.10 with a 95% CI of 89.59 to 100.62. Since the population mean (100) is in that interval, we cannot reject the population mean as a plausible value. This is virtually identical to saying we cannot reject the null hypothesis.

Just to help solidify things, let’s redo the foregoing examples with statistically significant results. Let’s say we draw 20 random members from Mensa and give them IQ tests. Mensa is an organization of people whose IQs are in at least the 98th percentile of IQ. For simplicity’s sake, let’s just say that the population average for IQ of Mensa members is 145 with the same standard deviation as the general population, 15. This means that Mensa members have an IQ that is higher than the general population, but no more dispersed or “spread out” compared to the general population.

If you’re a researcher, you formulate a research hypothesis that Mensa members have a higher IQ on average compared to the general population. This isn’t a very bold hypothesis, but roll with me here. The null hypothesis would be: “The sample mean for Mensa members will be close to the same as the population mean of IQ, which we know is 100.”

mensa.members=rnorm(20,145,15) # generate 20 random Mensa IQs
distance=mean(mensa.members)-100 # distance between mean Mensa and population IQ
SEM=15/sqrt(length(mensa.members)) # Standard error of the mean
z=distance/SEM
z

## [1] 14.17585

Our sample mean this time is 14.17585 standard errors away from zero. This is very, very far away from where the null hypothesis says it should be. In fact, if the null hypothesis is true then this z score is in the…

pnorm(z)

## [1] 1

100th percentile. We would have a pvalue equal to…

1-pnorm(z)

## [1] 0

…basically zero. If a number is really, really close to 0 or 1, R will just say it’s 0 or 1. Computers have limits.

For the confidence interval, we get…

mean(mensa.members) # Our sample mean

## [1] 147.5473

mean(mensa.members)-SEM*1.645 # The lower boundary of our interval

## [1] 142.0298

mean(mensa.members)+SEM*1.645 # The upper boundary of our interval

## [1] 153.0647

So, our sample mean of Mensa member IQ is 147.55 with a 95% CI of 142.03 to 153.06.

1 sample t-test

For the examples above, I put the known population sd in the equation for the SEM, which was 15 in the case of IQ scores. Most times, however, we don’t know the population sd. We have to estimate it from our data. Because of this, we have to use a t-test rather than a group z-test.

Let’s say we have a group of professional sports bettors. These 20 people may not all be 100% accurate at predicting the outcome of sports match-ups, but they claim that they tend to predict who wins correctly better than a coin flip. Let’s create some simulated data…

bettor.accuracy=rnorm(20,55,5)
bettor.accuracy

##  [1] 56.79875 52.67903 61.91187 51.99226 55.80113 59.63107 57.56983 65.15500
##  [9] 50.75511 57.12218 59.30275 55.42495 56.59515 45.67000 54.85909 56.84561
## [17] 52.05318 53.70589 60.65416 54.98009

I chose to have our bettors have a population accuracy of 55% but have those individual accuracy levels have an sd of 5%

As you can see, most of the people in this group were slightly better than chance at their forecasts. One person was a bit worse than a coin flip. On average they were…

mean(bettor.accuracy)

## [1] 55.97536

55.98% accurate. So, as a group, they tend to just slightly outperform a random coin flip. We’re in a situation where we know the “ground truth”. We set the population mean to be 55% accurate, so we know it’s objectively better than a coin flip. As a researcher, though, you only ever have access to your data, so you have to use a statistical test to INFER whether a sample mean is (probably) different from a population mean.

distance=mean(bettor.accuracy)-50
SEM=sd(bettor.accuracy)/sqrt(length(bettor.accuracy))
t=distance/SEM
t

## [1] 6.183936

In the first line (“distance”), I calculate the distance between our sample mean (55.98) and our “population mean”, 50. This isn’t actually a population mean, per se. It’s more of a benchmark. We want to test whether our bettors are better than 50% accurate at forecasting the outcomes to games.

The line starting with “SEM” calculates the standard error of the mean. Rather than just put “5” as the numerator, which is our population mean, I put the estimated sample standard deviation. Remember, in this scenario, because we’re using simulations, we know the population mean. In the real world, though, you only ever have your data and must use that data to estimate the population parameters.

The last two lines calculate teh t-score, which is 6.18.

t.test(bettor.accuracy,mu=50)

## 
##  One Sample t-test
## 
## data:  bettor.accuracy
## t = 6.1839, df = 19, p-value = 6.09e-06
## alternative hypothesis: true mean is not equal to 50
## 95 percent confidence interval:
##  53.95293 57.99778
## sample estimates:
## mean of x 
##  55.97536

We see here that a t-value equal to 6.1839 and degrees of freedom (df) equal to 19 has a p-value of less than .05 associated with it. In fact, p is equal to .00000609. We reject the null hypothesis that the population bettor accuracy is only 50%. The R output also gives you a 95% confidence interval for the sample accuracy, which ranges from 53.95 to 60 (after rounding).

Correlation and regression

To create simulated, correlated data, I’m going to import a library called “faux”. Normally, if the “faux” package is already installed on your version of R, you can just run the following command to activate it.

library(faux)

## 
## ************
## Welcome to faux. For support and examples visit:
## https://debruine.github.io/faux/
## - Get and set global package options with: faux_options()
## ************

However, if you don’t have “faux” on your version of R, you’ll need to install it first:

install.packages('faux')

Next, I’ll use the rnorm_multi() function from the “faux” library to create some simulated, correlated data:

my_fake_data=rnorm_multi(n = 30, mu = c(2.5,20), sd = c(0.6,4), r = 0.5, varnames = c("GPA","Self-esteem"))

For the first argument of this function (n = 30), I specified how large of a sample I wanted. The second argument (mu = c(2.5,20)), I set the population means of my two variables to 2.5 and 20. Similarly, (sd = c(0.6,4)) set the standard deviations for these variables to 0.6 and 4. I set the correlation between these data to r = 0.5, and I named the variables with (varnames = c(“GPA”,“Self-esteem”).

Since we saved the results of this function call as “my_fake_data”, we can use the fake data in commands like these:

plot(my_fake_data$GPA~my_fake_data$`Self-esteem`,ylab="GPA",xlab="Self-esteem")

This gives you a scatterplot that visually represents the degree of covariation between the two variables. Remember, though, we set the “population” parameters for these two variables and then “sampled” from this population with only a small set of 30 observations. If you re-sampled, the data might not look like this again.

To get a good rundown on the correlation between two variables, run the “cor.test()” function:

cor.test(my_fake_data$GPA,my_fake_data$`Self-esteem`)

## 
##  Pearson's product-moment correlation
## 
## data:  my_fake_data$GPA and my_fake_data$`Self-esteem`
## t = 1.3263, df = 28, p-value = 0.1955
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1283969  0.5547942
## sample estimates:
##       cor 
## 0.2431183

This print out tells you that the t-value associated with these data is 1.33 with degrees of freedom of 28. The p-value is 0.196, so we would not be able to reject the null hypothesis with this sample (even though we know it’s false in this instance!). The 95% confidence interval for the correlation is -0.13 to 0.44. Quite a large interval, and it contains 0, so no wonder we couldn’t reject the null. Finally, at the very bottom, the sample correlation coefficient for these data is 0.24.

To run a regression model that predicts GPA based off of self-esteem, you run the “lm()” function, which is short for “linear model”. You use the “formula” syntax in this function. Basically, you say “y ~ x” where “~” means something like “as a function of” or “as predicted by”.

my_model=lm(my_fake_data$GPA ~ my_fake_data$`Self-esteem`)

This creates a linear model (AKA regression model) predicting GPA based on Self-esteem. To see the results, use the “summary()” function:

summary(my_model)

## 
## Call:
## lm(formula = my_fake_data$GPA ~ my_fake_data$`Self-esteem`)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.82272 -0.46395 -0.03692  0.45995  1.07629 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)   
## (Intercept)                 1.80280    0.57257   3.149  0.00388 **
## my_fake_data$`Self-esteem`  0.03591    0.02707   1.326  0.19547   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5613 on 28 degrees of freedom
## Multiple R-squared:  0.05911,    Adjusted R-squared:  0.0255 
## F-statistic: 1.759 on 1 and 28 DF,  p-value: 0.1955

The first couple of rows in the printout remind you what the formula was for your model. The “Residuals” section gives you a bit of information about how the models residuals are distributed.

The real “meat” of the print out is in the “Coefficients:” section. Each row has the name of the parameter in the model, the estimated regression coefficient, the standard error of that coefficient, its t-value, and p value.

We see that the intercept (the predicted GPA for people with 0 self-esteem) is estimated to be 1.80 with a standard error of 0.57, a t-value of 3.15, and a p-value of .004. It’s statistically significant! …but it rarely matters if the intercept in a model is statistically significant.

The regression coefficient for “Self-esteem” is 0.04. This means that, when self-esteem goes up by 1 point, we predict that GPA will go up by 0.04 points. We fastforward over to the p-value and see that this coefficient is not statistically significant. We cannot reject the null hypothesis that it is actually equal to 0.

The last few rows at the bottom of the print out give some information about general model fit. Probably the most important compeonent is “Multiple R-squared:”. This gives you the proportion of variance in the dependent variable “accounted for” by the regression model. In this case, $R^2$ = 0.059. This means that self-esteem can “account for” about 6% of the variance in GPA.

Differences between means

For an independent samples t-test, we need two sample means from two different populations. In the following code, I randomly generate GPAs for students in a chess club and students who are “normal”, i.e., not in a chess club.

chess.gpa=rnorm(20,3.5,.25)
normal.gpa=rnorm(20,2.5,.25)

I set the population GPA for the first group to 3.5, and for the second group I set it to 2.5. The SD in both groups is the same. I draw 20 random observations from each population.

t.test(chess.gpa,normal.gpa)

## 
##  Welch Two Sample t-test
## 
## data:  chess.gpa and normal.gpa
## t = 11.102, df = 37.854, p-value = 1.818e-13
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.8144165 1.1777158
## sample estimates:
## mean of x mean of y 
##  3.536128  2.540062

To conduct a t-test on the two sample means, I use the “t.test()” function. Towards the top of the print out, it tells you the t-value, degrees of freedom and p-value. That’s pretty much everything you need for writing up the results in APA style.

Note that R uses “Welch’s t-test” by default. This version of the independent samples t-test applies a correction to a test when there are unequal variances between the two groups. This correction will help you reach the correct conclusions more often when the assumption of unequal variances is violated, but it will also not change much at all if the assumption is safe.

If you want to run a “normal” t-test, you can override the default with…

t.test(chess.gpa,normal.gpa,var.equal = TRUE)

## 
##  Two Sample t-test
## 
## data:  chess.gpa and normal.gpa
## t = 11.102, df = 38, p-value = 1.727e-13
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.8144395 1.1776928
## sample estimates:
## mean of x mean of y 
##  3.536128  2.540062

To do an ANOVA, you usually have at least 3 sample means, so I created a third group of honors students. Their population gpa is the same as the chess club students (3.5).

# create a third group
honors.gpa=rnorm(20,3.5,.25)

The simplest/default syntax used to run an ANOVA assumes your data are in a “data.frame” format. Right now we just have loose vectors/lists of numbers. To fix that, I’ll create a new vector/list of labels. The idea here is that we’re going to turn all 60 GPAs into one column in a data frame and have labels for each of the GPAs identifying which group they came from.

group=c(
  rep("chess",20),
  rep("normal",20),
  rep("honors",20)
)

This column of labels is saved as “group”. It is equal to a concatenation of three elements, “c(first, second, third)”. I set each element equal to a character string repeated 20 times. For instance, “rep(”chess”,20)” creates a list of the word “chess” repeated 20 times.

gpa_data=data.frame(
  c(chess.gpa,normal.gpa,honors.gpa),
  group
)
colnames(gpa_data)=c("gpa","group")

Next I create a variable called “gpa_data” and set it equal to a data frame whose columns are equal to “data_frame(first_column, second_column)”. Basically, I set the first column to be equal to all three sets of GPAs concatenated together “c(chess.gpa,normal.gpa,honors.gpa)”. The second column is just “group”.

Now the data are set up in a proper data frame. Usually, when you are analyzing data in R, your data will ALREAY be formatted this way without all this painstaking code.

To rub an ANOVA, you can use the “aov()” function. It’s a good idea to save your ANOVA with a name. Here, I chose “my_first_anova”:

my_first_anova=aov(gpa~group,data=gpa_data)

The “aov()” function uses “formula syntax” for its first argument. You basically put your dependent variable first, the tilde character “~” and your independent variable after. “DV ~ IV” reads something like, “DV is distributed as IV” or “DV as a function of IV.” The second argument specified which data set the variables in the formula come from.

Now that the ANOVA is saved as “my_first_anova” we can summarize the results of that analysis with “summary()”.

summary(my_first_anova)

##             Df Sum Sq Mean Sq F value Pr(>F)    
## group        2 12.830   6.415   85.01 <2e-16 ***
## Residuals   57  4.301   0.075                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ANOVA table gives you all the information you need to report your results in APA style. As usual with an ANOVA, though, the significant p-value only tells you that there’s probably some difference among the 2+ group means, but doesn’t help you specify which pairs are actually different from each other.

For follow-up comparisons, you can go with the “pairwise.t.test()” function:

pairwise.t.test(gpa_data$gpa,gpa_data$group,
                p.adjust.method = "bonferroni")

## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  gpa_data$gpa and gpa_data$group 
## 
##        chess   honors 
## honors 1       -      
## normal 5.9e-16 2.1e-15
## 
## P value adjustment method: bonferroni

The first argument specifies the depdendent variable, “gpa_data$gpa”. The second specifies the grouping variable, “gpa_data$group”. Lastly, you can (and often should) specify a method for adjusting p-values to correct for multiple comparisons. Here, I went with the Bonferroni method.

Repeated measures

To do a repeated measures t-test, I will manually type out the data from the beginning of chapter 7:

before=c(200,190,187,175,240)
after=c(188,186,179,166,233)

To test whether there’s a difference between the average weight of the “before” group and the average weight of the “after” group, you use the “t.test()” function again, but with one difference:

t.test(before,after,paired=TRUE)

## 
##  Paired t-test
## 
## data:  before and after
## t = 6.1357, df = 4, p-value = 0.003576
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   4.379958 11.620042
## sample estimates:
## mean of the differences 
##                       8

There’s a third argument in their specifying that each pair of observations are from the same individual. If you don’t set this equal to “TRUE”, then the default is to treat these data like the “before” group is a completely different set of people from the “after” group.

To do a repeated measures ANOVA, I’ll make up a third set of observations. We have “before”, “after”, and now we have “way after”:

way.after=before-c(1,2,3,4,5)

Call me cynical, but I just set the “way after” weights to be the “before weights” minus only a few pounds.

The next line of code is pretty ugly, and not even the most elegant way to do this, but it accomplishes the goal of putting the before, after, and way after data into one column in a data frame alongside subject number (1 through 5) in the first column and numbers indicating whether each observation came from 1 = before, 2 = after, or 3 = way after. NOTE: Normally, this ugly-looking code would be completely unnecessary because you’d be working with real data that are already formatted in the way we need.

anova_data=data.frame(
  rbind(
  c(1,before[1],1),
  c(1,after[1],2),
  c(1,way.after[1],3),
  c(2,before[2],1),
  c(2,after[2],2),
  c(2,way.after[2],3),
  c(3,before[3],1),
  c(3,after[3],2),
  c(3,way.after[3],3),
  c(4,before[4],1),
  c(4,after[4],2),
  c(4,way.after[4],3),
  c(5,before[5],1),
  c(5,after[5],2),
  c(5,way.after[5],3)
  )
)
colnames(anova_data)=c("subject","weight","time")

With are data formated correctly, we turn to the “aov()” function again, but with a few extra twists:

my_second_anova=aov(weight~factor(time)+Error(factor(subject)),data=anova_data)

I’ve saved the results of this ANOVA as “my_second_anova”. The syntax for the first argument is still a formula, essentially “weight as a function of time” or “weight ~ time”. I did have to specify that “time” and “subject” were factors, not normal numbers like they might appear to be. So I put “factor(time)” and “factor(subject)” when I needed to refer to “time” and “subject”. After “time”, I specified that “subject” was a random effect with “+Error(factor(subject))”. This just tells R that each “subject” in the data frame is same person. The three times subject = 1 are the weights at 3 different time points for the same subject (the first one).

summary(my_second_anova)

## 
## Error: factor(subject)
##           Df Sum Sq Mean Sq F value Pr(>F)
## Residuals  4   7375    1844               
## 
## Error: Within
##              Df Sum Sq Mean Sq F value   Pr(>F)    
## factor(time)  2 163.33   81.67      20 0.000772 ***
## Residuals     8  32.67    4.08                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Just like before, you can get the ANOVA table with the “summary()” function.

To do follow-up contrasts, you can use “pairwise.t.test()” again. The syntax used here is exactly the same as before except there’s a new argument in there that says “paired=TRUE”. As you might guess, this tells R that these data are repeated measures.

pairwise.t.test(anova_data$weight,anova_data$time,paired=TRUE,p.adjust.method = 'bonferroni')

## 
##  Pairwise comparisons using paired t tests 
## 
## data:  anova_data$weight and anova_data$time 
## 
##   1     2    
## 2 0.011 -    
## 3 0.040 0.115
## 
## P value adjustment method: bonferroni

Factorial ANOVA

To do a factorial ANOVA, you need at least 2 factors with 2 levels apice.

men.gpa=rnorm(20,2.5,.2)
wom.gpa=rnorm(20,2.75,.2)
men.honors=rnorm(20,3.5,.2)
wom.honors=rnorm(20,3.3,.2)

Above, I’ve created 4 group means: GPA for men, GPA for women, GPA for men in honors, and GPA for women in honors. That makes two factrs: Sex (male, female) and honors (not in honors, in honors).

barplot(
  c(mean(men.gpa),mean(wom.gpa),mean(men.honors),mean(wom.honors)),
  names.arg=c("Men","Women","H.Men","H.Women"),
  ylim=c(0,4)
)

The “barplot()” function is nifty. The first argument is a series of quantities for the height of each bar. I concatenated the mean of our four groups. “names.arg” is a list of labels for each bar. “ylim = c(0,4)” tells R that I want the y-axis to range from 0 to 4 instead of the default range.

For this made up data, I intentionally created two main effects and an interaction: overall, being a woman is associated with having a higher GPA (main effect of sex), overall, being in honors is associated with having a higher GPA (main effect of honors), however, the effect of being in honors is different for men and women (sex x honors interaction).

Below, I set up this data in a proper data frame, the format it needed to run the “aov()” function with ease. I saved the data frame as “factorial.data”. Again, when you import real data, you won’t have to do this kind of thing. The data will already be in a data frame.

factorial.data=data.frame(
  gpa=c(men.gpa,wom.gpa,men.honors,wom.honors),
  sex=c(rep("man",20),rep("woman",20),rep("man",20),rep("woman",20)),
  honors=c(rep("not.honors",40),rep("honors",40))
)

Next, I create a 2 x 2 factorial ANOVA with the “aov()” function and save it as “mod”.

mod=aov(gpa~sex*honors,data=factorial.data)

If you were just doing an ANOVA contrasting gpa across the sexes, the formula would be “gpa~sex”. If you were doing just gpa between honors and non-honors students, the formula would be “gpa~honors”. If you wanted to analyze how sex and honors are associated with gpa but without an interaction between the two, the formula would be “gpa~sex+honors”. By writing “gpa~sex*honors”, this is saying we want both main effects and the interaction.

summary(mod)

##             Df Sum Sq Mean Sq F value   Pr(>F)    
## sex          1  0.055   0.055   1.199    0.277    
## honors       1 11.880  11.880 260.933  < 2e-16 ***
## sex:honors   1  1.719   1.719  37.755 3.41e-08 ***
## Residuals   76  3.460   0.046                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

“summary(mod)” gives you all the information you need to report the main effects and interactions, but it doesn’t give you specific group-by-group comparisons. Before, when we had only one factor, it was easy to just use “pairwise.t.tests()”. Sense we have multiple factors, it’s easier to use “TukeyHSD()”. John Tukey was an important statistician, and he invented something called the Honest Significant Difference test: Tukey’s HSD:

TukeyHSD(mod,which="sex:honors")

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = gpa ~ sex * honors, data = factorial.data)
## 
## $`sex:honors`
##                                       diff        lwr         upr     p adj
## woman:honors-man:honors         -0.2409277 -0.4181679 -0.06368747 0.0034122
## man:not.honors-man:honors       -1.0638632 -1.2411034 -0.88662299 0.0000000
## woman:not.honors-man:honors     -0.7184647 -0.8957049 -0.54122453 0.0000000
## man:not.honors-woman:honors     -0.8229355 -1.0001757 -0.64569533 0.0000000
## woman:not.honors-woman:honors   -0.4775371 -0.6547773 -0.30029686 0.0000000
## woman:not.honors-man:not.honors  0.3453985  0.1681583  0.52263867 0.0000132

The first argument of “TukeyHSD()” is our factorial model, “mod”. for the second argument, “which”, we say which factor (or factors) we want to use to divvy up the means and contrast the groups. By putting “sex:honors”, we’re saying that we want to contrast both levels of sex and both levels of honors. The printout lists all 6 possible contrasts you can have between 4 group means. The first column give the groups being contrasted, “diff” is the estimated difference in average GPAs between those two groups. “lwr” and “upr” are short for “lower” and “upper”. They give you the lower and upper boundaries of a 95% confidence interval for the estimated differences between teh groups. “p adj” is short for “p, adjusted”. It gives the p-value associated with each contrast, adjusted with Tukey’s method to protect against inflated type 1 errors.

General linear model

In this section, I just show that some tests yield identical results because they are all instances of the general linear model.

just.honors=subset(factorial.data,honors=="honors")

In the above code, I isolate the honors students into their own data set, saved as “just.honors”

t.test(just.honors$gpa~just.honors$sex,
       var.equal=TRUE)

## 
##  Two Sample t-test
## 
## data:  just.honors$gpa by just.honors$sex
## t = 3.8858, df = 38, p-value = 0.0003957
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.1154101 0.3664452
## sample estimates:
##   mean in group man mean in group woman 
##            3.527482            3.286555

I conduct an independent samples t-test contrasting men’s GPAs and women’s. The result is statistically significant, t(38) = 2.94, p = .006.

summary(lm(just.honors$gpa~just.honors$sex))

## 
## Call:
## lm(formula = just.honors$gpa ~ just.honors$sex)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.5274 -0.1338  0.0101  0.1406  0.3365 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           3.52748    0.04384  80.458  < 2e-16 ***
## just.honors$sexwoman -0.24093    0.06200  -3.886 0.000396 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1961 on 38 degrees of freedom
## Multiple R-squared:  0.2844, Adjusted R-squared:  0.2655 
## F-statistic:  15.1 on 1 and 38 DF,  p-value: 0.0003957

In the above code, I create a regression model predicting GPA based on whether participants are men or women. There is a 0.18 point difference in predicted GPA between men and women. The regression coefficient assoacited with being a woman is statistically significant, t(38) = 2.94, p = .006. The results are exactly the same as the t-test because the t-test is an application of the general linear model.

summary(
  aov(gpa~group,data=gpa_data)
)

##             Df Sum Sq Mean Sq F value Pr(>F)    
## group        2 12.830   6.415   85.01 <2e-16 ***
## Residuals   57  4.301   0.075                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In the code above, I run an ANOVA on some data we created earlier, with GPAs from students in the chess club, normal students, and students in honors. The output tells us that the grouping variable is statistically significant, F(2, 57) = 106, p < .001.

summary(
  lm(gpa~group,data=gpa_data)
)

## 
## Call:
## lm(formula = gpa ~ group, data = gpa_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.61196 -0.16038  0.03438  0.17839  0.61401 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.53613    0.06142  57.569   <2e-16 ***
## grouphonors -0.03097    0.08687  -0.357    0.723    
## groupnormal -0.99607    0.08687 -11.467   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2747 on 57 degrees of freedom
## Multiple R-squared:  0.7489, Adjusted R-squared:  0.7401 
## F-statistic: 85.01 on 2 and 57 DF,  p-value: < 2.2e-16

When I use the grouping variable to predict GPA I get an overall ANOVA result for the model that is statistically significant, F(2, 57) = 106, p < .001. Again, the inferential statistics are the same in the regression equation because ANOVA is a specific application of the general linear model. Plus, doing this as an ANOVA gives you two free independent samples t tests and you get the $R^2$ for the model automatically.

summary(
  lm(gpa~sex*honors,factorial.data)
)

## 
## Call:
## lm(formula = gpa ~ sex * honors, data = factorial.data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.52741 -0.13511  0.01878  0.14057  0.46732 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                3.52748    0.04771  73.934  < 2e-16 ***
## sexwoman                  -0.24093    0.06747  -3.571 0.000622 ***
## honorsnot.honors          -1.06386    0.06747 -15.767  < 2e-16 ***
## sexwoman:honorsnot.honors  0.58633    0.09542   6.145 3.41e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2134 on 76 degrees of freedom
## Multiple R-squared:  0.7978, Adjusted R-squared:  0.7898 
## F-statistic: 99.96 on 3 and 76 DF,  p-value: < 2.2e-16

In the above model, I use the interaction syntax from our factorial ANOVA in a linear model. We end up with an intercept (the predicted value of the DV when all independent variables are 0), a dummy variable indicating whether someone is a women, a dummy variable indicating whether someone’s NOT in honors, and a dummy variable indicating whether someone is NOT in honors and IS a woman.

By implication, the “baseline” group is men who are in honors. All other coefficients are deviations from that baseline group. Women who are in honors have GPA that is 0.18 points lower than men in honors. Men who are not in honors have a GPA that is 0.98 points lower than men who are in honors.

The last term is interesting. We already know that if someone is a woman, we subtract 0.18 from their predicted GPA, and if they’re not in honors we subtract another 0.98. But those are just main effects. To capture the fact that there’s an interaction between sex and honors, there is one final coefficient that is added in if someone is a woman who is not in honors: Overall, women who are not in honors have a GPA that is predicted to be 3.48 + 0.39 - 0.98 - 0.21 = 3.66

Chi-square and Fisher’s Exact Test

I’ll use the same data sets from the book for this section. The first one had frequency counts for men and women where voted either republican, democrat, or independent.

vote_data=matrix(c(200,150,50,250,300,50),
                 ncol=3,byrow=TRUE)
vote_data

##      [,1] [,2] [,3]
## [1,]  200  150   50
## [2,]  250  300   50

I save the data as “vote_data”. Technically, I could’ve formatted it as a data frame with “data_frame()”, but a matrix made more sense. THe first argument in the matrix is “c(200,150,50,250,300,50)”. These are the cell counts, starting at the top left and going left to right, top to bottom. The third argument specifies the ordering I just outlined with “byrow=TRUE”. The default is to enter them by column. “ncol=3” specifies the number of columns in the data set.

colnames(vote_data)=c("Republican","Democrat","Independent")
rownames(vote_data)=c("Male","Female")

These lines give names to the columns and rows of “vote_data”

my_chisq=chisq.test(vote_data)
my_chisq

## 
##  Pearson's Chi-squared test
## 
## data:  vote_data
## X-squared = 16.204, df = 2, p-value = 0.000303

To perform a chi-square test on a matrix of frequency counts, you can simply put the name the matrix is saved under in the “chisq.test()” function.

According to the print out, chi-square(2) = 16.20, p < .001. So there is some overrepresentation (or underrepresentation) in some of these cells.

my_chisq$stdres

##        Republican  Democrat Independent
## Male     2.594996 -3.892495    2.151657
## Female  -2.594996  3.892495   -2.151657

The code above takes the name I saved the chi-square analysis under and accesses its standardized residuals with the “$” symbol.

For Fisher’s Exact Test, I’ll recreate the table from the end of the chapter.

measles_data=matrix(c(12,2,6,7),byrow=TRUE,ncol=2)
colnames(measles_data)=c("No measles","Measles")
rownames(measles_data)=c("Vaccinated","Unvaccinated")

Run a Fisher’s test on this data with “fisher.test(measles_data)”:

fisher.test(measles_data)

## 
##  Fisher's Exact Test for Count Data
## 
## data:  measles_data
## p-value = 0.04607
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##   0.8701299 82.6095968
## sample estimates:
## odds ratio 
##   6.464855