Chapter Zero

This chapter includes both material from the textbook and material that I have added to aid you in using R and R Studio for statistical and data science tasks.

0.1 What Is A Statistical Model? (Chapter Zero)

To both illustrate the 4-Step Process of statistical modeling and to review the two-sample \(t\)-test, I will illustrate testing to see if there was a statistically significant differnce between two different sections of a class on their final exam, where one section took a final on Monday morning and the other one on Friday afternoon.

0.1.1 Choose the Model

##    exam class
## 1    74     F
## 2    72     F
## 3    65     F
## 4    96     F
## 5    45     F
## 6    62     F
## 7    82     F
## 8    67     F
## 9    63     F
## 10   93     F
## 11   29     F
## 12   68     F
## 13   47     F
## 14   80     F
## 15   87     F
## 16  100     M
## 17   86     M
## 18   87     M
## 19   89     M
## 20   75     M
## 21   88     M
## 22   81     M
## 23   71     M
## 24   87     M
## 25   97     M
## 26   83     M
## 27   81     M
## 28   49     M
## 29   71     M
## 30   63     M
## 31   53     M
## 32   77     M
## 33   71     M
## 34   86     M
## 35   78     M
##   class min   Q1 median Q3 max     mean       sd  n missing
## 1     F  29 62.5     68 81  96 68.66667 18.35626 15       0
## 2     M  49 71.0     81 87 100 78.65000 13.05565 20       0

Notice that there is a point that is an outlier in the Friday section (the lowest exam score). Despite the presence of this outlier, we will (for now) ignore it and use the \(t\)-test, despite this making the assumption of (nearly) normal data questionable. The variances are roughly equal. The parameters we will need to estimate are \(\mu_1, \sigma_1, \mu_2, \sigma_2\).

The model can be written as \[Y = \mu_i + \epsilon\] \(\mu_i\) is the mean of group \(i\) and \(\epsilon \sim N(0,\sigma_i)\) is the random error. We have jus two groups. If the null hypothesis \(H_o: \mu_1 = \mu_2\) (where group #1 is Friday, #2 is Monday), then we assume the means and variances of the two groups are equal, where we are primarily interested in the means.

0.1.2 Fit the Model

Our parameter estimates are: \[\bar{y}_1=68.67, \bar{y}_2=78.65, s_1=18.56, s_2=13.06 \]

The fitted model is \[\hat{y} = \bar{y}_i\] We are simply estimated any student in the Monday section to have a predicted exam score of 78.65 and the Friday section 68.67.

The error term isn’t in that fitted model, even though we know these predictions are not perfectly accurate for each student, since we don’t know whether the random error term will be positive or negative for each observation.

If this seems like a simplistic linear regression model, well, it is.

0.1.3 Assess the Model

We can compute the residual, \(e\) for each observation (student) by taking the difference of their actual response \(y\) (the real exam score) from the predicted valeu \(\hat{y}\) (the model prediction).

\[e = y - \hat{y}\]

For example, student #1 got an exam score of \(y_1=74\) and has a predicted score of \(\hat{y}_1=68.67\) (the mean of the Friday group), so \(e_1=74-68.67=5.33\) (this student did 5.33 points better than predicted). Student #30 has \(y_30=63\) and \(\hat{y}_{30}=78.65\) (the mean of the Monday group), hence \(e_{30}=63-78.65=-15.65\) (did 15.65 points worse than predicted).

We really are assuming that these residuals are normally distributed with mean zero and variance equal to \(\sigma^2\) for each group, so a more thorough assessment of the normality assumption would involve an analysis of these residuals, which we’ll do with R on Thursday.

We can now compute the usual test statistic and \(p\)-value for the \(t\)-test. For today, we will assume equal variances between the groups.

\[t= \frac{\bar{y}_1-\bar{y}_2}{\sqrt{s_p^2(1/n_1+1/n_2)}}\]

where \(s_p^2\) is the pooled variance, \[s_p^2=\frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1+n_2-2}\].

Under the null hypothesis, \(t \sim t(n_1 + n_2 -2)\).

Using the data in this problem, we obtain \(t=-1.8824\) with \(df=33\) and a \(p\)-value of \(p=0.0686\). If we are using a standard level of significance of \(\alpha=0.05\), we fail to reject the null hypothesis.

0.1.4 Use the Model

Since we failed to reject the null hypothesis, we would conclude that the difference in mean exam scores between the Monday and Friday sections is not statistically significant.

There are some caveats to this conclusion:

  • the sample sizes were not large

  • the assumption of normality is questionable due to an outlier

  • were the students assigned to sections randomly?

Other options (such as randomization tests and non-parametrics) exist if we question the use of a \(t\)-test. We might also consider collecting more data and/or improving how the data was collected.

0.2 Introduction to R and R Studio

we are going to get used to using R and R Studio.

http://cran.r-project.org

First, choose the link Download R for Linux/(Mac) OS X/Windows (choose your operating syster)

If you chose Windows, then click on the install R for the first time link, then Download R 4.0.2 For Windows (4.0.2 is the latest version as of this writing)

If you chose (Mac) OS X, then click on R-4.0.2.pkg link.

Once you’ve downloaded, then click and follow the menu instructions as you would for other downloaded software.

If you chose Linux, I doubt you’ll need my help. I’m not a Linux user.

http://www.rstudio.com (Download R Studio)

Click on Download R Studio then click on Download over R Studio Desktop (the Free version) and pick the link that corresponds to your operating system (Windows/macOS/several Linux distributions).

Some free online textbooks we’ll be using are:

https://r4ds.had.co.nz/ (R For Data Science, by Hadley Wickham & Garrett Grolemund)

Hadley’s probably the most prominent R guru and is largely responsible for the tidyverse, a collection of R packages designed for data science. We’ll discuss the tidyverse more in class, or go to https://www.tidyverse.org/. We’ll use R4DS as our main reference for the tidyverse.

https://moderndive.com/ (Modern Dive: Statistical Inference via Data Science, by Chester Ismay & Albert Kim)

We’ll use this book to supplement our main STAT 2 text, and as a guide for our “Data Story” project (see Section 1.1.1).

But first things first…

First, we can use R as a basic calculator. The stuff after the hashtag/pound sign are comments that the computer ignores but are there for our benefit.

2+2
5+8*2^2
(5+8)*2^2
sqrt(12)
-2^4
(-2)^4
pi
pi/4
exp(1) #e
exp(0) #e^0
log(100) #natural log or ln
log10(100) #base-10 log
log(64,b=2) #base-2 log

0.3 The Tidyverse

There are literally thousands of packages created by R users to add statistical functions and built-in data sets. Let’s install and load the tidyverse and Stat2Data packages. The tidyverse package is a set of several packages, including ggplot2 (Grammar of Graphics), for using R in a modern fashion. Stat2Data is a package with data sets that accompanies our textbook.

In the lower right hand window of R Studio, go to the Packages tab and then Install Packages. (you can also do this in the R console window if you aren’t using the R Studio interface).

A bit more on using the tidyverse package for data manipulation, graphing, and analysis. Key resources are https://r4ds.hadley.co.nz, https://www.rstudio.com/resources/cheatsheets/ and http://www.cookbook-r.com/Graphs/.

require(tidyverse)
require(MASS) #cereal data
require(mosaic)

# R for Data Science text book
# http://r4ds.had.co.nz

#  http://genomicsclass.github.io/book/pages/dplyr_tutorial.html (mammal sleep example)
# and http://dplyr.tidyverse.org/ (Star Wars example)
# and http://r4ds.had.co.nz/transform.html (NYC Flights example)
# look there for some options I left off

# using the 'dplyr' package for the grammar of data manipulation
# using the 'ggplot2' package for the grammar of graphics: data visualization

data(UScereal)
head(UScereal)
str(UScereal)
UScereal$SHELF <- factor(UScereal$shelf,labels=c("bottom","middle","top"),ordered=TRUE)
UScereal$BRAND <- rownames(UScereal)
head(UScereal)
str(UScereal)
View(UScereal)

# note, there is a function named 'select' in both dplyr and MASS
# thus, I'm using package::function notation
# Selecting specific columns using select()

UScereal1 <- dplyr::select(UScereal,BRAND,SHELF,carbo,fibre,sugars)
head(UScereal1)
# to select all columns except one, use - operator
UScereal2 <- dplyr::select(UScereal, -vitamins)
head(UScereal2)
# range of columns with the : operator
UScereal3 <- dplyr::select(UScereal,calories:sodium,SHELF)
head(UScereal3)

# Selecting specific rows using filter()
UScereal11 <- filter(UScereal1, fibre >= 10)
head(UScereal11)
UScereal12 <- filter(UScereal1, sugars>=10, fibre<5)
head(UScereal12)
UScereal13 <- filter(UScereal1, fibre>5, SHELF %in% c("middle","bottom"))
head(UScereal13)

# more with piping

UScereal %>% 
  dplyr::select(SHELF, sugars) %>% 
  mutate(sugars=round(sugars,1)) %>%
  head

# arrange or re-arrange rows with arrange()
UScereal1 %>% 
  arrange(SHELF) 
UScereal1 %>% 
 arrange(SHELF,fibre)
UScereal1 %>% 
  arrange(desc(SHELF),desc(fibre))

# create new columns with mutate()
 UScereal1 %>% 
  mutate(total_carbo=carbo+fibre) %>%
  head

UScereal1 %>% 
  mutate(sugars_oz = sugars/28.3495) %>%
  head

# Create summaries of a data frame with summarise()
UScereal %>%
  summarise(avg_fibre=mean(fibre))

UScereal %>%
  summarise(total = n(),
            MIN_calories=min(calories),
            Q1_calories=quantile(calories,probs=0.25),
            M_calories=median(calories),
            Q3_calories=quantile(calories,probs=0.75),
            MAX_calories=max(calories),
            SD_calories=sd(calories),
            IQR_calories=Q3_calories-Q1_calories,
            Range_calories=MAX_calories-MIN_calories)

# I like the favstats function from the `mosaic` package better, it is easier
favstats(~calories,data=UScereal)

# Group operations using group_by()
shelf_stats <- 
  UScereal %>%
  group_by(SHELF) %>%
  summarise(avg_fibre=mean(fibre),sd_fibre=sd(fibre),n=n(),se_fibre=sd_fibre/sqrt(n))
shelf_stats

favstats(calories~SHELF,data=UScereal)

# https://cran.r-project.org/web/packages/tibble/vignettes/tibble.html
# Tibbles are a modern take on data frames. They keep the features that have 
# stood the test of time, and drop the features that used to be convenient but 
# are now frustrating (i.e. converting character vectors to factors).

# Let's turn UScereal from the traditional R data frame to the modern R tibble
UScerealT <- as_tibble(UScereal)
UScerealT
str(UScereal)
str(UScerealT)
options(tibble.print_min=15,tibble.width=Inf) #15 rows, all columns
UScerealT



ggplot(shelf_stats,aes(x=SHELF,y=avg_fibre,color=SHELF)) + 
  geom_point() +
  geom_errorbar(aes(ymin=avg_fibre-se_fibre,ymax=avg_fibre+se_fibre),width=0.1)

# Two-way interaction graph
UScereal4 <- 
  UScereal %>% 
  dplyr::select(BRAND,SHELF,sugars,fibre) %>%
  mutate(HighSugar=(sugars>=10))

head(UScereal4)
shelf_hisugar_stats <-
  UScereal4 %>%
  group_by(SHELF,HighSugar) %>%
  summarise(avg_fibre=mean(fibre),sd_fibre=sd(fibre),n=n(),se_fibre=sd_fibre/sqrt(n))

shelf_hisugar_stats

ggplot(shelf_hisugar_stats,aes(x=SHELF,y=avg_fibre,group=HighSugar,color=HighSugar)) +
  geom_point() + 
  geom_line() 

data(faithful)

# http://www.cookbook-r.com/Graphs/

ggplot(faithful, aes(x=eruptions,y=waiting))+
  geom_point(color="darkblue",pch=22,size=2) +  
  # scatterplot, points twice as big, choose color & shape of points
  theme_bw() + 
  # uses black & white theme rather than standard gray theme
  labs(title="Predicting Old Faithful",  
       x="Length of Previous Eruption",y="Waiting Time") + 
  # labels for graph and axes
  theme(title=element_text(size=16,face="bold"), 
        axis.title=element_text(size=12,face="italic")) +
  # changes font size and face
  theme(plot.title = element_text(hjust = 0.5)) + 
  # centers title, shouldn't be this hard
  geom_smooth(method=lm, se=TRUE,color="green3",fill="goldenrod") + 
  #fits linear model (regression line), change to method=loess
  geom_abline(intercept=30,slope=15) + 
  #fits a specified line, not the line of best fit
  geom_vline(xintercept=mean(faithful$eruptions),linetype="dashed") + 
  #vertical line
  geom_hline(yintercept=mean(faithful$waiting),linetype="dashed")
#horizontal line

data(KidsFeet)
favstats(length~sex,data=KidsFeet)
g1 <- ggplot(KidsFeet,aes(x=sex,y=length,color=sex))
g1 # you haven't picked a "geom" yet
g2 <- ggplot(KidsFeet,aes(x=length))
g2

g1 + geom_boxplot()
g1 + geom_dotplot(binaxis="y",stackdir="center",binwidth=0.1)
g1 + geom_violin()
g2 + geom_histogram(binwidth=0.5,col="blue",fill="lightblue")
g1 + stat_summary(fun.data=mean_se) 
#a way to do error bars, mean +/- standard error

# change colors, Hadley Wickham will tell you to usually not do this

g1 + stat_summary(fun.data=mean_se) + 
  scale_color_manual(values=c("dodgerblue3","hotpink3")) +
  theme_bw()

g3 <- ggplot(KidsFeet,aes(x=length,y=width,color=sex,shape=sex))
g3 + geom_point(size=2) + 
  scale_color_manual(values=c("dodgerblue3","hotpink3")) +
  theme_bw() +
  xlim(20,30) +
  ylim(7,11)

# what if you don't have a color printer
g1 + geom_boxplot() + 
  scale_color_grey() +
  theme_bw()


# lots more out there...
# http://environmentalcomputing.net/plotting-with-ggplot-colours-and-symbols/

0.4 Data Wrangling & Visualization

We are going to follow the lead of Chapters 3 & 4 of the Modern Dive online textbook, https://moderndive.com/2-viz.html and https://moderndive.com/3-wrangling.html to work with a couple of data sets. In my opinion, the Modern Dive book, while it doesn’t have as much detail as R4DS, https://r4ds.hadley.co.nz, it is probably a better introduction to using R Studio for basic data analysis.

The first data set will be the flights data set in the nycflights13 data set we talked about on Thursday, with some information about flights leaving from the three New York City area airports: Newark (EWR), LaGuardia (LGA), and Kennedy (JFK) in 2013.

I am going to suppose that I wish to fly from the New York City area to Denver or Salt Lake City for a ski trip in March. I wish to pick an airline with historically (at least in 2013) the least amount of delay because I don’t want to miss any ski time, and I don’t care about price or which of the three airports I fly out of.

The second data set will be the classic_rock_song_list data set from the fivethirtyeight package, with the songs played most frequently on a sample of classic rock radio stations in June 2014 (more info at https://fivethirtyeight.com/features/why-classic-rock-isnt-what-it-used-to-be/) I will look at some visualizations and statistics of the most popular classic rock songs of the 1960s.

require(tidyverse) # gives us access to ggplot2, dplyr, and others
require(mosaic) # I am using mainly for favstats
require(nycflights13) # for the flights data set
require(fivethirtyeight) # you might need to install this package
require(scales) # and this one as well

# I want to fly from the NYC area to DEN in March.
# I will create a data set limited to this.
# & is the logical operator AND, | is the logical operator OR

# airlines # tells us about the carrier codes, what's "B6"?

skitrip <- flights %>%
  filter(month==3 & (dest=="DEN" | dest=="SLC")) %>%
  mutate(carrier=recode(carrier,"DL"="Delta","F9"="Frontier","B6"="JetBlue",
                        "WN"="Southwest","UA"="United"))
# those are the only 5 airlines that fly direct from NYC to DEN or SLC
  
# str(skitrip)
# summary(skitrip)
# table(skitrip$dest)
xtabs(~dest+carrier,data=skitrip)
##      carrier
## dest  Delta Frontier JetBlue Southwest United
##   DEN    98       57      31       120    337
##   SLC   191        0      31         0      0
# barchart number of flights by airline
ggplot(skitrip,aes(x=carrier)) + 
  geom_bar()

# by airline and destination city
ggplot(skitrip,aes(x=carrier)) +
  geom_bar(aes(fill=dest))

ggplot(skitrip,aes(x=carrier,fill=dest)) + 
  geom_bar(position="dodge") 

# basic regression model
model1 <- lm(arr_delay~dep_delay,data=skitrip)
model1
## 
## Call:
## lm(formula = arr_delay ~ dep_delay, data = skitrip)
## 
## Coefficients:
## (Intercept)    dep_delay  
##     -11.028        1.032
summary(model1)
## 
## Call:
## lm(formula = arr_delay ~ dep_delay, data = skitrip)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -48.941 -11.845  -1.988  10.147  80.432 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -11.02758    0.64579  -17.08   <2e-16 ***
## dep_delay     1.03191    0.01519   67.93   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.01 on 848 degrees of freedom
##   (15 observations deleted due to missingness)
## Multiple R-squared:  0.8448,	Adjusted R-squared:  0.8446 
## F-statistic:  4614 on 1 and 848 DF,  p-value: < 2.2e-16
# scatterplot, departure delay vs arrival delay
ggplot(skitrip,aes(x=dep_delay,y=arr_delay,color=carrier)) +
  geom_point(pch=16) + 
  geom_abline(intercept=-11.028,slope=1.032)

# focus on arrival delay, statistics by airline
favstats(arr_delay~carrier,data=skitrip)
##     carrier min     Q1 median    Q3 max       mean       sd   n missing
## 1     Delta -61 -23.00   -9.5  6.75 387  0.3706294 46.05476 286       3
## 2  Frontier -32 -11.00   -1.0 16.00 410 12.6666667 62.90592  57       0
## 3   JetBlue -27 -16.75   -1.0 14.75 356 14.8870968 57.97600  62       0
## 4 Southwest -56 -24.00  -11.5  6.75 134 -2.8245614 35.09462 114       6
## 5    United -62 -21.50   -9.0  5.50 293  0.2296073 41.92506 331       6
ggplot(skitrip,aes(x=carrier,y=arr_delay)) +
  geom_boxplot()

# I suspect the missing data is canceled flights

# create a data set with summary statistics
# remove missing data
skitrip_stats <- skitrip %>%
  filter(!is.na(arr_delay)) %>%
  group_by(carrier) %>%
  summarize(mean_delay=mean(arr_delay),
            sd_delay=sd(arr_delay),
            n=n())

skitrip_stats
## # A tibble: 5 x 4
##   carrier   mean_delay sd_delay     n
##   <chr>          <dbl>    <dbl> <int>
## 1 Delta          0.371     46.1   286
## 2 Frontier      12.7       62.9    57
## 3 JetBlue       14.9       58.0    62
## 4 Southwest     -2.82      35.1   114
## 5 United         0.230     41.9   331
# error bar plot, not in Modern Dive
ggplot(skitrip_stats,aes(x=carrier)) + 
  geom_point(aes(y=mean_delay),color="red",size=3) +
  geom_errorbar(aes(ymin=mean_delay-sd_delay/sqrt(n),
                    ymax=mean_delay+sd_delay/sqrt(n)),width=0.1,lwd=1.5) +
  geom_hline(yintercept=0,linetype="dashed",color="red") +
  xlab("Airline") +
  ylab("Mean Arrival Delay (minutes)") +
  ggtitle("Error Bars (Mean +/- 1 Standard Error)") +
  theme_bw()

# basic inference, let's compare B6 (Jet Blue) vs DL (Delta)
# the only two airlines that fly into both Denver and Salt Lake from NYC
skitrip2 <- skitrip %>%
  filter(carrier %in% c("Delta","JetBlue"))

t.test(arr_delay~carrier,data=skitrip2)
## 
## 	Welch Two Sample t-test
## 
## data:  arr_delay by carrier
## t = -1.8491, df = 78.516, p-value = 0.0682
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -30.143873   1.110938
## sample estimates:
##   mean in group Delta mean in group JetBlue 
##             0.3706294            14.8870968
# compare Denver vs Salt Lake
t.test(arr_delay~dest,data=skitrip)
## 
## 	Welch Two Sample t-test
## 
## data:  arr_delay by dest
## t = 1.143, df = 349.65, p-value = 0.2538
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.112329 11.749971
## sample estimates:
## mean in group DEN mean in group SLC 
##          2.893482         -1.425339
## Classic Rock Songs, 1960s only
classic60s <- classic_rock_song_list %>%
  filter(release_year >= 1960 & release_year <= 1969) 
# View(classic60s)
# 264 differnt songs from this decade were played
# use the between function for the 1990s
# there are often two or more ways to accomplish the same task
classic90s <- classic_rock_song_list %>%
  filter(between(release_year,1990,1999)) 

View(classic90s)

# histogram
ggplot(classic60s,aes(x=release_year)) +
  geom_histogram(binwidth=1,col="white",fill="lightblue") +
  scale_x_continuous(breaks=1980:1989)

# most popular songs
classic60s %>%
  dplyr::select(song,artist,release_year,playcount) %>%
  arrange(desc(playcount)) %>%
  print(n=20)
## # A tibble: 264 x 4
##    song                               artist         release_year playcount
##    <chr>                              <chr>                 <int>     <int>
##  1 All Along the Watchtower           Jimi Hendrix           1968       141
##  2 Gimme Shelter                      Rolling Stones         1969        96
##  3 Ramble On                          Led Zeppelin           1969        90
##  4 Honky Tonk Women                   Rolling Stones         1969        87
##  5 Sympathy For The Devil             Rolling Stones         1968        81
##  6 Whole Lotta Love                   Led Zeppelin           1969        72
##  7 Jumpin' Jack Flash                 Rolling Stones         1968        72
##  8 Come Together                      The Beatles            1969        69
##  9 White Room                         Cream                  1968        68
## 10 Purple Haze                        Jimi Hendrix           1967        67
## 11 Magic Carpet Ride                  Steppenwolf            1968        65
## 12 Fire                               Jimi Hendrix           1967        63
## 13 You Can't Always Get What You Want Rolling Stones         1969        62
## 14 (I Can't Get No) Satisfaction      Rolling Stones         1965        60
## 15 Paint It Black                     Rolling Stones         1966        58
## 16 Revolution                         The Beatles            1968        55
## 17 Brown Eyed Girl                    Van Morrison           1967        55
## 18 Pinball Wizard                     The Who                1969        54
## 19 Born to Be Wild                    Steppenwolf            1968        53
## 20 People Are Strange                 The Doors              1967        53
## # ... with 244 more rows
# most popular artists
popular60s <- classic60s %>%
  group_by(artist) %>%
  summarize(songs=n()) %>%
  arrange(desc(songs),artist) %>%
  print(n=10)
## # A tibble: 57 x 2
##    artist                       songs
##    <chr>                        <int>
##  1 The Beatles                     91
##  2 Rolling Stones                  22
##  3 Led Zeppelin                    18
##  4 Jimi Hendrix                    14
##  5 The Doors                       13
##  6 The Who                         10
##  7 Creedence Clearwater Revival     7
##  8 Cream                            6
##  9 Pink Floyd                       5
## 10 The Kinks                        5
## # ... with 47 more rows
# barchart of the most popular artists
popular60s %>%
  filter(songs>=5) %>%
  ggplot(aes(x=reorder(artist,songs),y=songs)) +
  geom_col(fill="purple") +
  coord_flip() +
  xlab("Number of Songs") +
  ylab("Artist") +
  ggtitle("Popular 1960s Classic Rock Artists") +
  theme(plot.title=element_text(size=10,face="bold",color="purple",
        family="serif"))

# we aren't going to fuss with reordering alphabeltically or by most songs 
# at least today

# what percentage of songs come from various decades?

# View(classic_rock_song_list), sort by release_year
# why I am I using right=FALSE below?

classic_rock_song_list %>%
  filter(release_year>=1960) %>% 
  # there is an Elton John song listed as from the year 1071, typo
  # the line below is probably the most complex in this example
  mutate(decade=cut(release_year,breaks=c(1960,1970,1980,1990,2000,2020),
                    right=FALSE,
                    labels=c("60s","70s","80s","90s","2000s"))) ->
  classic_rock_decade

ggplot(classic_rock_decade,aes(x=decade,y=..prop..,group=1)) +
  geom_bar(color="blue",fill="lightblue") +
  scale_y_continuous(labels=scales::percent) +
  ylab("Percent") + xlab("Decade") +
  ggtitle("What Decade's Songs Are Still Played \n on Classic Rock Stations?") +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle=element_text(hjust = 0.5)) # centers

0.5 Data Importing

  1. Direct Entry

Basically, just manually type in a bunch of vectors and then bind them together into a data frame. This is really only feasible for a small example.

name <- c("Amy","Brad","Carrie","Diego","Emily")
exam <- c(95,72,88,92,65)
grades <- data.frame(name,exam)
str(grades)
## 'data.frame':	5 obs. of  2 variables:
##  $ name: chr  "Amy" "Brad" "Carrie" "Diego" ...
##  $ exam: num  95 72 88 92 65
summary(grades)
##      name                exam     
##  Length:5           Min.   :65.0  
##  Class :character   1st Qu.:72.0  
##  Mode  :character   Median :88.0  
##                     Mean   :82.4  
##                     3rd Qu.:92.0  
##                     Max.   :95.0
  1. Use built-in datasets

This is very handy for classroom use and if a built-in data set happens to be what you want or need to analyze. Base R comes with many built-in data sets built into the datasets package, including classics such as the Old Faithful data, faithful, Fisher’s flower measurements iris and Anscombe’s famous example anscombe of 4 data sets with the same correlation and regression equation but very different trends. Notice you do not have to “load” the datasets package with a require or library command.

require(mosaic)
# help(faithful) use this to get help on any data set
head(faithful)
##   eruptions waiting
## 1     3.600      79
## 2     1.800      54
## 3     3.333      74
## 4     2.283      62
## 5     4.533      85
## 6     2.883      55
histogram(~waiting,data=faithful)

head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 
favstats(Petal.Length~Species,data=iris)
##      Species min  Q1 median    Q3 max  mean        sd  n missing
## 1     setosa 1.0 1.4   1.50 1.575 1.9 1.462 0.1736640 50       0
## 2 versicolor 3.0 4.0   4.35 4.600 5.1 4.260 0.4699110 50       0
## 3  virginica 4.5 5.1   5.55 5.875 6.9 5.552 0.5518947 50       0
anscombe
##    x1 x2 x3 x4    y1   y2    y3    y4
## 1  10 10 10  8  8.04 9.14  7.46  6.58
## 2   8  8  8  8  6.95 8.14  6.77  5.76
## 3  13 13 13  8  7.58 8.74 12.74  7.71
## 4   9  9  9  8  8.81 8.77  7.11  8.84
## 5  11 11 11  8  8.33 9.26  7.81  8.47
## 6  14 14 14  8  9.96 8.10  8.84  7.04
## 7   6  6  6  8  7.24 6.13  6.08  5.25
## 8   4  4  4 19  4.26 3.10  5.39 12.50
## 9  12 12 12  8 10.84 9.13  8.15  5.56
## 10  7  7  7  8  4.82 7.26  6.42  7.91
## 11  5  5  5  8  5.68 4.74  5.73  6.89
cor(y1~x1,data=anscombe)
## [1] 0.8164205
cor(y2~x2,data=anscombe)
## [1] 0.8162365
cor(y3~x3,data=anscombe)
## [1] 0.8162867
cor(y4~x4,data=anscombe)
## [1] 0.8165214
par(mfrow=c(2,2))
plot(y1~x1,data=anscombe)
plot(y2~x2,data=anscombe)
plot(y3~x3,data=anscombe)
plot(y4~x4,data=anscombe)

  1. Use data sets in other R packages

Many of the thousands of R packages come with built-in data sets as well. For example, your textbook has a package called Stat2Data that contains the data sets used in this book. One such data set is Backpack, with the results of the students’ research project looking at factors related to the weight of backpacks and whether that was associated with back problems.

The package MASS contains several data sets that I have used in classroom situations. USCereal has nutritional information on several brands of breakfast cereal and is a favorite of mine. In my biostatistics course for nursing students, I have used a data set called Pima.tr which has data regarding possible predictors of Type II diabetes in women on the Pima reservation in Arizon.

require(Stat2Data)
## Loading required package: Stat2Data
data(Backpack)
head(Backpack)
##   BackpackWeight BodyWeight     Ratio BackProblems      Major Year    Sex
## 1              9        125 0.0720000            1        Bio    3 Female
## 2              8        195 0.0410256            0 Philosophy    5   Male
## 3             10        120 0.0833333            1        GRC    4 Female
## 4              6        155 0.0387097            0        CSC    6   Male
## 5              8        180 0.0444444            0         EE    2 Female
## 6              5        240 0.0208333            0    History    0   Male
##   Status Units
## 1      U    13
## 2      U    12
## 3      U    14
## 4      G     0
## 5      U    14
## 6      G     0
require(MASS)
## Loading required package: MASS
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
data(UScereal)
summary(UScereal)
##  mfr       calories        protein             fat            sodium     
##  G:22   Min.   : 50.0   Min.   : 0.7519   Min.   :0.000   Min.   :  0.0  
##  K:21   1st Qu.:110.0   1st Qu.: 2.0000   1st Qu.:0.000   1st Qu.:180.0  
##  N: 3   Median :134.3   Median : 3.0000   Median :1.000   Median :232.0  
##  P: 9   Mean   :149.4   Mean   : 3.6837   Mean   :1.423   Mean   :237.8  
##  Q: 5   3rd Qu.:179.1   3rd Qu.: 4.4776   3rd Qu.:2.000   3rd Qu.:290.0  
##  R: 5   Max.   :440.0   Max.   :12.1212   Max.   :9.091   Max.   :787.9  
##      fibre            carbo           sugars          shelf      
##  Min.   : 0.000   Min.   :10.53   Min.   : 0.00   Min.   :1.000  
##  1st Qu.: 0.000   1st Qu.:15.00   1st Qu.: 4.00   1st Qu.:1.000  
##  Median : 2.000   Median :18.67   Median :12.00   Median :2.000  
##  Mean   : 3.871   Mean   :19.97   Mean   :10.05   Mean   :2.169  
##  3rd Qu.: 4.478   3rd Qu.:22.39   3rd Qu.:14.00   3rd Qu.:3.000  
##  Max.   :30.303   Max.   :68.00   Max.   :20.90   Max.   :3.000  
##    potassium          vitamins 
##  Min.   : 15.00   100%    : 5  
##  1st Qu.: 45.00   enriched:57  
##  Median : 96.59   none    : 3  
##  Mean   :159.12                
##  3rd Qu.:220.00                
##  Max.   :969.70
data(Pima.tr)
help(Pima.tr)
## starting httpd help server ...
##  done
head(Pima.tr)
##   npreg glu bp skin  bmi   ped age type
## 1     5  86 68   28 30.2 0.364  24   No
## 2     7 195 70   33 25.1 0.163  55  Yes
## 3     5  77 82   41 35.8 0.156  35   No
## 4     0 165 76   43 47.9 0.259  26   No
## 5     0 107 60   25 26.4 0.133  23   No
## 6     5  97 76   27 35.6 0.378  52  Yes
  1. Internet URL If someone has uploaded a data set onto a webpage, you can use read.table (if it is a .txt file) or read.csv (if it is a .csv comma-delimited file) to read in the data. You must be absolutely precise with the URL.
require(mosaic)
URL <- "http://campus.murraystate.edu/academic/faculty/cmecklin/STA135ClassDataFall2017.txt"
class.data <- read.table(file=URL,header=TRUE)
head(class.data)
##   Gender  Color Texts Chocolate HSAlgebra Pizza Sushi Tacos Temp Height
## 1      F yellow     3        10         8     3     1     2   85     62
## 2      F yellow     5         9         6     1     3     2   87     63
## 3      M orange    21         9         3     3     1     2   88     66
## 4      F yellow     4         9         7     2     3     1   84     61
## 5      M   blue     0         7         5     1     3     2   80     74
## 6      F   gray     4        10         4     1     2     3   72     65
str(class.data)
## 'data.frame':	45 obs. of  10 variables:
##  $ Gender   : chr  "F" "F" "M" "F" ...
##  $ Color    : chr  "yellow" "yellow" "orange" "yellow" ...
##  $ Texts    : int  3 5 21 4 0 4 4 8 5 20 ...
##  $ Chocolate: int  10 9 9 9 7 10 2 7 10 4 ...
##  $ HSAlgebra: int  8 6 3 7 5 4 8 5 6 6 ...
##  $ Pizza    : int  3 1 3 2 1 1 3 2 2 2 ...
##  $ Sushi    : int  1 3 1 3 3 2 1 3 1 3 ...
##  $ Tacos    : int  2 2 2 1 2 3 2 1 3 1 ...
##  $ Temp     : int  85 87 88 84 80 72 83 87 93 84 ...
##  $ Height   : int  62 63 66 61 74 65 71 68 62 70 ...
favstats(Texts~Gender,data=class.data)
##   Gender min Q1 median   Q3 max  mean       sd  n missing
## 1      F   0  4      6 16.0 100 15.32 22.04866 25       0
## 2      M   0  0      4 14.5  46  9.00 12.12894 19       1
bwplot(Texts~Gender,data=class.data)

t.test(Texts~Gender,data=class.data,alt="greater")
## 
## 	Welch Two Sample t-test
## 
## data:  Texts by Gender
## t = 1.2121, df = 38.73, p-value = 0.1164
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  -2.466853       Inf
## sample estimates:
## mean in group F mean in group M 
##           15.32            9.00
grad.school <- read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
# note, unlike read.table, header=TRUE is the default for read.csv
head(grad.school)
##   admit gre  gpa rank
## 1     0 380 3.61    3
## 2     1 660 3.67    3
## 3     1 800 4.00    1
## 4     1 640 3.19    4
## 5     0 520 2.93    4
## 6     1 760 3.00    2
str(grad.school)
## 'data.frame':	400 obs. of  4 variables:
##  $ admit: int  0 1 1 1 0 1 1 0 1 0 ...
##  $ gre  : int  380 660 800 640 520 760 560 400 540 700 ...
##  $ gpa  : num  3.61 3.67 4 3.19 2.93 3 2.98 3.08 3.39 3.92 ...
##  $ rank : int  3 3 1 4 4 2 1 2 3 2 ...
summary(grad.school)
##      admit             gre             gpa             rank      
##  Min.   :0.0000   Min.   :220.0   Min.   :2.260   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:520.0   1st Qu.:3.130   1st Qu.:2.000  
##  Median :0.0000   Median :580.0   Median :3.395   Median :2.000  
##  Mean   :0.3175   Mean   :587.7   Mean   :3.390   Mean   :2.485  
##  3rd Qu.:1.0000   3rd Qu.:660.0   3rd Qu.:3.670   3rd Qu.:3.000  
##  Max.   :1.0000   Max.   :800.0   Max.   :4.000   Max.   :4.000
  1. EXCEL file

You can save it as .csv rather than .xls or .xlsx, then specify location on your machine when using read.csv. I believe there exist specialized packages for directly importing EXCEL, or go below to see an easier way with the help of R Studio.

# EXCEL file, save in .csv format, stored on machine

class.csv <- read.csv(file="C:/Users/Chris Mecklin/Desktop/MAT135Fall2012/STA135ClassDataFall2017.csv")
head(class.csv)
##   Gender  Color Texts Chocolate HSAlgebra Pizza Sushi Tacos Temp Height
## 1      F yellow     3        10         8     3     1     2   85     62
## 2      F yellow     5         9         6     1     3     2   87     63
## 3      M orange    21         9         3     3     1     2   88     66
## 4      F yellow     4         9         7     2     3     1   84     61
## 5      M   blue     0         7         5     1     3     2   80     74
## 6      F   gray     4        10         4     1     2     3   72     65
#or, hunt for it locally with file.choose
#uncomment the lines below to do 

# class.find <- read.csv(file=file.choose())
# tail(class.find)
  1. Import Data (Enivronment tab in Upper Right hand panel)

In the upper right hand corner in the Environments tab, click on Import Dataset, which will allow you to import in from five different formats: .csv files, EXCEL files, SPSS files, SAS files, and Stata files (the last three are various statistical software packages).

  1. Google Sheets

Maybe you like to work with a Google Sheet that you have stored on your Google Drive, or you have the link for someone else’s Google Sheet. A package called googlesheets can help.

However, the syntax is very cumbersome and another option would be to save the Google Sheet as a .csv file onto your machine and then import that in as discussed above. (Honestly, this would be my choice)

0.6 Working with Probability Distributions

As always, use the help command to learn about the various distributions available or a specific distribution

help(Distributions) 
help(dnorm)

In general, functions that begin with d-, such as dbinom and dnorm, evaluate the probability density (or mass) function and are equivalent to such TI functions as binompdf and normalpdf.

For a discrete distribution such as the binomial, this finds the probability that \(X\) is equal to a specific value \(x\). \[f(x)=P(X=x)={n \choose x} p^x (1-p)^{n-x}\] For example, if \(X \sim BIN(n=10,p=0.3)\), find \(P(X=4)\).

dbinom(x=4,size=10,prob=0.3)   #same as binompdf(10,0.3,4)
## [1] 0.2001209
# find and graph the probabilities of all values of X
x <- 0:10
P.x <- dbinom(x=x,size=10,prob=0.3)
plot(P.x~x,type="h",
     xlab="Successes",ylab="Probability",main="Binomial, n=10, p=0.3")

For a continuous distribution such as the normal, this finds the height of the curve \(f(x)\) at a specific value \(x\). \[f(x)=\frac{1}{\sqrt{2\pi\sigma^2}}e^{-(\frac{x-\mu}{\sigma})^2}\] You never needed to know this in a class like STA 135 and is mainly useful for graphing the function.

Suppose we have the normal curve \(X \sim N(\mu=100, \sigma=15)\)

dnorm(x=110,mean=100,sd=15)  #same as normalpdf(110,100,15), you probably never did this in STA 135
## [1] 0.02129653
## Graph the normal curve
x <- seq(50,150,by=0.01)
f.x <- dnorm(x=x,mean=100,sd=15)
plot(f.x~x,type="l",col="blue3",
     ylab="f(x)",main="Normal Dist. mu=100,sigma=15")

Functions that begin with p-, such as pbinom and pnorm, evaluate the cumulative distribution function, or \(P(X \leq x)\). They are equivalent to such TI functions as binomcdf and normalcdf.

To find the \(P(X \leq 4)\) for the binomial distribution with \(n=10\), \(p=0.3\)

pbinom(q=4,size=10,prob=0.3)  #same as binomcdf(10,0.3,4)
## [1] 0.8497317

This avoids us having to repeatedly use the binomial formula and adding the results together.

To find \(P(X \leq 110)\) from the normal curve \(X \sim N(\mu=100,\sigma=15)\)

pnorm(q=110,mean=100,sd=15)  #same as normalcdf(-1E99,110,100,15)
## [1] 0.7475075

This avoids having to either numerically approximate a difficult integral or finding \(z\)-scores and using the standard normal table.

In statistics, we frequently find probabilities involving either the lower tail, upper tail, or both of a distribution, as this will be the \(p\)-value of a hypothesis test.

To find the \(p\)-value of a two-sided \(t\)-test with test statisic \(t=-1.72\) with \(df=15\) (if the alternative was one-sided, don’t multiply by 2)

2*pt(q=-1.72,df=15) #same as 2*tcdf(-1E99,-1.72,15)
## [1] 0.1059907

To find the \(p\)-value for a chi-square test or a \(F\) test, which unlike the normal and \(t\) distributions only take on positive values and the \(p\)-value will just be the area above the test statistic.

pchisq(q=2.59,df=2,lower.tail=FALSE) #same as  X^2cdf(2.59,1E99,2)
## [1] 0.2738979
pf(q=4.83,df1=1,df2=20,lower.tail=FALSE) #same as Fcdf(4.83,1E99,1,20)
## [1] 0.03991306

Functions that begin with q-, such as qnorm and qt, evaluate the quantile function (i.e. find percentiles) and are equivalent to such TI functions as invNorm and invT. We use these statistics to find critical values that we used to look up in tables, typically as part of a formula for a confidence interval.

##Find the 75th percentile of X~N(100,15)
qnorm(p=0.75,mean=100,sd=15) #same as invNorm(0.75,100,15)
## [1] 110.1173
##Find the 2.5th and 97.5th perceniles of X~N(0,1), i.e. critical values for a 95% CI
qnorm(p=0.14) #same as invNorm(0.14)
## [1] -1.080319
##Find the critical value for a t-test with alpha=0.05, df=20
alpha<-0.05
qt(p=c(alpha/2,1-alpha/2),df=20) #two-sided, divide alpha in half
## [1] -2.085963  2.085963
qt(p=alpha,df=20) #lower one-sided
## [1] -1.724718
qt(p=1-alpha,df=20) #upper one-sided
## [1] 1.724718
##Find the critical value for a F=test with alpha=0.01, df=1,20
alpha<-0.01
qf(p=1-alpha,df1=1,df2=20)
## [1] 8.095958

Functions that begin with r-, such as rbinom and rnorm, generate random numbers from the appropriate distribution.

require(mosaic)
## Generate 20 random numbers from a normal curve
samp1 <- rnorm(n=20,mean=100,sd=15)
samp1
##  [1] 106.87258  94.09079 108.80870  86.18941  96.37385  79.82250 104.22389
##  [8] 107.17572  90.06373 104.31413  95.63439 101.61674  92.82288 109.49456
## [15] 103.58920 101.47943  79.61186  96.58949 108.26635 117.37099
favstats(samp1)
##       min       Q1   median       Q3     max     mean      sd  n missing
##  79.61186 93.77381 101.5481 106.9484 117.371 99.22056 10.0553 20       0
## Repeat an experiment 25 times where you flip a fair coin 100 times 
## and count the number of heads
samp2 <- rbinom(n=25,size=100,prob=0.5)
samp2
##  [1] 53 47 45 50 53 46 50 39 51 53 48 58 45 44 45 53 46 54 50 48 45 36 49 55 52
favstats(samp2)
##  min Q1 median Q3 max mean sd  n missing
##   36 45     49 53  58 48.6  5 25       0

0.7 Webscraping

This material is to come…