6 Summary Statistics

There are two functions in base R that I use to quickly calculate summary statistics. The first is summary() which calculates quantitative summary statistics.

summary(flights$dep_delay) ## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## -43.00 -5.00 -2.00 12.64 11.00 1301.00 8255 The other function is table() which creates a basic frequency table. table(flights$carrier)
##
##    9E    AA    AS    B6    DL    EV    F9    FL    HA    MQ    OO    UA    US    VX    WN    YV
## 18460 32729   714 54635 48110 54173   685  3260   342 26397    32 58665 20536  5162 12275   601

6.1 The tidyverse approach: summarize

The tidyverse approach to calculating summary statistics is a bit more involved, although offers a lot of flexibility. The key function is summarize(), which aggregates all the data in your dataset and creates new “variables” that are functions of your whole data. For example, I’m going to calculate the mean departure delay.

flights %>% summarize(delay=mean(dep_delay, na.rm = T))
## # A tibble: 1 x 1
##   delay
##   <dbl>
## 1  12.6
flights %>% summarize(dep.delay=mean(dep_delay, na.rm = T),
dep.delay.sd = sd(dep_delay, na.rm = T),
dep.delay.med = median(dep_delay, na.rm = T))
## # A tibble: 1 x 3
##   dep.delay dep.delay.sd dep.delay.med
##       <dbl>        <dbl>         <dbl>
## 1      12.6         40.2            -2