Chapter 3 Stats in the Wild: Good Data Practices

3.1 Introduction

I will introduce some additional R commands and packages through reproducing Table 2.1, Table 2.2, Table 2.5, and Figure 2.3. In addition, we will go through the Computing Corner.

3.2 Table and Figure Reproduction

3.2.1 Table 2.1

Since we saved the data frame we created in Chapter 1 as donuts.RData, we will load the file into global environment. We are only interested in the summary statistics for Weight and Donuts. We can get a basic set of summary statistics by calling summary on the data frame. But, stargazer from the stargazer package. The stargazer produces well formatted tables in LaTex code, HTML code, and ASCII text.

We will make use of the pipe operator from the magrittr package (also part of the tidyverse), as well. The pipe operator %>% (ctr-shift-m shortcut in R Studio) allows for more intuitive reading of code especially when nesting commands inside of each other. Take a simple example of finding calling the str command on a data frame, df. Without the pipe operator %>%, we would call the command like this str(df) and you might read this aloud alike this find the structure of df. With the pipe operator, call the command like this df %>% str(). Read aloud it might be something like this “take the df data and find its structure.” The pipe operator really shines when functions are nested together, as we shall see below.

load("donuts.RData")
library(tidyverse)
library(stargazer)
donuts %>% 
  # choose and rename the columns
  select("Weight" = weight, "Donuts" = donuts_per_week) %>% 
  # stargazer doesn't play nice with tibbles so we coerce donuts to a data frame
  as.data.frame %>%
  # tell stargazer to produce an ASCII text version of the table
  stargazer(type = "text", 
              title = "Table 2.1", 
  # omit the default 1st and 3rd quartiles
              omit.summary.stat = c("p25", "p75"))

Table 2.1
=====================================
Statistic N   Mean   St. Dev. Min Max
-------------------------------------
Weight    13 172.000  76.200  70  310
Donuts    13  5.450   6.750    0  20 
-------------------------------------

We can return similar summary statistics with descr from the summarytools package.

donuts %>% 
  # choose the variables we'd like to summarize
  # here we choose "away" from observation number
  # descr will ignore non-numeric vectors
  select(Weight = weight, Donuts = donuts_per_week) %>% 
  summarytools::descr(stats = c("n.valid", "mean", "sd", "min", "max"),
                      round.digits = 3,
                      transpose = TRUE) %>% 
  DT::datatable(caption = htmltools::tags$caption(
                style = "caption-side: bottom; text-align: left;",
                "Table 2.1"),
                class = "cell-border stripe",
                colnames = c("N", "Mean", "St. Dev.", "Min", "Max"),
                options = list(dom = "t"))

3.2.2 Table 2.2

To reproduce Table 2.2 we will need to add a variable named male which will take on the value 1 for each observation in the data that represents a male and a 0 otherwise.

\[male = \begin{cases} 1 & \text{if observation is male} \\ 0 & \text{if observation is female} \end{cases}\]

We will make use of mutate() from the dplyr package. We use mutate() when we want to create a new variable from existing variables in the data frame.

load("donuts.RData")
donuts$name # this syntax reads "print the variable name from the donuts data frame".
 [1] "Homer"             "Marge"             "Lisa"             
 [4] "Bart"              "Comic Book Guy"    "Mr. Burns"        
 [7] "Smithers"          "Chief Wiggum"      "Principle Skinner"
[10] "Rev. Lovejoy"      "Ned Flanders"      "Patty"            
[13] "Selma"            

The square brackets [] indicate the position of each observation in the vector. So, in position [9] is Principle Skinner.

Making use of donuts$name we see that the observations 1, 4, 5, 6, 7, 8, 9, 10, 11 are male and observations 2, 3, 12, 13 are not. We add the variable male to the donuts data frame as follows:

donuts$male <- c(1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0)

Call table to create a rudimentary version of Table 2.2

donuts$male %>% 
  table()
.
0 1 
4 9 

3.2.3 Table 2.5

To reproduce Table 2.5 we must first retrieve the data. We will retrieve the data directly from the agencies responsible for their collection. You can retrieve the data as a comma-separated values (csv) file. A csv file is a plain text file in which the data are stored in a tabular format with values separated by a comma.10

The crime data can be found on the U.S. Department of Justice Federal Bureau of Investigation Uniform Crime Reporting Statistics website. The single parent, urban, and poverty data can found on the U.S. Census website.

An investigation of the CrimeOneYearofData.csv file shows that there is meta information contained in the file along with the data. We could open the csv file in Excel and edit it to remove the information or we could read it directly into R using read_csv from the readr package11 with options to ignore the meta information. The readr package has many advantages over the base R read functions, see vignette("readr") for more information.12

We can view the file by clicking on it in the Files tab of the pane in the lower right corner. Choose the view file option and a new tab will open in the source pane.13 We see tat the first nine rows are either blank or contain information about the data and that rows 63 to 183 contain footnotes and other additional information about the data. The names of the variables are in row ten of the csv file; so, we will skip the first nine rows using the argument skip. We will choose the rows that contain the states and Washington, D.C., with the n_max option.

crime_one_year_of_data <- 
  # read the data from its location
  read_csv("Data/CrimeOneYearofData.csv", 
                               skip = 9, # start at row 10
                               n_max = 51) # use only 51 records
crime_one_year_of_data
# A tibble: 51 x 9
   State Population `Violent Crime ~ `Murder and non~ `Legacy rape ra~
   <chr>      <dbl>            <dbl>            <dbl>            <dbl>
 1 Alab~    4708708             450.              6.8             31.9
 2 Alas~     698473             633.              3.1             73.4
 3 Ariz~    6595778             426.              5.8             34.6
 4 Arka~    2889450             516.              6.2             47.7
 5 Cali~   36961664             473.              5.3             23.6
 6 Colo~    5024748             339.              3.2             45.2
 7 Conn~    3518288             301.              3               18.7
 8 Dela~     885122             645.              4.6             44.6
 9 Dist~     599657            1349.             24.2             25  
10 Flor~   18537969             612.              5.5             29.7
# ... with 41 more rows, and 4 more variables: `Revised rape rate /2` <lgl>,
#   `Robbery rate` <dbl>, `Aggravated assault rate` <dbl>, X9 <chr>

Similar to ggplot being based on the grammar of graphics, dplyr is a grammar of data manipulation. dplyr consists of a set of “verbs” to help solve common data manipulation problems. To learn more about dplyr read vignette("dplyr"), visit dplyr, or for a good introduction visit the data import chapter in R for Data Science.

Using glimpse from dplyr, we see that we have a tibble with 51 observations and 9 variables. glimpse is similar to str.

crime_one_year_of_data %>% 
  glimpse
Rows: 51
Columns: 9
$ State                                       <chr> "Alabama", "Alaska", "A...
$ Population                                  <dbl> 4708708, 698473, 659577...
$ `Violent Crime rate`                        <dbl> 450, 633, 426, 516, 473...
$ `Murder and nonnegligent manslaughter rate` <dbl> 6.8, 3.1, 5.8, 6.2, 5.3...
$ `Legacy rape rate /1`                       <dbl> 31.9, 73.4, 34.6, 47.7,...
$ `Revised rape rate /2`                      <lgl> NA, NA, NA, NA, NA, NA,...
$ `Robbery rate`                              <dbl> 133.1, 93.6, 124.3, 89....
$ `Aggravated assault rate`                   <dbl> 278.3, 463.2, 261.8, 37...
$ X9                                          <chr> NA, NA, NA, NA, NA, NA,...

We see that State is a character vector and Violent Crime Rate is a numeric vector. Looking at the names of the variables we can see they do not adhere to the stylistic guidelines discussed above. The State variable begins with a capital letter and the Violent Crime Variable has capital letters and spaces in its name (the spaces are why you see the tick mark “`” before and after the name). The state names are spelled out, but to reproduce Figure 2.3 we need to change those to two-letter abbreviations.

To bring the names into stylistic guidelines we can use clean_names from the janitor package, snake case is the default conversion. Note, the versatility of the %>% operator. If we did not use the %>% operator, the code would have been written as glimpse(crime_one_year_of_data <- clean_names(crime_one_year_of_data))

library(janitor)
crime_one_year_of_data <- 
  crime_one_year_of_data %>% 
  clean_names() 
crime_one_year_of_data %>% 
  glimpse
Rows: 51
Columns: 9
$ state                                     <chr> "Alabama", "Alaska", "Ari...
$ population                                <dbl> 4708708, 698473, 6595778,...
$ violent_crime_rate                        <dbl> 450, 633, 426, 516, 473, ...
$ murder_and_nonnegligent_manslaughter_rate <dbl> 6.8, 3.1, 5.8, 6.2, 5.3, ...
$ legacy_rape_rate_1                        <dbl> 31.9, 73.4, 34.6, 47.7, 2...
$ revised_rape_rate_2                       <lgl> NA, NA, NA, NA, NA, NA, N...
$ robbery_rate                              <dbl> 133.1, 93.6, 124.3, 89.2,...
$ aggravated_assault_rate                   <dbl> 278.3, 463.2, 261.8, 372....
$ x9                                        <chr> NA, NA, NA, NA, NA, NA, N...

The names of the variables follow good practice.

We will read the other data in a similar fashion.

# Source: U.S. Census Bureau, 2009 American Community Survey, Table C23008
acs_09_1yr_c23008_with_ann <- read_csv("Data/ACS_09_1YR_C23008_with_ann.csv", 
    skip = 1,
    n_max = 51) %>% 
  clean_names() 

acs_09_1yr_c23008_with_ann
# A tibble: 51 x 8
   id    id2   geography estimate_total estimate_under_~ estimate_under_~
   <chr> <chr> <chr>              <dbl>            <dbl>            <dbl>
 1 0400~ 01    Alabama          1059528           357122           141977
 2 0400~ 02    Alaska            174634            61489            20676
 3 0400~ 04    Arizona          1645424           591862           226729
 4 0400~ 05    Arkansas          670246           231937            91509
 5 0400~ 06    Californ~        8980424          3122142          1061375
 6 0400~ 08    Colorado         1177386           419960           116316
 7 0400~ 09    Connecti~         780490           244527            73174
 8 0400~ 10    Delaware          195738            67400            26102
 9 0400~ 11    District~         106701            40852            20741
10 0400~ 12    Florida          3853587          1311773           507149
# ... with 41 more rows, and 2 more variables: estimate_6_to_17_years <dbl>,
#   estimate_6_to_17_years_living_with_one_parent <dbl>

To create the percentage of children with single parents, add those under 6 living with one parent to those between 6 and 17 living with one parent and divide by the estimated total. We create the new variable with the mutate verb from dplyr and select geography as state and percent with single parents into a new data frame.

single_parents <- 
acs_09_1yr_c23008_with_ann %>% 
  mutate(percent_single_parents = 
           (estimate_under_6_years_living_with_one_parent + 
              estimate_6_to_17_years_living_with_one_parent) / 
           estimate_total) %>% 
  select(state = geography, percent_single_parents) 

single_parents
# A tibble: 51 x 2
   state                percent_single_parents
   <chr>                                 <dbl>
 1 Alabama                               0.389
 2 Alaska                                0.303
 3 Arizona                               0.365
 4 Arkansas                              0.378
 5 California                            0.324
 6 Colorado                              0.280
 7 Connecticut                           0.299
 8 Delaware                              0.376
 9 District of Columbia                  0.608
10 Florida                               0.377
# ... with 41 more rows
# Source: U.S. Census Bureau, 2009 American Community Survey, Table S1701
acs_09_1yr_s1701_with_ann <- 
  read_csv("Data/ACS_09_1YR_S1701_with_ann.csv", skip = 1, n_max = 51) %>% 
  clean_names() %>% 
  select("state" = geography, # directly name the variables when selected
         "percent_poverty" =
           percent_below_poverty_level_estimate_population_for_whom_poverty_status_is_determined)

acs_09_1yr_s1701_with_ann
# A tibble: 51 x 2
   state                percent_poverty
   <chr>                          <dbl>
 1 Alabama                         17.5
 2 Alaska                           9  
 3 Arizona                         16.5
 4 Arkansas                        18.8
 5 California                      14.2
 6 Colorado                        12.9
 7 Connecticut                      9.4
 8 Delaware                        10.8
 9 District of Columbia            18.4
10 Florida                         14.9
# ... with 41 more rows

To create the percent urban in 2009, we need to interpolate using the 2000 and 2010 censuses. After reading each set of data we will combine them into one data frame using right_join from the dplyr package. More on that below. First, let’s read the data.

# Source: U.S. Census Bureau, Table P002
dec_00_sf1_p002_with_ann <- read_csv("Data/DEC_00_SF1_P002_with_ann.csv", skip = 1) %>% 
  clean_names() %>%
  select(state = geography, total_00 = total , urban_00 = urban)

dec_00_sf1_p002_with_ann
# A tibble: 51 x 3
   state                total_00 urban_00
   <chr>                   <dbl>    <dbl>
 1 Alabama               4447100  2465673
 2 Alaska                 626932   411257
 3 Arizona               5130632  4523535
 4 Arkansas              2673400  1404179
 5 California           33871648 31989663
 6 Colorado              4301261  3633185
 7 Connecticut           3405565  2988059
 8 Delaware               783600   627758
 9 District of Columbia   572059   572059
10 Florida              15982378 14270020
# ... with 41 more rows
# Source: U.S. Census Bureau, Table H2
dec_10_sf1_p2_with_ann <- read_csv("Data/DEC_10_SF1_P2_with_ann.csv", skip = 1) %>% 
  clean_names() %>%
  select("state" = geography, "total_10" = total , "urban_10" = urban) 

dec_10_sf1_p2_with_ann
# A tibble: 51 x 3
   state                total_10         urban_10
   <chr>                <chr>               <dbl>
 1 Alabama              4779736(r38235)   2821804
 2 Alaska               710231(r38823)     468893
 3 Arizona              6392017           5740659
 4 Arkansas             2915918(r39193)   1637589
 5 California           37253956         35373606
 6 Colorado             5029196           4332761
 7 Connecticut          3574097           3144942
 8 Delaware             897934             747949
 9 District of Columbia 601723(r39494)     601723
10 Florida              18801310(r40184) 17139844
# ... with 41 more rows

Note the message returned by read_csc telling us how it parsed each column into an atomic vector type. In particular, note that Total in the 2010 data was parsed as a character vector. This means that there is at least one observation that includes characters. In fact, we can see at least 3 of the observations include parenthetical notes.

dec_10_sf1_p2_with_ann %>% 
  # the pull call returns the vector pulled as a vector
  pull(total_10)
 [1] "4779736(r38235)"  "710231(r38823)"   "6392017"          "2915918(r39193)" 
 [5] "37253956"         "5029196"          "3574097"          "897934"          
 [9] "601723(r39494)"   "18801310(r40184)" "9687653(r41102)"  "1360301"         
[13] "1567582(r41542)"  "12830632"         "6483802"          "3046355"         
[17] "2853118"          "4339367"          "4533372"          "1328361"         
[21] "5773552(r42264)"  "6547629"          "9883640(r45127)"  "5303925"         
[25] "2967297"          "5988927"          "989415"           "1826341"         
[29] "2700551"          "1316470"          "8791894(r46246)"  "2059179(r46748)" 
[33] "19378102"         "9535483"          "672591"           "11536504"        
[37] "3751351"          "3831074"          "12702379"         "1052567"         
[41] "4625364"          "814180(r48166)"   "6346105"          "25145561(r48514)"
[45] "2763885"          "625741"           "8001024"          "6724540"         
[49] "1852994"          "5686986"          "563626"          

We have confirmed that undesirable string has the same form in each position it exists. We must remove those comments and coerce the variable to numeric to proceed. We can determine how many instances of these comments occur using str_detect from the stringr package. str_detect will return a logical vector, so we need only sum the vector to count the number of times this occurs. We need to remove the parentheses and what is between them.

When calling sum on a logical vector, TRUE is treated as 1 and FALSE as 0, so summing the vector “counts” the number of TRUE occurrences. A regular expression, regex or regexp, is a sequence of characters that define a search pattern, to learn more visit regexr.com. The pattern we are looking for here is given by “\(.+\)”. Since the parenthesis is a special character, it must be escaped with \, the . is a wild card, the + means 1 or more occurrence of the wild card, so the .+ means find anything that appears 1 or more times. So the expression can be read as start with ( find anything which occurs one or more times and end with ).

str_detect(dec_10_sf1_p2_with_ann$total_10, "\\(.+\\)") %>% 
  sum()
[1] 13

The pattern occurs 13 times. We need to remove the string and coerce the character vector to a numeric vector. str_replace_all will remove all occurrences of the string. as.numeric will coerce the character vector to a numeric vector. We will make use of the “two-way” pipe operator %<>% in each function call. This operator takes the left hand side passes it to to the function and returns the result back to the original vector effectively overwriting it.

# the  %<>% operator is a "two way" pipe that sends the result back to the left hand side.
dec_10_sf1_p2_with_ann$total_10 %<>% str_replace_all("\\(.+\\)", "") # "" replaces the string with blank
dec_10_sf1_p2_with_ann$total_10 %<>% as.numeric()
dec_10_sf1_p2_with_ann %>% 
 glimpse()
Rows: 51
Columns: 3
$ state    <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "California", ...
$ total_10 <dbl> 4779736, 710231, 6392017, 2915918, 37253956, 5029196, 3574...
$ urban_10 <dbl> 2821804, 468893, 5740659, 1637589, 35373606, 4332761, 3144...

We see that total_10 is now a numeric vector.

We can now combine the two data frames using right_join from the dplyr package. Since each data frame contains state, right_join will join the columns from the 2010 census to the end (right) of the 2000 census matching observations by state. We will assign the result to percent_urban.

urban <- dec_00_sf1_p002_with_ann %>% 
  right_join(dec_10_sf1_p2_with_ann)

urban 
# A tibble: 51 x 5
   state                total_00 urban_00 total_10 urban_10
   <chr>                   <dbl>    <dbl>    <dbl>    <dbl>
 1 Alabama               4447100  2465673  4779736  2821804
 2 Alaska                 626932   411257   710231   468893
 3 Arizona               5130632  4523535  6392017  5740659
 4 Arkansas              2673400  1404179  2915918  1637589
 5 California           33871648 31989663 37253956 35373606
 6 Colorado              4301261  3633185  5029196  4332761
 7 Connecticut           3405565  2988059  3574097  3144942
 8 Delaware               783600   627758   897934   747949
 9 District of Columbia   572059   572059   601723   601723
10 Florida              15982378 14270020 18801310 17139844
# ... with 41 more rows

We can, now, interpolate the 2009 observations from the 2000 and 2010 observations. Since 2009 is nine tenths of the distance to 2010 from 2000, we will add 9/10 of the difference between the two observations to the 2000 observation.

urban %<>% 
  mutate(percent_urban = (.9 * (urban_10 - urban_00) + urban_00) / 
           (.9 * (total_10 - total_00) + total_00) * 100) %>% 
  select(state, percent_urban)

urban
# A tibble: 51 x 2
   state                percent_urban
   <chr>                        <dbl>
 1 Alabama                       58.7
 2 Alaska                        66.0
 3 Arizona                       89.7
 4 Arkansas                      55.8
 5 California                    94.9
 6 Colorado                      86.0
 7 Connecticut                   88.0
 8 Delaware                      83.0
 9 District of Columbia         100  
10 Florida                       91.0
# ... with 41 more rows

We now have 4 data frames containing the information we need to create Table 2.5 and Figure 2.3. We will create a one data frame by joining the four data frames using the dplyr package.

crime_df <- crime_one_year_of_data %>% 
  right_join(single_parents) %>% 
  right_join(urban) %>% 
  right_join(acs_09_1yr_s1701_with_ann) %>% 
  select(state, violent_crime_rate, percent_single_parents,
         percent_urban, percent_poverty)

crime_df
# A tibble: 51 x 5
   state       violent_crime_r~ percent_single_pa~ percent_urban percent_poverty
   <chr>                  <dbl>              <dbl>         <dbl>           <dbl>
 1 Alabama                 450.              0.389          58.7            17.5
 2 Alaska                  633.              0.303          66.0             9  
 3 Arizona                 426.              0.365          89.7            16.5
 4 Arkansas                516.              0.378          55.8            18.8
 5 California              473.              0.324          94.9            14.2
 6 Colorado                339.              0.280          86.0            12.9
 7 Connecticut             301.              0.299          88.0             9.4
 8 Delaware                645.              0.376          83.0            10.8
 9 District o~            1349.              0.608         100              18.4
10 Florida                 612.              0.377          91.0            14.9
# ... with 41 more rows

Figure 2.3 includes state abbreviations rather than state names. We will change the names into abbreviations with the help of a built in character vector. state.name is character vector of state names, excluding Washington DC, built into R. We can concatenate that vector with the character string “District of Columbia”, sort the new character vector alphabetically, convert the names to abbreviations with state2abbr from the openintro package, and assign the result to the state vector in the crime_one_year_of_data data frame.

library(openintro)
state_abb <- c(state.name, "District of Columbia") %>% 
  sort() %>% 
  state2abbr()
crime_df$state <- state_abb  
crime_df %>% glimpse
Rows: 51
Columns: 5
$ state                  <chr> "AL", "AK", "AZ", "AR", "CA", "CO", "CT", "D...
$ violent_crime_rate     <dbl> 450, 633, 426, 516, 473, 339, 301, 645, 1349...
$ percent_single_parents <dbl> 0.389, 0.303, 0.365, 0.378, 0.324, 0.280, 0....
$ percent_urban          <dbl> 58.7, 66.0, 89.7, 55.8, 94.9, 86.0, 88.0, 83...
$ percent_poverty        <dbl> 17.5, 9.0, 16.5, 18.8, 14.2, 12.9, 9.4, 10.8...

We proceed as we did with Table 2.1 to reproduce Table 2.5. This time we will use the gt package.

crime_df %>% 
  select("Percent poverty" = percent_poverty,
         "Percent single parents" = percent_single_parents,
         "Percent urban" = percent_urban,
         "Violent crime rate (per 100,00 people)" = violent_crime_rate) %>%
  summarytools::descr(stats = c("n.valid", "mean", "sd", "min", "max"),
                      round.digits = 2,
                      transpose = TRUE) %>%
  as_tibble() %>% 
  mutate(Variable = c("Percent poverty",
         "Percent single parents",
         "Percent urban",
         "Violent crime rate (per 100,00 people)"),
         .before = N.Valid) %>% 
  rename(N = N.Valid) %>% 
  arrange(desc(Variable)) %>% 
  gt() %>% 
  tab_header(title = "Table 2.5 Descriptive Statistics for State Crime Data") %>% 
  opt_align_table_header(align = "left") 
Table 2.5 Descriptive Statistics for State Crime Data
Variable N Mean Std.Dev Min Max
Violent crime rate (per 100,00 people) 51 406.535 205.6164 119.900 1348.900
Percent urban 51 73.932 14.9185 38.811 100.000
Percent single parents 51 0.332 0.0642 0.185 0.608
Percent poverty 51 13.855 3.1106 8.500 21.900
  # DT::datatable(caption = htmltools::tags$caption(
  #               style = "caption-side: top; text-align: left;",
  #               "Table 2.5 Descriptive Statistics for State Crime Data"),
  #               class = "cell-border stripe",
  #               colnames = c("N", "Mean", "St. Dev.", "Min", "Max"),
  #               options = list(dom = "t"))
  # as.data.frame() %>% 
  # stargazer(type = "text", 
  #           title = "Table 2.3",
  #           omit.summary.stat = c("p25", "p75")) 

3.2.4 Figure 2.3

We will use ggplot from the ggplot2 package to reproduce Figure 2.3. We will use the plot_grd from cowplot package to create a grid of the three individual plots after we create them individually.

plot_urban <- 
  crime_df %>% 
  ggplot(aes(x = percent_urban, y = violent_crime_rate)) +
  labs(x = "Percent urban\n(0-to-100 scale)", # \n creates a new line
       y = "Violent\ncrime\nrate\n(per\n100,000\npeople)") +
  geom_text(aes(label = state), color = "blue") +
  scale_y_continuous(breaks = seq(200, 1200, 200)) + # creates a sequence from 200 to 1200 by 200
  scale_x_continuous(breaks = seq(40, 100, 10)) + # creates a sequence from 40 to 100 by 10
  theme(axis.title.y = element_text(angle = 0), 
        panel.grid = element_blank(), 
        panel.background = element_blank(), 
        axis.line = element_line())

plot_single <- 
  crime_df %>% 
  ggplot(aes(x = percent_single_parents, y = violent_crime_rate)) +
  labs(x = "Percent single parent\n(0-to-1 scale)", # \n creates a new line
       y = "") +
  geom_text(aes(label = state), color = "blue") +
  scale_y_continuous(breaks = seq(200, 1200, 200)) +
  theme(axis.title.y = element_text(angle = 0), 
        panel.grid = element_blank(), 
        panel.background = element_blank(), 
        axis.line = element_line())

plot_poverty <- 
  crime_df %>% 
  ggplot(aes(x = percent_poverty, y = violent_crime_rate)) +
  labs(x = "Percent poverty\n(0-to-100 scale)", # \n creates a new line
       y = "") +
  geom_text(aes(label = state), color = "blue") +
  scale_y_continuous(breaks = seq(200, 1200, 200)) +
  scale_x_continuous(breaks = seq(8, 22, 2)) +
  theme(axis.title.y = element_text(angle = 0), 
        panel.grid = element_blank(), 
        panel.background = element_blank(), 
        axis.line = element_line())

library(cowplot)
plot_grid(plot_urban, plot_single, plot_poverty,  ncol = 3)

FIGURE 2.3: Scatterplots of Violent Crime against Percent Urban, Single Parent, and Poverty

3.3 Computing Center

3.3.1 Reading Data

There are packages available to read data formatted in a variety of ways into R. A partial list is

Package Type of Data
haven SPSS, Stata, & SAS
httr Web APIs
readxl .xls & .xlsx sheets
rvest web scraping
jsonlite JSON
DBI relational databases
xml2 XML

Data can also be imported using the Import Dataset icon in the Environment/History pane. When learning to import data, this method can be useful as it will create the command line necessary to import the data, which you can then paste into your source file.

3.3.2 Manually Entering Data

In Chapter 1 we saw that we can directly (manually) enter data into R as well. Below is the appropriate syntax for doing so. The function call c() combines or concatenates elements into a vector.

name <- c("Homer", "Marge", "Lisa", "Bart", "Comic Book Guy", "Mr. Burns", "Smithers", "Chief Wiggum", "Principle Skinner", "Rev. Lovejoy", "Ned Flanders", "Patty", "Selma")
donuts_per_week <- c(14, 0, 0, 5, 20, 0.75, 0.25, 16, 3, 2, 0.8, 5, 4)
weight <- c(275, 141, 70, 75, 310, 80, 160, 263, 205, 185, 170, 155, 145)

We can combine this into a single “file” called a data frame as follows:

Donuts <- data.frame(name, donuts_per_week, weight)
Donuts %>% 
  str()
'data.frame':	13 obs. of  3 variables:
 $ name           : chr  "Homer" "Marge" "Lisa" "Bart" ...
 $ donuts_per_week: num  14 0 0 5 20 0.75 0.25 16 3 2 ...
 $ weight         : num  275 141 70 75 310 80 160 263 205 185 ...
# could also use the modern version of data.frame
tibble(name, donuts_per_week, weight) %>% 
  str()
tibble [13 x 3] (S3: tbl_df/tbl/data.frame)
 $ name           : chr [1:13] "Homer" "Marge" "Lisa" "Bart" ...
 $ donuts_per_week: num [1:13] 14 0 0 5 20 0.75 0.25 16 3 2 ...
 $ weight         : num [1:13] 275 141 70 75 310 80 160 263 205 185 ...

The character vector “name” is coerced to a Factor by default. Factors in R are store as a vector of integer values with a corresponding set of character values. We will see that Factors are very useful, however, in this case we want name to remain a character vector. If we add the option stringsAsFactors = FALSE to our call of data.frame we can prevent the coercion. Our we can call tibble as described in Chapter 1.

We can see the difference above.

Donuts <- data.frame(name, donuts_per_week, weight, stringsAsFactors = FALSE)
Donuts %>% str()
'data.frame':	13 obs. of  3 variables:
 $ name           : chr  "Homer" "Marge" "Lisa" "Bart" ...
 $ donuts_per_week: num  14 0 0 5 20 0.75 0.25 16 3 2 ...
 $ weight         : num  275 141 70 75 310 80 160 263 205 185 ...

You can see in the Global Environment tab of the Environment/History pane that you have an object named Donuts that has 13 observations on 3 variables. In addition, you can see under values you have each variable as well.

3.3.3 Simple Statistics

R has many built in calls to get basic statistics on data. For example, to get the mean of a variable call mean(). Be aware, that if there are missing values in the data the function call will return NA as it’s result. NA is contagious, i.e., if any of the observations are NA function calls will generally return NA. The upshot of this is it forces you into having a strategy to deal with NAs. The donuts data frame contains to missing values “NA”, so it won’t be a problem. Some simple exploratory data analysis will let you know if you have any issues in the data. We saw one of those problems above when we had a variable that we thought was numeric, but was read in as a character vector. summary is a good place to start.

donuts %>% 
  summary()
 observation_number     name           donuts_per_week     weight   
 Min.   : 1         Length:13          Min.   : 0.00   Min.   : 70  
 1st Qu.: 4         Class :character   1st Qu.: 0.75   1st Qu.:141  
 Median : 7         Mode  :character   Median : 3.00   Median :160  
 Mean   : 7                            Mean   : 5.45   Mean   :172  
 3rd Qu.:10                            3rd Qu.: 5.00   3rd Qu.:205  
 Max.   :13                            Max.   :20.00   Max.   :310  
      male      
 Min.   :0.000  
 1st Qu.:0.000  
 Median :1.000  
 Mean   :0.692  
 3rd Qu.:1.000  
 Max.   :1.000  

The summarytools package has a variety of tools to expand on the capabilities of summary.

summarytools::descr(donuts)
Descriptive Statistics  

                    donuts_per_week     male   observation_number   weight
----------------- ----------------- -------- -------------------- --------
             Mean              5.45     0.69                 7.00   171.85
          Std.Dev              6.75     0.48                 3.89    76.16
              Min              0.00     0.00                 1.00    70.00
               Q1              0.75     0.00                 4.00   141.00
           Median              3.00     1.00                 7.00   160.00
               Q3              5.00     1.00                10.00   205.00
              Max             20.00     1.00                13.00   310.00
              MAD              3.34     0.00                 4.45    66.72
              IQR              4.25     1.00                 6.00    64.00
               CV              1.24     0.69                 0.56     0.44
         Skewness              1.04    -0.74                 0.00     0.30
      SE.Skewness              0.62     0.62                 0.62     0.62
         Kurtosis             -0.54    -1.56                -1.48    -1.13
          N.Valid             13.00    13.00                13.00    13.00
        Pct.Valid            100.00   100.00               100.00   100.00

Note that descr ignores any character vectors.

We can get a table formatted for R Markdown with the argument style = "rmarkdown"

donuts %>% 
  summarytools::descr(style = "rmarkdown")
### Descriptive Statistics  
#### donuts  
**N:** 13  

|          &nbsp; | donuts_per_week |   male | observation_number | weight |
|----------------:|----------------:|-------:|-------------------:|-------:|
|        **Mean** |            5.45 |   0.69 |               7.00 | 171.85 |
|     **Std.Dev** |            6.75 |   0.48 |               3.89 |  76.16 |
|         **Min** |            0.00 |   0.00 |               1.00 |  70.00 |
|          **Q1** |            0.75 |   0.00 |               4.00 | 141.00 |
|      **Median** |            3.00 |   1.00 |               7.00 | 160.00 |
|          **Q3** |            5.00 |   1.00 |              10.00 | 205.00 |
|         **Max** |           20.00 |   1.00 |              13.00 | 310.00 |
|         **MAD** |            3.34 |   0.00 |               4.45 |  66.72 |
|         **IQR** |            4.25 |   1.00 |               6.00 |  64.00 |
|          **CV** |            1.24 |   0.69 |               0.56 |   0.44 |
|    **Skewness** |            1.04 |  -0.74 |               0.00 |   0.30 |
| **SE.Skewness** |            0.62 |   0.62 |               0.62 |   0.62 |
|    **Kurtosis** |           -0.54 |  -1.56 |              -1.48 |  -1.13 |
|     **N.Valid** |           13.00 |  13.00 |              13.00 |  13.00 |
|   **Pct.Valid** |          100.00 | 100.00 |             100.00 | 100.00 |

We confirmed that there are no missing values in our data. If there were, we can easily deal with them with the argument na.rm = TRUE.

donuts$weight %>% 
  mean(na.rm = TRUE)  
[1] 172

Call var or sd to return the sample variance or sample standard deviation. Of course the standard deviation can also be calculated by calling sqrt on the result of the var call. There are multiple ways to retrieve the number of observations of a variable or data frame. The minimum and maximum values are returned by calling min and max. As described in the text, we can call sum on the result of the call is.finite. nrow will return the number of observations in a data frame, NROW will return the number of observations of a vector or single variable.

donuts$weight %>% var()
[1] 5800
donuts$weight %>% var() %>% sqrt()
[1] 76.2
donuts$weight %>% sd()
[1] 76.2
donuts$weight %>% min()
[1] 70
donuts$weight %>% max()
[1] 310
donuts$weight %>% NROW()
[1] 13

3.3.4 Subsetting

Subsetting in R can be accomplished in a variety of ways. In Base R, use [] syntax. Use brackets can be used to call specific rows and columns from a matrix or data frame. To return the observation in the 12th row and 3rd column call donuts[12,3]. To return all of the observations in a specific row or column, leave the row or column number out of the call. To return all of the observations from the 3rd column call donuts[,3]. To return the observations for an individual record, say the 4th row, call donuts[4,]. To choose (subset) all of those records where, e.g,, donuts eaten per week is 0, call donuts[donuts$donuts_per_week == 0,]; to choose all those records where donuts donuts are not equal to 0, call donuts[donuts$donuts_per_week != 0,]. We can also subset using filter from dplyr. An advantage of subsetting with dplyr is that the resulting tibble can be piped into the another function call.

donuts[12,3]
# A tibble: 1 x 1
  donuts_per_week
            <dbl>
1               5
donuts[,3]
# A tibble: 13 x 1
   donuts_per_week
             <dbl>
 1           14   
 2            0   
 3            0   
 4            5   
 5           20   
 6            0.75
 7            0.25
 8           16   
 9            3   
10            2   
11            0.8 
12            5   
13            4   
donuts[4,]
# A tibble: 1 x 5
  observation_number name  donuts_per_week weight  male
               <int> <chr>           <dbl>  <dbl> <dbl>
1                  4 Bart                5     75     1
donuts[donuts$donuts_per_week == 0,]
# A tibble: 2 x 5
  observation_number name  donuts_per_week weight  male
               <int> <chr>           <dbl>  <dbl> <dbl>
1                  2 Marge               0    141     0
2                  3 Lisa                0     70     0
donuts[donuts$donuts_per_week != 0,]
# A tibble: 11 x 5
   observation_number name              donuts_per_week weight  male
                <int> <chr>                       <dbl>  <dbl> <dbl>
 1                  1 Homer                       14       275     1
 2                  4 Bart                         5        75     1
 3                  5 Comic Book Guy              20       310     1
 4                  6 Mr. Burns                    0.75     80     1
 5                  7 Smithers                     0.25    160     1
 6                  8 Chief Wiggum                16       263     1
 7                  9 Principle Skinner            3       205     1
 8                 10 Rev. Lovejoy                 2       185     1
 9                 11 Ned Flanders                 0.8     170     1
10                 12 Patty                        5       155     0
11                 13 Selma                        4       145     0
donuts %>% 
  filter(donuts_per_week == 0)
# A tibble: 2 x 5
  observation_number name  donuts_per_week weight  male
               <int> <chr>           <dbl>  <dbl> <dbl>
1                  2 Marge               0    141     0
2                  3 Lisa                0     70     0
donuts %>% 
  filter(donuts_per_week != 0)
# A tibble: 11 x 5
   observation_number name              donuts_per_week weight  male
                <int> <chr>                       <dbl>  <dbl> <dbl>
 1                  1 Homer                       14       275     1
 2                  4 Bart                         5        75     1
 3                  5 Comic Book Guy              20       310     1
 4                  6 Mr. Burns                    0.75     80     1
 5                  7 Smithers                     0.25    160     1
 6                  8 Chief Wiggum                16       263     1
 7                  9 Principle Skinner            3       205     1
 8                 10 Rev. Lovejoy                 2       185     1
 9                 11 Ned Flanders                 0.8     170     1
10                 12 Patty                        5       155     0
11                 13 Selma                        4       145     0

We can subset on more than one variable as well. Using Base R we can choose all those males who consumed some donuts per week by calling donuts[donuts$donuts_per_week != 0 & donuts$male == 1]. We can choose all those observations where donut consumption per week is more than 15 or the person is female by using the or operator | in call donuts[donuts$donuts_per_week > 15 | donuts$male != 1,]. filter can be used as well.

donuts[donuts$donuts_per_week != 0 & donuts$male == 1,]
# A tibble: 9 x 5
  observation_number name              donuts_per_week weight  male
               <int> <chr>                       <dbl>  <dbl> <dbl>
1                  1 Homer                       14       275     1
2                  4 Bart                         5        75     1
3                  5 Comic Book Guy              20       310     1
4                  6 Mr. Burns                    0.75     80     1
5                  7 Smithers                     0.25    160     1
6                  8 Chief Wiggum                16       263     1
7                  9 Principle Skinner            3       205     1
8                 10 Rev. Lovejoy                 2       185     1
9                 11 Ned Flanders                 0.8     170     1
donuts %>% 
  filter(donuts_per_week != 0 & male == 1)
# A tibble: 9 x 5
  observation_number name              donuts_per_week weight  male
               <int> <chr>                       <dbl>  <dbl> <dbl>
1                  1 Homer                       14       275     1
2                  4 Bart                         5        75     1
3                  5 Comic Book Guy              20       310     1
4                  6 Mr. Burns                    0.75     80     1
5                  7 Smithers                     0.25    160     1
6                  8 Chief Wiggum                16       263     1
7                  9 Principle Skinner            3       205     1
8                 10 Rev. Lovejoy                 2       185     1
9                 11 Ned Flanders                 0.8     170     1
# a slightly more intuitive alternative is:
donuts %>% 
  filter(donuts_per_week != 0, male == 1) 
# A tibble: 9 x 5
  observation_number name              donuts_per_week weight  male
               <int> <chr>                       <dbl>  <dbl> <dbl>
1                  1 Homer                       14       275     1
2                  4 Bart                         5        75     1
3                  5 Comic Book Guy              20       310     1
4                  6 Mr. Burns                    0.75     80     1
5                  7 Smithers                     0.25    160     1
6                  8 Chief Wiggum                16       263     1
7                  9 Principle Skinner            3       205     1
8                 10 Rev. Lovejoy                 2       185     1
9                 11 Ned Flanders                 0.8     170     1
donuts[donuts$donuts_per_week > 15 | donuts$male != 1,]
# A tibble: 6 x 5
  observation_number name           donuts_per_week weight  male
               <int> <chr>                    <dbl>  <dbl> <dbl>
1                  2 Marge                        0    141     0
2                  3 Lisa                         0     70     0
3                  5 Comic Book Guy              20    310     1
4                  8 Chief Wiggum                16    263     1
5                 12 Patty                        5    155     0
6                 13 Selma                        4    145     0
donuts %>% 
  filter(donuts_per_week > 15 | male != 1)
# A tibble: 6 x 5
  observation_number name           donuts_per_week weight  male
               <int> <chr>                    <dbl>  <dbl> <dbl>
1                  2 Marge                        0    141     0
2                  3 Lisa                         0     70     0
3                  5 Comic Book Guy              20    310     1
4                  8 Chief Wiggum                16    263     1
5                 12 Patty                        5    155     0
6                 13 Selma                        4    145     0

We can modify Figure 2.2 to include only males by modifying our original code by piping the filtered results into ggplot.

library(ggrepel)
donuts %>% 
  filter(male == 1) %>% 
  ggplot(mapping = aes(x = donuts_per_week, y = weight)) + 
  geom_vline(xintercept = 0, color = "gray80", size = 1) +
  geom_point(color = "blue", size = 2) + 
  labs(x = "Donuts", 
       y = "Weight\n(in pounds)", # \n creates a new line
       caption = "Figure 2.2: Weight and Donuts in Springfield") + 
  geom_text_repel(aes(label = name), color = "blue") +
  theme(axis.title.y = element_text(angle = 0), 
          panel.grid = element_blank(), 
          panel.background = element_blank(), 
          axis.line = element_line(), 
          plot.caption = element_text(hjust = 0)) 


  1. These types of files are often referred to as flat files.↩︎

  2. readr is part of the Tidyverse.↩︎

  3. All of the text’s data files are available in csv format, so we will make repeated use of read_csv.↩︎

  4. We could also open the file with either Excel or text editor.↩︎