Chapter 4 Stats in the Wild: Good Data Practices

4.1 Introduction

I will introduce some additional R commands and packages through reproducing Table 2.1, Table 2.2, Table 2.5, and Figure 2.3. In addition, we will go through the Computing Corner.

4.2 Table and Figure Reproduction

4.2.1 Table 2.1

Since we saved the data frame we created in Chapter 1 as donuts.RData, we will load the file into global environment. We are only interested in the summary statistics for Weight and Donuts. We can get a basic set of summary statistics by calling summary on the data frame. But, stargazer from the stargazer package. The stargazer produces well formatted tables in LaTex code, HTML code, and ASCII text.

We will make use of the pipe operator from the magrittr package (also part of the tidyverse), as well. The pipe operator %>% (ctr-shift-m shortcut in R Studio) allows for more intuitive reading of code especially when nesting commands inside of each other. Take a simple example of finding calling the str command on a data frame, df. Without the pipe operator %>%, we would call the command like this str(df) and you might read this aloud alike this find the structure of df. With the pipe operator, call the command like this df %>% str(). Read aloud it might be something like this “take the df data and find its structure.” The pipe operator really shines when functions are nested together, as we shall see below.


Table 2.1
=====================================
Statistic N   Mean   St. Dev. Min Max
-------------------------------------
Weight    13 172.000  76.200  70  310
Donuts    13  5.450   6.750    0  20 
-------------------------------------

4.2.2 Table 2.2

To reproduce Table 2.2 we will need to add a variable named male which will take on the value 1 for each observation in the data that represents a male and a 0 otherwise.

 [1] "Homer"             "Marge"             "Lisa"             
 [4] "Bart"              "Comic Book Guy"    "Mr. Burns"        
 [7] "Smithers"          "Chief Wiggum"      "Principle Skinner"
[10] "Rev. Lovejoy"      "Ned Flanders"      "Patty"            
[13] "Selma"            

Making use of donuts$name we see that the observations 1, 4, 5, 6, 7, 8, 9, 10, 11 are male and observations 2, 3, 12, 13 are not. We add the variable male to the donuts data frame as follows:

Call table to create a rudimentary version of Table 2.2

.
0 1 
4 9 

4.2.3 Table 2.5

To reproduce Table 2.5 we must first retrieve the data. We will retrieve the data directly from the agencies responsible for their collection. You can retrieve the data as a comma-separated values (csv) file. A csv file is a plain text file in which the data are stored in a tabular format with values separated by a comma.

The crime data can be found on the U.S. Department of Justice Federal Bureau of Investigation Uniform Crime Reporting Statistics website. The single parent, urban, and poverty data can found on the U.S. Census website.

An investigation of the CrimeOneYearofData.csv file shows that there is meta information contained in the file along with the data. We could open the csv file in Excel and edited to remove the information or we could read it directly into R using read_csv from the readr package with options to ignore the meta information. The readr package has many advantages over the base R read functions, see vignette("readr") for more information. All of the text’s data files are available in csv format, so we will make repeated use of read-csv. Investigation of the file with either Excel or text editor, shows the first nine rows are either blank or contain information about the data. Rows 63 to 183 contain footnotes and other additional information about the data. The names of the variables are in row ten of the csv file; so, we will skip the first nine rows using the option skip. We can choose the rows that contain the states and Washington, D.C., with the n_max option. Also, we need only the columns that contain the state names and the violent crime numbers. After reading in the data we will use select from the dplyr package.

Similar to ggplot being based on the grammar of graphics, dplyr is a grammar of data manipulation. dplyr consists of a set of “verbs” to help solve common data manipulation problems. To learn more about dplyr read vignette("dplyr"), visit dplyr, or for a good introduction visit the data import chapter in R for Data Science. We will make use of the pipe operator from the magrittr package (also part of the tidyverse), as well.

Using glimpse from dplyr, we see that we have a tibble with 51 observations and 2 variables. glimpse is similar to str.

Observations: 51
Variables: 2
$ State                <chr> "Alabama", "Alaska", "Arizona", "Arkansas...
$ `Violent Crime rate` <dbl> 450, 633, 426, 516, 473, 339, 301, 645, 1...

We can see that State is a character vector and Violent Crime Rate is a numeric vector. Looking at the names of the variables we can see they do not adhere to the stylistic guidelines discussed above. The State variable begins with a capital letter and the Violent Crime Variable has capital letters and spaces in its name (the spaces are why you see the tick mark “`” before and after the name). The state names are spelled out, but to reproduce Figure 2.3 we need to change those to two-letter abbreviations.

To bring the names into stylistic guidelines we can use clean_names from the janitor package, snake case is the default conversion. Note, the versatility of the %>% operator. If we did not use the %>% operator, the code would have been written as glimpse(crime_one_year_of_data <- clean_names(crime_one_year_of_data))

Observations: 51
Variables: 2
$ state              <chr> "Alabama", "Alaska", "Arizona", "Arkansas",...
$ violent_crime_rate <dbl> 450, 633, 426, 516, 473, 339, 301, 645, 134...

We will read the other data in a similar fashion.

Observations: 51
Variables: 8
$ id                                            <chr> "0400000US01", "...
$ id2                                           <chr> "01", "02", "04"...
$ state                                         <chr> "Alabama", "Alas...
$ estimate_total                                <dbl> 1059528, 174634,...
$ estimate_under_6_years                        <dbl> 357122, 61489, 5...
$ estimate_under_6_years_living_with_one_parent <dbl> 141977, 20676, 2...
$ estimate_6_to_17_years                        <dbl> 702406, 113145, ...
$ estimate_6_to_17_years_living_with_one_parent <dbl> 270077, 32250, 3...

To create the percentage of children with single parents, add those under 6 living with one parent to those between 6 and 17 living with one parent and divide by the estimated total. We create the new variable with the mutate verb from dplyr and select state and percent with single parents into a new data frame.

Observations: 51
Variables: 2
$ state                  <chr> "Alabama", "Alaska", "Arizona", "Arkans...
$ percent_single_parents <dbl> 0.389, 0.303, 0.365, 0.378, 0.324, 0.28...
Observations: 51
Variables: 2
$ state           <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "C...
$ percent_poverty <dbl> 17.5, 9.0, 16.5, 18.8, 14.2, 12.9, 9.4, 10.8, ...

To create the percent urban in 2009, we need to interpolate using the 2000 and 2010 censuses. After reading each set of data we will combine them into one data frame using right_join from the dplyr package.

Observations: 51
Variables: 3
$ state    <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "Californ...
$ total_00 <dbl> 4447100, 626932, 5130632, 2673400, 33871648, 4301261,...
$ urban_00 <dbl> 2465673, 411257, 4523535, 1404179, 31989663, 3633185,...
Observations: 51
Variables: 3
$ state    <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "Californ...
$ total_10 <chr> "4779736(r38235)", "710231(r38823)", "6392017", "2915...
$ urban_10 <dbl> 2821804, 468893, 5740659, 1637589, 35373606, 4332761,...

Note that total_10 from the 2010 census is a character vector. This means that there is at least one observation that includes characters. In fact, we can see at least 3 of the observations include parenthetical notes. print the variable to the screen to confirm each of the patterns is of the form “(some text)”.

 [1] "4779736(r38235)"  "710231(r38823)"   "6392017"         
 [4] "2915918(r39193)"  "37253956"         "5029196"         
 [7] "3574097"          "897934"           "601723(r39494)"  
[10] "18801310(r40184)" "9687653(r41102)"  "1360301"         
[13] "1567582(r41542)"  "12830632"         "6483802"         
[16] "3046355"          "2853118"          "4339367"         
[19] "4533372"          "1328361"          "5773552(r42264)" 
[22] "6547629"          "9883640(r45127)"  "5303925"         
[25] "2967297"          "5988927"          "989415"          
[28] "1826341"          "2700551"          "1316470"         
[31] "8791894(r46246)"  "2059179(r46748)"  "19378102"        
[34] "9535483"          "672591"           "11536504"        
[37] "3751351"          "3831074"          "12702379"        
[40] "1052567"          "4625364"          "814180(r48166)"  
[43] "6346105"          "25145561(r48514)" "2763885"         
[46] "625741"           "8001024"          "6724540"         
[49] "1852994"          "5686986"          "563626"          

We have confirmed that undesirable string has the same form in each position it exists. We must remove those comments and coerce the variable to numeric to proceed. We can determine how many instances of these comments occur using str_detect from the stringr package. str_detect will return a logical vector, so we need only sum the vector to count the number of times this occurs.

When calling sum on a logical vector, TRUE is treated as 1 and FALSE as 0, so summing the vector “counts” the number of TRUE occurrences. A regular expression, regex or regexp, is a sequence of characters that define a search pattern, to learn more visit regexr.com. The pattern we are looking for here is given by “\(.+\)”. Since the parenthesis is a special character, it must be escaped with \, the . is a wild card, the + means 1 or more, so the .+ means find anything that appears 1 or more times. So the expression can be read as start with ( find anything which occurs one or more times and end with ).

[1] 13

The pattern occurs 13 times. We need to remove the string and coerce the character vector to a numeric vector. str_replace_all will remove all occurrences of the string. as.numeric will coerce the character vector to a numeric vector. We will make use of the “two-way” pipe operator %<>% in each function call. This operator takes the left hand side passes it to to the function and returns the result back to the original vector effectively overwriting it.

Observations: 51
Variables: 3
$ state    <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "Californ...
$ total_10 <dbl> 4779736, 710231, 6392017, 2915918, 37253956, 5029196,...
$ urban_10 <dbl> 2821804, 468893, 5740659, 1637589, 35373606, 4332761,...

We see that total_10 is now a numeric vector.

We can now combine the two data frames using right_join from the dplyr package. Since each data frame has the state variable, right_join will add the columns from the 2010 census to the end (right) of the 2000 census matching observations by state. We will assign the result to percent_urban.

Observations: 51
Variables: 5
$ state    <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "Californ...
$ total_00 <dbl> 4447100, 626932, 5130632, 2673400, 33871648, 4301261,...
$ urban_00 <dbl> 2465673, 411257, 4523535, 1404179, 31989663, 3633185,...
$ total_10 <dbl> 4779736, 710231, 6392017, 2915918, 37253956, 5029196,...
$ urban_10 <dbl> 2821804, 468893, 5740659, 1637589, 35373606, 4332761,...

We can, now, interpolate the 2009 observations from the 2000 and 2010 observations. Since 2009 is nine tenths of the distance to 2010 from 2000, we will add 9/10 of the difference between the two observations to the 2000 observation.

We now have 4 data frames containing the information we need to create Table 2.5 and Figure 2.3. We will create a one data frame by joining the four data frames using the dplyr package.

Observations: 51
Variables: 5
$ state                  <chr> "Alabama", "Alaska", "Arizona", "Arkans...
$ violent_crime_rate     <dbl> 450, 633, 426, 516, 473, 339, 301, 645,...
$ percent_single_parents <dbl> 0.389, 0.303, 0.365, 0.378, 0.324, 0.28...
$ percent_urban          <dbl> 58.7, 66.0, 89.7, 55.8, 94.9, 86.0, 88....
$ percent_poverty        <dbl> 17.5, 9.0, 16.5, 18.8, 14.2, 12.9, 9.4,...

Figure 2.3 includes state abbreviations rather than state names. We will change the names into abbreviations with the help of a built in character vector. state.name is character vector of state names, excluding Washington DC, built into R. We can concatenate that vector with the character string “District of Columbia”, sort the new character vector alphabetically, convert the names to abbreviations with state2abbr from the openintro package, and assign the result to the state vector in the crime_one_year_of_data data frame.

Observations: 51
Variables: 5
$ state                  <chr> "AL", "AK", "AZ", "AR", "CA", "CO", "CT...
$ violent_crime_rate     <dbl> 450, 633, 426, 516, 473, 339, 301, 645,...
$ percent_single_parents <dbl> 0.389, 0.303, 0.365, 0.378, 0.324, 0.28...
$ percent_urban          <dbl> 58.7, 66.0, 89.7, 55.8, 94.9, 86.0, 88....
$ percent_poverty        <dbl> 17.5, 9.0, 16.5, 18.8, 14.2, 12.9, 9.4,...

We proceed as we did with Table 2.1 to reproduce Table 2.3.


Table 2.3
============================================================================
Statistic                              N   Mean   St. Dev.   Min      Max   
----------------------------------------------------------------------------
Violent crime rate (per 100,00 people) 51 407.000 206.000  120.000 1,349.000
Percent single parents                 51  0.332   0.064    0.185    0.608  
Percent urban                          51 73.900   14.900  38.800   100.000 
Percent poverty                        51 13.900   3.110    8.500   21.900  
----------------------------------------------------------------------------

4.2.4 Figure 2.3

We will use ggplot from the ggplot2 package to reproduce Figure 2.3. We will use the plot_grd from cowplot package to create a grid of the three individual plots after we create them individually.

plot_urban <- 
  crime_df %>% 
  ggplot(aes(x = percent_urban, y = violent_crime_rate)) +
  labs(x = "Percent urban\n(0-to-100 scale)", # \n creates a new line
       y = "Violent\ncrime\nrate\n(per\n100,000\npeople)") +
  geom_text(aes(label = state), color = "blue") +
  scale_y_continuous(breaks = seq(200, 1200, 200)) + # creates a sequence from 200 to 1200 by 200
  scale_x_continuous(breaks = seq(40, 100, 10)) + # creates a sequence from 40 to 100 by 10
  theme(axis.title.y = element_text(angle = 0), 
        panel.grid = element_blank(), 
        panel.background = element_blank(), 
        axis.line = element_line())

plot_single <- 
  crime_df %>% 
  ggplot(aes(x = percent_single_parents, y = violent_crime_rate)) +
  labs(x = "Percent single parent\n(0-to-1 scale)", # \n creates a new line
       y = "") +
  geom_text(aes(label = state), color = "blue") +
  scale_y_continuous(breaks = seq(200, 1200, 200)) +
  theme(axis.title.y = element_text(angle = 0), 
        panel.grid = element_blank(), 
        panel.background = element_blank(), 
        axis.line = element_line())

plot_poverty <- 
  crime_df %>% 
  ggplot(aes(x = percent_poverty, y = violent_crime_rate)) +
  labs(x = "Percent poverty\n(0-to-100 scale)", # \n creates a new line
       y = "") +
  geom_text(aes(label = state), color = "blue") +
  scale_y_continuous(breaks = seq(200, 1200, 200)) +
  scale_x_continuous(breaks = seq(8, 22, 2)) +
  theme(axis.title.y = element_text(angle = 0), 
        panel.grid = element_blank(), 
        panel.background = element_blank(), 
        axis.line = element_line())

library(cowplot)
plot_grid(plot_urban, plot_single, plot_poverty,  ncol = 3)

FIGURE 2.3: Scatterplots of Violent Crime against Percent Urban, Single Parent, and Poverty

4.3 Computing Center

4.3.1 Reading Data

There are packages available to read data formatted in a variety of ways into R. Data can also be imported using the Import Dataset icon in the Environment/History pane. When learning to import data, this method can be useful as it will create the command line necessary to import the data, which you can then paste into your R Script of R Markdown file.

4.3.2 Manually Entering Data

In Chapter 1 we saw that we can directly (manually) enter data into R as well. Below is the appropriate syntax for doing so.

We can combine this into a single “file” called a data frame as follows:

'data.frame':   13 obs. of  3 variables:
 $ name           : Factor w/ 13 levels "Bart","Chief Wiggum",..: 4 6 5 1 3 7 13 2 10 11 ...
 $ donuts_per_week: num  14 0 0 5 20 0.75 0.25 16 3 2 ...
 $ weight         : num  275 141 70 75 310 80 160 263 205 185 ...

The character vector “name” is coerced to a Factor by default. Factors in R are store as a vector of integer values with a corresponding set of character values. We will see that Factors are very useful, however, in this case we want name to remain a character vector. If we add the option stringsAsFactors = FALSE to our call of data.frame we can prevent the coercion. Our we can call tibble as described in Chapter 1.

'data.frame':   13 obs. of  3 variables:
 $ name           : chr  "Homer" "Marge" "Lisa" "Bart" ...
 $ donuts_per_week: num  14 0 0 5 20 0.75 0.25 16 3 2 ...
 $ weight         : num  275 141 70 75 310 80 160 263 205 185 ...

You can see in the Global Environment tab of the Environment/History pane that you have an object named Donuts that has 13 observations on 3 variables. In addition, you can see under values you have each variable as well.

4.3.3 Simple Statistics

R has many built in calls to get basic statistics on data. For example, to get the mean of a variable call mean(). Be aware, that if there are missing values in the data the function call will return NA as it’s result. The donuts data frame contains to missing values “NA”, so it won’t be a problem. Some simple exploratory data analysis will let you know if you have any issues in the data. We saw one of those problems above when we had a variable that we thought was numeric, but was read in as a character vector. summary is a good place to start.

 observation_number     name           donuts_per_week     weight   
 Min.   : 1         Length:13          Min.   : 0.00   Min.   : 70  
 1st Qu.: 4         Class :character   1st Qu.: 0.75   1st Qu.:141  
 Median : 7         Mode  :character   Median : 3.00   Median :160  
 Mean   : 7                            Mean   : 5.45   Mean   :172  
 3rd Qu.:10                            3rd Qu.: 5.00   3rd Qu.:205  
 Max.   :13                            Max.   :20.00   Max.   :310  
      male      
 Min.   :0.000  
 1st Qu.:0.000  
 Median :1.000  
 Mean   :0.692  
 3rd Qu.:1.000  
 Max.   :1.000  

We confirmed that there are no missing values in our data. If there were, we can easily deal with them with a option in the function call (I include the option below.)

[1] 172

Call var or sd to return the sample variance or sample standard deviation. Of course the standard deviation can also be calculated by calling sqrt on the result of the var call. There are multiple ways to retrieve the number of observations of a variable or data frame. The minimum and maximum values are returned by calling min and max. As described in the text, we can call sum on the result of the call is.finite. nrow will return the number of observations in a data frame, NROW will return the number of observations of a vector or single variable.

[1] 5800
[1] 76.2
[1] 76.2
[1] 70
[1] 310
[1] 13

To return a variety of descriptive statistics on a data frame or variable we can call stargazer from the stargazer package.


================================================================
Statistic          N   Mean   St. Dev. Min Pctl(25) Pctl(75) Max
----------------------------------------------------------------
observation_number 13  7.000   3.890    1     4        10    13 
donuts_per_week    13  5.450   6.750    0    0.8       5     20 
weight             13 172.000  76.200  70    141      205    310
male               13  0.692   0.480    0     0        1      1 
----------------------------------------------------------------

Subsetting in R can be accomplished in a variety of ways. In Base R, use [] syntax. Use brackets can be used to call specific rows and columns from a matrix or data frame. To return the observation in the 12th row and 3rd column call donuts[12,3]. To return all of the observations in a specific row or column, leave the row or column number out of the call. To return all of the observations from the 3rd column call donuts[,3]. To return the observations for an individual record, say the 4th row, call donuts[4,]. To choose (subset) all of those records where, e.g,, donuts eaten per week is 0, call donuts[donuts$donuts_per_week == 0,]; to choose all those records where donuts donuts are not equal to 0, call donuts[donuts$donuts_per_week != 0,]. We can also subset using filter from dplyr. An advantage of subsetting with dplyr is that the resulting tibble can be piped into the another function call.

# A tibble: 1 x 1
  donuts_per_week
            <dbl>
1               5
# A tibble: 13 x 1
   donuts_per_week
             <dbl>
 1           14   
 2            0   
 3            0   
 4            5   
 5           20   
 6            0.75
 7            0.25
 8           16   
 9            3   
10            2   
11            0.8 
12            5   
13            4   
# A tibble: 1 x 5
  observation_number name  donuts_per_week weight  male
               <int> <chr>           <dbl>  <dbl> <dbl>
1                  4 Bart                5     75     1
# A tibble: 2 x 5
  observation_number name  donuts_per_week weight  male
               <int> <chr>           <dbl>  <dbl> <dbl>
1                  2 Marge               0    141     0
2                  3 Lisa                0     70     0
# A tibble: 11 x 5
   observation_number name              donuts_per_week weight  male
                <int> <chr>                       <dbl>  <dbl> <dbl>
 1                  1 Homer                       14       275     1
 2                  4 Bart                         5        75     1
 3                  5 Comic Book Guy              20       310     1
 4                  6 Mr. Burns                    0.75     80     1
 5                  7 Smithers                     0.25    160     1
 6                  8 Chief Wiggum                16       263     1
 7                  9 Principle Skinner            3       205     1
 8                 10 Rev. Lovejoy                 2       185     1
 9                 11 Ned Flanders                 0.8     170     1
10                 12 Patty                        5       155     0
11                 13 Selma                        4       145     0
# A tibble: 2 x 5
  observation_number name  donuts_per_week weight  male
               <int> <chr>           <dbl>  <dbl> <dbl>
1                  2 Marge               0    141     0
2                  3 Lisa                0     70     0
# A tibble: 11 x 5
   observation_number name              donuts_per_week weight  male
                <int> <chr>                       <dbl>  <dbl> <dbl>
 1                  1 Homer                       14       275     1
 2                  4 Bart                         5        75     1
 3                  5 Comic Book Guy              20       310     1
 4                  6 Mr. Burns                    0.75     80     1
 5                  7 Smithers                     0.25    160     1
 6                  8 Chief Wiggum                16       263     1
 7                  9 Principle Skinner            3       205     1
 8                 10 Rev. Lovejoy                 2       185     1
 9                 11 Ned Flanders                 0.8     170     1
10                 12 Patty                        5       155     0
11                 13 Selma                        4       145     0

We can subset on more than one variable as well. Using Base R we can choose all those males who consumed some donuts per week by calling donuts[donuts$donuts_per_week != 0 & donuts$male == 1]. We can choose all those observations where donut consumption per week is more than 15 or the person is female by using the or operator | in call donuts[donuts$donuts_per_week > 15 | donuts$male != 1,]. filter can be used as well.

# A tibble: 9 x 5
  observation_number name              donuts_per_week weight  male
               <int> <chr>                       <dbl>  <dbl> <dbl>
1                  1 Homer                       14       275     1
2                  4 Bart                         5        75     1
3                  5 Comic Book Guy              20       310     1
4                  6 Mr. Burns                    0.75     80     1
5                  7 Smithers                     0.25    160     1
6                  8 Chief Wiggum                16       263     1
7                  9 Principle Skinner            3       205     1
8                 10 Rev. Lovejoy                 2       185     1
9                 11 Ned Flanders                 0.8     170     1
# A tibble: 9 x 5
  observation_number name              donuts_per_week weight  male
               <int> <chr>                       <dbl>  <dbl> <dbl>
1                  1 Homer                       14       275     1
2                  4 Bart                         5        75     1
3                  5 Comic Book Guy              20       310     1
4                  6 Mr. Burns                    0.75     80     1
5                  7 Smithers                     0.25    160     1
6                  8 Chief Wiggum                16       263     1
7                  9 Principle Skinner            3       205     1
8                 10 Rev. Lovejoy                 2       185     1
9                 11 Ned Flanders                 0.8     170     1
# A tibble: 9 x 5
  observation_number name              donuts_per_week weight  male
               <int> <chr>                       <dbl>  <dbl> <dbl>
1                  1 Homer                       14       275     1
2                  4 Bart                         5        75     1
3                  5 Comic Book Guy              20       310     1
4                  6 Mr. Burns                    0.75     80     1
5                  7 Smithers                     0.25    160     1
6                  8 Chief Wiggum                16       263     1
7                  9 Principle Skinner            3       205     1
8                 10 Rev. Lovejoy                 2       185     1
9                 11 Ned Flanders                 0.8     170     1
# A tibble: 6 x 5
  observation_number name           donuts_per_week weight  male
               <int> <chr>                    <dbl>  <dbl> <dbl>
1                  2 Marge                        0    141     0
2                  3 Lisa                         0     70     0
3                  5 Comic Book Guy              20    310     1
4                  8 Chief Wiggum                16    263     1
5                 12 Patty                        5    155     0
6                 13 Selma                        4    145     0
# A tibble: 6 x 5
  observation_number name           donuts_per_week weight  male
               <int> <chr>                    <dbl>  <dbl> <dbl>
1                  2 Marge                        0    141     0
2                  3 Lisa                         0     70     0
3                  5 Comic Book Guy              20    310     1
4                  8 Chief Wiggum                16    263     1
5                 12 Patty                        5    155     0
6                 13 Selma                        4    145     0

We can modify Figure 2.2 to include only males by modifying our original code by piping the filtered results into ggplot.