The Art of Data Science

4.6 ABC: Always be Checking Your “n”s

In general, counting things is usually a good way to figure out if anything is wrong or not. In the simplest case, if you’re expecting there to be 1,000 observations and it turns out there’s only 20, you know something must have gone wrong somewhere. But there are other areas that you can check depending on your application. To do this properly, you need to identify some landmarks that can be used to check against your data. For example, if you are collecting data on people, such as in a survey or clinical trial, then you should know how many people there are in your study. That’s something you should check in your dataset, to make sure that you have data on all the people you thought you would have data on.

In this example, we will use the fact that the dataset purportedly contains hourly data for the entire country. These will be our two landmarks for comparison.

Here, we have hourly ozone data that comes from monitors across the country. The monitors should be monitoring continuously during the day, so all hours should be represented. We can take a look at the Time.Local variable to see what time measurements are recorded as being taken.

> head(table(ozone$Time.Local))

 00:00  00:01  01:00  01:02  02:00  02:03 
288698      2 290871      2 283709      2

One thing we notice here is that while almost all measurements in the dataset are recorded as being taken on the hour, some are taken at slightly different times. Such a small number of readings are taken at these off times that we might not want to care. But it does seem a bit odd, so it might be worth a quick check.

We can take a look at which observations were measured at time “00:01”.

> library(dplyr)
> filter(ozone, Time.Local == "13:14") %>% 
+         select(State.Name, County.Name, Date.Local, 
+                Time.Local, Sample.Measurement)
# A tibble: 2 × 5
  State.Name County.Name Date.Local Time.Local
       <chr>       <chr>      <chr>      <chr>
1   New York    Franklin 2014-09-30      13:14
2   New York    Franklin 2014-09-30      13:14
# ... with 1 more variables:
#   Sample.Measurement <dbl>

We can see that it’s a monitor in Franklin County, New York and that the measurements were taken on September 30, 2014. What if we just pulled all of the measurements taken at this monitor on this date?

> filter(ozone, State.Code == "36" 
+        & County.Code == "033" 
+        & Date.Local == "2014-09-30") %>%
+         select(Date.Local, Time.Local, 
+                Sample.Measurement) %>% 
+         as.data.frame
   Date.Local Time.Local Sample.Measurement
1  2014-09-30      00:01              0.011
2  2014-09-30      01:02              0.012
3  2014-09-30      02:03              0.012
4  2014-09-30      03:04              0.011
5  2014-09-30      04:05              0.011
6  2014-09-30      05:06              0.011
7  2014-09-30      06:07              0.010
8  2014-09-30      07:08              0.010
9  2014-09-30      08:09              0.010
10 2014-09-30      09:10              0.010
11 2014-09-30      10:11              0.010
12 2014-09-30      11:12              0.012
13 2014-09-30      12:13              0.011
14 2014-09-30      13:14              0.013
15 2014-09-30      14:15              0.016
16 2014-09-30      15:16              0.017
17 2014-09-30      16:17              0.017
18 2014-09-30      17:18              0.015
19 2014-09-30      18:19              0.017
20 2014-09-30      19:20              0.014
21 2014-09-30      20:21              0.014
22 2014-09-30      21:22              0.011
23 2014-09-30      22:23              0.010
24 2014-09-30      23:24              0.010
25 2014-09-30      00:01              0.010
26 2014-09-30      01:02              0.011
27 2014-09-30      02:03              0.011
28 2014-09-30      03:04              0.010
29 2014-09-30      04:05              0.010
30 2014-09-30      05:06              0.010
31 2014-09-30      06:07              0.009
32 2014-09-30      07:08              0.008
33 2014-09-30      08:09              0.009
34 2014-09-30      09:10              0.009
35 2014-09-30      10:11              0.009
36 2014-09-30      11:12              0.011
37 2014-09-30      12:13              0.010
38 2014-09-30      13:14              0.012
39 2014-09-30      14:15              0.015
40 2014-09-30      15:16              0.016
41 2014-09-30      16:17              0.016
42 2014-09-30      17:18              0.014
43 2014-09-30      18:19              0.016
44 2014-09-30      19:20              0.013
45 2014-09-30      20:21              0.013
46 2014-09-30      21:22              0.010
47 2014-09-30      22:23              0.009
48 2014-09-30      23:24              0.009

Now we can see that this monitor just records its values at odd times, rather than on the hour. It seems, from looking at the previous output, that this is the only monitor in the country that does this, so it’s probably not something we should worry about.

Because the EPA monitors pollution across the country, there should be a good representation of states. Perhaps we should see exactly how many states are represented in this dataset.

> select(ozone, State.Name) %>% unique %>% nrow
[1] 52

So it seems the representation is a bit too good—there are 52 states in the dataset, but only 50 states in the U.S.!

We can take a look at the unique elements of the State.Name variable to see what’s going on.

> unique(ozone$State.Name)
 [1] "Alabama"              "Alaska"              
 [3] "Arizona"              "Arkansas"            
 [5] "California"           "Colorado"            
 [7] "Connecticut"          "Delaware"            
 [9] "District Of Columbia" "Florida"             
[11] "Georgia"              "Hawaii"              
[13] "Idaho"                "Illinois"            
[15] "Indiana"              "Iowa"                
[17] "Kansas"               "Kentucky"            
[19] "Louisiana"            "Maine"               
[21] "Maryland"             "Massachusetts"       
[23] "Michigan"             "Minnesota"           
[25] "Mississippi"          "Missouri"            
[27] "Montana"              "Nebraska"            
[29] "Nevada"               "New Hampshire"       
[31] "New Jersey"           "New Mexico"          
[33] "New York"             "North Carolina"      
[35] "North Dakota"         "Ohio"                
[37] "Oklahoma"             "Oregon"              
[39] "Pennsylvania"         "Rhode Island"        
[41] "South Carolina"       "South Dakota"        
[43] "Tennessee"            "Texas"               
[45] "Utah"                 "Vermont"             
[47] "Virginia"             "Washington"          
[49] "West Virginia"        "Wisconsin"           
[51] "Wyoming"              "Puerto Rico"

Now we can see that Washington, D.C. (District of Columbia) and Puerto Rico are the “extra” states included in the dataset. Since they are clearly part of the U.S. (but not official states of the union) that all seems okay.

This last bit of analysis made use of something we will discuss in the next section: external data. We knew that there are only 50 states in the U.S., so seeing 52 state names was an immediate trigger that something might be off. In this case, all was well, but validating your data with an external data source can be very useful. Which brings us to….