4.6 ABC: Always be Checking Your “n”s

In general, counting things is usually a good way to figure out if anything is wrong or not. In the simplest case, if you’re expecting there to be 1,000 observations and it turns out there’s only 20, you know something must have gone wrong somewhere. But there are other areas that you can check depending on your application. To do this properly, you need to identify some landmarks that can be used to check against your data. For example, if you are collecting data on people, such as in a survey or clinical trial, then you should know how many people there are in your study. That’s something you should check in your dataset, to make sure that you have data on all the people you thought you would have data on.

In this example, we will use the fact that the dataset purportedly contains hourly data for the entire country. These will be our two landmarks for comparison.

Here, we have hourly ozone data that comes from monitors across the country. The monitors should be monitoring continuously during the day, so all hours should be represented. We can take a look at the Time.Local variable to see what time measurements are recorded as being taken.

> head(table(ozone$Time.Local)) 00:00 00:01 01:00 01:02 02:00 02:03 288698 2 290871 2 283709 2  One thing we notice here is that while almost all measurements in the dataset are recorded as being taken on the hour, some are taken at slightly different times. Such a small number of readings are taken at these off times that we might not want to care. But it does seem a bit odd, so it might be worth a quick check. We can take a look at which observations were measured at time “00:01”. > library(dplyr) > filter(ozone, Time.Local == "13:14") %>% + select(State.Name, County.Name, Date.Local, + Time.Local, Sample.Measurement) # A tibble: 2 × 5 State.Name County.Name Date.Local Time.Local <chr> <chr> <chr> <chr> 1 New York Franklin 2014-09-30 13:14 2 New York Franklin 2014-09-30 13:14 # ... with 1 more variables: # Sample.Measurement <dbl> We can see that it’s a monitor in Franklin County, New York and that the measurements were taken on September 30, 2014. What if we just pulled all of the measurements taken at this monitor on this date? > filter(ozone, State.Code == "36" + & County.Code == "033" + & Date.Local == "2014-09-30") %>% + select(Date.Local, Time.Local, + Sample.Measurement) %>% + as.data.frame Date.Local Time.Local Sample.Measurement 1 2014-09-30 00:01 0.011 2 2014-09-30 01:02 0.012 3 2014-09-30 02:03 0.012 4 2014-09-30 03:04 0.011 5 2014-09-30 04:05 0.011 6 2014-09-30 05:06 0.011 7 2014-09-30 06:07 0.010 8 2014-09-30 07:08 0.010 9 2014-09-30 08:09 0.010 10 2014-09-30 09:10 0.010 11 2014-09-30 10:11 0.010 12 2014-09-30 11:12 0.012 13 2014-09-30 12:13 0.011 14 2014-09-30 13:14 0.013 15 2014-09-30 14:15 0.016 16 2014-09-30 15:16 0.017 17 2014-09-30 16:17 0.017 18 2014-09-30 17:18 0.015 19 2014-09-30 18:19 0.017 20 2014-09-30 19:20 0.014 21 2014-09-30 20:21 0.014 22 2014-09-30 21:22 0.011 23 2014-09-30 22:23 0.010 24 2014-09-30 23:24 0.010 25 2014-09-30 00:01 0.010 26 2014-09-30 01:02 0.011 27 2014-09-30 02:03 0.011 28 2014-09-30 03:04 0.010 29 2014-09-30 04:05 0.010 30 2014-09-30 05:06 0.010 31 2014-09-30 06:07 0.009 32 2014-09-30 07:08 0.008 33 2014-09-30 08:09 0.009 34 2014-09-30 09:10 0.009 35 2014-09-30 10:11 0.009 36 2014-09-30 11:12 0.011 37 2014-09-30 12:13 0.010 38 2014-09-30 13:14 0.012 39 2014-09-30 14:15 0.015 40 2014-09-30 15:16 0.016 41 2014-09-30 16:17 0.016 42 2014-09-30 17:18 0.014 43 2014-09-30 18:19 0.016 44 2014-09-30 19:20 0.013 45 2014-09-30 20:21 0.013 46 2014-09-30 21:22 0.010 47 2014-09-30 22:23 0.009 48 2014-09-30 23:24 0.009 Now we can see that this monitor just records its values at odd times, rather than on the hour. It seems, from looking at the previous output, that this is the only monitor in the country that does this, so it’s probably not something we should worry about. Because the EPA monitors pollution across the country, there should be a good representation of states. Perhaps we should see exactly how many states are represented in this dataset. > select(ozone, State.Name) %>% unique %>% nrow [1] 52 So it seems the representation is a bit too good—there are 52 states in the dataset, but only 50 states in the U.S.! We can take a look at the unique elements of the State.Name variable to see what’s going on. > unique(ozone$State.Name)
[3] "Arizona"              "Arkansas"
[7] "Connecticut"          "Delaware"
[9] "District Of Columbia" "Florida"
[11] "Georgia"              "Hawaii"
[13] "Idaho"                "Illinois"
[15] "Indiana"              "Iowa"
[17] "Kansas"               "Kentucky"
[19] "Louisiana"            "Maine"
[21] "Maryland"             "Massachusetts"
[23] "Michigan"             "Minnesota"
[25] "Mississippi"          "Missouri"
[31] "New Jersey"           "New Mexico"
[33] "New York"             "North Carolina"
[35] "North Dakota"         "Ohio"
[37] "Oklahoma"             "Oregon"
[39] "Pennsylvania"         "Rhode Island"
[41] "South Carolina"       "South Dakota"
[43] "Tennessee"            "Texas"
[45] "Utah"                 "Vermont"
[47] "Virginia"             "Washington"
[49] "West Virginia"        "Wisconsin"
[51] "Wyoming"              "Puerto Rico"         

Now we can see that Washington, D.C. (District of Columbia) and Puerto Rico are the “extra” states included in the dataset. Since they are clearly part of the U.S. (but not official states of the union) that all seems okay.

This last bit of analysis made use of something we will discuss in the next section: external data. We knew that there are only 50 states in the U.S., so seeing 52 state names was an immediate trigger that something might be off. In this case, all was well, but validating your data with an external data source can be very useful. Which brings us to….