4.7 Validate With at Least One External Data Source

Making sure your data matches something outside of the dataset is very important. It allows you to ensure that the measurements are roughly in line with what they should be and it serves as a check on what other things might be wrong in your dataset. External validation can often be as simple as checking your data against a single number, as we will do here.

In the U.S. we have national ambient air quality standards, and for ozone, the current standard set in 2008 is that the “annual fourth-highest daily maximum 8-hr concentration, averaged over 3 years” should not exceed 0.075 parts per million (ppm). The exact details of how to calculate this are not important for this analysis, but roughly speaking, the 8-hour average concentration should not be too much higher than 0.075 ppm (it can be higher because of the way the standard is worded).

Let’s take a look at the hourly measurements of ozone.

> summary(ozone$Sample.Measurement)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.00000 0.02000 0.03200 0.03123 0.04200 0.34900 

From the summary we can see that the maximum hourly concentration is quite high (0.349 ppm) but that in general, the bulk of the distribution is far below 0.075.

We can get a bit more detail on the distribution by looking at deciles of the data.

> quantile(ozone$Sample.Measurement, seq(0, 1, 0.1))
   0%   10%   20%   30%   40%   50%   60%   70% 
0.000 0.010 0.018 0.023 0.028 0.032 0.036 0.040 
  80%   90%  100% 
0.044 0.051 0.349 

Knowing that the national standard for ozone is something like 0.075, we can see from the data that

  • The data are at least of the right order of magnitude (i.e. the units are correct)

  • The range of the distribution is roughly what we’d expect, given the regulation around ambient pollution levels

  • Some hourly levels (less than 10%) are above 0.075 but this may be reasonable given the wording of the standard and the averaging involved.