The Art of Data Science

4.3 Read in your data

The next task in any exploratory data analysis is to read in some data. Sometimes the data will come in a very messy format and you’ll need to do some cleaning. Other times, someone else will have cleaned up that data for you so you’ll be spared the pain of having to do the cleaning.

We won’t go through the pain of cleaning up a dataset here, not because it’s not important, but rather because there’s often not much generalizable knowledge to obtain from going through it. Every dataset has its unique quirks and so for now it’s probably best to not get bogged down in the details.

Here we have a relatively clean dataset from the U.S. EPA on hourly ozone measurements in the entire U.S. for the year 2014. The data are available from the EPA’s Air Quality System web page. I’ve simply downloaded the zip file from the web site, unzipped the archive, and put the resulting file in a directory called “data”. If you want to run this code you’ll have to use the same directory structure.

The dataset is a comma-separated value (CSV) file, where each row of the file contains one hourly measurement of ozone at some location in the country.

NOTE: Running the code below may take a few minutes. There are 7,147,884 rows in the CSV file. If it takes too long, you can read in a subset by specifying a value for the n_max argument to read_csv() that is greater than 0.

> library(readr)
> ozone <- read_csv("data/hourly_44201_2014.csv", 
+                   col_types = "ccccinnccccccncnncccccc")

The readr package by Hadley Wickham is a nice package for reading in flat files (like CSV files) very fast, or at least much faster than R’s built-in functions. It makes some tradeoffs to obtain that speed, so these functions are not always appropriate, but they serve our purposes here.

The character string provided to the col_types argument specifies the class of each column in the dataset. Each letter represents the class of a column: “c” for character, “n” for numeric“, and”i" for integer. No, I didn’t magically know the classes of each column—I just looked quickly at the file to see what the column classes were. If there are too many columns, you can not specify col_types and read_csv() will try to figure it out for you.

Just as a convenience for later, we can rewrite the names of the columns to remove any spaces.

> names(ozone) <- make.names(names(ozone))