Chapter 7 Data extraction

  • Reproducible extraction of data from source location: may be complicated by access protocols.

    • access tokens; APIs
    • raw data from github for private repos
    • databases
    • httr

Make your extraction code “as reproducicle as possible”, subject to these access constraints. At minimum, document clearly how you obtained the data, so others could follow your path, even if not via pure code.

Reminder: Keep your raw data in read-only mode. Don’t edit these files. Write code to transform the raw data into form you will use for analysis.

---- Forwarded Message -----
From: GitHub <noreply@github.com>
To: Arthur Small <asmall@virginia.edu>
Sent: Sunday, February 21, 2021, 6:20:58 AM EST
Subject: [GitHub] Deprecation Notice

Hi @arthursmalliii,

You recently used a password to access the repository at uva-eng-time-series-sp21/coronato-nicholas with git using git/2.30.0.

Basic authentication using a password to Git is deprecated and will soon no longer work. Visit https://github.blog/2020-12-15-token-authentication-requirements-for-git-operations/ for more information around suggested workarounds and removal dates.

Thanks,
The GitHub Team

7.1 Reading in data from source files

  • Declaring data types as you read in data.

7.1.1 Example using readr::read_csv

# Data Source: https://www.eia.gov/petroleum/supply/weekly/

# Original code:
PetroStocksData <- readr::read_csv("data/PetroStocks.csv")
names(PetroStocksData) <- PetroStocksData[2,] # assign correct header
PetroStocksData <- PetroStocksData[-1,] # remove unused rows
PetroStocksData <- PetroStocksData[-1,] # remove unused rows

### Revised using readr::read_csv() to skip empty rows and declare data types
### Declare column types within the read_csv() function call, to save yourself trouble later:
ps_tbl <- readr::read_csv("data/PetroStocks.csv", skip = 2, col_types = cols(.default = col_double(),
                                                                       "Date"   = col_character())) 

### Reference: https://readr.tidyverse.org/reference/read_delim.html