Chapter 7 Data acquisition and extraction

Readings:

TSDS, Chapter 2

7.1 Access protocols and permissions

  • Reproducible extraction of data from source location: may be complicated by access protocols.

    • access tokens; APIs
    • raw data from github for private repos
    • databases
    • package httr to access data from websites

7.2 Accessing databases

esales <- dbGetQuery(db,'SELECT * from eia_elec_sales_va_all_m') # SQL code to retrieve data from a table in the remote database
# str(esales)
esales <- as_tibble(esales) # Convert dataframe to a 'tibble' for tidyverse work
# str(esales)
# Reference: https://arrow.apache.org/docs/r/
# if(!('arrow' %in% installed.packages())) install.packages('arrow')
library(arrow)
write_feather(esales, "esales.feather")
# Close connection -- this is good practice
dbDisconnect(db)
dbUnloadDriver(db_driver)

7.3 Other comments

Make your extraction code “as reproducible as possible”, subject to these access constraints. At minimum, document clearly how you obtained the data, so others could follow your path, even if not via pure code.

Keep your raw data in read-only mode. Don’t edit these files.

Write code to transform the raw data into form you will use for analysis. Don’t do it manually.