Chapter 19 API Data Queries

library(tidyverse)
library(censusapi)
library(jsonlite) 
library(tidycensus)

CHAPTER IS STILL BEING EDITED

19.1 Introduction

Dr. Sonderegger’s Video Companion: Video Lecture.

With a standard database connection, there is quite a lot we can do. For example we could insert incorrect rows into tables, or even delete whole tables. Many organizations that deliver data to clients require a way to minimize the types of data base actions that are allowed. For example, consider Twitter. Twitter clients need to connect to the Twitter database, sign in, and download the latest tweets from whomever they follow and accept a database input that adds a tweet from the signed in user. However, the client must not be able to update or insert tweets from somebody else, and to prevent Denial-Of-Service attacks, there should be some limit to the number of rows of information that we ask for. Furthermore, the client shouldn’t have to remember the details of how the data is stored and changes to the database configuration should be completely invisible to clients.

Application Program Interfaces (APIs) are the specification for how two programs will interface. An API that is well thought out and documented is wonderful to use. In a data query situation, the API will define how we submit a query and the manner in which the result will be returned.

As the internet has become more sophisticated and companies have begun to understand the economics and risks associated with releasing their data, APIs have generally become more restrictive and Tom Scott has a great video about this change, titled “This Video Has XXX,XXX,XXX Views”.

The US Census Bureau has a really nice Web Page interface to their data and arguable, these is the easiest way to get data from the Census Bureau.

The tool we’ll be using next is still in beta version, so it might change, but using the search function at https://data.census.gov/cedsci/ we can search for whatever we want. For example we might be interested in the percent of residents that have health insurance and so we’ll search for ‘health insurance’. Several tables show up and we can look through all of the results for a table that gives us what we’d like. Note the table name!

From there we could customize the table and download it.

19.2 Census Bureau API

The US Census Bureau’s API interface works by having users visit websites with extremely carfully craften URL strings. The following web links will cause a query on the Census web site, and then result in some data. Go ahead and click on these!

http://api.census.gov/data/2018/pep/population?get=DATE_CODE,DATE_DESC,DENSITY,POP,GEONAME,STATE&for=state:*&DATE_CODE=1

http://api.census.gov/data/2018/pep/population?get=DATE_CODE,DATE_DESC,DENSITY,POP,GEONAME,STATE&in=state:01&for=county:*&DATE_CODE=1

  1. The base website is http://api.census.gov/data/2018/pep/population. This is effectively specifying which table we want to query from. The pep part stands for the Population Estimation Program, which is one division of the Census Bureau. The 2018 part of the base address defines the vintage of the estimate. This page will produce estimates for the years 2010-2018, but the Census Bureau is constantly updating those estimates based on new information. So the this is specifying that we are to use the Census’ 2018 estimate of the population.
  2. Modifiers are included after the ? and different modifiers are separated by &
  3. get= section defines the variables that you want
  4. The for=state:* denotes that we want all of the states. for=state:01 would have been just Alabama. If we want all the county populations we can use for=county:*. If we just want county populations within a particular state, we would use in=state:01&for=county:*
  5. The DATE_CODE=1 indicates that I just want the first estimate in the decadal time series of estimates. If I didn’t include this, we’d end up with estimates for each year between 2010 and 2018.

When you go to this website, the database will read the modifier list, do the appropriate database query, and return the result via a webpage that has a very simple structure that is easy to parse into a table.

The hard part about Web APIs is understanding which tables are available and what each covariate means. For the US Census Bureau, the developers page is a great place to start.

19.3 Package censusapi

While it is helpful to understand how the web API works, it would be nice to not have to worry about some of the fiddly aspects of parsing the result into a data frame. There are many R packages that provide a convenient interface to some database API. For our US Census Bureau example, we’ll use the R package censusapi. You should read the documentation as well. It looks like there is another package, tidycensus that might be even better.

The Census Bureau wants to identify which developers are accessing their data and you are required to sign up for a Census Key. It is easy to give them your email and they’ll email you a character string that you’ll use for the rest of these examples.

# I got a Census API key from https://api.census.gov/data/key_signup.html 
# and saved it as Census_API_Key in my .Rprofile file in my home directory. 
# source('~/.Rprofile')  
# CENSUS_API_KEY = Sys.getenv('census_api_key')
# This query is the example query first given in the censusapi vignette.
censusapi::getCensus(name = "timeseries/healthins/sahie",
    vars = c("NAME", "IPRCAT", "IPR_DESC", "PCTUI_PT"),      # Define the gets=
    region = "state:01",                                     # Define the for=
    time = 2017,                                         
    key = CENSUS_API_KEY)
##   time state    NAME IPRCAT                IPR_DESC PCTUI_PT
## 1 2017    01 Alabama      0             All Incomes     11.0
## 2 2017    01 Alabama      1      <= 200% of Poverty     18.3
## 3 2017    01 Alabama      2      <= 250% of Poverty     17.3
## 4 2017    01 Alabama      3      <= 138% of Poverty     19.4
## 5 2017    01 Alabama      4      <= 400% of Poverty     14.5
## 6 2017    01 Alabama      5 138% to 400% of Poverty     11.5

This is now super easy to query the Census database, except that I have NO IDEA what API names (ie tables) are available and I have no clue what variables I just downloaded. We need to get a better sense of what data sets are available.

A good place to start is the developer datasets. In particular I’m interested in both county are municipality level population estimates over time as well as information from the American Community Survey (ACS).

19.3.1 Population Estimates

The Census Bureau’s Population Estimation Program (PEP) is responsible for population estimates. On the Census Population API page, it looks like I need to use the pep/population tables.

# Code to grab county level population levels.
County_Populations <- getCensus(name = "pep/population",
  vars = c('STATE','COUNTY','GEONAME','DATE_CODE','DATE_DESC','POP'),
  vintage = '2018',
  regionin = 'state:04',  # Just Arizona, which is coded as 04. I don't know why...
  region = 'county:*',    # All the counties
  DATE_CODE=1,            # 2010, Leave this out to get each year 2010-2018
  key = CENSUS_API_KEY)   # If key is missing, it will look in System Environment 

County_Populations %>% head(6)  
##   state county STATE COUNTY                  GEONAME DATE_CODE
## 1    04    001    04    001   Apache County, Arizona         1
## 2    04    003    04    003  Cochise County, Arizona         1
## 3    04    005    04    005 Coconino County, Arizona         1
## 4    04    007    04    007     Gila County, Arizona         1
## 5    04    009    04    009   Graham County, Arizona         1
## 6    04    011    04    011 Greenlee County, Arizona         1
##                    DATE_DESC    POP DATE_CODE_1
## 1 4/1/2010 Census population  71518           1
## 2 4/1/2010 Census population 131346           1
## 3 4/1/2010 Census population 134421           1
## 4 4/1/2010 Census population  53597           1
## 5 4/1/2010 Census population  37220           1
## 6 4/1/2010 Census population   8437           1

I was looking for population divided up by Age and Sex and it took awhile to figure out that I want to use PEP’s Demographic Characteristics Estimates by Age Groups tables pep/charagegroups. From there I looked at some of the examples and variables.

County_Populations_by_AgeGender <- getCensus(
  name = "pep/charagegroups",
  vars = c('GEONAME','DATE_CODE','AGEGROUP','SEX','DATE_DESC','POP'),
  vintage = '2018',
  region = 'state:04', 
    key = CENSUS_API_KEY)

County_Populations_by_AgeGender %>% head(6)
##   state GEONAME DATE_CODE AGEGROUP SEX                  DATE_DESC     POP
## 1    04 Arizona         1        0   0 4/1/2010 Census population 6392017
## 2    04 Arizona         1        0   1 4/1/2010 Census population 3175823
## 3    04 Arizona         1        0   2 4/1/2010 Census population 3216194
## 4    04 Arizona         1        1   0 4/1/2010 Census population  455715
## 5    04 Arizona         1        1   1 4/1/2010 Census population  232562
## 6    04 Arizona         1        1   2 4/1/2010 Census population  223153

As I played around with it, it seems that I can grab Race and Sex information as well. But unfortunately the categories are numerically coded so somehow we have to figure out which are which. It looks like SEX=0 is both but I have no idea which is men and which is women. Still looking at the Census Population API page and following the link about the variables for demographic characteristics, we can click through each variable to see the .json file that defines the factor levels.

We can also import those into R directly

# What variables are Available?
censusapi::listCensusMetadata(name='pep/charagegroups', vintage=2018)
##          name                                      label
## 1         for               Census API FIPS 'for' clause
## 2          in                Census API FIPS 'in' clause
## 3       ucgid Uniform Census Geography Identifier clause
## 4   DATE_CODE                                       Date
## 5      SUMLEV                              Summary Level
## 6         SEX                                        Sex
## 7       STATE                            State FIPS code
## 8     GEONAME          All geo names seperated by commas
## 9    DIVISION                       Census Division Code
## 10     REGION                       Census Regional Code
## 11     GEO_ID                 Geographic identifier code
## 12        POP                                 Population
## 13 LASTUPDATE                                Last Update
## 14     NATION                                     Nation
## 15   AGEGROUP                                  Age Group
## 16       RACE                                       Race
## 17  DATE_DESC                 Description of DATE values
## 18   UNIVERSE                                   Universe
## 19     COUNTY                           County FIPS code
## 20       HISP                            Hispanic Origin
##                               concept predicateType group limit predicateOnly
## 1  Census API Geography Specification      fips-for   N/A     0          TRUE
## 2  Census API Geography Specification       fips-in   N/A     0          TRUE
## 3  Census API Geography Specification         ucgid   N/A     0          TRUE
## 4                                <NA>           int   N/A     0          <NA>
## 5                                <NA>        string   N/A     0          <NA>
## 6                                <NA>           int   N/A     0          <NA>
## 7                                <NA>          <NA>   N/A     0          <NA>
## 8                                <NA>          <NA>   N/A     0          <NA>
## 9                                <NA>           int   N/A     0          <NA>
## 10                               <NA>           int   N/A     0          <NA>
## 11                               <NA>        string   N/A     0          <NA>
## 12                               <NA>          <NA>   N/A     0          <NA>
## 13                               <NA>        string   N/A     0          <NA>
## 14                               <NA>          <NA>   N/A     0          <NA>
## 15                               <NA>           int   N/A     0          <NA>
## 16                               <NA>           int   N/A     0          <NA>
## 17                               <NA>        string   N/A     0          <NA>
## 18                               <NA>        string   N/A     0          <NA>
## 19                               <NA>          <NA>   N/A     0          <NA>
## 20                               <NA>           int   N/A     0          <NA>
##             required
## 1               <NA>
## 2               <NA>
## 3               <NA>
## 4  default displayed
## 5               <NA>
## 6  default displayed
## 7               <NA>
## 8               <NA>
## 9               <NA>
## 10              <NA>
## 11              <NA>
## 12              <NA>
## 13              <NA>
## 14              <NA>
## 15 default displayed
## 16 default displayed
## 17              <NA>
## 18              <NA>
## 19              <NA>
## 20 default displayed
# I can't figure out how to make the censusapi package import the factor levels. 
# So I'll just import the levels directly from the webpage .json file and then 
# clean it up into a nice data frame.
CensusFactorLevels <- function(name, vintage, variable){
  file <- str_c('https://api.census.gov/data/',vintage,'/',name,
                '/variables/',variable,'.json')
  Meta <- jsonlite::read_json(file) %>% 
    .[['values']] %>% .[['item']] %>% 
    unlist() %>% tibble::enframe() 
  colnames(Meta) <- c(variable, str_c(variable,'_DESC'))
  return(Meta)
}

CensusFactorLevels('pep/charagegroups', 2018, 'SEX') 
## # A tibble: 3 × 2
##   SEX   SEX_DESC  
##   <chr> <chr>     
## 1 0     Both Sexes
## 2 1     Male      
## 3 2     Female
CensusFactorLevels('pep/charagegroups', 2018, 'RACE') 
## # A tibble: 12 × 2
##    RACE  RACE_DESC                                                         
##    <chr> <chr>                                                             
##  1 11    Native Hawaiian and Other Pacific Islander alone or in combination
##  2 10    Asian alone or in combination                                     
##  3 9     American Indian and Alaska Native alone or in combination         
##  4 8     Black alone or in combination                                     
##  5 7     White alone or in combination                                     
##  6 6     Two or more races                                                 
##  7 5     Native Hawaiian and Other Pacific Islander alone                  
##  8 4     Asian alone                                                       
##  9 3     American Indian and Alaska Native alone                           
## 10 2     Black alone                                                       
## 11 1     White alone                                                       
## 12 0     All races
CensusFactorLevels('pep/charagegroups', 2018, 'AGEGROUP') 
## # A tibble: 32 × 2
##    AGEGROUP AGEGROUP_DESC    
##    <chr>    <chr>            
##  1 31       Median age       
##  2 30       15 to 44 years   
##  3 29       18 years and over
##  4 28       16 years and over
##  5 27       85 years and over
##  6 26       65 years and over
##  7 25       45 to 64 years   
##  8 24       25 to 44 years   
##  9 23       18 to 24 years   
## 10 22       18 to 64 years   
## # ℹ 22 more rows

Using these factor levels, we can add the description onto our county populations by AGE and SEX by simply doing some table joins.

County_Populations_by_AgeGender %>%
  left_join(CensusFactorLevels('pep/charagegroups', 2018, 'SEX')) %>%
  left_join(CensusFactorLevels('pep/charagegroups', 2018, 'AGEGROUP')) %>%
  head(6)
## Joining with `by = join_by(SEX)`
## Joining with `by = join_by(AGEGROUP)`
##   state GEONAME DATE_CODE AGEGROUP SEX                  DATE_DESC     POP
## 1    04 Arizona         1        0   0 4/1/2010 Census population 6392017
## 2    04 Arizona         1        0   1 4/1/2010 Census population 3175823
## 3    04 Arizona         1        0   2 4/1/2010 Census population 3216194
## 4    04 Arizona         1        1   0 4/1/2010 Census population  455715
## 5    04 Arizona         1        1   1 4/1/2010 Census population  232562
## 6    04 Arizona         1        1   2 4/1/2010 Census population  223153
##     SEX_DESC    AGEGROUP_DESC
## 1 Both Sexes         All Ages
## 2       Male         All Ages
## 3     Female         All Ages
## 4 Both Sexes Age 0 to 4 years
## 5       Male Age 0 to 4 years
## 6     Female Age 0 to 4 years

19.4 Package tidycensus

The tidycensus package is a little easier to work with. By default, tidycensus functions will look for the API Key in the System Environment.

# The tidycensus package has a nice way of installing the 
# API key in your .Rprofile file. 
tidycensus::census_api_key('adoiYOURadsKEYmcvGOESsdljHERE', install=TRUE)
## Your original .Renviron will be backed up and stored in your R HOME directory if needed.
## Your API key has been stored in your .Renviron and can be accessed by Sys.getenv("CENSUS_API_KEY"). 
## To use now, restart R or run `readRenviron("~/.Renviron")`
## [1] "2b7acec9009601032e6605909c4901d632f541d6"

There are three major functions that will get used. The

Function Description
get_estimates() Gives information from the Population Estimates Program.
get_acs() Gives information from the American Community Survey
load_variables() Gives the mapping between the variable code and the description

For our first example, we’ll grab the county population totals in Arizona broken down by sex and ethnicity.

AZ_County_Populations <- tidycensus::get_estimates(
  geography = "county",
  state=c('AZ'),  # leave this out for all state/county combinations
  product = "characteristics",
  breakdown = c('SEX','HISP'),
  breakdown_labels = TRUE)      # Give the variable labels, not the Census Variable code 
## Warning: For post-2020 Census estimates, `get_estimates()` now uses the `vintage` argument to specify the PEP vintage, and the `year` argument to isolate a year within that vintage.
## ! This may be a breaking change in your code
## ! Omitting `vintage` may lead to incorrect or unexpected results.
## Using the Vintage 2022 Population Estimates
head(AZ_County_Populations)
## # A tibble: 6 × 6
##   GEOID NAME                    year SEX    HISP                  value
##   <chr> <chr>                  <int> <chr>  <chr>                 <dbl>
## 1 04001 Apache County, Arizona  2022 Male   Both Hispanic Origins 32405
## 2 04001 Apache County, Arizona  2022 Male   Non-Hispanic          29942
## 3 04001 Apache County, Arizona  2022 Male   Hispanic               2463
## 4 04001 Apache County, Arizona  2022 Female Both Hispanic Origins 33027
## 5 04001 Apache County, Arizona  2022 Female Non-Hispanic          30762
## 6 04001 Apache County, Arizona  2022 Female Hispanic               2265

For our next example, we’ll grab the number of people with or without Insurance in each state.

# We want to get the Number of people with or without Insurance in each state
US_State_Health_Insurance <- 
  tidycensus::get_acs(
    table = 'B27003',        # Found the table from Census Website
    geography = "state",
    cache_table=TRUE)        # On multiple runs, don't ask Census Bureau over and over.
## Getting data from the 2018-2022 5-year ACS
## Loading ACS5 variables for 2022 from table B27003 and caching the dataset for faster future access.
# Grab the Variable Names so we can convert from the variable codes
# to variable labels, i.e., something I can understand by reading.  
ACS_Variables <- tidycensus::load_variables(2018, 'acs1', cache = TRUE)
head(ACS_Variables)
## # A tibble: 6 × 3
##   name        label                               
##   <chr>       <chr>                               
## 1 B00001_001  Estimate!!Total                     
## 2 B00002_001  Estimate!!Total                     
## 3 B01001A_001 Estimate!!Total                     
## 4 B01001A_002 Estimate!!Total!!Male               
## 5 B01001A_003 Estimate!!Total!!Male!!Under 5 years
## 6 B01001A_004 Estimate!!Total!!Male!!5 to 9 years 
##   concept                                  
##   <chr>                                    
## 1 UNWEIGHTED SAMPLE COUNT OF THE POPULATION
## 2 UNWEIGHTED SAMPLE HOUSING UNITS          
## 3 SEX BY AGE (WHITE ALONE)                 
## 4 SEX BY AGE (WHITE ALONE)                 
## 5 SEX BY AGE (WHITE ALONE)                 
## 6 SEX BY AGE (WHITE ALONE)
# Join the data with the readable variable names
US_State_Health_Insurance <- US_State_Health_Insurance %>%
  left_join( ACS_Variables, by=c('variable' = 'name') )

# Now for some cleaning
US_State_Health_Insurance %>%
  tidyr::separate(label, c('Estimate','Total','Gender','Age','Insurance'), sep ='!!') %>%
  tidyr::drop_na() %>%
  select(NAME, Gender, Age, Insurance, estimate, moe)
## Warning: Expected 5 pieces. Missing pieces filled with `NA` in 1092 rows [1,
## 2, 3, 6, 9, 12, 15, 18, 21, 24, 27, 30, 31, 34, 37, 40, 43, 46, 49,
## 52, ...].
## # A tibble: 1,872 × 6
##    NAME    Gender Age            Insurance            estimate   moe
##    <chr>   <chr>  <chr>          <chr>                   <dbl> <dbl>
##  1 Alabama Male   Under 6 years  With public coverage    89886  1949
##  2 Alabama Male   Under 6 years  No public coverage      89396  1747
##  3 Alabama Male   6 to 18 years  With public coverage   179616  3262
##  4 Alabama Male   6 to 18 years  No public coverage     243764  3866
##  5 Alabama Male   19 to 25 years With public coverage    19864  1391
##  6 Alabama Male   19 to 25 years No public coverage     217562  2455
##  7 Alabama Male   26 to 34 years With public coverage    29513  1467
##  8 Alabama Male   26 to 34 years No public coverage     244350  2361
##  9 Alabama Male   35 to 44 years With public coverage    36926  1493
## 10 Alabama Male   35 to 44 years No public coverage     250553  2134
## # ℹ 1,862 more rows

19.5 Exercises

  1. The Census Bureau is constantly running many different surveys and compiling the results. One of the most comprehensive and interesting of these is the American Community Survey. I’m interested in using the ACS to get information about educational attainment. Use the Census Bureau’s Data search page, find a table that gives you information about educational attainment. With this table information, use the tidycensus package to download the latest information about educational attainment. Create map or graph summarizing educational attainment in either counties in Arizona or across states within the US. Or if you are feeling ambitious, create a graph or graphs summarizing this across all counties in the US. Perhaps you could break the educational level into high school, some college, bachelors, advanced degree. Then maybe make a map of counties/states colored by percent with BS or higher. Or maybe stacked barcharts ordered by percent BS or higher.

  2. Pick some API to investigate how to use. Utilizing your interests, pick an API and figure out how to use it. Using the API, download some data and produce an interesting graphic. Many government agencies have data API as well. For example, Centers for Disease Control mortality and disease information at the county level, the National Oceanic and Atmospheric Administration has weather data accessible. If you are a genomics person, the R interface to Kegg would be a fun choice. Many social media apps such as Twitter, Reddit, and Facebook have APIs. Many municipalities are starting to create Open Data and some of them have web APIs. Explore your interests and see if there is an interface to that data!