Chapter 48 Using the {tabulapdf} package tp extract tables from PDFs
This chapter is part of the Data Import pathway.
Packages needed for this chapter include {tabulapdf}, {tidyverse}, and a working version of rJava. If you don’t have these packages installed, you can install them by copying the code chunk below and running it in your local RStudio session. You will also likely need the {rJavaEnv} package,
which will help you install rJava and connect it to your R environment.
This chapter takes a bit of setup to get rJava working, which is needed for the {tabulapdf} package to work and to interactively extract tables from PDFs.
To get rJava working on your computer can sometimes be a bear. The {rJavaEnv}
package has made this a lot more manageable.
Let’s walk through the steps to get rJava set up.
start by loading {rJavaEnv} with a library command.
Then run the java_check_version_cmd()
function in your local RStudio session
to see if you already have a working version of Java on your computer.
## JAVA_HOME:
## '/Users/peterhiggins/Library/Caches/org.R-project.R/R/rJavaEnv/installed/macos/aarch64/21'
## Java path:
## '/Users/peterhiggins/Library/Caches/org.R-project.R/R/rJavaEnv/installed/macos/aarch64/21/bin/java'
## Java version: "openjdk version \"21.0.6\" 2025-01-21 LTS OpenJDK Runtime
## Environment Corretto-21.0.6.7.1 (build 21.0.6+7-LTS) OpenJDK 64-Bit Server VM
## Corretto-21.0.6.7.1 (build 21.0.6+7-LTS, mixed mode, sharing)"
## [1] "21"
If your Java version is confirmed with a path to JAVA HOME and a version number (as it does on my home computer as seen here), you can skip the next step.
If not, you will probably get a message like
! JAVA_HOME is not set. [1] FALSE
If this is the case, copy and run the java_quick_install()
function from
the chunk below in your local RStudio session to install the
latest version of Java.
Now check that this worked by running the java_check_version_cmd()
function in your local RStudio session again.
This should now tell you the path to JAVA HOME, the Java path, and the Java version. Once this is set, you are ready to get started.
48.1 Why Tables from PDFs?
Many organizations publish reports as PDFs, with tables of data in them. It would be much more valuable for data anlysis to have these data publicly available in a format that can be used for analysis (like downloadable csv files). This chapter will show you how to do extract these tables from PDFs.
48.2 Extracting Tables from PDFs
The {tabulapdf} package is a great tool for extracting tables from PDFs. It uses the Java library Tabula (the equivalent of R packages in the Java computer language are called libraries
) to do this, and it works well with many different types of tables.
To use the {tabulapdf} package, you need to have Java installed on your computer. If you don’t have Java installed, you can follow the instructions in the previous section to install it.
Run the code chunk below to load the {tabulapdf} package into your R session, along with {tidyverse} to clean up the data tables we will extract (there will be lots of cleaning up).
Now let’s start with a PDF filled with health-related data tables. This is a nice example from KFF, the Kaiser Family Foundation, which is a non-profit organization that provides information on health issues at www.kff.org.
Unfortunately, they publish a lot of interesting health and health insurance data in tables in PDFs.
Let’s link this pdf to an object in R, by running the code chunk below. Copy and run this code chunk in your local RStudio session.
48.3 Extracting Tables from PDFs
Now we have to extract some data tables from this PDF. The nice folks at KFF have a habit of including lots of summary rows and subset totals as distinct rows in their tables, so we will have to select the columns and rows that we want, and frequently rename columns.
The {tabulapdf} package has a function called extract_tables()
that will help you pull out the tables from a PDF.
To get started, we can simply grab ALL the tables, or do this in a targeted way.
Take a moment to go the the website for this PDF, here, and scroll through the PDF to see what tables are in it. Keep this PDF window open for future reference during this chapter.
Let’s get a quick overview of this PDF by running the commands in the code chunk below.
## [1] 40
## $pages
## [1] 40
##
## $title
## [1] "KFF Survey of Consumer Experiences with Health Insurance"
##
## $author
## NULL
##
## $subject
## NULL
##
## $keywords
## [1] "Topline-Survey-of-Consumer-Experiences-with-Health-Insurance"
##
## $creator
## [1] "Microsoft® Word for Microsoft 365"
##
## $producer
## [1] "Microsoft® Word for Microsoft 365"
##
## $created
## [1] "Thu Jun 08 15:15:33 EDT 2023"
##
## $modified
## [1] "Mon Jun 12 16:07:40 EDT 2023"
##
## $trapped
## NULL
## [[1]]
## [1] 612 792
##
## [[2]]
## [1] 612 792
Now we know that there are 40 pages in this PDF, it was exported to PDF from Microsoft Word on June 8th 2023, and we have a title and some keywords.
We also learned that each page is 612 pixels wide and 792 pixels tall (the pages are all the same size, 8.5 inches wide by 11 inches tall).
48.4 Extracting Tables from the PDF
Let’s go whole hog. Let’s see how many tables there are. This will take a minute, but run the code chunk belowin your local R session.
# Extract all tables from the PDF
kff_tables <- extract_tables(kff_pdf, pages = 1:40, output = "tibble")
If you take a look in your Environment tab, you will now find the kff_tables object. It is a list of 38 tables. Click on the object in the Environment tab, and you will see a list of 38 tables in the Viewer tab. Each table has its own blue dot with a white triangle next to the table number. Click on one or two of these to expand them, and compare these to the PDF. See if you can figure out which table object corresponds to which table in the PDF.
You will probably find that there are actually more than 38 tables in the PDF. Several of these are stuck together in the table extraction process, and some of them are empty.
To take a look at one of the tables in the list, run the code chunk below, which will pull out the 17th table in the list and show it in the Viewer tab.
The result is a bit wonky, but recognizable as all of the tables in response to questions 17 through 19b, all bound together in a single table. Scroll through the PDF to find these.
48.5 Extracting a Specific Table
You can use slice()
to select specific rows to get the table that you actually want.
Let’s try to get the data in response to question 19a. This was: “In the past 12
months, did you end up paying more for treatment or services than you expected
to pay as a direct result of the problems youhad with your current health
insurance?”
This is a yes/no question, and the table shows the percentage of people who said yes, and the percentage of people who said no.
To do this, we can use the extract_tables()
function again, but this time we
will pull this one table
and slice out the rows that we want.
Copy and run the code below in your local R session. Review each line to make sure you know what each function is doing.
# Extract the 17th table from the list of tables
pay_more <- kff_tables[[17]] |>
slice(20:21) |>
rename(response = `...1`) |>
separate(`Total sponsored`, into = c("overall", "emp_sponsored"), sep = "\\s") |>
relocate(overall, .after = Marketplace) |>
janitor::clean_names()
pay_more |>
gt::gt()
response | emp_sponsored | medicare | medicaid | marketplace | overall |
---|---|---|---|---|---|
Yes | 33 | 19 | 12 | 35 | 28 |
No | 67 | 80 | 88 | 65 | 72 |
Despite the common assumption that employer sponsored health insurance in the US is the best, satisfaction seems to be higher for medicaid and medicare.
You can even remove the response column (the chi-squared test will not work with a character column), convert the percentages from character to numeric, and run a chi-squared test to show that the different types of insurance are significantly different from each other by this metric.
##
## Pearson's Chi-squared test
##
## data: mutate_if(select(pay_more, -response), is.character, as.numeric)
## X-squared = 19.732, df = 4, p-value = 0.0005641
48.6 Extracting a Specific Table
But this approach is a bit cumbersome, and involves a fair amount of hunting to
find the table you want, and to clean it up.
Instead, we can use the extract_tables()
function to extract a specific page.
This narrows it down a lot.
## [[1]]
## # A tibble: 20 × 10
## ...1 Excellent ...3 ...4 ...5 ...6 ...7 ...8 ...9 ...10
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 <NA> / Very <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 2 <NA> good/ <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 3 <NA> Good <NA> Very <NA> Fair… <NA> <NA> No <NA>
## 4 <NA> (NET) Exce… good Good (NET) Fair Poor answ… <NA>
## 5 a. physical health <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 6 Total 84 12 35 38 16 14 2 * n=3,…
## 7 Employer 89 14 37 39 11 10 1 - n=978
## 8 sponsored <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 9 Medicare 77 8 34 36 23 18 4 * n=885
## 10 Medicaid 68 8 24 36 32 27 5 - n=815
## 11 Marketplace 84 15 36 33 16 14 2 - n=880
## 12 b. mental health a… <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 13 emotional well- <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 14 being <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 15 Total 82 21 31 30 18 14 3 * n=3,…
## 16 Employer 84 19 35 30 16 13 3 - n=978
## 17 sponsored <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 18 Medicare 89 28 31 29 11 10 2 * n=885
## 19 Medicaid 64 12 17 35 36 29 7 - n=815
## 20 Marketplace 80 22 31 28 20 17 3 - n=880
Now there is only one table to look at, and it is a bit easier to see what is going on. Note that kff_tables is still a list (take a look in the Viewer tab), and we still have to pluck out the tibble we want, even though there is only one. Our usual data cleaning functions from {dplyr} will work on a tibgle, but not on a list. We need to pluck out the tibble we want. There are two ways to do this:
- kff_tables[[1]], or
- kff_tables %>% pluck(1)
For our purposes, on page 5 of this PDF, we want the mental health and emotional well-being table, so we can slice rows 15 to 20, and clean up the column names to get only the raw data columns.
mental_health <- kff_tables[[1]] |>
slice(15:20) |>
select(1, 3:5, 7:8) |>
rename(
insurance = `...1`,
excellent = `...3`,
very_good = `...4`,
good = `...5`,
fair = `...7`,
poor = `...8`
) |>
pivot_longer(
cols = c(excellent, very_good, good, fair, poor),
names_to = "rating",
values_to = "percent"
) |>
mutate(insurance = case_when(
insurance == "Total" ~ "overall",
TRUE ~ insurance
)) |>
filter(insurance != "sponsored") |>
pivot_wider(
names_from = insurance,
values_from = percent
) |>
janitor::clean_names() |>
relocate(overall, .after = marketplace)
mental_health
## # A tibble: 5 × 6
## rating employer medicare medicaid marketplace overall
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 excellent 19 28 12 22 21
## 2 very_good 35 31 17 31 31
## 3 good 30 29 35 28 30
## 4 fair 13 10 29 17 14
## 5 poor 3 2 7 3 3
As you can see, removing subtotals and rearranging to get clean data takes a bit of wrangling to get it into a tidy clean format of raw data. But now we can use this data to make plots or perform analyses.
48.7 Viewing and Targeting the PDF
The {tabula_pdf} package can go even farther. It can let you view thumbnails of the pages in the RStudio Viewer window, and even pick out the particular table area that you want to extract. This can only be done in an interactive R session, as it will open a shiny app in the Viewer tab and require a human to draw a rectangle around the desired section of the PDF to extract a table from.
## [1] 40
## Error in locate_areas(kff_pdf, pages = 22): locate_areas() is only available in an interactive session
# manually select the area of the table
# using the shiny widget in the Viewer pane
# then click the Done button at the top right
extract_tables(kff_pdf,
pages = 22,
guess = FALSE,
col_names = FALSE,
area = subtable
)
## Error: object 'subtable' not found
You can also combine the steps and just use the extract_areas()
function to
pull data from a specific area of the PDF. This is useful if you know exactly
which table you want to extract.
Let’s try this for question 15h (deny or delay drug coverage) on page 19 of this PDF in the code chunk below. Copy and run this in your local RStudio session. Click Done when you have completed your rectangle.
This should produce a 5 x 6 tibble,
with some goofy column names, a total row that you probably don’t want,
and some extra columns for no answer
and N that you don’t need.
If you did not draw the rectangle around the table quite right,
just run the code chunk again and adjust your area.
The deny
object that you selected with your rectangle is still a list,
so you need to pluck out the first table
with deny[[1]]
, or pluck(1) to get a single tibble instead of a list.
Then use the select()
and slice()
functions to get down to the raw data,
and rename the columns with the rename()
function to fix these one at a time,
or with the purrr:set_names()
function for renaming all the names at once.
An example of data cleaning is shown below.
I managed to cut off the b
in before
with my rectangle, so I had to shorten
the insurance column name in the rename function.
# Clean up the table
deny2 <- deny |>
pluck(1) |> # you could use deny[[1]] for the first 2 lines
rename(
insurance = `efore you received it`,
yes = `...2`,
no = `...3`,
not_tried = `...4`
) |>
slice(2:5) |>
select(-`...5`, -`...6`)
deny2
Now try it yourself. Try to grab an interesting section from the big table on page 22 of the PDF. Pick a question that interests you, grab it with a rectangle, and extract and clean the raw data into an analysis-ready tibble.
The example below is one possible solution for question 20d
# note that I excluded the last 2 columns by drawing a smaller rectangle
extract_areas(kff_pdf, pages = 22, output = "tibble") |>
pluck(1) |>
slice(3:6) |>
purrr::set_names(c("insurance", "yes", "no"))
## Error in locate_areas(file = file, pages = pages, copy = copy): locate_areas() is only available in an interactive session
Note that the interactive portion of drawing the rectangle actually makes this process less reproducible, because a key piece is not hard-coded, but it is a pretty convenient tool for quickly getting the exact data you want, especially when it is a difficult table structure.
This sort of interactive step can not be scheduled or run automatically, as it requires a person to draw the rectangle in real time, so it is fine for one-off interactive coding, but not for a reproducible workflow.
48.8 Your Turn
Let’s try to extract some data from a PDF from the Centers for Disease Control (CDC). Get the pdf from the code chunk below. Tables are on pages 26-51. You will want to find & inspect the PDF on the web at https://www.cdc.gov/nchs/data/hus/hus19-508.pdf to help you pick out an interesting table. Note that the page numbers within the pdf do not correspond to the actual page numbers of the document. You may have to add ~11 to the listed page number the get the actual page/table.
Pick out a table that you find interesting, and try to extract it and clean it up to raw data. Try doing one table programmatically, and (a harder one) interactively. The side-by-side ones on pages 47-48 (which are really 58-59) may be good examples for interactive extraction. See if you can get all 50 states into one table.
The example below is one possible table extraction for health care expenditures by an individual.
## New names:
## • `` -> `...1`
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`
## • `` -> `...7`
## • `` -> `...8`
health_dollars |>
pluck(1) |>
slice(5:15) |>
separate(`Source of funds`, into = c("medicaid_f", "medicaid_s"), sep = " ", extra = "merge") |>
purrr::set_names(c("year", "hc_spend_total", "private_ins", "medicare", "medicaid_total", "medicaid_federal", "medicaid_state", "out_of_pocket", "other")) |>
mutate(year = parse_number(year)) |>
mutate(across(
.cols = where(is.character),
parse_number
))
## # A tibble: 11 × 9
## year hc_spend_total private_ins medicare medicaid_total medicaid_federal
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2008 2009. 698. 442 318. 188.
## 2 2009 2111. 731. 470. 346. 231.
## 3 2010 2191. 751. 489. 366. 247.
## 4 2011 2267. 780 512 374. 228.
## 5 2012 2361. 810. 534. 388. 224.
## 6 2013 2431. 824. 554. 406. 234.
## 7 2014 2556 870. 580. 447. 274.
## 8 2015 2710. 936. 607. 484. 304.
## 9 2016 2838. 990. 630. 504 318.
## 10 2017 2954. 1032. 659. 520. 323.
## 11 2018 3076. 1079. 697. 533. 331.
## # ℹ 3 more variables: medicaid_state <dbl>, out_of_pocket <dbl>, other <dbl>