16 February 2021
This is a basic rundown of how to interact with Google Storage programmatically in R. You need Google Storage in order to use Document AI and other Google APIs on scale, because these services do not accept bulk file submissions directly. Instead they use Google Storage as an intermediary, so you need to know how to get files in and out of Google Storage.
It is possible to bulk upload and download files to Google Storage in the Google Cloud Console. In fact, for uploads it can sometimes be easier than doing it programmatically. But downloads and deletions are cumbersome if you have a lot of files. And since bulk processing in DAI can only be done with code, you might as well keep the whole workflow in R.
The biggest hurdle to using any Google API is authentication. It’s daunting for several reasons. For one, it involves abstract new concepts like “service accounts”, “Oauth2.0”, and “scopes”. For another, the Google Cloud Console is so crowded it’s an absolute nightmare to navigate as a beginner. In addition, different R packages have different procedures for authenticating with Google Cloud Services (GCS).
A full explanation of Google API authentication would fill a small book, but suffice to say here that there are several different ways to authenticate to GCS from R. In the following I will walk you through one such way, the one I think is the simplest and most robust if you are primarily planning to use Google Storage and Google Document AI.
If you have one already, you can use that. Or you can create a burner account for your GCS work.
While logged in to your gmail account, go to the Google Cloud Console. Agree to the terms of service and click “Try for free”.
Accept the terms again, and add an address and a credit card. This last part is a prerequisite for using GCS.
The largest “unit” of your GCS activities is your project. You can think of it as your root folder, since you will most likely only ever need one unless you are a business or a developer (in principle, though, you can have as many projects as you like).
When you activate GCS, you are assigned a project named “My first project”. Click on “My first project” in the top blue bar, just to the right of “Google cloud services”. You’ll see a screen like this:
Note that your project has an ID, usually consisting of an adjective, a noun, and a number. You’ll need this soon, so I recommend opening RStudio and storing it as a vector:
my_project_id <- "<your project id>"
Return to the Google Cloud Console and look at the left column. Toward the top you see an entry called “Billing”. Click it. You’ll get to a screen saying “This project has no billing account”. Click “link a billing account” and set the billing account to “My billing account”.
All this is necessary for you to be able to access Google Storage and other Google tools programmatically. Both Google Storage and DAI are paid services, although for Google Storage the cost is negligible unless you plan to keep very large amounts of data there for a long time. For DAI, you’re looking at around EUR 0.06 per processed page, though at the current time of writing, you get 300$ worth of free credits.
Bring out the navigation menu on the left hand side by clicking the little circle with the three horizonal lines in the top left of the screen. Click on “APIs and services”. Scroll down and you’ll see a list of the APIs which are enabled by default. These include the Google Storage API and a few others. Now is a good time to activate any other APIs that you know you are going to use in conjuction with Google Storage, such as Document AI. Click on “Enable APIs and Services”, type “document ai” in the search field, click on “cloud document ai API” and then “Enable”. Repeat for other APIs you are interested in.
Now open the navigation menu on the left again, and click on APIs and services. Then click on “credentials” in the left pane. You should see this:
Now we want to create a service account. Click on “create credentials” in the top middle, then choose service account. Give it any name you like (e.g. “my_rstudio_service_account”) and a description (e.g. “Interacting with GCS through R”) and click “create”.
In section 2 titled “Grant this service account access to project”, add “Basic > Owner” to the service account’s roles.
Click “continue”, then “done” at the bottom. You should now see your service account listed at the bottom.
Now we need to generate a json file containing the login details for this service account. Click the small edit icon on the bottom right. On the next page, click “add key”, choose “create new key”, select JSON format, and click “create”. This should prompt a save file window. Save the file to your hard drive. You can change the name to something more memorable if you like (but keep the “.json” extension). Also, take note of where you stored it. Now we are done in the Google Cloud Console and can finally start working in RStudio.
The last step is to store the path to the json file in your .Renviron file so that RStudio can authenticate you whenever you are working with GCS from R. Start by writing the following in the console:
usethis::edit_r_environ()
This will open a pane with your .Renviron file. If you haven’t modified it before, it is probably empty.
All you need to do is add a line with the following: GCS_AUTH_FILE='<full path to the json file you stored earlier>'
. Make sure all the slashes in the filepath are forward slashes. Save the file, close it, and restart RStudio.
Now when you load the library googleCloudStorageR
, you will be auto-authenticated and ready to communicate with your Google Storage account from within R.
library(googleCloudStorageR)
googleCloudStorageR
is a so-called wrapper for the Google Storage API, which means it translates your R input into URLs that the API can understand. When you execute googleCloudStorageR
functions, you are really sending GET and POST requests to Google and receiving responses in return.
Google Storage is a file repository, and it keeps your files in so-called “buckets”. You need at least one bucket to store files. To inspect your Storage account, first bring out your project id. If you did not store it in step 3 above, you can get it from the Google Cloud Console or from the json file with your service account key.
my_project_id <- "<your project id>"
Now let’s see how many buckets we have:
gcs_list_buckets(my_project_id)
Answer: zero, because we haven’t created one yet. We can do this with gcs_create_bucket()
. Note that it has to be globally unique (“my_bucket” won’t work because someone’s already taken it). For this example, let’s call it “superbucket_2021”. Also add your location (“EU” or “US”).
gcs_create_bucket("superbucket_2021", my_project_id, location = "EU")
Now we can see the bucket listed:
gcs_list_buckets(my_project_id)
At this point you may want to tell R that this is your default bucket. It saves you from adding the bucket id to every subsequent call.
gcs_global_bucket("superbucket_2021")
We can get more details about the bucket with gcs_get_bucket()
gcs_get_bucket()
To get the bucket’s file inventory, we use gcs_list_objects()
gcs_list_objects()
At this point it’s obviously empty, so let’s upload something.
This we do with gcs_upload()
. If the file is in your working directory, just write the filename; otherwise provide the full file path. If you want, you can store the file under another name in Google Storage with the name
parameter. Otherwise, just leave the parameter out.
write.csv(mtcars, "mtcars.csv")
gcs_upload("mtcars.csv", name = "overused_tutorial_dataset.csv")
Now let’s check the contents:
gcs_list_objects()
The Google Storage API handles only one file at a time, so for bulk uploads you need to use a loop or an apply function. To test it, we can download two random pdfs.
library(purrr)
download.file("https://cran.r-project.org/web/packages/googleCloudStorageR/googleCloudStorageR.pdf",
"storager_doc.pdf")
download.file("https://cran.r-project.org/web/packages/gargle/gargle.pdf",
"gargle_doc.pdf")
my_pdfs <- list.files(pattern = "*.pdf")
map(my_pdfs, function(x) gcs_upload(x, name = x))
Let’s check the contents again:
gcs_list_objects()
Note that there’s a file size limit of 5Mb, but you can change it with gcs_upload_set_limit()
.
gcs_upload_set_limit(upload_limit = 20000000L)
Downloads are performed with gcs_get_object(). Here, too, you can save it under a different name, but this time the parameter is saveToDisk
.
gcs_get_object("overused_tutorial_dataset.csv", saveToDisk = "mtcars_duplicate.csv")
To download multiple files we need to loop or map. Let’s say we wanted to download all the pdfs in the bucket:
contents <- gcs_list_objects()
pdfs_to_download <- grep("*.pdf", contents$name, value = TRUE)
map(pdfs_to_download, function(x) gcs_get_object(x, saveToDisk = x, overwrite = TRUE))
We can delete files in the bucket with gcs_delete_object()
:
gcs_delete_object("overused_tutorial_dataset.csv")
To delete several, we again need to loop or map. Let’s try to delete everything in the bucket:
contents <- gcs_list_objects()
map(contents$name, gcs_delete_object)
It is not possible to have folders in Google Storage. All files in a bucket are kept side by side in a flat structure. We can, however, imitate a folder structure by adding prefixes with forward slashes to filenames. This can be useful to keep files organized and for situations where you want to process a subset of files in the bucket.
To illustrate, let’s start by creating two folders in our working directory on our computer. We can make one for csv files and another for pdfs, and copy our files there.
dir.create("csvs")
dir.create("pdfs")
csv_files <- list.files(pattern = "*.csv")
pdf_files <- list.files(pattern = "*.pdf")
file.copy(csv_files, "./csvs")
file.copy(pdf_files, "./pdfs")
Now let’s upload both folders. We start by making vectors of the files in each folder. We add full.names=TRUE
to list.files()
to include the folder name.
csvs_to_upload <- list.files("./csvs", full.names = TRUE)
pdfs_to_upload <- list.files("./pdfs", full.names = TRUE)
When we upload, we want to remove the “./” part of the file path before saving to Google Storage. This we do with gsub()
. Note that .
and /
are special characters in regex, so we need to backslash them.
map(csvs_to_upload, function(x) gcs_upload(x, name = gsub("\\.\\/", "", x)))
map(pdfs_to_upload, function(x) gcs_upload(x, name = gsub("\\.\\/", "", x)))
If we now check the bucket contents, we see that the files are in “folders”.
gcs_list_objects()
Bear in mind, though, that this is an optical illusion; the files are technically still on the same level.
We can now download the contents of one bucket “folder” as follows:
contents <- gcs_list_objects()
folder_to_download <- grep("csvs/*", contents$name, value = TRUE)
map(folder_to_download, function(x) gcs_get_object(x, saveToDisk = x, overwrite = TRUE))
Note that this script only worked because there already was a csvs
folder in our working directory. If there hadn’t been, R would have returned an error, becaus the gcs_get_object()
function cannot create new folders on your hard drive.
This means that if you have a tree of subfolders in the bucket, the only way to keep the tree structure when you download is to have a destination folder with the same folder structure ready. When this is not feasible, one workaround is to change the forward slash in the filenames for something else, like underscore, and reconstruct the folder structure later.
contents <- gcs_list_objects()
map(contents$name, function(x) gcs_get_object(x, saveToDisk = gsub("/", "_", x), overwrite = TRUE))
What if we want to upload a folder with lots of subfolders? To illustrate, let’s first make two subfolders in the csvs folder and put some files in there.
dir.create("./csvs/folder1")
dir.create("./csvs/folder2")
write.csv(AirPassengers, "./csvs/folder1/airpassengers.csv")
write.csv(Titanic, "./csvs/folder2/titanic.csv")
To upload the csv folder with subfolders we just add recursive = TRUE
to the list.files()
call. We also clean the filepath as we did earlier.
files_and_folders <- list.files("./csvs", full.names = TRUE, recursive = TRUE)
map(files_and_folders, function(x) gcs_upload(x, name = gsub("\\.\\/", "", x)))
Let’s inspect the bucket again.
gcs_list_objects()
And there they are, nicely organized in (fake) folders.
With this you should be able to work comfortably with Google Storage buckets in R.