Chapter 13 Iterations
Hello! Today, we’re going to be talking about more data structures and some ways to iterate your code. This tutorial builds on Week 2’s material, so you may want to go back and review that tutorial.
In this tutorial, we will be learning about tools in R that allow you to iterate or recursively apply a function. This is very useful when you need your code to do something over and over (and over and over) again. Iteration/recursion is one of the most important principles in programming. If you can iterate a task, it is much easier for the computer to do and you save a lot of time.
Iteration is incredibly useful when collecting data (e.g., scraping text from multiple web pages) and processing data (conditionally adjusting a variable).
For this tutorial, we will use a dataset of tweets that I collected from the #academictwitter hashtag.
Let’s bring this package in using the read_csv()
function from readr
.
This dataset has about 20,000 tweets. That’s quite a bit to work with, so let’s focus on the tweets talking about graduate school by searching for the “grad school” substring. If you recall from last week’s tutorial, we discussed how to subset a dataset using a substring. Let’s do that now with the mutate() %>% filter()
combination.
gradschool_tweets <- tweet_df %>%
mutate(grads = str_detect(text, "(?i)grad school")) %>%
filter(grads == TRUE)
Now, let’s focus on the tweets that actually have profile URLs (the variable we will be using in this exercise)
70 tweets!
13.1 Loops
So in this dataset, we have a few URL rows. One such row is the url
variable. If the user provides a link in their profile, this is the t.co link to that site.
## [1] "https://t.co/5UFWxMJvo0" "https://t.co/tc35om5yvh" "https://t.co/tc35om5yvh" "https://t.co/PgfMZC8rgS" "https://t.co/3wrdD8W9d6" "https://t.co/3wrdD8W9d6"
But what if we wanted to see the original url? We can use the longurl package’s function expand_url().
#install.packages("longurl") #here is the longurl package
library(longurl)
#?expand_urls #remove the # sign in front of this line to learn more about this function
expand_urls(grad_tweet_urls$url[1])
## # A tibble: 1 x 3
## orig_url expanded_url status_code
## <chr> <chr> <int>
## 1 https://t.co/5UFWxMJvo0 https://www.oacommunity.org/ 200
## # A tibble: 1 x 3
## orig_url expanded_url status_code
## <chr> <chr> <int>
## 1 https://t.co/tc35om5yvh https://phdvoice.org/ 200
What if we wanted to apply this to the whole data frame? We can do so using a loop. Loops have a very specific structure:
for (start-number in start-number:end-number) { #set the number of times you want it to look
#do cool iterative stuff here
}
Let’s now fill in that information. First, we create a new empty character variable named profile_url_expanded
. Then, we fill it using the loop. We want to cover the entire data frame, so we’ll go from row 1
to the last row (220, which is the number of rows, or nrow()
in the dataset).
A note before proceeding with your code: This code may take some time (we are trying to get URL data from 70 rows). To figure out how long this code takes, I’m going to show you how to calculate the amount of time some code runs (there are a couple ways to do this, but this one is easy because it uses the library tictoc
).
#install.packages("tictoc")
library(tictoc)
tic()
grad_tweet_urls$profile_url_expanded <- "" #creates a new, "empty" character variable
for (variable in 1:nrow(grad_tweet_urls)) { #nrow() returns the number of rows in gradschool_tweets
result <- expand_urls(grad_tweet_urls$url[variable]) #run the function
grad_tweet_urls$profile_url_expanded[variable] <- result[2] #get the second value in this list
}
## Warning in FUN(X[[i]], ...): Invalid URL: [https://t.co/PgfMZC8rgS]
## Warning in FUN(X[[i]], ...): Invalid URL: [https://t.co/3wrdD8W9d6]
## Warning in FUN(X[[i]], ...): Invalid URL: [https://t.co/3wrdD8W9d6]
## Warning in FUN(X[[i]], ...): httr::warn_for_status() on HEAD request result
## Warning in FUN(X[[i]], ...): httr::warn_for_status() on HEAD request result
## Warning in FUN(X[[i]], ...): Invalid URL: [https://t.co/ee2tXHPABW]
## Warning in doTryCatch(return(expr), name, parentenv, handler): Method Not Allowed (HTTP 405).
## Warning in FUN(X[[i]], ...): httr::warn_for_status() on HEAD request result
## Warning in FUN(X[[i]], ...): Invalid URL: [https://t.co/SCoIvd05WO]
## Warning in doTryCatch(return(expr), name, parentenv, handler): Forbidden (HTTP 403).
## Warning in FUN(X[[i]], ...): httr::warn_for_status() on HEAD request result
## Warning in FUN(X[[i]], ...): httr::warn_for_status() on HEAD request result
## Warning in doTryCatch(return(expr), name, parentenv, handler): Forbidden (HTTP 403).
## Warning in FUN(X[[i]], ...): Invalid URL: [https://t.co/Znoy1BhmJU]
## Warning in FUN(X[[i]], ...): Invalid URL: [https://t.co/CoOgdi3SwY]
## Warning in FUN(X[[i]], ...): Invalid URL: [https://t.co/2iv87TwlKy]
## Warning in doTryCatch(return(expr), name, parentenv, handler): Forbidden (HTTP 403).
## Warning in FUN(X[[i]], ...): Invalid URL: [https://t.co/bHWgL8hIVx]
## Warning in FUN(X[[i]], ...): httr::warn_for_status() on HEAD request result
## Warning in doTryCatch(return(expr), name, parentenv, handler): Forbidden (HTTP 403).
## Warning in FUN(X[[i]], ...): httr::warn_for_status() on HEAD request result
## Warning in doTryCatch(return(expr), name, parentenv, handler): Not Found (HTTP 404).
## Warning in FUN(X[[i]], ...): Invalid URL: [https://t.co/WZ1ai3Kxjk]
## Warning in FUN(X[[i]], ...): httr::warn_for_status() on HEAD request result
## 51.33 sec elapsed
If you have empty rows, you may get a warning: [NA]number of items to replace is not a multiple of replacement length error
.
Let us now turn to the apply()
functions.
13.2 apply()
apply()
is a function in base R that allows you to conduct a function across a whole data frame, array, or matrix. This is a faster strategy than loops in R, so you will see many people recommending apply()
as opposed to the for-loop.
One of the apply functions is lapply()
.
tic()
grad_tweet_urls$profile_url_expanded <- lapply(grad_tweet_urls$url, expand_urls) #%>% select(expanded_url)
## Warning in FUN(X[[i]], ...): Invalid URL: [https://t.co/PgfMZC8rgS]
## Warning in FUN(X[[i]], ...): Invalid URL: [https://t.co/3wrdD8W9d6]
## Warning in FUN(X[[i]], ...): Invalid URL: [https://t.co/3wrdD8W9d6]
## Warning in FUN(X[[i]], ...): httr::warn_for_status() on HEAD request result
## Warning in FUN(X[[i]], ...): httr::warn_for_status() on HEAD request result
## Warning in FUN(X[[i]], ...): Invalid URL: [https://t.co/ee2tXHPABW]
## Warning in doTryCatch(return(expr), name, parentenv, handler): Method Not Allowed (HTTP 405).
## Warning in FUN(X[[i]], ...): httr::warn_for_status() on HEAD request result
## Warning in FUN(X[[i]], ...): Invalid URL: [https://t.co/SCoIvd05WO]
## Warning in doTryCatch(return(expr), name, parentenv, handler): Forbidden (HTTP 403).
## Warning in FUN(X[[i]], ...): httr::warn_for_status() on HEAD request result
## Warning in FUN(X[[i]], ...): httr::warn_for_status() on HEAD request result
## Warning in doTryCatch(return(expr), name, parentenv, handler): Forbidden (HTTP 403).
## Warning in FUN(X[[i]], ...): Invalid URL: [https://t.co/Znoy1BhmJU]
## Warning in FUN(X[[i]], ...): Invalid URL: [https://t.co/CoOgdi3SwY]
## Warning in FUN(X[[i]], ...): Invalid URL: [https://t.co/2iv87TwlKy]
## Warning in doTryCatch(return(expr), name, parentenv, handler): Forbidden (HTTP 403).
## Warning in FUN(X[[i]], ...): Invalid URL: [https://t.co/bHWgL8hIVx]
## Warning in FUN(X[[i]], ...): httr::warn_for_status() on HEAD request result
## Warning in doTryCatch(return(expr), name, parentenv, handler): Forbidden (HTTP 403).
## Warning in FUN(X[[i]], ...): httr::warn_for_status() on HEAD request result
## Warning in doTryCatch(return(expr), name, parentenv, handler): Not Found (HTTP 404).
## Warning in FUN(X[[i]], ...): Invalid URL: [https://t.co/WZ1ai3Kxjk]
## Warning in FUN(X[[i]], ...): httr::warn_for_status() on HEAD request result
## 41.28 sec elapsed
In the above chunks, expand_urls
returns a tibble with 3 variables (orig_url
, expanded_url
, and status_code
). We can use unnest_longer()
to turn the lists in gradschool_tweets$profile_url_expanded
into 3 different variables in the main dataset.
grad_tweet_urls$profile_url_expanded <- lapply(grad_tweet_urls$profile_url_expanded, `[[`, 2) %>% unlist()
(We’re going to delete these columns for now and re-create them using purrr::map())
13.3 Other Resources
Learn about the speed of different functions here
Review our course textbook chapter on iteration, chapter 21
Practice purrr::map() with Game of Thrones data here
Learn about apply vs. loops here
Here is an example of a more complex loop/apply() from stackoverflow: link
Hadley Wickham’s argument for why we should use purrr::map() is here
A more advanced tutorial about purr from Tom Mock can be found here (Tom Mock runs TidyTuesday).