Chapter 15 Data Scraping

Hello! In this tutorial, we will be learning how to scrape data from a website. There are a range of packages we will rely on for this tutorial, but the one we’ll really talk about is rvest (which is maintained by Hadley Wickham, one of the co-authors of the book). You can check out the functions of the rvest package in the documentation here.

Note: To use this package, you will need to have Java installed on your computer.

#newpackages <- c("RcppRoll", "boilerpipeR", "rvest", "xml2")
#install.packages(newpackages)
library(tidyverse)
library(RcppRoll)
library(rJava)
library(boilerpipeR)
library(rvest)
library(xml2)

If you recall from class, data scraping is a strategy for downloading data directly from a website. Normally, you would have to go to a page, manually copy it, and then paste it to a different site. But, using computational methods, you can do this process without manually going through each page.

This process takes the following steps: 1. Direct the code to the appropriate website (using the url). 2. Get the html code for that website. 3. Isolate the parts of the html that you want to collect. 4. Save those parts in your environment (i.e., assign it to an object)

Let’s begin with a link to a forum, which you can also go to here. In this link there is a forum with multiple posts, and you may want to collect the text of each post.

The first thing you’ll want to do is assign it to an object.

forum_url <- "https://stackoverflow.com/questions/14737773/replacing-occurrences-of-a-number-in-multiple-columns-of-data-frame-with-another"

Next, you need a function to read the html code for that website. To do this, we’ll use read_html(), which is in the xml2 package.

forum_html <- read_html(forum_url)

You’ll notice this produces a quirky list with two values, which are actually lists (in fact, it’s a list of lists of lists). If you just print this out (without any functions), it will tell you that it is an {html_document} document type.

forum_html
## {html_document}
## <html itemscope="" itemtype="https://schema.org/QAPage" class="html__responsive " lang="en">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n<title>replace - Replacing occurrences of a number in multiple columns of data  ...
## [2] <body class="question-page unified-theme">\r\n    <div id="notify-container"></div>\r\n    <div id="custom-header"></div>\r\n        \r\n<header class="s-to ...

In order to read this complex {html_document}, you’ll want to use functions that isolate the specific parts of the website that you are interested in. And, because each website has different parts, the data scraping strategy you use for one website (e.g., stack overflow) will be different from the strategy you use for another website (e.g., rstudio).

It’s worth repeating this: How you scrape content in one website will be different from how you scrape content in another.

Okay! Now that we have this disclaimer, let’s get on with the data scraping.

In HTML, “objects” (in R language) are called nodes (or “elements”). HTML has a kind of branching structure called a DOM (or “Document Object Model”), and nodes are like its branches and leaves and flowers. For more on HTML jargon, visit this site.

For this tutorial, the main thing you need to remember is that, to identify the parts of the website you want to scrape, you need to know the node (websites very rarely have the same node names, so the nodes in The New York Times won’t be the same as the nodes in The Texas Tribune).

To do this, it can be useful to have a CSS selector. I recommend Selector Gadget for Chrome or Scrapemate for Firefox.

The package rvest() has a couple of neat functions that allow you to then get the information from a specific node. In our example, we may notice that there are two nodes of text: #question and .answer. To identify the right node in your {html_document}, you can use the html_node() function in rvest!

question_node <- html_nodes(forum_html, '#question')
answer_node <- html_nodes(forum_html, '.answer')

Okay, so now you’ve pulled out the nodes, which are list objects. The object question_node is a list object with four lists. The number of lists correspond to the number of nodes that you have pulled. For example, there are four (4) .answer nodes (as a result, answer_node is a List of 4).

answer_node
## {xml_nodeset (4)}
## [1] <div id="answer-14737832" class="answer js-answer accepted-answer js-accepted-answer" data-answerid="14737832" data-parentid="14737773" data-score="73" data ...
## [2] <div id="answer-14737883" class="answer js-answer" data-answerid="14737883" data-parentid="14737773" data-score="5" data-position-on-page="2" data-highest-s ...
## [3] <div id="answer-14738286" class="answer js-answer" data-answerid="14738286" data-parentid="14737773" data-score="2" data-position-on-page="3" data-highest-s ...
## [4] <div id="answer-14737890" class="answer js-answer" data-answerid="14737890" data-parentid="14737773" data-score="1" data-position-on-page="4" data-highest-s ...

But pulling out the node is not the same as pulling out the content of the node. For that, you will need to extract the text of the node using the rvest function html_text(). The html_text() function takes a node as its main argument.

answer_text <- html_text(answer_node)
#answer_text <- html_text(html_nodes(forum_html, '.answer')) #you can put the html_node() function within html_text()
#answer_text <- html_nodes(forum_html, '.answer') %>% html_text()

question_text <- html_text(question_node)

answer_text[1]
## [1] "\r\n    \r\n        \r\n            \r\n        \r\n            \r\n        \r\n            73\r\n        \r\n        \r\n            \r\n\r\n\r\n        \r\n\r\n    \r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n            \r\n                \r\n                    \r\n            \r\n\r\n    \r\n    \r\n\r\n\r\n\r\n        \r\n\r\n        \r\n\r\n\r\n        \r\n\r\n\r\n    \r\n    \r\nyou want to search through the whole data frame for any value that matches the value you're trying to replace.  the same way you can run a logical test like replacing all missing values with 10..\n\ndata[ is.na( data ) ] <- 10\n\n\nyou can also replace all 4s with 10s.\n\ndata[ data == 4 ] <- 10\n\n\nat least i think that's what you're after?\n\nand let's say you wanted to ignore the first row (since it's all letters)\n\n# identify which columns contain the values you might want to replace\ndata[ , 2:3 ]\n\n# subset it with extended bracketing..\ndata[ , 2:3 ][ data[ , 2:3 ] == 4 ]\n# ..those were the values you're going to replace\n\n# now overwrite 'em with tens\ndata[ , 2:3 ][ data[ , 2:3 ] == 4 ] <- 10\n\n# look at the final data\ndata\n\n    \r\n    \r\n        \r\n            \r\n                \r\n\r\n\r\n\r\n\r\n    \r\n\r\n        \r\n            Share\r\n        \r\n\r\n\r\n                    \r\n                        Improve this answer\r\n                    \r\n\r\n                \r\n                    \r\n                        Follow\r\n                    \r\n                \r\n\r\n\r\n\r\n\r\n\r\n\r\n    \r\n    \r\n\r\n            \r\n            \r\n\r\n    \r\n        edited Feb 6, 2013 at 20:14\r\n    \r\n    \r\n        \r\n    \r\n    \r\n        \r\n        \r\n            \r\n        \r\n    \r\n\r\n            \r\n\r\n\r\n            \r\n                \r\n    \r\n        answered Feb 6, 2013 at 20:09\r\n    \r\n    \r\n        \r\n    \r\n    \r\n        Anthony DamicoAnthony Damico\r\n        \r\n            5,78977 gold badges4646 silver badges7777 bronze badges\r\n        \r\n    \r\n\r\n\r\n\r\n            \r\n        \r\n        \r\n    \r\n    \r\n    \r\n\r\n\r\n\r\n\r\n\r\n            1 \r\n    \r\n        \r\n            \r\n        \r\n            \r\n                    1\r\n            \r\n        \r\n        \r\n            \r\n                \r\n                I flipping swear I tried this and it wasn't working for me before. I hope to get to the point where I don't kick myself everytime I post to SO... By the way -- you're the 1min R video guy, aren't you!? Those rock.\r\n                \r\n                \r\n<U+2013><U+00A0>Hendy\r\n                \r\n                Feb 6, 2013 at 21:57\r\n            \r\n        \r\n    \r\n\r\n            \r\n\r\n        \r\n                    Add a comment\r\n                <U+00A0>|<U+00A0>\r\n            \r\n                 \r\n    \r\n    \r\n"

The object you created (answer_text) is a vector of 4 character values. Each value is pretty messy though, so you may want to clean up some the html code. In particular, you likely see a lot of \r (the carriage return) and \n (the newline). You can learn more about these two here.

If you want to get rid of them, you’ll need to use a package that can manipulate character values (“strings”). Luckily, in tidyverse, there is such a package: stringr! In that package, you’ll be able to use the function str_replace_all(), which replaces one substring for another. (A substring is part of a character object or string. For example, in the character object “A banana walked into town,” the word “banana” is a substring).

In the example below, I take the new answer_text object and replace any instances of \n with nothing ““).

answer_text <- answer_text %>%
  str_replace_all("\n", "") %>%
  str_replace_all("\r", "") #you can also use str_remove_all()
answer_text[1]
## [1] "                                                                73                                                                                                                                            you want to search through the whole data frame for any value that matches the value you're trying to replace.  the same way you can run a logical test like replacing all missing values with 10..data[ is.na( data ) ] <- 10you can also replace all 4s with 10s.data[ data == 4 ] <- 10at least i think that's what you're after?and let's say you wanted to ignore the first row (since it's all letters)# identify which columns contain the values you might want to replacedata[ , 2:3 ]# subset it with extended bracketing..data[ , 2:3 ][ data[ , 2:3 ] == 4 ]# ..those were the values you're going to replace# now overwrite 'em with tensdata[ , 2:3 ][ data[ , 2:3 ] == 4 ] <- 10# look at the final datadata                                                                    Share                                                    Improve this answer                                                                                Follow                                                                                edited Feb 6, 2013 at 20:14                                                                                                                    answered Feb 6, 2013 at 20:09                                Anthony DamicoAnthony Damico                    5,78977 gold badges4646 silver badges7777 bronze badges                                                                1                                                                 1                                                                        I flipping swear I tried this and it wasn't working for me before. I hope to get to the point where I don't kick myself everytime I post to SO... By the way -- you're the 1min R video guy, aren't you!? Those rock.                                <U+2013><U+00A0>Hendy                                Feb 6, 2013 at 21:57                                                                Add a comment                <U+00A0>|<U+00A0>                                     "

You can fill it with something else, of course, but what you replace should not make it harder to read (like below).

question_text %>% 
  str_replace_all("\n", "RESEARCH TIME") %>%
  str_replace_all("\r", "NAP TIME")
## [1] "NAP TIMERESEARCH TIME    NAP TIMERESEARCH TIME    NAP TIMERESEARCH TIME    NAP TIMERESEARCH TIME\t\tNAP TIMERESEARCH TIMENAP TIMERESEARCH TIMENAP TIMERESEARCH TIME    NAP TIMERESEARCH TIME        NAP TIMERESEARCH TIME            NAP TIMERESEARCH TIME        NAP TIMERESEARCH TIME            NAP TIMERESEARCH TIME        NAP TIMERESEARCH TIME            31NAP TIMERESEARCH TIME        NAP TIMERESEARCH TIME        NAP TIMERESEARCH TIME            NAP TIMERESEARCH TIMENAP TIMERESEARCH TIMENAP TIMERESEARCH TIME        NAP TIMERESEARCH TIMENAP TIMERESEARCH TIME    NAP TIMERESEARCH TIMENAP TIMERESEARCH TIMENAP TIMERESEARCH TIMENAP TIMERESEARCH TIMENAP TIMERESEARCH TIMENAP TIMERESEARCH TIMENAP TIMERESEARCH TIMENAP TIMERESEARCH TIMENAP TIMERESEARCH TIME    NAP TIMERESEARCH TIME    NAP TIMERESEARCH TIMENAP TIMERESEARCH TIMENAP TIMERESEARCH TIMENAP TIMERESEARCH TIME        NAP TIMERESEARCH TIMENAP TIMERESEARCH TIME        NAP TIMERESEARCH TIMENAP TIMERESEARCH TIMENAP TIMERESEARCH TIME    NAP TIMERESEARCH TIME    NAP TIMERESEARCH TIME                NAP TIMERESEARCH TIMEETA: the point of the below, by the way, is to not have to iterate through my entire set of column vectors, just in case that was a proposed solution (just do what is known to work once at a time).RESEARCH TIMERESEARCH TIMEThere's plenty of examples of replacing values in a single vector of a data frame in R with some other value.RESEARCH TIMERESEARCH TIMEReplace a value in a data frame based on a conditional (if) statement in RRESEARCH TIMEreplace numbers in data frame column in r [duplicate]RESEARCH TIMEAnd also how to replace all values of NA with something else:RESEARCH TIMERESEARCH TIMEHow to replace all  values in a data.frame with another ( not 0) valueRESEARCH TIMEWhat I'm looking for is analogous to the last question, but basically trying to replace one value with another. I'm having trouble generating a data frame of logical values mapped to my actual data frame for cases where multiple columns meet a criteria, or simply trying to do the actions from the first two questions on more than one column.RESEARCH TIMERESEARCH TIMEAn example:RESEARCH TIMERESEARCH TIMEdata <- data.frame(name = rep(letters[1:3], each = 3), var1 = rep(1:9), var2 = rep(3:5, each = 3))RESEARCH TIMERESEARCH TIMEdataRESEARCH TIME  name var1 var2RESEARCH TIME1    a    1    3RESEARCH TIME2    a    2    3RESEARCH TIME3    a    3    3RESEARCH TIME4    b    4    4RESEARCH TIME5    b    5    4RESEARCH TIME6    b    6    4RESEARCH TIME7    c    7    5RESEARCH TIME8    c    8    5RESEARCH TIME9    c    9    5RESEARCH TIMERESEARCH TIMERESEARCH TIMEAnd say I want all of the values of 4 in var1 and var2 to be 10.RESEARCH TIMERESEARCH TIMEI'm sure this is elementary and I'm just not thinking through it properly. I have been trying things like:RESEARCH TIMERESEARCH TIMEdata[data[, 2:3] == 4, ]RESEARCH TIMERESEARCH TIMERESEARCH TIMEThat doesn't work, but if I do the same with data[, 2] instead of data[, 2:3], things work fine. It seems that logical test (like is.na()) work on multiple rows/columns, but that numerical comparisons aren't playing as nicely?RESEARCH TIMERESEARCH TIMEThanks for any suggestions!RESEARCH TIME    NAP TIMERESEARCH TIMENAP TIMERESEARCH TIME        NAP TIMERESEARCH TIME            NAP TIMERESEARCH TIME                NAP TIMERESEARCH TIME                    NAP TIMERESEARCH TIME                    rreplaceindexingNAP TIMERESEARCH TIME            NAP TIMERESEARCH TIME        NAP TIMERESEARCH TIMENAP TIMERESEARCH TIME    NAP TIMERESEARCH TIME        NAP TIMERESEARCH TIME            NAP TIMERESEARCH TIME                NAP TIMERESEARCH TIMENAP TIMERESEARCH TIMENAP TIMERESEARCH TIMENAP TIMERESEARCH TIMENAP TIMERESEARCH TIME    NAP TIMERESEARCH TIMENAP TIMERESEARCH TIME        NAP TIMERESEARCH TIME            ShareNAP TIMERESEARCH TIME        NAP TIMERESEARCH TIMENAP TIMERESEARCH TIMENAP TIMERESEARCH TIME                    NAP TIMERESEARCH TIME                        Improve this questionNAP TIMERESEARCH TIME                    NAP TIMERESEARCH TIMENAP TIMERESEARCH TIME                NAP TIMERESEARCH TIME                    NAP TIMERESEARCH TIME                        FollowNAP TIMERESEARCH TIME                    NAP TIMERESEARCH TIME                NAP TIMERESEARCH TIMENAP TIMERESEARCH TIMENAP TIMERESEARCH TIMENAP TIMERESEARCH TIMENAP TIMERESEARCH TIMENAP TIMERESEARCH TIMENAP TIMERESEARCH TIME    NAP TIMERESEARCH TIME    NAP TIMERESEARCH TIMENAP TIMERESEARCH TIME            NAP TIMERESEARCH TIMENAP TIMERESEARCH TIME                NAP TIMERESEARCH TIMENAP TIMERESEARCH TIME    NAP TIMERESEARCH TIME        edited May 23, 2017 at 12:10NAP TIMERESEARCH TIME    NAP TIMERESEARCH TIME    NAP TIMERESEARCH TIME        NAP TIMERESEARCH TIME    NAP TIMERESEARCH TIME    NAP TIMERESEARCH TIME        CommunityBotNAP TIMERESEARCH TIME        NAP TIMERESEARCH TIME            111 silver badgeNAP TIMERESEARCH TIME        NAP TIMERESEARCH TIME    NAP TIMERESEARCH TIMENAP TIMERESEARCH TIME                NAP TIMERESEARCH TIME            NAP TIMERESEARCH TIME                NAP TIMERESEARCH TIME    NAP TIMERESEARCH TIME        asked Feb 6, 2013 at 20:05NAP TIMERESEARCH TIME    NAP TIMERESEARCH TIME    NAP TIMERESEARCH TIME        NAP TIMERESEARCH TIME    NAP TIMERESEARCH TIME    NAP TIMERESEARCH TIME        HendyHendyNAP TIMERESEARCH TIME        NAP TIMERESEARCH TIME            10.2k1515 gold badges6565 silver badges7171 bronze badgesNAP TIMERESEARCH TIME        NAP TIMERESEARCH TIME    NAP TIMERESEARCH TIMENAP TIMERESEARCH TIMENAP TIMERESEARCH TIMENAP TIMERESEARCH TIME            NAP TIMERESEARCH TIME        NAP TIMERESEARCH TIME    NAP TIMERESEARCH TIME    NAP TIMERESEARCH TIMENAP TIMERESEARCH TIMENAP TIMERESEARCH TIMENAP TIMERESEARCH TIMENAP TIMERESEARCH TIMENAP TIMERESEARCH TIME             NAP TIMERESEARCH TIME    NAP TIMERESEARCH TIME        NAP TIMERESEARCH TIME            NAP TIMERESEARCH TIMENAP TIMERESEARCH TIME        NAP TIMERESEARCH TIME                    Add a commentNAP TIMERESEARCH TIME                <U+00A0>|<U+00A0>NAP TIMERESEARCH TIME            NAP TIMERESEARCH TIME                 NAP TIMERESEARCH TIME    NAP TIMERESEARCH TIME    NAP TIMERESEARCH TIMENAP TIMERESEARCH TIME"

So let’s string these steps together: first we’ll identify the node, then we’ll pull the text, and finally, we’ll replace some of the substrings.

answer_data <- html_nodes(forum_html, '.answer') %>% #select the node
  html_text() %>% #pull the text
  str_replace_all("\n", "") %>% #remove \n
  str_replace_all("\r", "") #remove \r

question_data <- html_nodes(forum_html, '#question') %>% #select the node
  html_text() %>% #pull the text
  str_replace_all("\n", "") %>% #remove \n
  str_replace_all("\r", "") #remove \r

You can do this with other nodes too, like the number of upvotes (.fc-black-200).

user_data <- html_nodes(forum_html, '.fc-black-200') %>% #select the node
  html_text() %>% #pull the text
  str_replace_all("\n", "") %>% #remove \n
  str_replace_all("\r", "") #remove \r
user_data
## [1] ""                                                                                                                                                      
## [2] "                                        31                                                "                                                            
## [3] "                                        73                                                                                                            "
## [4] "                                        5                                                                                                            " 
## [5] "                                        2                                                                                                            " 
## [6] "                                        1                                                                                                            "

One thing you’ve probably noticed with all of the scraped data thus far is that the text is quite messy. Compared to collecting from an API, scraped data often requires substantially more cleaning. At the same time, there is often data that you cannot get using an API. In these situations, data scraping is necessary.

15.1 Bonus Content

Thus far, we’ve talked about scraping from one webpage. But what if you wanted to scrape from multiple web pages (like a list of news stories)?

One of the first functions I built in R did just this (see the last chapter about how to build functions). Learning how to write functions is one of the key steps to becoming a computational scholar. Let’s see how to use this using a list of URLs of articles about Elon Musk from Vox.

news_urls <- read_csv("data/mediacloud_musk_2023.csv")
## Rows: 50 Columns: 16
## -- Column specification -------------------------------------------------------------------------------------------------------------------------------------------
## Delimiter: ","
## chr  (8): guid, language, media_name, media_url, metadata, story_tags, title, url
## dbl  (3): media_id, processed_stories_id, stories_id
## lgl  (3): ap_syndicated, feeds, word_count
## dttm (2): collect_date, publish_date
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

The first thing we have to do with this dataset is identify the variable with the urls (for us, this is the fourth row). Then, we need to extract the html code of each url (our dataset has 50 urls, so we should expect 50 html documents). To do this, we’ll use the apply() function. The apply() function is an extremely versatile function that you can use to apply a function to an array or vector. There are many, many tutorials that teach apply(), but here are some that I like: 1, 2 (from our textbook), 3 (a YouTube tutorial).

urls <- news_urls$url #identify the column with the urls

read_html(urls[13])
## {html_document}
## <html lang="en">
## [1] <head>\n<title>Bluesky, the Twitter replacement that AOC and Chrissy Teigen just joined, explained - Vox</title>\n<meta http-equiv="Content-Type" content="t ...
## [2] <body class="entry_key_unison_standard entry_layout_unison_main entry_template_standard" data-entry-id="23467020">\n    <a class="sr-only link-skip" href="# ...
url_xml <- lapply(urls, try(read_html)) #apply the function read_html() to the `url` vector.

You’ll notice that I also used the try() function (which may be a different color from the functions you normally use). try() is a special type of function that lets you test things. If you are using apply() to apply a function across a variable, your apply() function may “break” in the middle of the dataset if there is an error (e.g., if value 32 is empty, you cannot read_html() that value). Using try() allows you to skip over the error to the next url.

Now that we’ve done these two lines, you should have a list of 90 html documents (url_xml). Now, what we need to do is create a function that allows us to pull the right content out of each URL. You can then apply your new function to the dataset. Below, I call my function textScraper.

textScraper <- function(x) {
  html_nodes(x, ".c-entry-content") %>% 
    html_text() %>%
    str_replace_all("\n", "") %>% #removes "\n"
    paste(collapse = '')
}

#textScraper(url_xml[1]) #use this to see the output of this function on *one* url

Now that we have our function, we are ready to apply it!

article_text <- lapply(url_xml, textScraper)
article_text[1]
## [[1]]
## [1] "  When Elon Musk took over Twitter, he said he wanted to protect its place as a <U+201C>digital town square,<U+201D> where ideas from all corners of the internet could flourish. But soon, if you want your voice to really be heard in the town square, you<U+2019>ll need to pay.Musk tweeted that, starting April 15, Twitter will only recommend content from paid accounts in the For You feed, the first screen users see when they open the app. It<U+2019>s just one of several seemingly random changes Musk has been making to Twitter<U+2019>s core user experience without explanation. He changed the Twitter homepage<U+2019>s icon from its classic blue bird logo to <U+201C>doge<U+201D> <U+2014> the cartoonish Shiba Inu dog meme linked to the cryptocurrency dogecoin <U+2014> and for some users, the app started seemingly inserting tweets from accounts people didn<U+2019>t follow into the their Following feed.We don<U+2019>t know exactly why the doge logo suddenly appeared at the top of the homepage, but there is one relevant piece of news people are pointing to: Elon Musk is currently facing a $258 billion lawsuit alleging that he ran a pyramid scheme to support dogecoin. Musk<U+2019>s legal team asked a court to dismiss the dogecoin suit a few days before doge appeared on Twitter<U+2019>s site.As promised pic.twitter.com/Jc1TnAqxAV<U+2014> Elon Musk (@elonmusk) April 3, 2023It<U+2019>s hard to make any real sense of Musk<U+2019>s constant changes to Twitter, but one general trend, is that if you don<U+2019>t start paying $8 a month for Twitter<U+2019>s subscription plan, Twitter Blue, you<U+2019>ll have a harder time on the app. For people tweeting, you<U+2019>ll have less of a chance that your tweets will actually get seen, and for people viewing but not posting on Twitter, you<U+2019>ll be seeing a lot more content from paid accounts, which currently make up only 0.2 percent of all users. After Twitter users started complaining about the new plan, Elon clarified that people you follow will also show up in the For You feed, but the main point still stands: Musk wants to turn your Twitter feed into a pay-to-play arena.      Related        The ridiculous but important Twitter check mark fiasco, explained.  The introduction of random accounts in the Following tab added insult to injury. Users used to be able to escape the randomness of the For You feed by using the Following tab, which showed you accounts of people you followed ranked chronologically. The For You offered users an approximation of the old, pre-Musk Twitter experience, but now, even that<U+2019>s not the same. The sudden appearance of random accounts in the Following tab may have an explanation: Twitter seemed to stop showing some users whether tweets were directly from people they followed, or retweets of other users<U+2019> tweets. Since Twitter didn<U+2019>t confirm the changes, it<U+2019>s unclear if this was a bug or intentional. Either way, Musk<U+2019>s plan is to fill your Twitter feed with a higher ratio of paid accounts, and is pressuring more free users to pay for what was once considered a given. This move is the next step in Musk<U+2019>s plan to try to get more people to subscribe to Twitter Blue. Musk said that on April 1 he<U+2019>d remove <U+201C>legacy<U+201D> verification checkmarks from notable accounts that had them for free, including news organizations, politicians, and researchers. On March 31, some major accounts like the White House and LeBron James said they would not be paying for a checkmark <U+2014> not a good sign for the impending rollout. Many are concerned that it could become even easier for public figures who don<U+2019>t pay for a checkmark to be impersonated. The checkmark part of Musk<U+2019>s plan has received a lot of attention <U+2014> in part because it involves famous people <U+2014> but it<U+2019>s the changes to Twitter<U+2019>s feed that are potentially just as, if not more, impactful.That<U+2019>s because Musk is changing the incentives to Twitter<U+2019>s core product, its recommendation algorithms, to an extent that it could potentially fill the average user<U+2019>s experience with lower-quality content.<U+201C>The notion that by virtue of being willing to pay $8 a month means that you are a higher-quality account or worthy of being verified is a really reductive analysis,<U+201D> said Jason Goldman, a VP of product at Twitter from 2007 to 2010. <U+201C>There<U+2019>s plenty of people who are complete trolls and are looking to just get attention for ridiculous behavior for whom $8 a month is a pittance to pay.<U+201D>In his explanation of the upcoming feed change, Musk said that Twitter has to charge users to make sure people aren<U+2019>t actually spam bots. But there<U+2019>s a simpler reason that<U+2019>s also driving this push: Twitter needs to make more money. The company, which is now valued at half of what it was when Musk bought it, is still bleeding advertisers that are put off by Musk<U+2019>s antics. Not enough people have subscribed to Twitter Blue: There are only about 180,000 subscribers, according to the Information. They bring in roughly $28 million in annual revenue, less than 1 percent of the $3 billion Musk aimed to make in 2022. Now, in an effort to get more people to sign up for Twitter Blue, Musk is essentially threatening to make using the app harder for Twitter users who don<U+2019>t pay. Moreover, the fact that Musk is seriously proposing turning your Twitter homepage into a place where you don<U+2019>t see tweets from the users you care about and only see the people who spent money shows how much he<U+2019>s willing to compromise the basic utility of the app. He<U+2019>s pushing an extreme version of an increasingly popular <U+201C>pay-to-play<U+201D> model for social media, one that goes against some of the basic ideas that made apps like Twitter popular in the first place.Early signs that people are buying into Musk<U+2019>s vision for social media are not looking good.First of all, the company is already planning major exceptions: Twitter<U+2019>s top 500 advertisers and 10,000 most-followed organizations keep to their checkmarks for free, according to a recent report in the New York Times. That eliminates a major pool of potential customers that Twitter may have wisely realized were not going to pay.Some of the largest newsrooms in the country, like the New York Times, the Los Angeles Times, and Politico, have said they will not be buying a Twitter Blue verification for their company accounts (a one-year subscription for a company costs $12,000), nor do they intend to subsidize individual reporters<U+2019> subscriptions. In its rationale, the LA Times said that <U+201C>verification no longer establishes authority or credibility.<U+201D> A few celebrities, like Seinfeld star Jason Alexander, William Shatner, and Ice-T have recently joined other actors, writers, and comedians who previously threatened to leave if Musk took away their checkmark. If more famous people refuse to buy Twitter verification and subsequently find less value in Twitter, they could leave for other platforms. Meanwhile, Twitter<U+2019>s technical quality has been degrading since Musk took over. Features have been more frequently buggy, the site has had embarrassing outages, and source code has been leaked online. <U+201C>I think [changes to the For You feed and verification] are only going to expedite that decline and demise of a platform that is really in its death rattle right now,<U+201D> said social media consultant Matt Navarra.Even though Musk acquired Twitter to democratize it from the hands of elite users, in many ways his actions are doing the opposite.A major part of social media<U+2019>s appeal in the past two decades of its existence is the idea that anyone, from anywhere, at any time, could go viral <U+2014> for better or worse. And in turn, users see the most compelling, <U+201C>engagement<U+201D>-worthy media. Companies like Meta, TikTok, and YouTube are in the business of carefully fine-tuning algorithms that recommend the content they know we<U+2019>ll want to click, whether that<U+2019>s cat videos, political debates, or beauty tutorials. A major part of Twitter<U+2019>s appeal was about seeing random interactions between powerful people and everyday citizens, like someone seeing a tweet from a senator, replying to it, and actually getting a reply back.If Musk starts making it harder for an average user to stumble on and participate in viral exchanges, he<U+2019>s taking away from the basic democratic promise of social media.Already, under Musk<U+2019>s leadership, Twitter has been promoting certain content according to the whims of the company<U+2019>s new owner. Twitter has recently boosted Musk<U+2019>s own tweets, and for months it has boosted those of certain people the company designated as VIPs, like LeBron James, Ben Shapiro, and (somewhat surprisingly, since she<U+2019>s a known foe of Musk) Rep. Alexandria Ocasio Cortez, according to recent reports in Platformer.It<U+2019>s important to note here that there<U+2019>s a good chance Musk will not go through with this, given his track record of missing deadlines for major changes at Twitter. In the few months since he took over, Musk has promised to share revenue with creators (hasn<U+2019>t happened). He<U+2019>s warned for months that Twitter will remove blue checkmarks, but he hasn<U+2019>t actually done it yet.  As of Monday, April 3, Twitter still hasn<U+2019>t seemed to remove checkmarks for legacy accounts, which could be because it<U+2019>s reportedly a slow and manual process. There<U+2019>s one exception: Twitter removed the checkmark on the account of the the New York Times, a frequent target of Musk<U+2019>s media criticism.Regardless of whether Musk executes his plans, he is to some extent doing what many social media platforms have often done in private: tinker with secretive algorithms and give special treatment to high-profile users. TikTok was found to be <U+201C>heating<U+201D> certain VIP user content, showing it more in people<U+2019>s For You feeds. Facebook and Instagram have let celebrities get away with breaking the company<U+2019>s policies. The two apps, which are owned by Meta, also recently started charging users for verification and some basic services like access to customer support. But even if these companies give certain users benefits over others, they<U+2019>re doing it within reason. Musk is pushing pay-to-play to the extreme. If he goes too far, celebrities and the everyday users who follow them could leave Twitter in a mass exodus. So far, though, they haven<U+2019>t. Twitter<U+2019>s biggest benefit is that there is no good Twitter alternative. The most viable contender, Mastodon, while popular with some journalists, hasn<U+2019>t reached nearly the same level of mainstream appeal as Twitter. Regardless, if Musk wants Twitter Blue to succeed, he<U+2019>ll need to get celebrities and everyday people not just to stay on Twitter, but to pay for an $8-a-month subscription service.We<U+2019>ll see if his plan to turn Twitter into a for-sale popularity contest will work.Update, April 3, 6:30 pm ET: This story, originally published on March 31, has been updated with new details about changes to Twitter<U+2019>s Following tab.                            We're here to shed some clarity              One of our core beliefs here at Vox is that everyone needs and deserves access to the information that helps them understand the world, regardless of whether they can pay for a subscription. With the 2024 election on the horizon, more people are turning to us for clear and balanced explanations of the issues and policies at stake. We<U+2019>re so grateful that we<U+2019>re on track to hit 85,000 contributions to the Vox Contributions program before the end of the year, which in turn helps us keep this work free. We need to add 2,500 contributions this month to hit that goal.\rWill you make a contribution today to help us hit this goal and support our policy coverage? Any amount helps.                                                  One-Time                                Monthly                                Annual                                                                                                  $5/month                                                                                          $10/month                                                                                          $25/month                                                                                          $50/month                                                                    Other                                                            $                                                  Yes, I'll give $5/month                                    Yes, I'll give $5/month                                                                    We accept credit card, Apple Pay, and                                            Google Pay. You can also contribute via                                                                                      "

Now you have a list of articles! To make sure you re-attach this with the meta-data, you can assign it to a new variable:

news_urls$full_article <- unlist(article_text)

Hurrah! Let’s make sure to save it.

news_urls |>
  select(collect_date, language, media_name, processed_stories_id, stories_id, title, url, full_article) |>
  write.csv("mc_musk.csv")

To get better at using rvest, it is important to practice, practice, practice. You can use some of the tutorials below to help, too:

  1. RStudio Blog Post
  2. Data Quest