Chapter 8 Google News API

Bernhard Clemm von Hohenberg

You will need to install the following packages for this chapter (run the code):

# install.packages('pacman')
library(pacman)
p_load('httr', 'dplyr', 'httr')

With the News API (formerly Google News API), you can get article snippets and news headlines, both up to four years old and real-time, from over 80,000 news sources worldwide.

8.1 Prerequisites

What are the prerequisites to access the API (authentication)?

You need an API key, which can be requested via https://newsapi.org/register.

One big drawback of the News API is that the free version (“Developer”) does not get you very far. Some serious limitations are that you can only get articles that are up to a month old; that it is restricted to 100 requests per day; that the article content is truncated to the first 200 characters. In addition, the Developer key expires after a while even if you stick to those limits, although it is easy to sign up for a new Developer account (gmail address or else). The “Business” version costs $449 per month and allows searching for articles to up to 4 years old as well as 250,000 requests per month. More details on pricing can be found here.

Up until at least 2019, the Business version also allowed you to get the entire news article. The documentation is not clear whether this is still the case.

8.2 Simple API call

What does a simple API call look like?

The documentation of the API is available here. A couple of good examples can be found on the landing page of the API. For instance, we could get all articles mentioning Biden since four weeks ago, sorted by recency, with the following call:

library(httr)

endpoint_url <- "https://newsapi.org/v2/everything?"
my_query <- "biden"
my_start_date <- Sys.Date() - 28
my_api_key <- # <YOUR_API_KEY>

params <- list(
  "q" = my_query,
  "from" = my_start_date,
  "language" = "en",
  "sortBy" = "publishedAt")

news <- httr::GET(url = endpoint_url, 
               httr::add_headers(Authorization = my_api_key),
               query = params)

content(news) # the resulting articles[[1]]$content shows that the article content is truncated

8.3 API access in R

How can we access the API from R (httr + other packages)?

To date, there is no R package facilitating access, but the API structure is simple enough to rely on httr. The API has three main endpoints:

We have already explored the everything endpoint. Additional parameters to use are, for example, searchIn (specifying whether you want to search in the title, the description or the main text), to (specifying until what date to search) or pageSize (how many results to return per page).

Though perhaps not so interesting from a research perspective, the sources endpoint is useful because it allows to explore the list of sources in each country (not really documented anywhere). Let’s get all sources from Germany—we can see that there are ten from which the News API draws content:

library(dplyr)
library(httr)

endpoint_url <- "https://newsapi.org/v2/top-headlines/sources?"
my_country <- "de"
my_api_key <- # <YOUR_API_KEY>

params <- list("country" = my_country)

sources <- httr::GET(url = endpoint_url, 
               httr::add_headers(Authorization = my_api_key),
               query = params)

sources_df <- bind_rows(content(sources)$sources)
sources_df[,c("id", "name", "url", "category")]
## # A tibble: 10 x 4
##    id                name              url                         category  
##    <chr>             <chr>             <chr>                       <chr>     
##  1 bild              Bild              http://www.bild.de          general   
##  2 der-tagesspiegel  Der Tagesspiegel  http://www.tagesspiegel.de  general   
##  3 die-zeit          Die Zeit          http://www.zeit.de/index    business  
##  4 focus             Focus             http://www.focus.de         general   
##  5 gruenderszene     Gruenderszene     http://www.gruenderszene.de technology
##  6 handelsblatt      Handelsblatt      http://www.handelsblatt.com business  
##  7 spiegel-online    Spiegel Online    http://www.spiegel.de       general   
##  8 t3n               T3n               https://t3n.de              technology
##  9 wired-de          Wired.de          https://www.wired.de        technology
## 10 wirtschafts-woche Wirtschafts Woche http://www.wiwo.de          business

This illustrates another weakness of the News API: The selection of sources is not neither comprehensive nor transparent. In any case, let’s use this information to try out the headlines endpoint, getting breaking headlines from Bild (via its id), with 5 results per page. Note that in the Developer version, these headlines are not really breaking, but actually from one hour ago.

endpoint_url <- "https://newsapi.org/v2/top-headlines?"
my_source <- "bild"
my_api_key <- # <YOUR_API_KEY>

params <- list(
  "sources" = my_source,
  "pageSize" = 5)

headlines <- httr::GET(url = endpoint_url, 
               httr::add_headers(Authorization = my_api_key),
               query = params)

headlines_df <- bind_rows(content(headlines)$articles) %>% 
  mutate(source = tolower(source)) %>% unique()
headlines_df[,c("publishedAt","title")]
## # A tibble: 5 x 2
##   publishedAt                  title                                                                        
##   <chr>                        <chr>                                                                        
## 1 2022-02-16T15:52:26.3782532Z Köln: Erweiterte Anklage gegen Pfarrer Hans U. – 85 weitere Missbrauchsfälle!
## 2 2022-02-16T15:46:52Z         Hai-Attacke in Australien: Schwimmer vor der Küste Sydneys zerfetzt          
## 3 2022-02-16T15:37:24.2222797Z Essen: Polizei befreit Hundewelpen aus Kofferraum                            
## 4 2022-02-16T15:20:45Z         BVB – Erling Haaland: Idol verrät brisante Interna vom Adidas-Geheimtreffen  
## 5 2022-02-16T15:07:32Z         Texas, USA: Raubopfer ballert um sich, um Dieb aufzuhalten – Mädchen (9) tot

8.4 Social science examples

Are there social science research examples using the API?

A search on Google Scholar (queries “Google News API” and “News API”) reveals that surprisingly few social-science studies make use of the News API, although many rely the web site of Google News for research (e.g., Haim, Graefe, and Brosius (2018)). One example from the social sciences is Chrisinger et al. (2021), who ask how the discourse on food stamps in the United States has changed over time. Through the News API, they collected 13,987 newspaper articles using keyword queries, and ran a structural topic model. In one of my papers, I ask whether US conservatives and liberals differ in their ability to discern true from false information, and in their tendency to give more credit to information that is ideologically congruent. As I argue that these questions can best be answered if the news items used in a survey represent the universe of news well, the News API helps me get a decent approximation of this universe. Note, however, that at the time I was still able to get complete articles through the Business version of the API, and it is unclear whether this is still the case (Clemm von Hohenberg 2022)10).

References

Chrisinger, Benjamin W., Eliza W. Kinsey, Ellie Pavlick, and Chris Callison-Burch. 2021. “SNAP Judgments into the Digital Age: Reporting on Food Stamps Varies Significantly with Time, Publication Type, and Political Leaning.” Commun. Methods Meas., November, 1–18.
Clemm von Hohenberg, Bernhard. 2022. “Truth and Bias, Left and Right.”
Haim, Mario, Andreas Graefe, and Hans-Bernd Brosius. 2018. “Burst of the Filter Bubble?” Digital Journalism 6 (3): 330–43. https://doi.org/10.1080/21670811.2017.1338145.