With the News API (formerly Google News API), you can get article snippets and news headlines, both up to four years old and real-time, from over 80,000 news sources worldwide.
You will need to install the following packages for this chapter (run the code):
# install.packages('pacman') library(pacman) p_load('httr', 'dplyr', 'httr')
What are the prerequisites to access the API (authentication)?
You need an API key, which can be requested via https://newsapi.org/register.
One big drawback of the News API is that the free version (“Developer”) does not get you very far. Some serious limitations are that you can only get articles that are up to a month old; that it is restricted to 100 requests per day; that the article content is truncated to the first 200 characters. In addition, the Developer key expires after a while even if you stick to those limits, although it is easy to sign up for a new Developer account (gmail address or else). The “Business” version costs $449 per month and allows searching for articles to up to 4 years old as well as 250,000 requests per month. More details on pricing can be found here.
Up until at least 2019, the Business version also allowed you to get the entire news article. The documentation is not clear whether this is still the case.
What does a simple API call look like?
The documentation of the API is available here. A couple of good examples can be found on the landing page of the API. For instance, we could get all articles mentioning Biden since four weeks ago, sorted by recency, with the following call:
library(httr) <- "https://newsapi.org/v2/everything?" endpoint_url <- "biden" my_query <- Sys.Date() - 28 my_start_date <- Sys.getenv("GoogleNews_token") # <YOUR_API_KEY> my_api_key <- list( params "q" = my_query, "from" = my_start_date, "language" = "en", "sortBy" = "publishedAt") <- httr::GET(url = endpoint_url, news ::add_headers(Authorization = my_api_key), httrquery = params) ::content(news) # the resulting articles[]$content shows that the article content is truncatedhttr
How can we access the API from R (httr + other packages)?
To date, there is no R package facilitating access, but the API structure is simple enough to rely on
httr. The API has three main endpoints:
- https://newsapi.org/v2/everything?, documented at https://newsapi.org/docs/endpoints/everything
- https://newsapi.org/v2/top-headlines/sources?, documented at https://newsapi.org/docs/endpoints/sources
- https://newsapi.org/v2/top-headlines?, documented at https://newsapi.org/docs/endpoints/top-headlines
We have already explored the
everything endpoint. Additional parameters to use are, for example,
searchIn (specifying whether you want to search in the title, the description or the main text),
to (specifying until what date to search) or
pageSize (how many results to return per page).
Though perhaps not so interesting from a research perspective, the
sources endpoint is useful because it allows to explore the list of sources in each country (not really documented anywhere). Let’s get all sources from Germany—we can see that there are ten from which the News API draws content:
library(dplyr) library(httr) <- "https://newsapi.org/v2/top-headlines/sources?" endpoint_url <- "de" my_country <- Sys.getenv("GoogleNews_token") # <YOUR_API_KEY> my_api_key <- list("country" = my_country) params <- httr::GET(url = endpoint_url, sources ::add_headers(Authorization = my_api_key), httrquery = params) <- bind_rows(httr::content(sources)$sources) sources_df c("id", "name", "url", "category")]sources_df[,
# A tibble: 10 × 4 id name url category <chr> <chr> <chr> <chr> 1 bild Bild http://www.bild.de general 2 der-tagesspiegel Der Tagesspiegel http://www.tagesspiegel.de general 3 die-zeit Die Zeit http://www.zeit.de/index business 4 focus Focus http://www.focus.de general 5 gruenderszene Gruenderszene http://www.gruenderszene.de technology 6 handelsblatt Handelsblatt http://www.handelsblatt.com business 7 spiegel-online Spiegel Online http://www.spiegel.de general 8 t3n T3n https://t3n.de technology 9 wired-de Wired.de https://www.wired.de technology 10 wirtschafts-woche Wirtschafts Woche http://www.wiwo.de business
This illustrates another weakness of the News API: The selection of sources is not neither comprehensive nor transparent. In any case, let’s use this information to try out the
headlines endpoint, getting breaking headlines from Bild (via its
id), with 5 results per page. Note that in the Developer version, these headlines are not really breaking, but actually from one hour ago.
<- "https://newsapi.org/v2/top-headlines?" endpoint_url <- "bild" my_source <- Sys.getenv("GoogleNews_token") # <YOUR_API_KEY> my_api_key <- list( params "sources" = my_source, "pageSize" = 5) <- httr::GET(url = endpoint_url, headlines ::add_headers(Authorization = my_api_key), httrquery = params) <- bind_rows(httr::content(headlines)$articles) %>% headlines_df mutate(source = tolower(source)) %>% unique() c("publishedAt","title")]headlines_df[,
# A tibble: 5 × 2 publishedAt title <chr> <chr> 1 2022-02-16T15:52:26.3782532Z Köln: Erweiterte Anklage gegen Pfarrer Hans U. – 85 weitere Missbrauchsfälle! 2 2022-02-16T15:46:52Z Hai-Attacke in Australien: Schwimmer vor der Küste Sydneys zerfetzt 3 2022-02-16T15:37:24.2222797Z Essen: Polizei befreit Hundewelpen aus Kofferraum 4 2022-02-16T15:20:45Z BVB – Erling Haaland: Idol verrät brisante Interna vom Adidas-Geheimtreffen 5 2022-02-16T15:07:32Z Texas, USA: Raubopfer ballert um sich, um Dieb aufzuhalten – Mädchen (9) tot