πŸ“Œ Graded task in R

The graded task in R is due on December 6th, 2023 (23:59 pm). Please upload:

  • your R script with the solutions
  • in this R script, the names of all team members using # to annotate code
  • in this R script, a clear indication of which task was addressed where

You can also use the template for this R script from Moodle, where dedicated spaces to indicate team members and tasks are already included.

Image: Submission via Moodle

Task 1 (1 Point)

In Moodle, you will find the file β€œclimate_corpus.csv” (via Moodle/Data for R). Please download the file. Use R code to load it into your R environment as an object titled climate_corpus.

Information on the data: The corpus contains news media texts on climate change by outlets from different countries across the world.

Solution:

climate_corpus <- read.csv("H:/Lehre/WS2023_24/R Tutorials/Github/data/climate_corpus.csv")

Task 2 (1 Point)

If you could not read in the excel file in Task 1, simply load the working space including the data provided via Moodle. Only use this if you could not finish Task 1.

Image: Download working space via Moodle

Now, use R code to understand how many variables and how many observations climate_corpus contains.

Solution:

#Number of variables
ncol(climate_corpus)
## [1] 6
#Number of observations
nrow(climate_corpus)
## [1] 150

Task 3 (1 Point)

Use R code to understand which types of data the variables doc_id, year, and outlet contain.

Solution:

climate_corpus %>%
  
  #create tibble
  as_tibble() %>%
  
  # take only relevant variables
  select(doc_id, year, outlet) %>%
  
  #show first observations and type of data
  head()
## # A tibble: 6 x 3
##   doc_id                                          year outlet                   
##   <chr>                                          <int> <chr>                    
## 1 GlobeMail_2007-09-22_570.txt                    2007 GlobeMail                
## 2 The Australian_2009-12-11_11834.txt             2009 The Australian           
## 3 Guardian_2018-2-115.txt                         2018 Guardian                 
## 4 The Sydney Morning Herald_2016-10-19_38945.txt  2016 The Sydney Morning Herald
## 5 The Sunday Times_2011-11-27_32947.txt           2011 Sunday Times             
## 6 The Globe and Mail_2016-2-23_21717.txt          2016 GlobeMail

Task 4 (1 Point)

Use R code to understand how many texts are missing in the corpus (i.e., how many times text is NA).

Solution:

climate_corpus %>%
  
  #Use the summarize function to count number of missing values.
  summarize(missing = sum(is.na(text)))
##   missing
## 1       5

Task 5 (1.5 Points)

Use R code to reduce the data set only to cases where texts were published by outlets in Canada or the UK and exclude the variable month. Save the result in a new object called climate_corpus_reduced.

Solution:

climate_corpus_reduced <- climate_corpus %>%
  
  #exclude irrelevant observations
  filter(country %in% c("Canada", "UK")) %>%
  
  #exclude irrelevant variables
  select(-month)

Task 6 (1 Point)

Use R code to further reduce the data set: Only include observations where all texts are there (i.e., text is not NA). Save the result in a new object called climate_corpus_reduced_complete.

Important: If you could not solve Task 5 and create the object climate_corpus_reduced, you can also do this with the full data set climate_corpus.

Solution:

climate_corpus_reduced_complete <- climate_corpus_reduced %>%
  
  #exclude empty cases
  filter(!is.na(text))

Task 7 (1.5 Points)

In the full data set, climate_corpus, create a new variable called word_count. It should include the number of words each text contains (i.e., the number of words in the first text for the first text, the number of words in the second text for the second text, etc.). Use R code to identify the doc_id of the longest text in the corpus!

Solution:

library("stringi")
climate_corpus <- climate_corpus %>%
  
  #create new variable
  mutate(word_count = stri_count_words(text))
#Identify the longest text
climate_corpus %>%
  filter(word_count == max(climate_corpus$word_count, na.rm = TRUE)) %>%
  pull(doc_id)
## [1] "FAZ_2007-4-16_2028.txt_english.txt"

Task 8 (1.5 Points)

Writing the corresponding R code, write your own custom function with the name stats_helper.

The function should only need one argument called x for which it should be able to execute the following task:

Given a vector x with numeric data, the function should paste the following sentence: β€œThis variable has a mean of M = XY with a standard deviation of SD = XY. In total, XY out of N = XY observations are missing.”

Important: The values XY should be replaced with whatever mean, standard deviation, missing values, and total observations of x has (rounded to two decimals). That is, the function should calculate these values on its own for any x it is given.

Below, you see what the function should do when tested for year and word_count in climate_corpus:

Solution:

stats_helper <- function (x) 
{
  name <- names(x)
  mean <- round(mean(x, na.rm = TRUE),2)
  sd <- round(sd(x, na.rm = TRUE),2)
  na <- sum(is.na(x))
  n <- length(x)
  result <- paste0("This variable has a mean of M = ", 
                   mean, 
                   " with a standard deviation of SD = ", 
                   sd, ". In total, ", 
                   na, 
                   " out of N = ", 
                   n, 
                   " observations are missing.")
  return(result)
}
#Testing the function
stats_helper(x = climate_corpus$year)
## [1] "This variable has a mean of M = 2011.91 with a standard deviation of SD = 3.99. In total, 0 out of N = 150 observations are missing."
stats_helper(x = climate_corpus$word_count)
## [1] "This variable has a mean of M = 1209 with a standard deviation of SD = 4747.58. In total, 5 out of N = 150 observations are missing."

Please note: This is a pretty tricky task and requires you to google for (a lot of) stuff you do not know. Just try to get as close as you get. If you have an idea of how to theoretically solve this but cannot write the exact code, write down your thinking.