📌 Graded task in R

The graded task in R is due on December 4th, 2024 (23:59 pm). Please upload:

your R script with the solutions
in this R script, the names of all team members using # to annotate code
in this R script, a clear indication of which task was addressed where

You can also use the template for this R script from Moodle, where dedicated spaces to indicate team members and tasks are already included.

Image: Submission via Moodle

Task 1 (1 Point)

In Moodle, you will find the file “climate_corpus.csv” (via Moodle/Data for R). Please download the file. Use R code to load it into your R environment as an object titled climate_corpus.

Information on the data: The corpus contains news media texts on climate change by outlets from different countries across the world.

climate_corpus <- read.csv("climate_corpus.csv")

Task 2 (1 Point)

If you could not read in the excel file in Task 1, simply load the working space including the data provided via Moodle. Only use this if you could not finish Task 1.

Image: Download working space via Moodle

Now, use R code to understand how many variables and how many observations climate_corpus contains.

#Number of variables
ncol(climate_corpus)

## [1] 6

#Number of observations
nrow(climate_corpus)

## [1] 150

Task 3 (1 Point)

Use R code to understand which types of data the variables text, month, and outlet contain.

climate_corpus %>%
  
  #create tibble
  as_tibble() %>%
  
  # take only relevant variables
  select(doc_id, year, outlet) %>%
  
  #show first observations and type of data
  head()

## # A tibble: 6 x 3
##   doc_id                                          year outlet                   
##   <chr>                                          <int> <chr>                    
## 1 GlobeMail_2007-09-22_570.txt                    2007 GlobeMail                
## 2 The Australian_2009-12-11_11834.txt             2009 The Australian           
## 3 Guardian_2018-2-115.txt                         2018 Guardian                 
## 4 The Sydney Morning Herald_2016-10-19_38945.txt  2016 The Sydney Morning Herald
## 5 The Sunday Times_2011-11-27_32947.txt           2011 Sunday Times             
## 6 The Globe and Mail_2016-2-23_21717.txt          2016 GlobeMail

Task 4 (1 Point)

Use R code to understand how many articles were published in either Canada or Australia.

climate_corpus %>%
  
  #Use the filter function to filter the corpus
  filter(country %in% c("Canada", "Australia")) %>%
  
  #Count observations per group
  count(country)

##     country  n
## 1 Australia 29
## 2    Canada 22

Task 5 (1.5 Points)

Use R code to reduce the data set only to cases where texts were published between 2006 and 2010 and only include observations where all texts are there (i.e., text is not NA). Save the result in a new object called climate_corpus_reduced.

climate_corpus_reduced <- climate_corpus %>%
  
  #exclude irrelevant observations
  filter(year %in% c(2006:2010)) %>%
  
  #exclude empty cases
  filter(!is.na(text))

Task 6 (1 Point)

Using the full data set climate_corpus, we want to have months in a non-numeric way. For month = 1, we want to replace the value 1 with January, month = 2, we want to replace the value 2 with February, etc. Do this for all months!

climate_corpus <- climate_corpus %>%
  
  mutate(month = case_match(month,
                           1 ~ "January",
                           2 ~ "February",
                           3 ~ "March",
                           4 ~ "April",
                           5 ~ "May",
                           6 ~ "June",
                           7 ~ "July",
                           8 ~ "August",
                           9 ~ "September",
                           10 ~ "October",
                           11 ~ "November",
                           12 ~ "December"))

Task 7 (1.5 Points)

In the full data set, climate_corpus, create a new variable called word_count. It should include the number of words each text contains (i.e., the number of words in the first text for the first text, the number of words in the second text for the second text, etc.). Use R code to identify the doc_id of shortest text in the corpus!

library("stringi")
climate_corpus <- climate_corpus %>%
  
  #create new variable
  mutate(word_count = stri_count_words(text))

#Identify the shortest text
climate_corpus %>%
  filter(word_count == min(climate_corpus$word_count, na.rm = TRUE)) %>%
  pull(doc_id)

## [1] "The Times_2009-9-8_7484.txt"

Task 8 (1.5 Points)

Writing the corresponding R code, write your own custom function with the name text_summary.

The function should only need one argument called x for which it should be able to execute the following task:

Given a text x, the function should paste the following sentence: “This text has a total of XY words. It was published in the year XY. It is XY words shorter/longer than the average text in the corpus Climate_Corpus”.

Important: The values XY should be replaced with whatever amount of words, publication year, and difference to the average word count x has (rounded to two decimals). That is, the function should calculate these values on its own for any x it is given. The word difference should be negative if the text is shorter than the average text and positive when it is longer than the average text. You can ignore texts that are missing.

Below, you see what the function should do when tested for the first text in climate_corpus:

text_summary <- function (x) 
{
  #define word count for text
  word_count = stri_count_words(x)
  
  #define year of text
  year <- climate_corpus %>%
    filter(text == x) %>%
    pull(year)
  
  #define difference from average word count
  word_count_mean <- round(mean(stri_count_words(climate_corpus$text), na.rm = TRUE), 2)
  result <- paste0("This text has a total of ", 
                   word_count, 
                   " words. It was published in the year ", 
                   year, 
                   ". It is ", 
                   word_count - word_count_mean, 
                   " words shorter/longer than the average text in the corpus Climate_Corpus")
  return(result)
}

#Testing the function
text_summary(x = climate_corpus$text[2])

## [1] "This text has a total of 310 words. It was published in the year 2009. It is -899.05 words shorter/longer than the average text in the corpus Climate_Corpus"

text_summary(x = climate_corpus$text[119])

## [1] "This text has a total of 57226 words. It was published in the year 2007. It is 56016.95 words shorter/longer than the average text in the corpus Climate_Corpus"

Please note: This is a pretty tricky task and requires you to google for (a lot of) stuff you do not know. Just try to get as close as you get. If you have an idea of how to theoretically solve this but cannot write the exact code, write down your thinking.