π Graded task in R
The graded task in R is due on December 4th, 2024 (23:59 pm). Please upload:
- your R script with the solutions
- in this R script, the names of all team members using
#
to annotate code - in this R script, a clear indication of which task was addressed where
You can also use the template for this R script from Moodle, where dedicated spaces to indicate team members and tasks are already included.
Image: Submission via Moodle
Task 1 (1 Point)
In Moodle, you will find the file βclimate_corpus.csvβ (via Moodle/Data for R). Please download the file. Use R code to load it into your R environment as an object titled climate_corpus
.
Information on the data: The corpus contains news media texts on climate change by outlets from different countries across the world.
Task 2 (1 Point)
If you could not read in the excel file in Task 1, simply load the working space including the data provided via Moodle. Only use this if you could not finish Task 1.
Image: Download working space via Moodle
Now, use R code to understand how many variables and how many observations climate_corpus
contains.
## [1] 6
## [1] 150
Task 3 (1 Point)
Use R code to understand which types of data the variables text
, month
, and outlet
contain.
climate_corpus %>%
#create tibble
as_tibble() %>%
# take only relevant variables
select(doc_id, year, outlet) %>%
#show first observations and type of data
head()
## # A tibble: 6 x 3
## doc_id year outlet
## <chr> <int> <chr>
## 1 GlobeMail_2007-09-22_570.txt 2007 GlobeMail
## 2 The Australian_2009-12-11_11834.txt 2009 The Australian
## 3 Guardian_2018-2-115.txt 2018 Guardian
## 4 The Sydney Morning Herald_2016-10-19_38945.txt 2016 The Sydney Morning Herald
## 5 The Sunday Times_2011-11-27_32947.txt 2011 Sunday Times
## 6 The Globe and Mail_2016-2-23_21717.txt 2016 GlobeMail
Task 4 (1 Point)
Use R code to understand how many articles were published in either Canada or Australia.
climate_corpus %>%
#Use the filter function to filter the corpus
filter(country %in% c("Canada", "Australia")) %>%
#Count observations per group
count(country)
## country n
## 1 Australia 29
## 2 Canada 22
Task 5 (1.5 Points)
Use R code to reduce the data set only to cases where texts were published between 2006
and 2010
and only include observations where all texts are there (i.e., text
is not NA
). Save the result in a new object called climate_corpus_reduced
.
Task 6 (1 Point)
Using the full data set climate_corpus
, we want to have months in a non-numeric way. For month
= 1
, we want to replace the value 1
with January
, month
= 2
, we want to replace the value 2
with February
, etc. Do this for all months!
Task 7 (1.5 Points)
In the full data set, climate_corpus
, create a new variable called word_count
. It should include the number of words each text contains (i.e., the number of words in the first text for the first text, the number of words in the second text for the second text, etc.). Use R code to identify the doc_id
of shortest text in the corpus!
library("stringi")
climate_corpus <- climate_corpus %>%
#create new variable
mutate(word_count = stri_count_words(text))
#Identify the shortest text
climate_corpus %>%
filter(word_count == min(climate_corpus$word_count, na.rm = TRUE)) %>%
pull(doc_id)
## [1] "The Times_2009-9-8_7484.txt"
Task 8 (1.5 Points)
Writing the corresponding R code, write your own custom function with the name text_summary
.
The function should only need one argument called x
for which it should be able to execute the following task:
Given a text x
, the function should paste the following sentence: βThis text has a total of XY words. It was published in the year XY. It is XY words shorter/longer than the average text in the corpus Climate_Corpusβ.
Important: The values XY should be replaced with whatever amount of words, publication year, and difference to the average word count x has (rounded to two decimals). That is, the function should calculate these values on its own for any x it is given. The word difference should be negative if the text is shorter than the average text and positive when it is longer than the average text. You can ignore texts that are missing.
Below, you see what the function should do when tested for the first text in climate_corpus
:
text_summary <- function (x)
{
#define word count for text
word_count = stri_count_words(x)
#define year of text
year <- climate_corpus %>%
filter(text == x) %>%
pull(year)
#define difference from average word count
word_count_mean <- round(mean(stri_count_words(climate_corpus$text), na.rm = TRUE), 2)
result <- paste0("This text has a total of ",
word_count,
" words. It was published in the year ",
year,
". It is ",
word_count - word_count_mean,
" words shorter/longer than the average text in the corpus Climate_Corpus")
return(result)
}
## [1] "This text has a total of 310 words. It was published in the year 2009. It is -899.05 words shorter/longer than the average text in the corpus Climate_Corpus"
## [1] "This text has a total of 57226 words. It was published in the year 2007. It is 56016.95 words shorter/longer than the average text in the corpus Climate_Corpus"
Please note: This is a pretty tricky task and requires you to google for (a lot of) stuff you do not know. Just try to get as close as you get. If you have an idea of how to theoretically solve this but cannot write the exact code, write down your thinking.