π Graded task in R
The graded task in R is due on December 6th, 2023 (23:59 pm). Please upload:
- your R script with the solutions
- in this R script, the names of all team members using
#
to annotate code - in this R script, a clear indication of which task was addressed where
You can also use the template for this R script from Moodle, where dedicated spaces to indicate team members and tasks are already included.
Image: Submission via Moodle
Task 1 (1 Point)
In Moodle, you will find the file βclimate_corpus.csvβ (via Moodle/Data for R). Please download the file. Use R code to load it into your R environment as an object titled climate_corpus
.
Information on the data: The corpus contains news media texts on climate change by outlets from different countries across the world.
Solution:
Task 2 (1 Point)
If you could not read in the excel file in Task 1, simply load the working space including the data provided via Moodle. Only use this if you could not finish Task 1.
Image: Download working space via Moodle
Now, use R code to understand how many variables and how many observations climate_corpus
contains.
Solution:
## [1] 6
## [1] 150
Task 3 (1 Point)
Use R code to understand which types of data the variables doc_id
, year
, and outlet
contain.
Solution:
climate_corpus %>%
#create tibble
as_tibble() %>%
# take only relevant variables
select(doc_id, year, outlet) %>%
#show first observations and type of data
head()
## # A tibble: 6 x 3
## doc_id year outlet
## <chr> <int> <chr>
## 1 GlobeMail_2007-09-22_570.txt 2007 GlobeMail
## 2 The Australian_2009-12-11_11834.txt 2009 The Australian
## 3 Guardian_2018-2-115.txt 2018 Guardian
## 4 The Sydney Morning Herald_2016-10-19_38945.txt 2016 The Sydney Morning Herald
## 5 The Sunday Times_2011-11-27_32947.txt 2011 Sunday Times
## 6 The Globe and Mail_2016-2-23_21717.txt 2016 GlobeMail
Task 4 (1 Point)
Use R code to understand how many texts are missing in the corpus (i.e., how many times text
is NA
).
Solution:
climate_corpus %>%
#Use the summarize function to count number of missing values.
summarize(missing = sum(is.na(text)))
## missing
## 1 5
Task 5 (1.5 Points)
Use R code to reduce the data set only to cases where texts were published by outlets in Canada or the UK and exclude the variable month. Save the result in a new object called climate_corpus_reduced
.
Solution:
Task 6 (1 Point)
Use R code to further reduce the data set: Only include observations where all texts are there (i.e., text
is not NA
). Save the result in a new object called climate_corpus_reduced_complete
.
Important: If you could not solve Task 5 and create the object climate_corpus_reduced
, you can also do this with the full data set climate_corpus
.
Solution:
Task 7 (1.5 Points)
In the full data set, climate_corpus
, create a new variable called word_count
. It should include the number of words each text contains (i.e., the number of words in the first text for the first text, the number of words in the second text for the second text, etc.). Use R code to identify the doc_id
of the longest text in the corpus!
Solution:
library("stringi")
climate_corpus <- climate_corpus %>%
#create new variable
mutate(word_count = stri_count_words(text))
#Identify the longest text
climate_corpus %>%
filter(word_count == max(climate_corpus$word_count, na.rm = TRUE)) %>%
pull(doc_id)
## [1] "FAZ_2007-4-16_2028.txt_english.txt"
Task 8 (1.5 Points)
Writing the corresponding R code, write your own custom function with the name stats_helper
.
The function should only need one argument called x
for which it should be able to execute the following task:
Given a vector x
with numeric data, the function should paste the following sentence: βThis variable has a mean of M = XY with a standard deviation of SD = XY. In total, XY out of N = XY observations are missing.β
Important: The values XY should be replaced with whatever mean, standard deviation, missing values, and total observations of x has (rounded to two decimals). That is, the function should calculate these values on its own for any x it is given.
Below, you see what the function should do when tested for year
and word_count
in climate_corpus
:
Solution:
stats_helper <- function (x)
{
name <- names(x)
mean <- round(mean(x, na.rm = TRUE),2)
sd <- round(sd(x, na.rm = TRUE),2)
na <- sum(is.na(x))
n <- length(x)
result <- paste0("This variable has a mean of M = ",
mean,
" with a standard deviation of SD = ",
sd, ". In total, ",
na,
" out of N = ",
n,
" observations are missing.")
return(result)
}
## [1] "This variable has a mean of M = 2011.91 with a standard deviation of SD = 3.99. In total, 0 out of N = 150 observations are missing."
## [1] "This variable has a mean of M = 1209 with a standard deviation of SD = 4747.58. In total, 5 out of N = 150 observations are missing."
Please note: This is a pretty tricky task and requires you to google for (a lot of) stuff you do not know. Just try to get as close as you get. If you have an idea of how to theoretically solve this but cannot write the exact code, write down your thinking.