Part A

In this part you should use the entire data set of 30 countries.

After each task, it is best to try to Run the chunk that you have inserted (by clicking the green “Play” button in the top right corner of the chunk) to see that you get the correct output. It also advised to Knit the document, to make sure that your PDF document is produced properly after each step.

To find more background to the different statistics and graphics below, you can refer to the appendix.

A1 Descriptive Statistics

Calculate the number of (non-missing) countries, minimum, maximum, median, quartiles, mean and standard deviation for the total number of COVID-19 cases for the entire data set, from the beginning of the data up until 2020-03-31, until 2021-03-31, and until 2022-03-31, respectively. Report one table with a column for each date. You can simply copy the example code below and modify the dates.


# --- PART A ---

## Descriptive Statistics

```{r}

# Enter column to calculate
column<-5

# Select dates
date1<-"2020-02-15"
date2<-"2020-03-15"
date3<-"2020-04-15"

# Filter data
data1<-data[data$date==date1,column]
data2<-data[data$date==date2,column]
data3<-data[data$date==date3,column]

# Create functions
functions<-function(x){
    list("N"            = sum (!is.na(x)),
         "Min"          = min (x, na.rm = TRUE),
         "Max"          = max (x, na.rm = TRUE),
         "1st Quartile" = format(round(as.vector(quantile(x, probs = 0.25, na.rm = TRUE)),2), nsmall=2),
         "Median"       = format(round(median (x, na.rm = TRUE),2),nsmall=2),         
         "3rd Quartile" = format(round(as.vector(quantile(x, probs = 0.75, na.rm = TRUE)),2), nsmall=2),
         "Mean"         = format(round(mean(x, na.rm = TRUE),2), nsmall=2),
         "Stdev"        = format(round(sd  (x, na.rm = TRUE),2),nsmall=2))}


# Apply functions to data
column1<-sapply(data1, functions)
column2<-sapply(data2, functions)
column3<-sapply(data3, functions)

# Create table
table<-cbind(column1,column2,column3)

# Format table
colnames(table)<-c(date1,date2,date3)

# Print table
kable(table,
      format="latex",
      caption="Total COVID-19 Cases",
      align=rep('r',5),
      booktabs=TRUE)%>%
kable_styling(latex_options = 
                c("striped", "hold_position"))
```

Make a brief comment on the figures. What can you say about the spread of COVID-19? (Write the comment under the code chunk.)

A2 Bar Plots

Draw two bar plots. Both plots should show the total number of COVID-19 cases for the different countries, from the beginning of the data until 2022-03-31. However, they should be sorted in different ways. The first plot shuould be sorted in decreasing order of the total number of COVID-19 cases. The other plot should be sorted in decreasing order of GDP per capita. You can copy and paste the code below and adjust it by choosing the right ending date and sorting column in the data.


\newpage

## Bar Plots

```{r, out.width=c('50%', '50%'), fig.show='hold'}

# Pick column to plot
data_column<-5

# Pick column to sort after
sort_column1<-5
sort_column2<-5

# Pick column for x-axis names
name_column<-3

# Select end date
date<-"2020-02-15"

# Filter data for selected date
filtered_data<-data[data$date==date,]

# Sort data by s variable in decreasing order
sorted_data1<-filtered_data[order(-filtered_data[,sort_column1]),]
sorted_data2<-filtered_data[order(-filtered_data[,sort_column2]),]

# Transpose data (for chosen column)
transposed_data1<-t(sorted_data1[,data_column])
transposed_data2<-t(sorted_data2[,data_column])

# Create names
names1<-t(sorted_data1[,name_column])
names2<-t(sorted_data2[,name_column])

# Specify plot margins (bottom, left, top, right)
margins<-par(mar=c(10,5,2,2))

# Create plots
barplot(transposed_data1, 
        main="Total COVID-19 cases per country",
        names.arg = names1,
        las=2)

barplot(transposed_data2, 
        main="Total COVID-19 cases per country (sorting: GDP/Cap)",
        names.arg = names2,
        las=2)

# Remove margins from workspace
rm(margins)
```

Compare the plots. What can you say about the spread of COVID-19?

A3 Box Plots

Draw box plots for the total number of COVID-19 cases per million capita for the countries from the beginning of the data up until 2021-03-31, and until 2022-03-31, respectively. Do this with both regular and logarithmic scales. You can copy and paste the code below.


\newpage

## Box Plots

```{r, out.width=c('50%', '50%'), fig.show='hold'}

# Pick column
column<-5

# Select dates
date1<-"2021-03-31"
date2<-"2022-03-31"

# Filter data
box_data<-data[data$date==date1|
               data$date==date2,]

# Create two new variables:
# Cases per GDP/Capita
box_data$cases_per_GDP <- (box_data$total_cases/box_data$gdp_per_capita)
# Cases per popolation density
box_data$cases_per_pop_den <- (box_data$total_cases/box_data$population_density)

# Apply margins to plots
margins<-par(mar=c(5,10,2,2))

boxplot(box_data$total_cases_per_million~box_data$date,
        horizontal=TRUE,
        las=1,
        main="Total COVID-19 Cases per Million",
        xlab="",
        ylab="")

boxplot(box_data$total_cases_per_million~box_data$date,
        horizontal=TRUE,
        log = "x",
        las=1,
        main="Total COVID-19 Cases per Million (log scale)",
        xlab="",
        ylab="")

# Remove margins from workspace
rm(margins)
```

As you can see when you study the code, two new variables have been created. Use thes new variables to create two more boxplots, that is, one boxplot for total number of COVID-19 cases per GDP and another boxplot for total number of COVID-19 cases per population density, but only using the logarithmic scale. You can create the boxplots by copying and pasting the last boxplot(…) function in the chunk. Then you change the total_cases_per_million variable in the code to the appropriate variable.

Make a brief comment about the plots. What can you say about the spread of COVID-19?

A4 Scatter Plots

Draw two scatter plots over total COVID-19 cases, both plots with data from 2020-04-01 to 2021-03-31 on the x-axis and 2021-04-01 to 2022-03-31 on the y-axis, one plot with a normal scale and the other plot using a logarithmic scale. Copy the code below and modify the dates. Also, in the second plot you should take the log of the x variable and the y variable by using a function called log().


\newpage

## Scatter Plots

```{r, out.width=c('50%', '50%'), fig.show='hold'}

# Select dates
date1<-"2020-06-30"
date2<-"2020-12-31"
date3<-"2020-12-31"
date4<-"2021-06-30"

# Filter data
data1<-data[data$date>=date1&data$date<=date2,]
data2<-data[data$date>=date3&data$date<=date4,]

a1<-aggregate(data1$new_cases, by=list(data1$location), FUN=sum)
a2<-aggregate(data2$new_cases, by=list(data2$location), FUN=sum)
a <-na.omit(cbind(a1,y=a2$x))

# Apply margins to plots
margins<-par(mar=c(5,10,2,2))

# Draw plot
plot(a$x,a$y,
        las=0,
        main="Total COVID-19 Cases",
        xlab=paste(date1," to ",date2,sep=""),
        ylab=paste(date3," to ",date4,sep=""))

# Draw log-plot
plot(a$x,a$y,
        las=0,
        main="Total COVID-19 Cases (log scale)",
        xlab=paste(date1," to ",date2,sep=""),
        ylab=paste(date3," to ",date4,sep=""))

# Remove margins from workspace
rm(margins)
```

Make a brief comment about the plots. What can you say about the spread of COVID-19?