4 Session 3: Data analysis and visualization
Aim: To provide tools for extracting answers from your data.
Intended Learning Outcomes:
At the end of the session a successful student will be able to:
Build an analysis dataset
Filter, group, and summarise data to answer questions
Build presentation-ready tables using R
Understand the key concepts behind making graphs with ggplot
4.1 What will be covered today?
4.2 Looking at your data
Section from Epidemiologist R handbook- 17 Descriptive Tables
There are many functions available to look at descriptive statistics from your dataset. For this example, we will use a function that is included in the basic installation of R.
library(here)
::p_load(here)
pacman<- read.csv(here('data', 'africa_covid_cases_long.csv'))
africa_covid_cases_long summary(africa_covid_cases_long)
## X ISO COUNTRY_NAME AFRICAN_REGION
## Min. : 1 Length:25917 Length:25917 Length:25917
## 1st Qu.: 6480 Class :character Class :character Class :character
## Median :12959 Mode :character Mode :character Mode :character
## Mean :12959
## 3rd Qu.:19438
## Max. :25917
##
## excel_date cases date_format
## Min. :43831 Min. : -209.0 Length:25917
## 1st Qu.:43953 1st Qu.: 0.0 Class :character
## Median :44075 Median : 7.0 Mode :character
## Mean :44075 Mean : 177.6
## 3rd Qu.:44197 3rd Qu.: 78.0
## Max. :44319 Max. :21980.0
## NA's :231
This function provides useful information which we can use for building our approach to designing an analysis workflow. For example, we can see from the summary of the date variable that the first record (Min
) is from 2020-01-01 and the last record (Max
) is from 2021-05-03, For the cases variable, the maximum number of cases recorded on one day was 6,195.
4.3 Building an analysis dataset
Before analysing the data, it is good to generate a new dataset that only contains the variables you need to analyse.
So what variables do we have in africa_covid_cases_long
names(africa_covid_cases_long)
## [1] "X" "ISO" "COUNTRY_NAME" "AFRICAN_REGION"
## [5] "excel_date" "cases" "date_format"
We can select the variables we want to keep using the select function from the dplyr package. dplyr
is a core package of the tidyverse
so it is loaded when you write library(tidyverse).
library(tidyverse) # this will load the dplyr package along with other core tidyverse packages
library(dplyr) # this will only load the dplyr package
# you don't need to run both lines! Loading a package twice will not have any negative consequences. But it makes your code less efficient to read!
<- africa_covid_cases_long %>% #use this dataset
analysis_dataset select(date_format,AFRICAN_REGION, COUNTRY_NAME, cases) #select the named columns
The select
function from the dplyr
package is very useful. It is used to tell R which columns you want to keep in the new data frame. We can look at the first few rows of the dataset we have created to check that we have selected the correct variables.
head(analysis_dataset)
## date_format AFRICAN_REGION COUNTRY_NAME cases
## 1 2020-01-01 Northern Africa Algeria 0
## 2 2020-01-02 Northern Africa Algeria 0
## 3 2020-01-03 Northern Africa Algeria 0
## 4 2020-01-04 Northern Africa Algeria 0
## 5 2020-01-05 Northern Africa Algeria 0
## 6 2020-01-06 Northern Africa Algeria 0
The select function can also be used to rename selected variables.
<- africa_covid_cases_long %>%
analysis_dataset select(date_format,region=AFRICAN_REGION, country=COUNTRY_NAME, cases) %>%
mutate(date=lubridate::ymd(date_format))
We have renamed AFRICAN_REGION
and COUNTRY_NAME
as region
and country
.
head(analysis_dataset)
## date_format region country cases date
## 1 2020-01-01 Northern Africa Algeria 0 2020-01-01
## 2 2020-01-02 Northern Africa Algeria 0 2020-01-02
## 3 2020-01-03 Northern Africa Algeria 0 2020-01-03
## 4 2020-01-04 Northern Africa Algeria 0 2020-01-04
## 5 2020-01-05 Northern Africa Algeria 0 2020-01-05
## 6 2020-01-06 Northern Africa Algeria 0 2020-01-06
4.4 Answering questions with data
The Epidemiologist R handbook has several comprehensive sections focusing on data analysis. We will continue to work with the dataset we have built while applying some of the examples from the handbook.
So far, we have:
Imported the data from an Excel worksheet
Reshaped the data into a “tidy” format
Changed the format of a variable to a date
Selected only the variables we want to use for the analysis
Now we can start to use the dataset to answer questions. The dplyr
package contains many useful functions for analysing data. Some of these functions are covered in the Epidemiologist R Handbook - Section 17.4. We will use some of these functions to answer questions using our dataset.
How many confirmed cases of COVID-19 have been recorded in Africa?
%>% # Tell R what dataset we want to use
analysis_dataset summarise(total_covid_cases=sum(cases)) #Tell R you want to sum all cases
## total_covid_cases
## 1 NA
The answer is “NA”, which stands for “Not Available”. This is a good example of how R “sees” missing data
There may be dates in our dataset where there were no confirmed cases of COVID-19 recorded.
When data are missing, R will display “NA” for the variable.
If you try to run a calculation on data where there is one or more “NA” values, the results will be “NA”.
One option for dealing with missing values in R is to exclude “NA” values from calculations. To do this we, can add an additional argument to the function to remove any record with “NA” in the variable “cases”.
<- analysis_dataset %>%
total_covid_cases summarise(total_covid_cases=sum(cases, na.rm=TRUE)) #Tell R you want to sum all cases but remove NA values - values with "missing" data
total_covid_cases
## total_covid_cases
## 1 4561465
This command has now excluded NA values and has provided us with an answer for the number of confirmed COVID-19 cases in Africa - 4561465.
4.4.1 Grouping and pivoting data
group_by
is a very powerful function for summarising data. You can instruct R to recognise specific groups in your data. Functions can then be applied to these groups providing you with additional information.
How many confirmed cases of COVID-19 have been recorded in Africa, by region?
<- analysis_dataset %>%
total_covid_cases_region group_by(region) %>% #use the values in the region column to group the data
summarise(total_covid_cases=sum(cases, na.rm=TRUE)) #for each region, sum the number of cases
total_covid_cases_region
## # A tibble: 5 x 2
## region total_covid_cases
## <chr> <int>
## 1 Central Africa 161353
## 2 Eastern Africa 622537
## 3 Northern Africa 1371469
## 4 Southern Africa 1970137
## 5 Western Africa 435969
The arrange
function can be used to sort the results. In this case, we have instructed R to sort the results by the total_covid_cases
variable, from highest to lowest value.
<- analysis_dataset %>%
total_covid_cases_region_sort group_by(region) %>% #use the values in the region column to group the data
summarise(total_covid_cases=sum(cases, na.rm=TRUE)) %>%
arrange(-total_covid_cases)
total_covid_cases_region_sort
## # A tibble: 5 x 2
## region total_covid_cases
## <chr> <int>
## 1 Southern Africa 1970137
## 2 Northern Africa 1371469
## 3 Eastern Africa 622537
## 4 Western Africa 435969
## 5 Central Africa 161353
We can add multiple variables to group_by
.
If we add region and country to the group_by command, sort from highest to lowest, we can see which countries reported the most confirmed COVID-19 cases.
<- analysis_dataset %>%
total_covid_cases_region_country_sort group_by(region, country) %>%
summarise(total_covid_cases=sum(cases, na.rm=TRUE)) %>% #for each region, sum the number of cases
arrange(-total_covid_cases) #sort the result from highest to lowest number of cases
## `summarise()` has grouped output by 'region'. You can override using the `.groups` argument.
total_covid_cases_region_country_sort
## # A tibble: 53 x 3
## # Groups: region [5]
## region country total_covid_cases
## <chr> <chr> <int>
## 1 Southern Africa South Africa 1584064
## 2 Northern Africa Morocco 511856
## 3 Northern Africa Tunisia 311743
## 4 Eastern Africa Ethiopia 258353
## 5 Northern Africa Egypt 228584
## 6 Northern Africa Libya 178335
## 7 Western Africa Nigeria 165199
## 8 Eastern Africa Kenya 160559
## 9 Northern Africa Algeria 122522
## 10 Western Africa Ghana 92683
## # … with 43 more rows
4.4.2 Filtering data
Another useful function is filter
which can be used to apply filters to calculations.
We can repeat the previous calculation, but then add a filter to only include results from countries in Northern Africa.
%>%
analysis_dataset group_by(region, country) %>%
summarise(total_covid_cases=sum(cases, na.rm=TRUE)) %>%
arrange(-total_covid_cases) %>% #sort the result from highest to lowest number of cases
filter(region=="Northern Africa") #filter the result to only include rows where the region is "Northern Africa"
## `summarise()` has grouped output by 'region'. You can override using the `.groups` argument.
## # A tibble: 6 x 3
## # Groups: region [1]
## region country total_covid_cases
## <chr> <chr> <int>
## 1 Northern Africa Morocco 511856
## 2 Northern Africa Tunisia 311743
## 3 Northern Africa Egypt 228584
## 4 Northern Africa Libya 178335
## 5 Northern Africa Algeria 122522
## 6 Northern Africa Mauritania 18429
The filter can be applied at any point within the calculation. For very complex calculations, it is helpful to apply the filter as early as possible. This reduces the number of records before the complex portion of the calculation occurs.
filter
can also be used to make data frames
<- analysis_dataset %>%
northern_africa filter(region=="Northern Africa")
head(northern_africa)
## date_format region country cases date
## 1 2020-01-01 Northern Africa Algeria 0 2020-01-01
## 2 2020-01-02 Northern Africa Algeria 0 2020-01-02
## 3 2020-01-03 Northern Africa Algeria 0 2020-01-03
## 4 2020-01-04 Northern Africa Algeria 0 2020-01-04
## 5 2020-01-05 Northern Africa Algeria 0 2020-01-05
## 6 2020-01-06 Northern Africa Algeria 0 2020-01-06
Using filters, we can answer additional questions.
What percentage of North Africa’s confirmed COVID-19 cases were recorded in each country in North Africa?
#to convert the calculation to percentage we will need to install and load an additional package called "scales"
::p_load(scales)
pacman
<- northern_africa %>%
northern_africa_cases_country group_by(country) %>%
summarise(total_covid_cases=sum(cases, na.rm=TRUE)) %>%
mutate(percentage=scales::percent(total_covid_cases/sum(total_covid_cases))) %>%
#with this mutate command we are telling r to divide the total number of covid cases for each country by the total number of covid cases for all countries in northern africa
arrange(-total_covid_cases)
northern_africa_cases_country
## # A tibble: 6 x 3
## country total_covid_cases percentage
## <chr> <int> <chr>
## 1 Morocco 511856 37.3%
## 2 Tunisia 311743 22.7%
## 3 Egypt 228584 16.7%
## 4 Libya 178335 13.0%
## 5 Algeria 122522 8.9%
## 6 Mauritania 18429 1.3%
4.5 Visualising data using ggplot
One of the major strengths of R is visualising data. There are many packages which have functions you can use to make graphs, tables, maps…the list is endless! The first package of functions we will use for visualising data is another core tidyverse
package called ggplot2.
This is commonly referred to as ggplot
.
We have already loaded the package when we ran library(tidyverse)
. You can also choose to only load the ggplot2 package by typing library(ggplot2)
.
library(ggplot2)
The Epidemiologist R handbook has 2 sections focused on ggplot
:
These sections contain very helpful explanations of many of the functions available with ggplot. There are also a number of excellent references for every type of graph you want to make. We will walk through some common examples to teach some of the most common approaches.
4.5.1 Epicurves
Firstly, we will produce epicurves to describe the distribution of COVID-19 cases (y axis) over time (x axis). Make a graph of confirmed COVID-19 cases in Northern Africa. The Epidemiologist R handbook has a very useful section on this topic - 32.3 Epicurves with ggplot
ggplot(northern_africa, aes(x=date,y=cases)) +
geom_line()
## Warning: Removed 7 row(s) containing missing values (geom_path).
This command has generated a line graph of confirmed COVID-19 cases for countries in Northern Africa. From earlier steps, we know that the dataset northern_africa
contains data from multiple countries: `r unique(northern_africa$country’. We can add more information to the ggplot command to draw separate lines for each country.
ggplot(northern_africa, aes(x=date,y=cases, color=country)) +
geom_line()
## Warning: Removed 9 row(s) containing missing values (geom_path).
To make the graph more presentable, we can add more options to the ggplot command.
ggplot(northern_africa, aes(x=date,y=cases, color=country)) +
geom_line() +
labs(x='Date', y='Total cases', color='Country') + #label axes
theme(legend.position='top') + #place legend at top of graph
scale_x_date(date_breaks = '2 months', #set x axis to have 2 month breaks
date_minor_breaks = '1 month', #set x axis to have 1 month breaks
date_labels = '%b-%y') #change label for x axis
## Warning: Removed 9 row(s) containing missing values (geom_path).
More information on plotting time-series data using ggplot can be found here.
It is still difficult to see the data for each country. There is a helpful command called facet_wrap
to fix this and show multiple epicurves by country.
ggplot(northern_africa, aes(x=date,y=cases, color=country)) +
geom_line() +
labs(x='Date', y='Total cases') + #label axes
theme(legend.position='none') + #remove legend by setting position to 'none'
scale_x_date(date_breaks = '4 months', #set x axis to have 2 month breaks
date_minor_breaks = '2 months', #set x axis to have 1 month breaks
date_labels = '%b-%y') + #change label for x axis
facet_wrap(~country) # this will create a separate graph for each country
## Warning: Removed 9 row(s) containing missing values (geom_path).
4.5.1.1 Highlighting
When data are presented for multiple countries, it can be helpful to highlight specific countries to show their path. This can be easily done in ggplot using the package gghighlight and is covered in the Epidemiologist R handbook Section 31.8 Highlighting.
::p_load(gghighlight)
pacman<- ggplot(northern_africa, aes(x=date,y=cases, color=country)) +
highlight_country_morocco_gph geom_line() +
::gghighlight(country == "Morocco") + #highlight data reported by Morocco
gghighlightlabs(x='Date', y='Total cases') + #label axes
theme(legend.position='none') + #remove legend by setting position to 'none'
scale_x_date(date_breaks = '4 months', #set x axis to have 2 month breaks
date_minor_breaks = '2 months', #set x axis to have 1 month breaks
date_labels = '%b-%y') #change label for x axis
## Warning: Tried to calculate with group_by(), but the calculation failed.
## Falling back to ungrouped filter operation...
## label_key: country
highlight_country_morocco_gph
## Warning: Removed 9 row(s) containing missing values (geom_path).
## Warning: Removed 1 row(s) containing missing values (geom_path).
## Warning: Removed 1 rows containing missing values (geom_label_repel).
Highlighting can also be applied to graphs with the facet_wrap function applied.
<- ggplot(northern_africa, aes(x=date,y=cases, color=country)) +
highlight_country_facet_gph geom_line() +
labs(x='Date', y='Total cases') + #label axes
::gghighlight() + #highlight each country independently
gghighlighttheme(legend.position='none') + #remove legend by setting position to 'none'
scale_x_date(date_breaks = '4 months', #set x axis to have 2 month breaks
date_minor_breaks = '2 months', #set x axis to have 1 month breaks
date_labels = '%b-%y') + #change label for x axis
facet_wrap(~country) # this will create a separate graph for each country
## label_key: country
highlight_country_facet_gph
## Warning: Removed 9 row(s) containing missing values (geom_path).
## Warning: Removed 9 row(s) containing missing values (geom_path).
## Warning: Removed 6 rows containing missing values (geom_label_repel).
4.5.2 Building your confidence
This code can seem overwhelming at first. The method to build a ggplot is very different to the ‘point and click’ method used in Excel. It will be very helpful to your learning to work through each step and see what changes when you delete/add code. A very helpful package called eqsuisse can help you understand more about how ggplot works.
::p_load(esquisse)
pacman::esquisser() esquisse
Using this package, you can drag and drop variables, change the type of graph and make a lot customisations. You can also click the button “code” and it will show you what the ggplot code is for each graph you have selected. This code can then be copied into your R script.
4.6 Useful resources
Epidemiologist R handbook
Data Management
- 15: De-duplication
Analysis
- 17: Descriptive tables
- 22: Moving averages
Data visualization
- 30: ggplot basics
- 31: ggplot tips
- 32: Epidemic curves
Reports and dashboards
4.7 Additional questions and workflows
4.7.1 When was the first confirmed case of COVID-19 in Northern Africa?
<- northern_africa %>% # use the northern_africa data
first_cases_northern_africa filter(cases>0) %>% #filter to only include rows with a case value of greater than 0
filter(date == min(date, na.rm=TRUE)) #keep the row with the minimum (first) date
first_cases_northern_africa
## date_format region country cases date
## 1 2020-02-14 Northern Africa Egypt 1 2020-02-14
Here we have added 2 filters:
Only keep records where the value for
cases
is higher than 0.Only keep records where the value for
date
is equal to the minimum value fordate.
We have also added thena.rm=TRUE
command from a previous step. If you don’t know the data very well, it is good practice to add this command.
When was the first confirmed case of COVID-19 in Northern Africa, by country?
<- northern_africa %>%
first_case_northern_africa_country group_by(country) %>%
filter(cases>0) %>%
filter(date == min(date, na.rm=TRUE))
first_case_northern_africa_country
## # A tibble: 6 x 5
## # Groups: country [6]
## date_format region country cases date
## <chr> <chr> <chr> <int> <date>
## 1 2020-02-25 Northern Africa Algeria 1 2020-02-25
## 2 2020-02-14 Northern Africa Egypt 1 2020-02-14
## 3 2020-03-24 Northern Africa Libya 1 2020-03-24
## 4 2020-03-13 Northern Africa Mauritania 1 2020-03-13
## 5 2020-03-02 Northern Africa Morocco 1 2020-03-02
## 6 2020-03-02 Northern Africa Tunisia 1 2020-03-02
The filter function can also be used to exclude certain records from the analysis
<- northern_africa %>%
first_case_northern_africa_country_exclude_tunisia group_by(country) %>%
filter(cases>0) %>%
filter(date == min(date, na.rm=TRUE)) %>%
#filter(!country=="Tunisia") %>%
filter(country!="Tunisia") #both methods for excluding results (in this case excluding results where the value for country is Tunisia) can be used
first_case_northern_africa_country_exclude_tunisia
## # A tibble: 5 x 5
## # Groups: country [5]
## date_format region country cases date
## <chr> <chr> <chr> <int> <date>
## 1 2020-02-25 Northern Africa Algeria 1 2020-02-25
## 2 2020-02-14 Northern Africa Egypt 1 2020-02-14
## 3 2020-03-24 Northern Africa Libya 1 2020-03-24
## 4 2020-03-13 Northern Africa Mauritania 1 2020-03-13
## 5 2020-03-02 Northern Africa Morocco 1 2020-03-02
On what date was the 100th case of COVID-19 reported from each country in Northern Africa?
<- northern_africa %>%
first_cases_northern_africa_data group_by(country) %>%
mutate(cumulative_cases=cumsum(cases)) %>%
filter(cumulative_cases>=100) %>%
slice(1) %>%
pull(date, country)
Here we have introduced two new functions, slice
and pull
. slice
can be used to select specific rows from a dataset. In this case, we have added a column which is the cumulative number of cases, selected the first row after filtering the dataset to only include results where the value is greater than or equal to 100, and then selected the first row using the slice command. An additional function is the pull
command. This is useful when you want to extract specific values from the result.
<- northern_africa %>%
first_100cases group_by(country) %>%
mutate(cumulative_cases=cumsum(cases)) %>%
filter(cumulative_cases>=100) %>%
slice(1) %>%
pull(date, country)
The Epidemiologist R Handbook has a thorough section on slicing data: 15.3- Slicing
4.7.2 Using the results in RMarkdown
RStudio has a powerful tool for writing documents called RMarkdown. These slides were written in RMarkdown and there are many resources online built using the same approach.
One major benefit of RMarkdown is that it is an excellent tool for reproducible research.
RMarkdown will not be covered during this training, but there are excellent resources online providing walk-through guides and some are included in the useful resources section of this session.
Here is one example:
If we write this text
-19 case recorded in Northern Africa was in `r first_cases_northern_africa %>% pull(country)` on `r first_cases_northern_africa %>% pull(date)`. The first COVID
it will work through the functions and this is the result
The first COVID-19 case recorded in Northern Africa was in Egypt on 2020-02-14.
If you were to re-run the report with data from eastern Africa, the figures would update using the same functions on the new dataset for eastern Africa. This is very helpful when multiple people work on a report that needs to be run at regular intervals.
4.7.3 Moving averages
The dataset is currently set up so that each row contains information on the number of recorded COVID-19 cases for a specific date for a specified country. One calculation that we may be interested in is the overall trend of case numbers over time. For this, we can calculate cumulative values and averages to identify any trends in the data.
To demonstrate this, we will filter the dataset only to include one country - in this case, Morocco.
<- northern_africa %>%
morocco_covid_cases filter(country=="Morocco")
When data are collected daily, it can be helpful to apply functions to improve the interpretation of trends that may be present in the data. For example, with this COVID dataset, data are available for 489 days between January 1, 2020, & May 3, 2021. There will be some days when 0 cases are reported, and there will be some days when many more cases are reported. Some of these differences may be due to delays in reporting cases if, for example, reporting does not take place at the weekend.
There are many functions in the slider
package that can help us partially account for reporting delays. These examples use code from the Epidemiologist R handbook: 22.2 Calculate with slider.
4.7.3.1 Rolling seven-day average (mean) of cases
::p_load(slider)
pacman
<- morocco_covid_cases %>%
morocco_covid_cases_mean mutate( # create new columns
# Using slide_dbl()
###################
reg_7day = slide_dbl(
# calculate on new_cases
cases, .f = ~mean(.x, na.rm = T), # function is sum() with missing values removed
.before = 7)) # window is the ROW and 6 prior ROWS
We have now created a new variable which calculates the 7-day moving average of cases. In the visualisation session of this training, we will compare the graphs of cases to the seven-day moving average to show the difference between the two indicators.
If you wanted to calculate a moving average over a more extended period, you could adjust the number after k=
<- morocco_covid_cases %>%
morocco_covid_cases_mean mutate( # create new columns
# Using slide_dbl()
###################
cases_7day_mean = slide_dbl(
# calculate on new_cases
cases, .f = ~mean(.x, na.rm = T), # function is sum() with missing values removed
.before = 7), # window is the ROW and 6 prior ROWS
cases_14day_mean = slide_dbl(
# calculate on new_cases
cases, .f = ~mean(.x, na.rm = T), # function is sum() with missing values removed
.before = 14)) # window is the ROW and 6 prior ROWS
4.7.4 Picking the right graph
With so many options to choose from you will find yourself spending a lot of time trying to work out the most effective visualisation for your analysis. Graphs are a very powerful method for visualising complex information but they can be misleading if they are not designed correctly.
4.7.4.1 Tufte’s 6 fundamental principles of design
Edward Tufte is a world-famous graphic designer who has published several books focusing on data visualisation. Tufte has suggested 6 fundamental principles of design which have been discussed here and here. We will work through the 6 principles and consider how we can apply them when building graphs in ggplot.
“The representation of numbers, as physically measured on the surface of the graphic itself, should be directly proportional to the numerical quantities measured.”
Use an accurate scale
“Clear, detailed, and thorough labeling should be used to defeat graphical distortion and ambiguity. Write out explanations of the data on the graphic itself. Label important events in the data.”
Label the graph so that the reader understands the story you are telling.
“Show data variation, not design variation.”
Pick colours that can help to tell the story. Don’t have more than 5 colours as it’s difficult to identify individual groups. Use the most commonly used types of graph.
“In time-series displays of money, deflated and standardized units of monetary measurement are nearly always better than nominal units.”
Use appropriate units to ensure data are comparable. For diseases, consider presenting incidence as cases/100,000 people.
“The number of information-carrying (variable) dimensions depicted should not exceed the number of dimensions in the data.”
Don’t use pie charts!
“Graphics must not quote data out of context.”
Ensure your graph is telling the truth!
Due to the structure of code using ggplot these principles can be followed to ensure clean, clear graphics.
4.7.5 Visualising the moving average
In a previous section, we added indicators for the rolling average and rolling sum of cases. These indicators can be helpful for identifying trends over time.
<- morocco_covid_cases_mean %>%
moroocco_covid_cases_graph ggplot() +
geom_col(aes(x=date, y=cases, color=country)) +
geom_line(aes(x=date, y=cases_7day_mean)) +
labs(x='Month-Year',
y='Total cases',
title='Cases and 7-day average (black line)') +
theme(legend.position='none') +
scale_x_date(date_breaks = '4 months', #set x axis to have 2 month breaks
date_minor_breaks = '2 months', #set x axis to have 1 month breaks
date_labels = '%b-%y') #change label for x axis
moroocco_covid_cases_graph
## Warning: Removed 1 rows containing missing values (position_stack).
This chart shows the total number of COVID-19 cases for each day in Morocco between January 1, 2020 and May 3, 2021. The red bars show the reported case numbers for each day while the black line show the 7-day average of cases. We can see that there are several dates with substantially higher numbers of cases compared to the neighbouring dates. This could be due to increased testing on specific days but it is more likely due to delays in reporting leading to a backlog of cases reported on specific days. The black line “smooths” out these differences, allowing us to see the overall trend.
4.7.6 Presenting your results in a table
The gt
package provides a very flexible interface for building tables from your data.
::p_load(gt,here,dplyr) pacman
The documentation describing the functions can be found here. Below is an example using the dataset of COVID-19 cases in Northern Africa.
<- northern_africa_cases_country %>%
northern_africa_cases_country_table gt() %>%
tab_header(
title = md("COVID-19 in Northern Africa")
%>%
) cols_label(
country = "Country",
total_covid_cases = "Count",
percentage = "% of total cases in Northern Africa"
%>%
) tab_spanner(
label = "Confirmed cases",
columns = c(total_covid_cases,percentage)
%>%
) fmt_number(
columns = total_covid_cases,
decimals=0,
use_seps = TRUE
%>%
) cols_align(
align = "center",
columns = c(total_covid_cases, percentage)
) northern_africa_cases_country_table
COVID-19 in Northern Africa | ||
---|---|---|
Country | Confirmed cases | |
Count | % of total cases in Northern Africa | |
Morocco | 511,856 | 37.3% |
Tunisia | 311,743 | 22.7% |
Egypt | 228,584 | 16.7% |
Libya | 178,335 | 13.0% |
Algeria | 122,522 | 8.9% |
Mauritania | 18,429 | 1.3% |
gt
has many options for customising tables. To demonstrate this, we will build a table to show when each country in Africa recorded its first COVID-19 case. This example uses some of the techniques demonstrated in this article.
<- africa_covid_cases_long %>%
first_cases_africa select(date=date_format,region=AFRICAN_REGION, country=COUNTRY_NAME, cases) %>%
group_by(region,country) %>%
filter(cases>0) %>%
filter(date == min(date, na.rm=TRUE)) %>%
ungroup()
<- first_cases_africa %>%
first_cases_africa_table select(region,country,date) %>%
group_by(region) %>%
arrange(date) %>%
gt() %>%
tab_header(
title = md("When did countries in Africa record their first case of COVID-19?")
%>%
) fmt_date(
columns = date,
date_style = 4
%>%
) opt_all_caps() %>%
#Use the Chivo font
#Note the great 'google_font' function in 'gt' that removes the need to pre-load fonts
opt_table_font(
font = list(
google_font("Chivo"),
default_fonts()
)%>%
) cols_label(
country = "Country",
date = "Date"
%>%
) cols_align(
align = "center",
columns = c(country, date)
%>%
) tab_options(
column_labels.border.top.width = px(3),
column_labels.border.top.color = "transparent",
table.border.top.color = "transparent",
table.border.bottom.color = "transparent",
data_row.padding = px(3),
source_notes.font.size = 12,
heading.align = "left",
#Adjust grouped rows to make them stand out
row_group.background.color = "grey") %>%
tab_source_note(source_note = "Data: Compiled from national governments and WHO by Humanitarian Emergency Response Africa (HERA)")
first_cases_africa_table
When did countries in Africa record their first case of COVID-19? | |
---|---|
Country | Date |
Northern Africa | |
Egypt | Friday 14 February 2020 |
Algeria | Tuesday 25 February 2020 |
Morocco | Monday 2 March 2020 |
Tunisia | Monday 2 March 2020 |
Mauritania | Friday 13 March 2020 |
Libya | Tuesday 24 March 2020 |
Western Africa | |
Nigeria | Thursday 27 February 2020 |
Senegal | Friday 28 February 2020 |
Togo | Friday 6 March 2020 |
Burkina Faso | Monday 9 March 2020 |
Cote d'Ivoire | Wednesday 11 March 2020 |
Ghana | Thursday 12 March 2020 |
Guinea | Thursday 12 March 2020 |
Benin | Monday 16 March 2020 |
Gambia | Monday 16 March 2020 |
Liberia | Monday 16 March 2020 |
Niger | Thursday 19 March 2020 |
Guinea-Bissau | Wednesday 25 March 2020 |
Mali | Wednesday 25 March 2020 |
Sierra Leone | Tuesday 31 March 2020 |
Southern Africa | |
South Africa | Thursday 5 March 2020 |
Eswatini | Saturday 14 March 2020 |
Namibia | Saturday 14 March 2020 |
Zimbabwe | Friday 20 March 2020 |
Angola | Saturday 21 March 2020 |
Zambia | Sunday 22 March 2020 |
Mozambique | Monday 23 March 2020 |
Botswana | Monday 30 March 2020 |
Malawi | Thursday 2 April 2020 |
Lesotho | Tuesday 12 May 2020 |
Central Africa | |
Cameroon | Friday 6 March 2020 |
Democratic Republic of the Congo | Tuesday 10 March 2020 |
Gabon | Thursday 12 March 2020 |
Central African Republic | Saturday 14 March 2020 |
Equatorial Guinea | Saturday 14 March 2020 |
Chad | Thursday 19 March 2020 |
Congo | Sunday 22 March 2020 |
Burundi | Tuesday 31 March 2020 |
Sao Tome and Principe | Monday 6 April 2020 |
Eastern Africa | |
Ethiopia | Friday 13 March 2020 |
Kenya | Friday 13 March 2020 |
Sudan | Friday 13 March 2020 |
Rwanda | Saturday 14 March 2020 |
Somalia | Monday 16 March 2020 |
Mayotte | Tuesday 17 March 2020 |
Tanzania | Tuesday 17 March 2020 |
Djibouti | Wednesday 18 March 2020 |
Mauritius | Wednesday 18 March 2020 |
Madagascar | Friday 20 March 2020 |
Eritrea | Saturday 21 March 2020 |
Uganda | Saturday 21 March 2020 |
South Sudan | Monday 6 April 2020 |
Comoros | Thursday 30 April 2020 |
Data: Compiled from national governments and WHO by Humanitarian Emergency Response Africa (HERA) |