Chapter 11 More Granular Data on Vaccines and Deaths

Malaysia has made publicly available larger, more detailed, and granular data sets on Covid. We “copy and paste” below the description of the csv datasets from the GitHub depository.49

1. Cases

All aggregated data on cases available via Github is derived from this line list.

  • date : yyyy-mm-dd format; date of case
  • days_dose1 : number of days between the positive sample date and the individual’s first dose (if any); values 0 or less are nulled
  • days_dose2 : number of days between the positive sample date and the individual’s second dose (if any); values 0 or less are nulled
  • vaxtype : p = Pfizer, s = Sinovac, a = AstraZeneca, c = Cansino, m = Moderna, h = Sinopharm, j = Janssen, u = unverified (pending sync with VMS)
  • import : binary variable with 1 denoting an imported case and 0 denoting local transmission cluster : binary variable with 1 denoting cluster-based transmission and 0 denoting and unlinked case symptomatic : binary variable with 1 denoting an individual presenting with symptoms at the point of testing state : state of residence, coded as an integer (refer to param_geo.csv) district : district of residence, coded as an integer (refer to param_geo.csv) age : age as an integer male : binary variable with 1 denoting male and 0 denoting female malaysian : binary variable with 1 denoting Malaysian and 0 denoting non-Malaysian

2. Deaths

All aggregated data on deaths available via Github is derived from this line list. The deaths line list was released before the cases line list. As such, it is formatted differently, because several optimizations had to be made to reduce the size of the case line list (in particular, coding as many things as possible as integers).

  • date : yyyy-mm-dd format; date of death
  • date_announced : date on which the death was announced to the public (i.e. registered in the public line list)
  • date_positive : date of positive sample
  • date_dose1 : date of the individual’s first dose (if any)
  • date_dose2 : date of the individual’s second dose (if any)
  • vaxtype : p = Pfizer, s = Sinovac, a = AstraZeneca, c = Cansino, m = Moderna, h = Sinopharm, j = Janssen, u = unverified (pending sync with VMS)
  • state : state of residence
  • age : age as an integer; note that it is possible for age to be 0, denoting infants less than 6 months old
  • male : binary variable with 1 denoting male and 0 denoting female
  • bid : binary variable with 1 denoting brought-in-dead and 0 denoting an inpatient death
  • malaysian : binary variable with 1 denoting Malaysian and 0 denoting non-Malaysian
  • comorb : binary variable with 1 denoting that the individual has comorbidities and 0 denoting no comorbidities declared. (Comorbidity is the state of having multiple medical conditions at the same time, especially when they interact with each other in some way.)

We chose to download the linelist_deaths.csv and param_geo.csv data sets for now. We may add others later. We activate some packages.

library(ggplot2)
library(dplyr)
library(tidyverse)
library(tidyr)
library(readr)

library(patchwork)
# deaths <- read.csv("https://raw.githubusercontent.com/MoH-Malaysia/covid19-public/main/epidemic/linelist/linelist_deaths.csv")
# param_geo <- read.csv("https://raw.githubusercontent.com/MoH-Malaysia/covid19-public/main/epidemic/linelist/param_geo.csv")

# saveRDS(deaths, "data/deaths.rds")
# saveRDS(param_geo, "data/param_geo.rds")


deaths <- readRDS("data/deaths.rds")
param_geo <- readRDS("data/param_geo.rds")

We downloaded the data on “2021-20-23”. We will visually explore this data frame. Initial data exploration should reveal some questions to further explore. We will focus on the deaths and vaccinations. All the data is in the deaths data frame.

11.1 Data frame structure

deaths

sapply(deaths, typeof)
##           date date_announced  date_positive     date_dose1     date_dose2 
##    "character"    "character"    "character"    "character"    "character" 
##        vaxtype          state            age           male            bid 
##    "character"    "character"      "integer"      "integer"      "integer" 
##      malaysian         comorb 
##      "integer"      "integer"

Note that the date, date_announced, date_positive, date_dose1, and date_dose2 columns are of type character. We will have to make the necessary adjustments when we use them.

param_geo

sapply(param_geo, typeof)
##       state    district        idxs        idxd 
## "character" "character"   "integer"   "integer"

We change the following columns (male, bid, malaysian, comorb) from type integer to character.

deaths %>% 
  mutate(jantina = ifelse(male == 1, "M", "F"),
         BID = ifelse(bid == 1, "BID", "InPatient"),
         warga = ifelse(malaysian == 1, "Tempatan", "Asing"),
         sebab = ifelse(comorb == 1, "Comorbiditi", "Covid")) %>% 
  select(-male, -bid, -malaysian, -comorb) -> deaths
deaths %>% head(10)
##          date date_announced date_positive date_dose1 date_dose2 vaxtype
## 1  2020-03-17     2020-03-17    2020-03-12                              
## 2  2020-03-17     2020-03-17    2020-03-15                              
## 3  2020-03-20     2020-03-20    2020-03-11                              
## 4  2020-03-21     2020-03-21    2020-03-17                              
## 5  2020-03-21     2020-03-21    2020-03-13                              
## 6  2020-03-21     2020-03-21    2020-03-21                              
## 7  2020-03-21     2020-03-21    2020-03-14                              
## 8  2020-03-22     2020-03-22    2020-03-18                              
## 9  2020-03-22     2020-03-22    2020-03-14                              
## 10 2020-03-22     2020-03-22    2020-03-20                              
##                state age jantina       BID    warga       sebab
## 1              Johor  34       M InPatient Tempatan Comorbiditi
## 2            Sarawak  60       M InPatient Tempatan Comorbiditi
## 3              Sabah  58       M InPatient Tempatan Comorbiditi
## 4           Kelantan  69       M InPatient Tempatan Comorbiditi
## 5             Melaka  50       M InPatient Tempatan Comorbiditi
## 6            Sarawak  39       F InPatient Tempatan Comorbiditi
## 7  W.P. Kuala Lumpur  57       M InPatient Tempatan Comorbiditi
## 8             Perlis  48       M InPatient Tempatan Comorbiditi
## 9       Pulau Pinang  73       M InPatient Tempatan Comorbiditi
## 10           Sarawak  80       F       BID Tempatan Comorbiditi

11.1.1 Shape of the data

We run summary to find the ranges for the numeric data.

summary(deaths)
##      date           date_announced     date_positive       date_dose1       
##  Length:28312       Length:28312       Length:28312       Length:28312      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##   date_dose2          vaxtype             state                age        
##  Length:28312       Length:28312       Length:28312       Min.   : -1.00  
##  Class :character   Class :character   Class :character   1st Qu.: 49.00  
##  Mode  :character   Mode  :character   Mode  :character   Median : 62.00  
##                                                           Mean   : 60.75  
##                                                           3rd Qu.: 72.00  
##                                                           Max.   :130.00  
##    jantina              BID               warga              sebab          
##  Length:28312       Length:28312       Length:28312       Length:28312      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
## 

The summary shows the range, [-1, 130] for the age. It is difficult to explain the unborn Covid death (age < 1), we will drop it. The median age is 62, to the right of, but very close to the mean of 60.75. We do the first plot to see the shape of the distribution of the deaths by age.

bins = max(deaths$age)
max_date <- as.Date(max(deaths$date, na.rm = T))
deaths <- deaths %>% filter(age >= 0)
deaths %>% 
  ggplot(aes(x = age)) + 
  geom_histogram(aes(y=..density..), binwidth=1, 
                 fill="bisque", color="white", alpha=0.7) + 
  geom_density() +
  geom_rug() +
  labs(title = "Total Death Distribution by Age", 
       caption = paste0("As of ", max_date),
       x = "Age")
Summary distribution of Covid deaths by age

Figure 11.1: Summary distribution of Covid deaths by age

The frequency chart with a density curve in Figure 11.1 shows a bell-shaped distribution. We repeat the frequency plot and facet it by states to see the distribution.

deaths %>% 
  filter(age >= 0) %>% 
  ggplot(aes(x = age)) + 
  geom_histogram(aes(y=..density..), binwidth=1, 
                 fill="bisque", color="white", alpha=0.7) + 
  geom_density() + 
  facet_wrap(~state)
Summary distribution of Covid deaths by age faceted by state

Figure 11.2: Summary distribution of Covid deaths by age faceted by state

  labs(title = "Total Death Distribution by Age", 
       caption = paste0("As of ", max_date),
       x = "Age")
## $x
## [1] "Age"
## 
## $title
## [1] "Total Death Distribution by Age"
## 
## $caption
## [1] "As of 2021-10-21"
## 
## attr(,"class")
## [1] "labels"

The bell shape seems consistent. While the density curve is informative, it can be too technical. A better way is binning the values into discrete categories and plotting the count of each bin in bars.

11.1.2 Summarizing data

We count the new Covid deaths and use group_by to group the data based on the granular categories we need. We permanently remove all data with age < 0 from the deaths data frame.

deaths <- deaths %>% filter(age >= 0)

We look at a simple visual distribution of deaths by age. Follow the flow of the pipe operator %>% (read %>% as “and then”) to understand the sequential steps in drawing the plots. We set (binwidth = 1).

deaths %>% 
  group_by(age, state) %>% 
  summarize(count = n()) %>% 
  ggplot(aes(x = age, fill = state)) + 
  geom_histogram(binwidth = 1) +
  labs(title = "Total Death Distribution by Age", 
       subtitle = "Color by state",
       caption = paste0("As of ", max_date),
       x = "Age",
       y = "Number of deaths",
       fill = "State")
Histogram of Covid deaths by age

Figure 11.3: Histogram of Covid deaths by age

Figure 11.3 may have an outlier data for the maximum death case of 130 years old. In Figure 11.4, we repeat the histogram but with a more coarse (binwidth = 10), ([1-10 years], [11-20 years], and so on).

deaths %>% 
  group_by(age, state) %>% 
  summarize(count = n()) %>% 
  ggplot(aes(x = age, fill = state)) + 
  geom_histogram(binwidth = 10) +
  labs(title = "Total Death Distribution by Age", 
       subtitle = "Color by state and age group of 10 years",
       caption = paste0("As of ", max_date),
       x = "Age",
       y = "Number of deaths",
       fill = "State")
Histogram of Covid deaths by age group

Figure 11.4: Histogram of Covid deaths by age group

Next we look at morbidity cases. We also add a kernel density plot.

deaths %>% 
  group_by(age, sebab) %>% 
  summarize(count = n()) %>% 
  ggplot(aes(x = age, fill = sebab)) + 
  geom_histogram(binwidth = 10) + 
  labs(title = "Total Death Distribution by Age", 
       x = "Age",
       y = "Number of deaths",
       fill = "Cause") -> p1

deaths %>% 
  group_by(age, sebab) %>% 
  summarize(count = n()) %>% 
  ggplot(aes(x = age, fill = sebab)) + 
  geom_density(alpha = 0.5) + 
  labs(title = "Total Death Distribution by Age", 
       x = "Age",
       y = "Number of deaths",
       fill = "Cause") -> p2

cowplot::plot_grid(
     p1,
     p2, ncol = 1)
Covid deaths by age group and morbidity

Figure 11.5: Covid deaths by age group and morbidity

Figure 11.5 shows an almost equal distribution of morbidity and non-morbidity death cases.

Facets

Figure 11.3 and Figure 11.4 look busy with too many colors (16 states). In Figure 11.6, we define facets by the 16 levels of the state column. We also look at the death cases by jantina.

deaths %>% 
  group_by(age, state, jantina) %>% 
  summarize(count = n()) %>% 
  ggplot(aes(x = age, fill = jantina)) + 
  geom_histogram(binwidth = 10) + 
  labs(title = "Total Death Distribution by Age", 
       subtitle = "Facet by states", 
       caption = "Fixed y-axis scales to compare the actual values", 
       x = "Age",
       y = "Number of deaths",
       fill = "Gender") +
  scale_y_continuous(breaks = seq(0, 300, 50)) + 
  facet_wrap(~state) 
Deaths by age and gender faceted by state

Figure 11.6: Deaths by age and gender faceted by state

We repeat Figure 11.6 for scales = "free".

deaths %>% 
  group_by(age, state, jantina) %>% 
  summarize(count = n()) %>% 
  ggplot(aes(x = age, fill = jantina)) + 
  geom_histogram(binwidth = 10) + 
  labs(title = "Total Death Distribution by Age", 
       subtitle = "Facet by states", 
       caption = "Free y-axis scales to compare the actual trends", 
       x = "Age",
       y = "Number of deaths",
       fill = "Gender") + 
  facet_wrap(~state, scales = "free") 
Deaths by age and gender faceted by state and free scale

Figure 11.7: Deaths by age and gender faceted by state and free scale

We have visually explored the Covid death count distribution by age and state while checking on some categories like jantina and sebab. Figures 11.1 to 11.4 show a normal bell curve distribution of deaths by age group. Figure 11.5 shows the bell shape varies only slightly when we split the comorbidity cases. Figure 11.7 shows that the bell shape is approximately reflected for most states.

Are the shapes expected, in the sense that larger data samples will generate a “more bell-shaped” distribution?50

11.1.3 Compute Means by Group

We want to compare the mean, median, and weighted mean for a subgroup of our data. We calculate several weighted means of the deaths by age.

deaths %>% 
  group_by(age, state) %>% 
  summarize(count = n()) %>% 
  group_by(state) %>% 
  summarise(meandeaths = mean(count, na.rm = TRUE), 
            mediandeaths = median(count, na.rm = TRUE), 
            meanage = mean(age, na.rm = TRUE),
            medianage = mean(age, na.rm = TRUE),
            weighted_mean = weighted.mean(count, age, na.rm = TRUE)) %>% 
  arrange(-meandeaths)
## # A tibble: 16 x 6
##    state             meandeaths mediandeaths meanage medianage weighted_mean
##    <chr>                  <dbl>        <dbl>   <dbl>     <dbl>         <dbl>
##  1 Selangor              104.           98      54.1      54.1        111.  
##  2 Johor                  37.3          38      56.2      56.2         38.7 
##  3 W.P. Kuala Lumpur      30.2          32      58.9      58.9         31.3 
##  4 Sabah                  23.5          14      53.5      53.5         28.5 
##  5 Kedah                  22.7          20      54.9      54.9         25.0 
##  6 Pulau Pinang           17.9          16      57.3      57.3         20.2 
##  7 Negeri Sembilan        15.1          13      58.9      58.9         16.2 
##  8 Sarawak                14.2          10      58.4      58.4         16.5 
##  9 Perak                  13.2          10.5    56.5      56.5         14.9 
## 10 Kelantan               11.4           8      57.9      57.9         13.0 
## 11 Melaka                 11.0          11      56.9      56.9         11.5 
## 12 Pahang                  8.56          7      57.1      57.1          9.01
## 13 Terengganu              5.41          4      54.1      54.1          6.19
## 14 W.P. Labuan             2.76          3      58.8      58.8          2.73
## 15 Perlis                  2.35          2      61.6      61.6          2.49
## 16 W.P. Putrajaya          1.21          1      64.8      64.8          1.22

We plot the above.

deaths %>% 
  group_by(age, state) %>% 
  summarize(count = n()) %>% 
  group_by(state) %>% 
  summarise(meandeaths = mean(count, na.rm = TRUE), 
            mediandeaths = median(count, na.rm = TRUE), 
            meanage = mean(age, na.rm = TRUE),
            medianage = mean(age, na.rm = TRUE),
            weighted_mean = weighted.mean(count, age, na.rm = TRUE)) %>% 
  gather("metric", "value", 2:6) %>% 
  ggplot(aes(x = state, 
             y = value, 
             color = metric,
             shape = metric)) + 
  geom_point() +
  coord_flip() +
  labs(title = "Daily Death Averages by State", 
       caption = paste0("As of ", max_date),
       x = "State",
       y = "",
       color = "Metric",
       shape = "Metric")
Mean, median and weighted means of daily Covid deaths by state

Figure 11.8: Mean, median and weighted means of daily Covid deaths by state

Figure 11.8 shows that the three metrics are quite close to each other for all the states. Normal distributions are bell-shaped and symmetric. The mean and median are equal. Thus we see that the plots are bell-shaped but not perfectly symmetric.

The Central Limit Theorem says that, if we take the mean of a large number of independent samples, then the distribution of that mean will be approximately normal. In practice, if the population we are sampling from is symmetric with no outliers, then we can start to see a good approximation to normality after as few as 15-20 samples. If the population is moderately skew, then it can take between 30-50 samples before getting a good approximation.51 We see that in Figures 11.1 to 11.3 with 28306 data points.

11.2 Vaccination and deaths

Of direct interest is the relationship between vaccination and deaths which we covered in Chapter 10. Now we have more granular data on both the deaths and vaccinations. We have to at least filter the deaths data frame to only reflect the data since the start of the vaccination program on “2021-02-25”. We can group_by(state, age, vaxtype, jantina, BID, warga, sebab) and count the number of death cases.

min_date <- as.Date("2021-02-25")
max_date <- as.Date(max(deaths$date, na.rm = T))

deaths %>% 
  mutate(vaxtype = replace(vaxtype, vaxtype == "", "0-None")) -> deaths

deaths %>% 
  filter(as.Date(date) >= min_date) %>%
  group_by(state, age, vaxtype, jantina, BID, warga, sebab) %>% 
  summarize(count = n()) %>% 
  filter(age >= 0) %>% 
  head(15)
## # A tibble: 15 x 8
## # Groups:   state, age, vaxtype, jantina, BID, warga [14]
##    state   age vaxtype jantina BID       warga    sebab       count
##    <chr> <int> <chr>   <chr>   <chr>     <chr>    <chr>       <int>
##  1 Johor     0 0-None  F       BID       Asing    Comorbiditi     1
##  2 Johor     1 0-None  F       InPatient Tempatan Comorbiditi     1
##  3 Johor     7 0-None  M       InPatient Tempatan Comorbiditi     1
##  4 Johor     8 0-None  M       InPatient Tempatan Comorbiditi     1
##  5 Johor    10 0-None  M       BID       Tempatan Comorbiditi     1
##  6 Johor    11 0-None  M       InPatient Tempatan Comorbiditi     1
##  7 Johor    13 0-None  M       BID       Tempatan Comorbiditi     1
##  8 Johor    14 0-None  F       InPatient Tempatan Comorbiditi     1
##  9 Johor    14 0-None  M       InPatient Tempatan Comorbiditi     1
## 10 Johor    16 0-None  F       InPatient Tempatan Comorbiditi     1
## 11 Johor    18 0-None  F       InPatient Tempatan Comorbiditi     2
## 12 Johor    18 0-None  F       InPatient Tempatan Covid           1
## 13 Johor    18 Pfizer  M       InPatient Tempatan Comorbiditi     1
## 14 Johor    20 0-None  F       InPatient Tempatan Comorbiditi     2
## 15 Johor    20 0-None  M       BID       Asing    Covid           1

We first plot the distribution of the deaths by age and identify the category by the vaccination type. We only need two categories as reflected by group_by(age, vaxtype). With select(age, vaxtype) we have created a uni-variable temporary tibble or data frame (see Chapter 3 on Single Variable Graphs).

deaths %>% 
  filter(as.Date(date) >= min_date) %>%
  group_by(age, vaxtype) %>% 
  select(age, vaxtype) %>% 
  arrange(age) %>%
  ggplot(aes(x = age, fill = vaxtype)) + 
  geom_histogram(binwidth = 5) +
  labs(title = "Total Death Distribution by Age", 
       subtitle = "Color by Vaccine Type and age group of 5 years",
       caption = paste0("Between ", min_date,  " and ", max_date),
       x = "Age",
       y = "Number of deaths",
       fill = "Vaccine type")
Summary distribution of Covid deaths by age group and vaccine type

Figure 11.9: Summary distribution of Covid deaths by age group and vaccine type

Figure 11.9 shows that from the summary data the death cases between those vaccinated and not vaccinated are about the same (visually). AstraZeneca has the lowest cases within the data set. But we must rule out other causes of death since they are all Covid vaccines. We separate Figure 11.9 neatly with a simple facet.

deaths %>% 
  filter(as.Date(date) >= min_date) %>%
  group_by(age, vaxtype, sebab) %>% 
  select(age, vaxtype, sebab) %>% 
  arrange(age) %>% 
  ggplot(aes(x = age, fill = vaxtype)) + 
  geom_histogram(binwidth = 5) +
  labs(title = "Total Death Distribution by Age", 
       subtitle = "Color by Vaccine Type and age group of 5 years",
       caption = paste0("Between ", min_date,  " and ", max_date),
       x = "Age",
       y = "Number of deaths",
       fill = "Vaccine type") + 
  facet_wrap(~sebab)
Summary distribution of Covid deaths by age group,  vaccine type and cause

Figure 11.10: Summary distribution of Covid deaths by age group, vaccine type and cause

Figure 11.10 does not paint a good picture about vaccinations to prevent deaths due to Covid.

We look at the breakdown by gender.

deaths %>% 
  filter(as.Date(date) >= min_date) %>%
  group_by(age, vaxtype, sebab) %>% 
  select(age, vaxtype, sebab, jantina) %>% 
  arrange(age) %>% 
  ggplot(aes(x = age, fill = vaxtype)) + 
  geom_histogram(binwidth = 5) +
  labs(title = "Total Death Distribution by Age", 
       subtitle = "Color by Vaccine Type and age group of 5 years",
       caption = paste0("Between ", min_date,  " and ", max_date),
       x = "Age",
       y = "Number of deaths",
       fill = "Vaccine type") + 
  facet_grid(jantina ~ sebab)
Summary distribution of Covid deaths by age group and vaccine type with dual facet

Figure 11.11: Summary distribution of Covid deaths by age group and vaccine type with dual facet

Figure 11.11 confirms the visual trend shown in Figure 11.9. Based on the Malaysian case, we cannot conclude that Covid vaccinations prevent Covid related deaths.

11.4 Discussion

This is an initial visual exploration of the deaths data frame. It is not prudent to test any hypothesis or develop models until we have more specific questions. Many of the visuals we painted from the data raise questions rather than provide answers. The deaths data frame has 28306 observations. Visualization helps us see the relationships between the 12 variables.

In general, the visuals show that the jury is still out there on the effectiveness of Covid vaccinations in preventing Covid deaths.

Correlation studies between vaccination, COVID-19 incidence, and mortality using the data available up to April 23, 2021, show that 60 countries presented positive correlations and 27 countries with negative correlation for the number of new cases. This finding reinforces the need to keep social distance and the use of face masks recommendations to reduce the virus transmission. The same approach was employed with the number of new deaths and 37 countries show positive correlations and 33 countries have negative correlations. Malaysia showed negative correlations for both cases. These results show that the implementation of vaccines is not the final solution and the maintenance of the non-pharmacological interventions should not be abandoned.56

To further analyze the data more accurately requires some knowledge of statistics, modeling, and hypothesis testing, as evidenced from the reference above. This is beyond the scope of this book.

References