Chapter 1 Simple Data Science Process

The data science process has 6 parts as shown in Figure 1.1 .2

Data science process

Figure 1.1: Data science process

Data import, data tidying, data transformation and processing, visualization, modeling, and the formation of repeatable reports should all be done in one integrated environment and program. The entire analysis and exploration process should be completed in one program code. This method is very helpful for training our data thinking. Thus three tools become indispensable when doing data science in R.

  1. RStudio3
  2. R Markdown4
  3. tidyverse package5

Please refer to the guide here to get started on RStudio.

1.1 The tidyverse family

The tidyverse is a collection of R packages designed for data science as shown in Figure 1.26. All packages share an underlying design philosophy, grammar, and data structures.

Components of Tidyverse package

Figure 1.2: Components of Tidyverse package

1.2 Set up and load packages

We set up our R working environment by activating the following packages that we will use in Part I, Part II, and Part III. Readers will have to download and install these packages on their respective computers first before running the following chunk of R code. We will repeat the library function to activate other packages as we need them.

library(dplyr)
library(tidyverse)
library(readr) # Loading the data
library(tidyr)
library(sf) # For the maps 
library(sp) # Transform coordinates 
library(ggplot2)
library(viridis)
library(igraph) # build network
library(spdep) # builds network
library(tidygraph)
library(ggraph) # for plotting networks
library(cowplot)
library(zoo)

1.3 Data Preparation

Based on the process shown in Figure 1.1 we must first import our Malaysian Covid data from an external source into R. The data will be in an R data frame and/or tibble. There are differences between the two formats but the later versions of R will default to a tibble. We will use the terms dataset, data frame, and tibble interchangeably. The last two terms have a specific meaning in R.

1.4 Importing data

R can import data from almost any source; text files, Excel spreadsheets, and databases. The Malaysia Covid data are all in csv text files. So we will only show how to download such files into R data frames.

1.4.1 Open data on COVID-19 in Malaysia7

The scope and granularity of data in this depository will evolve over time. We “copy and paste” below the description of the csv datasets from the github depository.

  1. Cases and Testing
    • cases_malaysia.csv: Daily recorded COVID-19 cases at country level, as of 1200 of date.
    • cases_state.csv: Daily recorded COVID-19 cases at the state level, as of 1200 of date.
    • clusters.csv: Exhaustive list of announced clusters with relevant epidemiological data points, as of 2359 of date of update.
    • tests_malaysia.csv: Daily tests (note: not necessarily unique individuals) by type at country level, as of 1200 of date.
  2. Healthcare
    • pkrc.csv: Flow of patients to/out of Covid-19 Quarantine and Treatment Centres (PKRC), with capacity and utilization as of 2359 of date.
    • hospital.csv: Flow of patients to/out of hospitals, with capacity and utilization as of 2359 of date.
    • icu.csv: Capacity and utilization of intensive care unit (ICU) beds as of 2359 of date.
  3. Deaths
    • deaths_malaysia.csv: Daily deaths due to COVID-19 at country level, as of 1200 of date.
    • deaths_state.csv: Daily deaths due to COVID-19 at the state level, as of 1200 of date.
  4. Vaccination MoH collaborates with MoSTI and the COVID-19 Immunisation Task Force (CITF) to publish open data on Malaysia’s vaccination rollout.8
    • vax_state.csv: Daily and cumulative vaccination at the state level, as at 2359 of date.
  5. Mobility and Contact Tracing
    • checkin_malaysia.csv: Daily check-ins on MySejahtera at country level, as of 2359 of date.
    • checkin_malaysia_time.csv: Time distribution of daily check-ins on MySejahtera at country level, as of 2359 of date.
    • trace_malaysia.csv: Daily casual contacts traced and hotspots identified by HIDE, at country level, as of 2359 of date.
  6. Static data
    • population.csv: Total, adult (18+), and elderly (60+) population at the state level.

Static data will (probably) remain unchanged for the duration of the program, barring an update from the source.

The readr package which is part of tidyverse as shown in Figure 1.2, provides functions for importing delimited text files into R data frames. These functions assume that the first line of the data contains the variable or column names, the values are separated by commas or tabs respectively and that missing data are represented by blanks. The good news is that all the csv files listed above follow this format, so we can simply download the raw data on an ’as-is" basis.

We run this R chunk for the first time to get the latest data. Then we safe the files in the R rds format and use it throughout the examples in the book. Readers should edit the R chunk below by removing the # comment at the beginning of the line for the commands to be executed. There are 3 groups of commands to

  1. Download the required files from the source into an R data frame
  2. Save the R data frame into the current working directory
  3. Load the R data frame

We chose to download only 8 of the csv datasets for now. We may add others later. Note that the vax_state.csv file is downloaded from another depository.

library(readr)

# mys1 <- read.csv("https://raw.githubusercontent.com/MoH-Malaysia/covid19-public/main/epidemic/cases_malaysia.csv")
# mysstates <- read_csv("https://raw.githubusercontent.com/MoH-Malaysia/covid19-public/main/epidemic/cases_state.csv")
# mysclusters <- read.csv("https://raw.githubusercontent.com/MoH-Malaysia/covid19-public/main/epidemic/clusters.csv")
# mystests <- read.csv("https://raw.githubusercontent.com/MoH-Malaysia/covid19-public/main/epidemic/tests_malaysia.csv")
# mysdeaths <- read.csv("https://raw.githubusercontent.com/MoH-Malaysia/covid19-public/main/epidemic/deaths_state.csv")
# popn <- read.csv("https://raw.githubusercontent.com/MoH-Malaysia/covid19-public/main/static/population.csv")
# vacn <- read.csv("https://raw.githubusercontent.com/CITF-Malaysia/citf-public/main/vaccination/vax_state.csv")
# hospital <- read.csv("https://raw.githubusercontent.com/MoH-Malaysia/covid19-public/main/epidemic/hospital.csv")
# icu <- read.csv("https://raw.githubusercontent.com/MoH-Malaysia/covid19-public/main/epidemic/icu.csv")

# saveRDS(mys1, "data/mys1.rds")
# saveRDS(mysstates, "data/mysstates.rds")
# saveRDS(mysclusters, "data/mysclusters.rds")
# saveRDS(mysdeaths, "data/mysdeaths.rds")
# saveRDS(popn, "data/popn.rds")
# saveRDS(vacn, "data/vacn.rds")
# saveRDS(hospital, "data/hospital.rds")
# saveRDS(icu, "data/icu.rds")

mys1 <- readRDS("data/mys1.rds")
mysstates <- readRDS("data/mysstates.rds")
mysclusters <- readRDS("data/mysclusters.rds")
mysdeaths <- readRDS("data/mysdeaths.rds")
popn <- readRDS("data/popn.rds")
vacn <- readRDS("data/vacn.rds")
hospital <- readRDS("data/hospital.rds")
icu <- readRDS("data/icu.rds")

1.5 Basic data manipulation (wrangling) commands

We will show some examples of using the dplyr and tidyr packages with some selected Malaysia Covid datasets that we have downloaded. Earlier we have enabled or activated these packages using the library function. Otherwise, they can be explicitly activated with the package::function notation.

  • dplyr::select - select variables/columns (we will mainly use columns in the book)
  • dplyr::filter - select observations/rows (we will mainly use rows in the book)
  • dplyr::mutate - transform existing columns or create new columns
  • dplyr::summarize - summarize data
  • dplyr::group_by - identify subgroups for further processing
  • tidyr::gather - convert wide format data frame to long format
  • tidyr::spread - convert long format data frame to wide format

Many other functions to manipulate data can be found here.9

1.5.1 Data frame structure

The following code displays the column or variable name and type for each data frame we created from the downloaded datasets.

mys1

sapply(mys1, typeof)
##                    date               cases_new            cases_import 
##             "character"               "integer"               "integer" 
##         cases_recovered          cluster_import       cluster_religious 
##               "integer"               "integer"               "integer" 
##       cluster_community        cluster_highRisk       cluster_education 
##               "integer"               "integer"               "integer" 
## cluster_detentionCentre       cluster_workplace 
##               "integer"               "integer"

mysclusters

sapply(mysclusters, typeof)
##         cluster           state        district  date_announced date_last_onset 
##     "character"     "character"     "character"     "character"     "character" 
##        category          status       cases_new     cases_total    cases_active 
##     "character"     "character"       "integer"       "integer"       "integer" 
##           tests             icu          deaths       recovered 
##       "integer"       "integer"       "integer"       "integer"

mysstates

sapply(mysstates, typeof)
##            date           state    cases_import       cases_new cases_recovered 
##        "double"     "character"        "double"        "double"        "double"

mysdeaths

sapply(mysdeaths, typeof)
##           date          state     deaths_new     deaths_bid deaths_new_dod 
##    "character"    "character"      "integer"      "integer"      "integer" 
## deaths_bid_dod    deaths_pvax    deaths_fvax     deaths_tat 
##      "integer"      "integer"      "integer"      "integer"

popn

sapply(popn, typeof)
##       state        idxs         pop      pop_18      pop_60 
## "character"   "integer"   "integer"   "integer"   "integer"

vacn

sapply(vacn, typeof)
##                date               state       daily_partial          daily_full 
##         "character"         "character"           "integer"           "integer" 
##               daily daily_partial_child    daily_full_child       cumul_partial 
##           "integer"           "integer"           "integer"           "integer" 
##          cumul_full               cumul cumul_partial_child    cumul_full_child 
##           "integer"           "integer"           "integer"           "integer" 
##             pfizer1             pfizer2            sinovac1            sinovac2 
##           "integer"           "integer"           "integer"           "integer" 
##              astra1              astra2             cansino             pending 
##           "integer"           "integer"           "integer"           "integer"

hospital

sapply(hospital, typeof)
##             date            state             beds       beds_covid 
##      "character"      "character"        "integer"        "integer" 
##     beds_noncrit     admitted_pui   admitted_covid   admitted_total 
##        "integer"        "integer"        "integer"        "integer" 
##   discharged_pui discharged_covid discharged_total       hosp_covid 
##        "integer"        "integer"        "integer"        "integer" 
##         hosp_pui    hosp_noncovid 
##        "integer"        "integer"

icu

sapply(icu, typeof)
##           date          state       beds_icu   beds_icu_rep beds_icu_total 
##    "character"    "character"      "integer"      "integer"      "integer" 
## beds_icu_covid           vent      vent_port      icu_covid        icu_pui 
##      "integer"      "integer"      "integer"      "integer"      "integer" 
##   icu_noncovid     vent_covid       vent_pui  vent_noncovid      vent_used 
##      "integer"      "integer"      "integer"      "integer"      "integer" 
## vent_port_used 
##      "integer"

Notice that not all the date variables are of the same type, some are defined as character and some as double. These were defined by the data providers. We will have to make the necessary adjustments when we use them.

1.5.2 Selecting columns

The select function allows us to limit the data frame to specified variables (columns).

  • Select from mysclusters data frame the columns cluster, state, district.
  • Store the results in a new data frame tmp. We show two similar ways using the -> and <- assign operator. We can also use =.
  • Display the first 10 records or rows using head.
  • Display the last 10 records or rows using tail.
# library(dplyr) (only if it has not been loaded)

select(mysclusters, cluster, state, district) -> tmp
tmp <- select(mysclusters, cluster, state, district)
head(tmp, 10)
##                                                                                  cluster
## 1                                                                    Kluster Warga Jepun
## 2                                                                            Kluster UDA
## 3                                               Kluster Keagamaan (Tabligh Sri Petaling)
## 4                                                              Kluster Tabligh New Delhi
## 5                                                                          Kluster AIROD
## 6                                                                      Kluster PUI Itali
## 7         Subkluster Tabligh Sri Petaling - Kluster Majlis Perkahwinan Bandar Baru Bangi
## 8                                                         Kluster Persidangan Gereja GNF
## 9  Subkluster Tabligh Sri Petaling - Kluster Bandar Baru Ibrahim Majid & kawasan sekitar
## 10                       Subkluster Tabligh Sri Petaling - Kluster Menara Plaza City One
##                                                                                                           state
## 1                                                                                               WP Kuala Lumpur
## 2                                              WP Kuala Lumpur, WP Putrajaya, Selangor, Negeri Sembilan & Johor
## 3                                                                                                  Semua Negeri
## 4                                                     Selangor, Negeri Sembilan, WP Kuala Lumpur & WP Putrajaya
## 5                                                                                                        Pahang
## 6                                                                                                       Sarawak
## 7  WP Kuala Lumpur, WP Putrajaya, Kedah, Selangor, Perak, Kelantan, Terengganu, Johor, Melaka & Negeri Sembilan
## 8                                                                                                       Sarawak
## 9                                                                                                         Johor
## 10                                                                                              WP Kuala Lumpur
##                       district
## 1                Lembah Pantai
## 2                     Pelbagai
## 3                     Pelbagai
## 4                     Pelbagai
## 5                      Kuantan
## 6                      Kuching
## 7                     Pelbagai
## 8  Kuching, Samarahan & Serian
## 9                       Kluang
## 10               Lembah Pantai
tail(tmp, 10)
##                                            cluster    state
## 5468                  Kluster Industri Sri Tanjung    Johor
## 5469       Kluster Industri Dua Anggerik Mokara 48 Selangor
## 5470                     Kluster Dua Jalan Puchong Selangor
## 5471            Kluster Tapak Bina Jalan Kamunting    Perak
## 5472            Kluster Jalan Teknologi Dua Gopeng    Perak
## 5473                   Kluster Kampung Dalam Pupoh Kelantan
## 5474           Kluster Kampung Padang Tok La Hilir Kelantan
## 5475                            Kluster Long Pasia    Sabah
## 5476             Kluster Kampung Jejawi Alor Pasir Kelantan
## 5477 Kluster Kampung Simpang Tiga Banggol Petaling Kelantan
##                          district
## 5468                Tangkak, Muar
## 5469 Klang, Petaling, Hulu Langat
## 5470                     Petaling
## 5471     Larut, Matang dan Selama
## 5472                Kampar, Kinta
## 5473                   Kota Bharu
## 5474                    Pasir Mas
## 5475                     Sipitang
## 5476                    Pasir Mas
## 5477                    Pasir Mas

1.5.3 Using pipes

dplyr and tidyr allow us to write our code in a compact format using the pipe %>% operator. It also shows the flow of the operations.

Let us redo the above using %>%. We also show how to sort of unselect columns from a data frame.

mysclusters %>% select(cluster, state, district) %>% head(10)
##                                                                                  cluster
## 1                                                                    Kluster Warga Jepun
## 2                                                                            Kluster UDA
## 3                                               Kluster Keagamaan (Tabligh Sri Petaling)
## 4                                                              Kluster Tabligh New Delhi
## 5                                                                          Kluster AIROD
## 6                                                                      Kluster PUI Itali
## 7         Subkluster Tabligh Sri Petaling - Kluster Majlis Perkahwinan Bandar Baru Bangi
## 8                                                         Kluster Persidangan Gereja GNF
## 9  Subkluster Tabligh Sri Petaling - Kluster Bandar Baru Ibrahim Majid & kawasan sekitar
## 10                       Subkluster Tabligh Sri Petaling - Kluster Menara Plaza City One
##                                                                                                           state
## 1                                                                                               WP Kuala Lumpur
## 2                                              WP Kuala Lumpur, WP Putrajaya, Selangor, Negeri Sembilan & Johor
## 3                                                                                                  Semua Negeri
## 4                                                     Selangor, Negeri Sembilan, WP Kuala Lumpur & WP Putrajaya
## 5                                                                                                        Pahang
## 6                                                                                                       Sarawak
## 7  WP Kuala Lumpur, WP Putrajaya, Kedah, Selangor, Perak, Kelantan, Terengganu, Johor, Melaka & Negeri Sembilan
## 8                                                                                                       Sarawak
## 9                                                                                                         Johor
## 10                                                                                              WP Kuala Lumpur
##                       district
## 1                Lembah Pantai
## 2                     Pelbagai
## 3                     Pelbagai
## 4                     Pelbagai
## 5                      Kuantan
## 6                      Kuching
## 7                     Pelbagai
## 8  Kuching, Samarahan & Serian
## 9                       Kluang
## 10               Lembah Pantai
mysclusters %>% select(-cluster, -state, -district) %>% tail(10)
##      date_announced date_last_onset  category status cases_new cases_total
## 5468     2021-09-30      2021-09-29 workplace active         9          10
## 5469     2021-09-30      2021-09-29 workplace active        37          44
## 5470     2021-09-30      2021-09-29 workplace active         1          79
## 5471     2021-09-30      2021-09-29 workplace active        20          20
## 5472     2021-09-30      2021-09-29 workplace active        19          36
## 5473     2021-09-30      2021-09-29 community active         4          10
## 5474     2021-09-30      2021-09-29 community active         3          13
## 5475     2021-09-30      2021-09-29 community active        10          25
## 5476     2021-09-30      2021-09-29 community active        10          11
## 5477     2021-09-30      2021-09-29 community active         2           8
##      cases_active tests icu deaths recovered
## 5468           10    21   0      0         0
## 5469           44   113   0      0         0
## 5470           79   318   0      0         0
## 5471           20   116   0      0         0
## 5472           36   155   0      0         0
## 5473           10    15   0      0         0
## 5474           13    16   0      0         0
## 5475           25    25   0      0         0
## 5476           11    11   0      0         0
## 5477            8    10   0      0         0

The %>% operator passes the result on the left to the first parameter of the function on the right. We no longer need to create a temporary data frame like tmp to store the results. Note that mysclusters remain unchanged.

There are 14 columns in mysclusters. We selected 3 in the first line of the code chunk above and the remaining 11 in the second line.

1.5.4 Filtering rows

The filter function allows us to limit our data frame to rows(observations, records) that meet specific criteria. Multiple criteria can be combined with the & (AND) and | (OR) operators.

  • Filter records for Sarawak
mysclusters %>% 
  filter(state == 'Sarawak') %>% 
  head(10)
##                                                 cluster   state
## 1                                     Kluster PUI Itali Sarawak
## 2                        Kluster Persidangan Gereja GNF Sarawak
## 3                         Kluster Hospital Umum Sarawak Sarawak
## 4  Subkluster Tabligh Sri Petaling - Kluster di Kuching Sarawak
## 5                       Kluster Tapak Pembinaan Kuching Sarawak
## 6                                      Kluster Kidurong Sarawak
## 7                    Kluster Syarikat Pembinaan Kuching Sarawak
## 8                          Kluster Kejuruteraan Kuching Sarawak
## 9                                       Kluster Mambong Sarawak
## 10                                      Kluster Jupiter Sarawak
##                         district date_announced date_last_onset  category
## 1                        Kuching     2020-04-05      2020-03-20    import
## 2    Kuching, Samarahan & Serian     2020-04-06      2020-03-15 religious
## 3                        Kuching     2020-04-17      2020-03-31  highRisk
## 4  Kuching, Samarahan & Simunjan     2020-05-04      2020-03-18 religious
## 5                        Kuching     2020-06-18      2020-06-17 workplace
## 6                        Bintulu     2020-06-19      2020-06-13 community
## 7                        Kuching     2020-06-24      2020-07-12 workplace
## 8                        Kuching     2020-07-12      2020-07-12 workplace
## 9            Kuching & Samarahan     2020-07-16      2020-07-15 workplace
## 10                       Bintulu     2020-07-17      2020-07-13    import
##    status cases_new cases_total cases_active tests icu deaths recovered
## 1   ended         0          65            0   629   0      5        60
## 2   ended         0         191            0  6991   0      3       188
## 3   ended         0          68            0  4849   0      0        68
## 4   ended         0          41            0  1034   0      2        39
## 5   ended         0           3            0   315   0      0         3
## 6   ended         0           8            0   200   0      0         8
## 7   ended         0           2            0    91   0      0         2
## 8   ended         0           8            0   189   0      0         8
## 9   ended         0           7            0   293   0      0         7
## 10  ended         0           3            0    35   0      0         3
  • Filter records for Sarawak, Sabah, Selangor
mysclusters %>% 
  filter(state == 'Sarawak' | 
                       state == 'Sabah' | 
                       state == 'Selangor') %>% 
  head(10)
##                                                         cluster    state
## 1                                             Kluster PUI Itali  Sarawak
## 2                                Kluster Persidangan Gereja GNF  Sarawak
## 3  Subkluster Tabligh Sri Petaling - Kluster Kampung Sungai Lui Selangor
## 4                                            Kluster Warga Emas Selangor
## 5                                              Kluster Pakistan    Sabah
## 6                                 Kluster Hospital Sungai Buloh Selangor
## 7                                 Kluster Hospital Umum Sarawak  Sarawak
## 8                               Kluster Petugas Kesihatan Sabah    Sabah
## 9                                       Kluster Seksyen 6 Bangi Selangor
## 10                                        Kluster Selayang Baru Selangor
##                                       district date_announced date_last_onset
## 1                                      Kuching     2020-04-05      2020-03-20
## 2                  Kuching, Samarahan & Serian     2020-04-06      2020-03-15
## 3                                  Hulu Langat     2020-04-12      2020-03-16
## 4                             Klang & Petaling     2020-04-16      2020-03-22
## 5                                        Tawau     2020-04-16      2020-04-13
## 6                                       Gombak     2020-04-17      2020-03-23
## 7                                      Kuching     2020-04-17      2020-03-31
## 8  Kota Kinabalu, Penampang, Keningau & Tuaran     2020-04-18      2020-04-03
## 9                                  Hulu Langat     2020-04-21      2020-03-17
## 10                                      Gombak     2020-04-30      2020-04-28
##     category status cases_new cases_total cases_active tests icu deaths
## 1     import  ended         0          65            0   629   0      5
## 2  religious  ended         0         191            0  6991   0      3
## 3  religious  ended         0         211            0  1088   0      0
## 4   highRisk  ended         0          36            0   242   0      5
## 5  community  ended         0           6            0    12   0      0
## 6   highRisk  ended         0          18            0   354   0      0
## 7   highRisk  ended         0          68            0  4849   0      0
## 8   highRisk  ended         0          35            0  1463   0      0
## 9  community  ended         0          31            0   301   0      2
## 10 community  ended         0          52            0  3802   0      0
##    recovered
## 1         60
## 2        188
## 3        211
## 4         31
## 5          6
## 6         18
## 7         68
## 8         35
## 9         29
## 10        52

We can achieve the same outcome by using the %in% operator where c('Sarawak', 'Sabah', 'Selangor') is the vector of matching observations.

mysclusters %>% 
  filter(state %in% c('Sarawak', 'Sabah', 'Selangor')) %>% 
  head(10)
##                                                         cluster    state
## 1                                             Kluster PUI Itali  Sarawak
## 2                                Kluster Persidangan Gereja GNF  Sarawak
## 3  Subkluster Tabligh Sri Petaling - Kluster Kampung Sungai Lui Selangor
## 4                                            Kluster Warga Emas Selangor
## 5                                              Kluster Pakistan    Sabah
## 6                                 Kluster Hospital Sungai Buloh Selangor
## 7                                 Kluster Hospital Umum Sarawak  Sarawak
## 8                               Kluster Petugas Kesihatan Sabah    Sabah
## 9                                       Kluster Seksyen 6 Bangi Selangor
## 10                                        Kluster Selayang Baru Selangor
##                                       district date_announced date_last_onset
## 1                                      Kuching     2020-04-05      2020-03-20
## 2                  Kuching, Samarahan & Serian     2020-04-06      2020-03-15
## 3                                  Hulu Langat     2020-04-12      2020-03-16
## 4                             Klang & Petaling     2020-04-16      2020-03-22
## 5                                        Tawau     2020-04-16      2020-04-13
## 6                                       Gombak     2020-04-17      2020-03-23
## 7                                      Kuching     2020-04-17      2020-03-31
## 8  Kota Kinabalu, Penampang, Keningau & Tuaran     2020-04-18      2020-04-03
## 9                                  Hulu Langat     2020-04-21      2020-03-17
## 10                                      Gombak     2020-04-30      2020-04-28
##     category status cases_new cases_total cases_active tests icu deaths
## 1     import  ended         0          65            0   629   0      5
## 2  religious  ended         0         191            0  6991   0      3
## 3  religious  ended         0         211            0  1088   0      0
## 4   highRisk  ended         0          36            0   242   0      5
## 5  community  ended         0           6            0    12   0      0
## 6   highRisk  ended         0          18            0   354   0      0
## 7   highRisk  ended         0          68            0  4849   0      0
## 8   highRisk  ended         0          35            0  1463   0      0
## 9  community  ended         0          31            0   301   0      2
## 10 community  ended         0          52            0  3802   0      0
##    recovered
## 1         60
## 2        188
## 3        211
## 4         31
## 5          6
## 6         18
## 7         68
## 8         35
## 9         29
## 10        52

1.5.5 Creating/Changing columns

The mutate function allows us to create new columns or change existing ones.

In the following code, we combine mutate, select, and filter to know the percentage of icu beds occupied in Kelantan. Note carefully that the icu data frame is not changed in any way.

 icu %>% 
  mutate(icucovidbedoccupied = (icu_covid/beds_icu_covid)*100, 
         icubedoccupied = ((icu_covid+icu_noncovid)/beds_icu_total)*100) %>% 
  select(date, state, icucovidbedoccupied, icubedoccupied) %>% 
  filter(state == "Kelantan") %>% 
  head(10)
##          date    state icucovidbedoccupied icubedoccupied
## 1  2020-03-24 Kelantan                   5       53.84615
## 2  2020-03-25 Kelantan                  15       79.54545
## 3  2020-03-26 Kelantan                  20       61.36364
## 4  2020-03-27 Kelantan                  25       75.00000
## 5  2020-03-28 Kelantan                  20       72.72727
## 6  2020-03-29 Kelantan                  24       68.18182
## 7  2020-03-30 Kelantan                  20       70.45455
## 8  2020-03-31 Kelantan                  16       77.27273
## 9  2020-04-01 Kelantan                  20       68.18182
## 10 2020-04-02 Kelantan                  16       75.00000

1.5.6 Summarizing data

The summarize function can be used to reduce multiple values down to a single value (like the sum). It is often used together with the by_group function, to calculate statistics by group. In the code below, the na.rm=TRUE option is used to drop missing values before calculating.

We calculate the mean, median, minimum, and maximum daily new Covid cases for each state. We save the results in a new data frame new_cases_summary'. Again,mysstates` remain unchanged.

mysstates %>% 
  group_by(state) %>% 
  summarize(mean = mean(cases_new, na.rm=TRUE), 
            median = median(cases_new, na.rm=TRUE), 
            min = min(cases_new, na.rm=TRUE), 
            max = max(cases_new, na.rm=TRUE)) -> new_cases_summary
new_cases_summary
## # A tibble: 16 x 5
##    state                mean median   min   max
##    <chr>               <dbl>  <dbl> <dbl> <dbl>
##  1 Johor              320.     44.5     0  2785
##  2 Kedah              220.     16       0  2455
##  3 Kelantan           189.      6.5     0  1573
##  4 Melaka              95.3     5       0  1120
##  5 Negeri Sembilan    158.     33       0  1619
##  6 Pahang             106.      6       0   926
##  7 Perak              168.     24       0  1596
##  8 Perlis               6.45    0       0   113
##  9 Pulau Pinang       213.     35       0  2474
## 10 Sabah              323.     98.5     0  3487
## 11 Sarawak            344.      9.5     0  5291
## 12 Selangor          1103.    261       0  8792
## 13 Terengganu          91.5     3       0   993
## 14 W.P. Kuala Lumpur  303.     68       0  2880
## 15 W.P. Labuan         16.2     1       0   253
## 16 W.P. Putrajaya      10.0     2       0   122

The minimum number of new cases is 0 for all states because the mysstates data frame starts at 2020-01-25 when all states had 0 new cases. We all know that the Sabah state elections and the social distancing, as well as the quarantine SOP non-compliance, was the main trigger that led to the jump in the number of cases after we had almost “flattened the curve”. Let us summarize the results starting 01 October 2020 without saving the results.

We use the arrange function with desc to sort the output starting with the highest mean new cases.

mysstates %>% 
  filter(date >= as.Date('2020-10-01')) %>% 
  group_by(state) %>% 
  summarize(mean = mean(cases_new, na.rm=TRUE), 
            median = median(cases_new, na.rm=TRUE), 
            min = min(cases_new, na.rm=TRUE), 
            max = max(cases_new, na.rm=TRUE)) %>% 
  arrange(desc(mean))
## # A tibble: 16 x 5
##    state               mean median   min   max
##    <chr>              <dbl>  <dbl> <dbl> <dbl>
##  1 Selangor          1850.  1082.     13  8792
##  2 Sarawak            577.   268.      0  5291
##  3 Sabah              538.   275      29  3487
##  4 Johor              536.   329       0  2785
##  5 W.P. Kuala Lumpur  502.   317       1  2880
##  6 Kedah              369.    96.5     0  2455
##  7 Pulau Pinang       357.   144.      0  2474
##  8 Kelantan           317.    97       0  1573
##  9 Perak              281.    99.5     0  1596
## 10 Negeri Sembilan    263.   124.      0  1619
## 11 Pahang             178.    47       0   926
## 12 Melaka             160.    63       0  1120
## 13 Terengganu         154.    33.5     0   993
## 14 W.P. Labuan         27.1    6       0   253
## 15 W.P. Putrajaya      16.6    8       0   122
## 16 Perlis              10.8    1       0   113

1.5.7 Reshaping data

Some graphs require the data to be in wide format, while some graphs require the data to be in long format.

We can convert a wide data frame to a long data frame using gather. Welcome to our first picture of a selected Malaysia Covid data frame. We deliberately label the R code Chp1-3 and add other parameters to control and identify the plot.

mys1 %>% 
  gather("cluster", "value", -date) %>% 
  ggplot(aes(x=as.Date(date), 
                 y=value, 
                 color=cluster, 
                 shape=cluster)) + 
  geom_point() +
  scale_shape_manual(values=c(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15)) +
  labs(title = "Daily New Cases by Cluster",
      subtitle = "Color by clusters")
Point plot of Malaysia case data

Figure 1.3: Point plot of Malaysia case data

We will explain the steps to create the plot using the ggplot functions later. The above code again does not change mys1 data frame but converts it to a long format to facilitate drawing Figure 1.3.

We can convert a long data frame to a wide data frame using

tidyr::wide_data <- spread(long_data, variable, value).

1.5.8 Missing data

Actual data are likely to contain missing values. There are three basic approaches to dealing with missing data: feature selection, listwise deletion, and imputation. Let’s see how each applies to the mys1 data frame.

1.5.8.1 Feature selection

In feature selection, we delete columns that contain too many missing values.

What is the proportion of missing data for each column?

pctmissing <- colSums(is.na(mys1))/nrow(mys1)
round(pctmissing, 2)
##                    date               cases_new            cases_import 
##                    0.00                    0.00                    0.00 
##         cases_recovered          cluster_import       cluster_religious 
##                    0.00                    0.56                    0.56 
##       cluster_community        cluster_highRisk       cluster_education 
##                    0.56                    0.56                    0.56 
## cluster_detentionCentre       cluster_workplace 
##                    0.56                    0.56

61 percent of the [cluster_import, cluster_religious, cluster_community, cluster_highRisk, cluster_education, cluster_detentionCentre, cluster_workplace] values are missing. This is probably due to these data being collected later on.

1.5.8.2 Listwise deletion

Listwise deletion involves deleting rows that contain missing values on any of the columns of interest.

Create a data frame without any rows containing missing data.

newdata <- na.omit(mys1)
pctmissing <- colSums(is.na(newdata))/nrow(newdata)
round(pctmissing, 2)
##                    date               cases_new            cases_import 
##                       0                       0                       0 
##         cases_recovered          cluster_import       cluster_religious 
##                       0                       0                       0 
##       cluster_community        cluster_highRisk       cluster_education 
##                       0                       0                       0 
## cluster_detentionCentre       cluster_workplace 
##                       0                       0

1.5.8.3 Imputation

Imputation involves replacing missing values with “reasonable” guesses about what the values would have been if they had not been missing. There are several approaches.10

1.6 Discussion

We have introduced the basic tools and functions to transform the data frames we have created to take advantage of the features and functions of our visualization package ggplot2. As we prepare to plot the various types of graphs, we will show how to manipulate or transform the relevant data frame(s).

References