8  Essential R Packages for Machine Learning

Learning Objectives

  • Overview of the most suitable R-packages
  • Apply different model frameworks
  • How to find a new R-package

This chapter provides an overview of the most suitable R-packages for machine learning. It also discusses the combination of packages and functions to achieve the desired results. Additionally, it explores how to find new R-packages and evaluate their suitability for a given task.

8.1 Inside and Outside of the Library Boxes

When it comes to R, the possibilities are endless. With over 15,000 packages available on CRAN (Comprehensive R Archive Network) and countless others on GitHub and other repositories, universe of available packages is vast. How to untangle among all of them? Which one to choose? Sometimes, it is rather favorable to start by working out of the library boxes, which means using functions provided in the built-in packages inside a fresh R installation.

However, the vast majority of the time, the best way to start is by using the packages that are already available in the R ecosystem. These packages are designed to make your life easier by providing pre-built functions for common tasks. They can help you save time and effort by automating repetitive tasks and streamlining complex analyses.

Life has become easier with the advent of wrapping functions and the reduction of lines of code through the use of functions found in packages. In the past, data analysis and modelling in R required extensive coding, often involving repetitive tasks and lengthy scripts. As a result, users can achieve sophisticated results more efficiently, without the need to write lengthy and intricate scripts from scratch. This shift has not only simplified the process of data analysis and modelling but has also democratized access to advanced statistical techniques, making them more accessible to a broader audience.

8.2 Essential R Packages for Machine Learning

Some of the most used modelling packages available in R that are particularly useful for machine learning modelling of the spread of infectious diseases and evaluating DALYs variation due to infectious disease include:

8.2.1 Meta-Packages

  1. {tidymodels}1: This framework provides a collection of packages for modelling and machine learning tasks, including tidyr, dplyr, and ggplot2. It offers a consistent and tidy workflow for data preprocessing, model building, evaluation, and visualization.
  2. {caret}2: The caret framework (short for Classification And REgression Training) provides a unified interface for training and evaluating machine learning models. It supports a wide range of algorithms and offers convenient functions for hyperparameter tuning and model selection.
  3. {mlr3}3: This is a modern and extensible framework for machine learning in R. It provides a unified interface for modelling tasks, making it easy to train and evaluate a wide range of machine learning algorithms.

8.2.2 Engines

Engines are the actual implementations of machine learning algorithms. They perform the computations necessary to train and evaluate models. Engines can be integrated into various meta-packages/frameworks to leverage their specific algorithms and computational efficiencies. Each engine might have its own set of parameters and functionalities, which can be accessed through the high-level interfaces provided by meta-packages.

  1. {randomForest}: For random forest models.
  2. {xgboost}: An optimized distributed gradient boosting library designed to be highly efficient, flexible, and portable. It implements machine learning algorithms under the gradient boosting framework.
  3. {glmnet}: For lasso and ridge regression. Provides efficient procedures for fitting generalized linear models via penalized maximum likelihood.
  4. {nnet}: For neural networks.
  5. {kernlab}: For kernel-based machine learning methods.
  6. {e1071}: For support vector machines and other functions.
  7. {lme4}: For linear and generalized linear mixed-effects models.
  8. {mgcv}: For generalized additive models.
  9. {rpart}: For recursive partitioning and regression trees.
  10. {h2o}: A powerful machine learning framework that can be efficiently integrated with tidymodels. h2o offers a range of machine learning algorithms, including linear models, tree-based models, and ensemble methods, with efficient handling of large datasets and support for parallel computing.
  11. {keras3}, the R interface to Keras, is a deep learning technique which provides high-level neural networks API written in Python. Designed to enable fast experimentation with deep learning models on top of {tensorflow} (an open-source machine learning framework developed by the Google).

8.2.3 Time Series Analysis

  1. {forecast}: For forecasting functions.
  2. {prophet}: Specifically designed for time series forecasting with an emphasis on flexibility and ease of use. Developed by Facebook, prophet can handle various time series patterns, including seasonality, holiday effects, and trend changes, making it suitable for modelling infectious disease trends over time.
  3. {fpp3}4: This package is part of the Forecasting: Principles and Practice (3rd edition) book, offering data and tools for forecasting time series data, such as the import of the {fable} package with the ARIMA() model which is relevant for modelling disease spread over time.

8.2.4 Bayesian Analysis

  1. {brms}: For Bayesian generalized multivariate non-linear multilevel models using Stan.
  2. {rstan}: For Bayesian inference using Stan.

8.2.5 Specialized Tools

  1. {spdep} is a powerful tools for spatial statistics, it provides a set of functions for analyzing spatial dependencies and autocorrelation in spatial data, particularly relevant in infectious disease modelling. It also incorporates spatial effects into models and allows for a better understanding of how infectious diseases spread across space.
  2. {INLA}5: Integrated Nested Laplace Approximations (INLA) is a package for Bayesian inference using the INLA method. It is particularly useful for spatial and spatio-temporal modelling, which can be relevant for studying the spread of infectious diseases.

These packages offer a diverse set of tools and techniques for modelling infectious disease spread and assessing its impact on health metrics like DALYs. These packages provide valuable insights into the dynamics of infectious diseases and can be used to inform public health interventions and policies. However, whether these packages are the most suitable depends on various factors, including:

  • the specific requirements of the analysis
  • the expertise of the researcher
  • the nature of the data being used

Each package has its strengths and weaknesses, and the choice of package often depends on factors such as:

  • the complexity of the modelling task
  • the type of data available
  • the preferred modelling approach (e.g., frequentist vs. Bayesian).

For example, tidymodels provides a tidy and consistent workflow for modelling tasks, while mlr3 offers extensive support for machine learning algorithms and hyperparameter tuning.

It’s also worth considering other packages that may be relevant for specific aspects of the analysis, such as spatial modelling (e.g., spdep) or time series forecasting (e.g., prophet).

The features and capabilities of each package should carefully be evaluated in relation to their specific research objectives before to choose the one that best meets research needs and objectives. And it is worth considering that experimenting with different packages and techniques may be beneficial to identify the most effective approach for a given analysis.

8.3 Model Framework Application Examples

In this section we provide some examples of the application of different model frameworks. The data is used, as introduced in Chapter 7, to demonstrate the application of model frameworks such as mlr3, h2o, and keras for machine learning tasks.

We have already shown an application of the tidymodels meta-package in Chapter 6 and Chapter 7, particularly useful for testing out different types of models with the help of key functions such as the workflow(), workflow_set(), and workflow_map().

8.3.1 Example: DALYs due to Dengue with mlr3

The dataset for this example is from the hmsidwR package, which contains data on Deaths, DALYs, YLDs, YLLs, Prevalence, and Incidence due to selected infectious diseases from 1980 to 2021.

library(tidyverse)
hmsidwR::infectious_diseases %>% names
#>  [1] "year"          "location_name" "location_id"   "cause_name"   
#>  [5] "Deaths"        "DALYs"         "YLDs"          "YLLs"         
#>  [9] "Prevalence"    "Incidence"

Let’s have a look at the locations where Dengue has been reported:

hmsidwR::infectious_diseases %>%
  filter(cause_name == "Dengue") %>%
  distinct(location_name)
#> # A tibble: 5 × 1
#>   location_name           
#>   <chr>                   
#> 1 Malawi                  
#> 2 Central African Republic
#> 3 Lesotho                 
#> 4 Eswatini                
#> 5 Zambia

And look at how the health metrics shape up over the years for Dengue:

Health Metrics due to Dengue
(a) YLLs
Health Metrics due to Dengue
(b) YLDs
Health Metrics due to Dengue
(c) DALYs
Figure 8.1: Health Metrics due to Dengue

The trend of the metrics in some locations is quite flat ranging around zero, while in others it is more evident. YLLs are the leading cause of DALYs, but YLDs are also contributing to the burden of disease.

We will focus on DALYs due to Dengue in three locations, Malawi, Zambia, and Central African Republic from 1990 to 2016 (26 years). The data from 2017 to 2021 is not included in the analysis and held out to be used as new data for testing the model’s performance, as shown in Chapter 9.

Data is preprocessed to remove missing values, the values of DALYs only appeared in 1990, and then grouped by location id; the location name is removed.

dalys_dengue <- hmsidwR::infectious_diseases %>%
  arrange(year) %>%
  filter(cause_name == "Dengue",
         year<=2016,
         !location_name %in% c("Eswatini", "Lesotho")) %>%
  drop_na() %>%
  group_by(location_id) %>%
  select(-location_name, -cause_name)

dalys_dengue %>%
  head()
#> # A tibble: 6 × 8
#> # Groups:   location_id [3]
#>    year location_id   Deaths DALYs  YLDs   YLLs Prevalence Incidence
#>   <dbl>       <dbl>    <dbl> <dbl> <dbl>  <dbl>      <dbl>     <dbl>
#> 1  1990         182 0.0330    7.59  5.35 2.24         33.7      560.
#> 2  1990         191 0.00135   6.88  6.86 0.0229       42.4      713.
#> 3  1990         169 0.000760  2.25  2.24 0.0186       13.1      222.
#> 4  1991         191 0.00135   8.57  8.54 0.0230       52.5      884.
#> 5  1991         182 0.0327    8.68  6.47 2.21         40.9      681.
#> 6  1991         169 0.000760  2.23  2.22 0.0185       13.0      218.

The mlr3 is a meta-package particularly useful for its syntax, as it guides the user through the machine learning steps by providing a consistent naming convention. It is also useful for benchmarking different models. In addition to the mlr3 package, the mlr3learners, mlr3viz, and mlr3verse, and data.table packages are also loaded. Each of these packages provides additional functionality for machine learning tasks. More information on these packages can be found in the respective documentation, and further details can be found in the mlr3 book. In this example we use two types of models:

  • regr.cv_glmnet: A cross-validated glmnet model with an alpha of 0.5 and a lambda of 0.1.
  • regr.xgboost: An xgboost model with 1000 rounds, a maximum depth of 6, and an eta of 0.01. For this model the xgboost package is required.

Define the task, where the id, backend, and target are specified. The backend is the data that will be used for training the models. The task is the interface between the data and the learners; in this use-case the preprocessing steps are done before hand, by removing missing and categorical values.

task <- TaskRegr$new(id = "DALYs",
                     backend = dalys_dengue,
                     target = "DALYs"
                     )

task$data() %>% head()
#>       DALYs       Deaths Incidence Prevalence     YLDs       YLLs location_id
#>       <num>        <num>     <num>      <num>    <num>      <num>       <num>
#> 1: 7.591334 0.0329974518  559.7017   33.65760 5.354059 2.23727545         182
#> 2: 6.882299 0.0013503820  713.4446   42.37633 6.859449 0.02285062         191
#> 3: 2.253677 0.0007599588  222.0559   13.05073 2.235120 0.01855750         169
#> 4: 8.565327 0.0013525307  884.4752   52.54218 8.542325 0.02300233         191
#> 5: 8.682844 0.0326559477  681.3534   40.88655 6.469088 2.21375567         182
#> 6: 2.234340 0.0007598756  217.7888   12.96389 2.215861 0.01847933         169
#>     year
#>    <num>
#> 1:  1990
#> 2:  1990
#> 3:  1990
#> 4:  1991
#> 5:  1991
#> 6:  1991

The learners are the models specifications. A full list of learners can be obtained with the function as.data.table(mlr_learners).

learner_cv_glmnet <- lrn("regr.cv_glmnet", 
                         alpha = 0.5, 
                         s = 0.1)
learner_xgboost <- lrn("regr.xgboost",
                       nrounds = 1000,
                       max_depth = 6,
                       eta = 0.01)

Then define a resampling strategy to estimate the generalization performance, then create a benchmark design to compare different learners:

resampling <- rsmp("cv", folds = 5)

design <- benchmark_grid(tasks = task,
                         learners = list(learner_cv_glmnet,
                                         learner_xgboost),
                         resamplings = resampling
                         )
design
#>      task        learner resampling
#>    <char>         <char>     <char>
#> 1:  DALYs regr.cv_glmnet         cv
#> 2:  DALYs   regr.xgboost         cv

Run the benchmark:

set.seed(19052024)
bmr <- benchmark(design,
                 store_models = TRUE,
                 store_backends = TRUE
                 )

To access the content of an object such as bmr, use the $ operator, or for more options use the View() function.

The score() function is used to extract the performance metrics of the models, in this case the mean squared error (MSE) is used.

bmr$score()[, .(learner_id,
                task_id,
                iteration,
                regr.mse)]
#>         learner_id task_id iteration    regr.mse
#>             <char>  <char>     <int>       <num>
#>  1: regr.cv_glmnet   DALYs         1 0.016850868
#>  2: regr.cv_glmnet   DALYs         2 0.011469850
#>  3: regr.cv_glmnet   DALYs         3 0.009037263
#>  4: regr.cv_glmnet   DALYs         4 0.009580525
#>  5: regr.cv_glmnet   DALYs         5 0.004772721
#>  6:   regr.xgboost   DALYs         1 0.057251614
#>  7:   regr.xgboost   DALYs         2 0.024933142
#>  8:   regr.xgboost   DALYs         3 0.392113572
#>  9:   regr.xgboost   DALYs         4 0.175673961
#> 10:   regr.xgboost   DALYs         5 0.155560986

Here we see the results of the benchmark, the regr.mse is the mean squared error of the models. The regr.rmse and regr.mae are also available.

set.seed(349)
autoplot(bmr, measure = msr("regr.rmse"))
autoplot(bmr, measure = msr("regr.mae"))
BMR Regression
(a) BMR Regression RMSE
BMR Regression
(b) BMR Regression MAE
Figure 8.2: BMR Regression RMSE and MAE

Aggregate the results:

measures <- msrs(c("regr.rmse", "regr.mae"))
aggr <- bmr$aggregate(measures)
rr1 <- aggr$resample_result[[1]]
rr2 <- aggr$resample_result[[2]]

Create a data frame with the results, for plotting with ggplot2, consider the data are shuffled in the resampling process, and need to be reordered:

regr.cv_glmnet <- as.data.table(rr1$prediction()) %>%
  mutate(learner = "regr.cv_glmnet") %>%
  # order data to match the original order
  cbind(rr1$task$data()[rr1$prediction()$row_ids])

regr.xgboost <- as.data.table(rr2$prediction()) %>%
  mutate(learner = "regr.xgboost") %>%
  # order data to match the original order
  cbind(rr2$task$data()[rr2$prediction()$row_ids])


dat <- rbind(regr.cv_glmnet, regr.xgboost)

Plot the results:

dat %>%
  ggplot(aes(x = truth, y = response,
             group = learner)) +
  geom_abline() +
  geom_point(aes(color = learner)) +
  scale_color_manual(values = c("regr.cv_glmnet" = "navy",
                                "regr.xgboost" = "orange")) +
  facet_wrap(~learner) +
  labs(x = "Truth", y = "Response") +
  theme(legend.position = "none")
Truth vs Response
Figure 8.3: Truth vs Response for cv.glmnet and xgboost
dat %>%
  group_by(location_id, learner) %>%
  ggplot(aes(x = year)) +
  geom_point(aes(y = truth),
             size = 0.2) +
  geom_line(aes(y = response,
                linetype = learner),
            linewidth = 0.3) +
  # create facets for each location and
  # rename the location_id with the location_name
  # with the labeller function
  facet_wrap(vars(location_id),
    labeller = as_labeller(c(`182` = "Malawi",
                             `191` = "Zambia",
                             `169` = "Central African Republic")),
    scales = "free") +
  labs(title = "Dalys due to Dengue 1990-2016: Truth vs Responses",
       caption = "Scale Free - Dots represent the Truth Values",
       x = "Year", y = "DALYs", linetype = "Model") +
  theme(plot.title = element_text(hjust = 0.5))
DALYs due to Dengue
Figure 8.4: Models results on observed DALYs due to Dengue 1990-2016; Test Resampled Data with cv.glmnet and xgboost.

In the next Chapter 9 we will see an application of these two models on remaing years from 2017 to 2021, to test the model performance as it was on new data.

8.3.2 Example: DALYs due to Rabies with H2O

The following example shows how to use the {h20} r package for running H2O via its REST API. It is particularly useful for handling large datasets. However, this is not the case seen the size of the rabies dataset, we can use it to compare the results with the previous model frameworks.

options(timeout = 6000)
install.packages("h2o")
# Initialize the H2O cluster
h2o.init()
# Upload data to H2O
h2o_data <- as.h2o(data)
response <- "dalys_rabies"
predictors <- setdiff(names(data), response)
# Split data into training and testing sets
splits <- h2o.splitFrame(h2o_data,
                         ratios = 0.8,
                         seed = 1234)
train <- splits[[1]]
test <- splits[[2]]

Train three models: a linear regression model, a GBM model, and a random forest model.

model_lm <- h2o.glm(x = predictors,
                    y = response,
                    training_frame = train,
                    family = "gaussian")
model_gbm <- h2o.gbm(x = predictors,
                     y = response,
                     training_frame = train,
                     validation_frame = test,
                     ntrees = 1000,
                     max_depth = 6,
                     learn_rate = 0.01,
                     seed = 1234)
model_rf <- h2o.randomForest(x = predictors,
                             y = response,
                             training_frame = train,
                             ntrees = 100,
                             max_depth = 20,
                             seed = 1234)

Evaluate the models:

perf_lm <- h2o.performance(model_lm, newdata = test)
perf_gbm <- h2o.performance(model_gbm, newdata = test)
perf_rf <- h2o.performance(model_rf, newdata = test)

Then extract the RMSE and MAE metrics:

h2o.rmse(<model>)
h2o.mae(<model>)
#>   model     rmse       mae
#> 1    lm 3.030244 2.5879271
#> 2   gbm 1.431776 0.9648512
#> 3    rf 1.150945 0.9111883
model_gbm@model
# Predict on test data
predictions <- h2o.predict(model_rf,
                           newdata = test)

Convert H2O frames to data frames for ggplot2, it requires the h2o library to be loaded and initiated:

pred_df <- as.data.frame(predictions)
actual_df <- as.data.frame(test)
# Combine actual and predicted values
results_df <- data.frame(Actual = actual_df[, response],
                         Predicted = pred_df[, 1])
# Plot Actual vs Predicted
p1 <- ggplot(results_df,
             aes(x = Actual, y = Predicted)) +
  geom_point(color = "blue", alpha = 0.5) +
  geom_abline(slope = 1, intercept = 0,
              color = "red", linetype = "dashed") +
  labs(title = "Actual vs Predicted Values",
       x = "Actual", y = "Predicted")
# Calculate residuals
results_df$Residuals <- results_df$Actual - results_df$Predicted
# Plot Residuals vs Predicted
ggplot(results_df,
       aes(x = Predicted, y = Residuals)) +
  geom_point(color = "navy", alpha = 0.5) +
  geom_hline(yintercept = 0,
             color = "orange", linetype = "dashed") +
  labs(title = "Residuals vs Predicted Values",
       x = "Predicted", y = "Residuals")
# Normality Test of the Residuals
qqnorm(results_df$Residuals)
qqline(results_df$Residuals, col = 2)
Residuals Distribution
(a) Residuals vs Predicted Values
Residuals Distribution
(b) Normality Test of the Residulas
Figure 8.5: H2O Models Results on DALYs due to Rabies; Test Resampled Data with lm, gbm, and rf.
# Assuming there is a time column in your test data
results_df$Time <- as.Date(actual_df$year)
# Plot Actual vs Predicted over time
p2 <- ggplot(results_df, aes(x = Time)) +
  geom_line(aes(y = Actual,
                color = "Actual"),
            linetype = 1) +
  geom_line(aes(y = Predicted,
                color = "Predicted"),
            linewidth = 1,
            linetype = "dashed") +
  labs(title = "Actual vs Predicted Values Over Time",
       x = "Time", y = "Value") +
  scale_color_manual(name = "Legend",
                     values = c("Actual" = "navy",
                                "Predicted" = "orange"))
Actual vs Predicted Values
(a) Actual vs Predicted Values
Actual vs Predicted Values
(b) Actual vs Predicted Over Time
Figure 8.6: Actual vs Predicted Values
h2o.shutdown(prompt = FALSE)

8.3.3 Example: General Infection with Keras

The following example shows how to use the keras, a package that facilitates the creation and training of neural networks and deep learning models in R. It provides a user-friendly interface to build models, whether they are simple neural networks or complex deep learning architectures, particularly useful for handling network connections. More detailed explanation of how a neural network works can be found in

In this case the SEIR model (susceptible, exposed, infected and recovered) is used, as seen in Chapter 6 chapter, to simulate the spread of a general infection in a population of 1000 individuals over 160 days. For some type of infections, the latency period is significant as it is the time during which infected individuals are not yet infectious themselves. During this period the individual is in compartment E (for exposed).

Once the simulation of the infected population is done, the deep learning model acts by training social media data to predict the probability of infection based on social media activity. The model results are then used to adjust the SEIR parameters based on the predicted infection status and re-run again on new data.

install.packages("keras3")
keras3::install_keras(backend = "tensorflow")

To setup the SEIR compartments we start by defining the differential equations built by considering the rates of change of each compartment (S,E,I,R), the parameters (\beta,\sigma,\gamma), the initial state value of the population (N = 1000), and the length of time in days (t=160). Finally the ode() function is used to solve the differential equations and provide the output dataset.

SEIR <- function(time, state, parameters) {
  with(as.list(c(state, parameters)), {
    # Differential equations
    dS <- -beta * S * I / N
    dE <- beta * S * I / N - sigma * E
    dI <- sigma * E - gamma * I
    dR <- gamma * I
    # Return the rates of change
    list(c(dS, dE, dI, dR))
  })
}
# Parameters
parameters <- c(beta = 0.3, # Infection rate
                sigma = 0.2, # Incubation rate
                gamma = 0.1 # Recovery rate
                )
# Initial state values
N <- 1000
initial_state <- c(S = 999, E = 1, I = 0, R = 0)
# Time points
times <- seq(0, 160, by = 1)
# Solve the model
output <- ode(y = initial_state,
              times = times,
              func = SEIR,
              parms = parameters
              )
# Convert output to a data frame
output <- as.data.frame(output)

The social media data, composed of features and labels, is obtained by simulating a matrix of 1000 individuals (n) over 10 (p) days used for training the model:

set.seed(123)
n <- 1000 # number of samples
p <- 10 # number of features
# Generate random features and labels
social_features <- matrix(rnorm(n * p),
                          nrow = n, ncol = p)
infection_labels <- sample(0:1, n, replace = TRUE)
# Combine into a data frame
social_data <- data.frame(social_features)
social_data$infection <- infection_labels

head(social_data)
#>            X1          X2         X3          X4         X5         X6
#> 1 -0.56047565 -0.99579872 -0.5116037 -0.15030748  0.1965498 -0.4941739
#> 2 -0.23017749 -1.03995504  0.2369379 -0.32775713  0.6501132  1.1275935
#> 3  1.55870831 -0.01798024 -0.5415892 -1.44816529  0.6710042 -1.1469495
#> 4  0.07050839 -0.13217513  1.2192276 -0.69728458 -1.2841578  1.4810186
#> 5  0.12928774 -2.54934277  0.1741359  2.59849023 -2.0261096  0.9161912
#> 6  1.71506499  1.04057346 -0.6152683 -0.03741501  2.2053261  0.3351310
#>           X7         X8         X9        X10 infection
#> 1 -0.6992281 -1.6180367  0.5110004  1.9315759         0
#> 2  0.9964515  0.3791812  1.8079928 -0.6164747         0
#> 3 -0.6927454  1.9022505 -1.7026150 -0.5625675         0
#> 4 -0.1034830  0.6018743  0.2874488 -0.9899631         1
#> 5  0.6038661  1.7323497 -0.2691142  2.7312276         1
#> 6 -0.6080450 -0.1468696 -0.3795247 -0.7216662         1

Define the model by using the keras_model_sequential() function, and add layers with the layer_dense() and layer_activation() functions. On each layer the model will apply a transformation to the input data, and the activation function will introduce non-linearity to the model.

Here we use a simple model with just two layers and a ReLU and a Sigmoid as activation functions. These functions have specific shapes that allows the model to learn complex patterns in the data. Any time one layer is done the output is passed to the next layer.

Although the layer_dropout() is not actually used in this application, but only commented out for testing model adjustments, it could be added to the model as a layer that drops out output results under specified requirements. The function is used to prevent overfitting by randomly setting a fraction of input units to zero during training.

The model concludes with one more layer_dense() with activation used to finally activate the output with a tailored function. In this case a sigmoid, but also a softmax function could be used.

The sigmoid function is used in binary classification problems, where the output is a probability between 0 and 1. The softmax function is used in multi-class classification problems, where the output is a probability distribution over multiple classes.

Given an input vector x=[x_1,x_2,...,x_p]

  1. First Dense Layer Transformation:

z^{(1)}=\sum_{i=1}^p{W_i^{(1)} x_i+b^{(1)}} \tag{8.1}

where W \epsilon R^{1\times p} are the weights, b is the bias, and z is the output.

  1. First Activation (ReLU):

a^{(1)}=ReLu(z^{(1)})=max(0,z^{(1)}) \tag{8.2}

  1. Second Dense Layer Transformation:

z^{(2)}=\sum_{i=1}^p{W_i^{(2)} x_i+b^{(2)}} \tag{8.3}

  1. Output Activation (Sigmoid):

a^{(2)}=Sigmoid(z^{(2)})= \frac{1} {1+e^{-z^{(2)}}} \tag{8.4}

model <- keras_model_sequential(input_shape = c(p))
# simple model
model %>%
  layer_dense(units = 1) %>%
  layer_activation("relu") %>%
  # layer_dropout(rate = 0.3) %>%
  # layer_dense(units = 64) %>%
  # layer_activation('relu') %>%
  # layer_dropout(rate = 0.4) %>%
  layer_dense(units = 1, activation = "sigmoid")

The model is then compiled using the compile() function which with the use of a loss function optimizes the results matching them against a minimum required value, and specified metrics to reduce the error.

y= \hat{y} + \epsilon

The loss function is used to measure how well the model is performing, in this case a binary crossentropy loss function is used, to match the difference between original data and the model output, and apply model adjustments in case the difference is too high. The formula for binary crossentropy is:

L(y,\hat{y})=-\frac{1}{n}\sum_{i=1}^n{y_i \log(\hat{y}_i)+(1-y_i) \log(1-\hat{y}_i)} The optimizer_adam() function, used to update the model’s weights during training, in Keras specifies the use of the Adaptive Moment Estimation (Adam) optimization algorithm for training a neural network. It updates the parameters of the neural network.

Then finally, the model is evaluated on performance with the accuracy metric.

# Compile the model
model %>% compile(loss = "binary_crossentropy",
                  optimizer = optimizer_adam(),
                  metrics = c("accuracy"))

Once the model is defined and compiled, the model is then trained using the fit() function with the training data, number of epochs, batch size, and validation split specified. The number of epochs is the number of times the model will go through the training data, the batch size is the number of samples used in each training step, and the validation split is the fraction of the training data that will be used for validation.

A particular object that is created in this type of models is the history object, which is used to store the training history of the model. This object contains information about the loss and accuracy of the model on the training and validation data during training.

history <- model %>% fit(x = as.matrix(social_data[, 1:p]),
                         y = social_data$infection,
                         epochs = 30,
                         batch_size = 128,
                         validation_split = 0.2
                         )

The subsequent times the history object is created, the model is trained with half the number of epochs and batch size. And this specifications can further be adjusted as needed.

history2 <- model %>% fit(x = as.matrix(social_data[, 1:p]),
                          y = social_data$infection,
                          epochs = 30 / 2,
                          batch_size = 128 / 2,
                          validation_split = 0.2
                          )

Here are two attempts of the trained model with different parameters, the first with 30 epochs and a batch size of 128, and the second with 15 epochs and a batch size of 64. The training history of both models is then plotted to compare their performance.

Model History
(a) Model 1 History
Model History
(b) Model 2 History
Figure 8.7: Model 1 and Model 2 Training History

Now that the model is trained and have provided an idea of how it works on our data, a prediction of infection status is done new social media data.

The model is then used to predict the probability of infection based on the social media data. The predictions are then converted to binary values using a threshold of 0.5, where values greater than 0.5 are classified as 1 (infected) and values less than or equal to 0.5 are classified as 0 (not infected).

new_social_data <- matrix(rnorm(p * 160),
                          nrow = 160, ncol = p)
predicted_infections <- model %>%
  predict(new_social_data)
#> 5/5 - 0s - 6ms/step
predicted_infections <- ifelse(predicted_infections > 0.5, 1, 0)

The adjustment is then done on the SEIR parameters based on predictions results. The beta parameter is adjusted by multiplying it by the mean of the predicted infections.

adjusted_parameters <- parameters
adjusted_parameters["beta"] <-
  adjusted_parameters["beta"] * (1 + mean(predicted_infections))

The SEIR model is then re-run with the adjusted parameters to simulate the spread of the infection in the population.

adjusted_output <- ode(y = initial_state,
                       times = times,
                       func = SEIR,
                       parms = adjusted_parameters
                       )
# Convert output to a data frame
adjusted_output <- as.data.frame(adjusted_output)

The output shows the impact of the social media adjustments on the spread of the infection.

ggplot() +
  geom_line(data = output,
            aes(x = time, y = I,
                color = "Original Infections")) +
  geom_line(data = adjusted_output,
            aes(x = time, y = I,
                color = "Adjusted Infections")) +
  labs(title = "SEIR Model - Social Media Adjustments",
       y = "Infectious Population",
       color = "Scenario") +
  scale_color_manual(values = c(
    "Original Infections" = "navy",
    "Adjusted Infections" = "orange"))
SEIR Model
Figure 8.8: SEIR Model with and without Social Media Adjustments

This is just a simple example of how the keras package can be used to train a deep learning model on social media data to predict the probability of infection. The model can tested on using different layers, more specification are found in the keras documentation.

8.4 How to Find a New R-Package

Finding a new R-package can be a daunting task, given the vast number of packages available on CRAN and other repositories. A strategy to identify relevant packages for your specific needs would be to use some prebuilt packages which provide search functionalities.

For example, the package_search() function from the pkgsearch package takes a text string as input and uses basic text mining techniques to search all of CRAN.

library(tidyverse) # for data manipulation
library(dlstats) # for package download stats
library(pkgsearch) # for searching packages

We search for packages related to “excess of mortality” and “infectious diseases”:

excessPkg <- pkg_search(query = "excess of mortality", size = 200)
head(excessPkg$maintainer_name)
#> [1] "Rafael A. Irizarry" "Juste Goungounga"   "Yohann Foucher"    
#> [4] "Mathieu Fauvernier" "Joonas Miettinen"   "Rob Hyndman"
excessPkgShort <- excessPkg %>%
  filter(maintainer_name != "ORPHANED", score > 100) %>%
  select(score, package, downloads_last_month) %>%
  arrange(desc(downloads_last_month))

head(excessPkgShort)
#> # A data frame: 5 × 3
#>   score package    downloads_last_month
#> * <dbl> <chr>                     <int>
#> 1  178. survPen                     396
#> 2 1013. excessmort                  253
#> 3  675. xhaz                          1
#> 4  273. RISCA                         1
#> 5  128. popEpi                        1
excess_shortList <- excessPkgShort$package
excess_downloads <- cran_stats(excess_shortList)
ggplot(excess_downloads, aes(end, downloads,
                             group = package, 
                             color = package)) +
  geom_line() +
  geom_point(aes(shape = package)) +
  scale_y_continuous(trans = "log2")
Excess of Mortality Packages
Figure 8.9: Excess of Mortality Packages Downloads

Another example is the cranly package, which provides a simple interface to search for packages on CRAN. The package provides comprehensive methods for cleaning up and organizing the information in the CRAN package database, making it easier to find relevant packages for your specific needs.

library(cranly)
p_db <- tools::CRAN_package_db()
package_db <- clean_CRAN_db(p_db)
package_db %>%
  filter(grepl("infectious diseases",
               description,
               ignore.case = TRUE)) %>%
  select(package)
#>        package
#> 1  EpiDynamics
#> 2    epitweetr
#> 3        pempi
#> 4         rts2
#> 5    seqDesign
#> 6 SMITIDstruct
#> 7       spaero
#> 8         ssrn

  1. Max Kuhn Silge and Julia, Tidy Modeling with r, n.d., https://www.tmwr.org/.↩︎

  2. Max Kuhn, The Caret Package, n.d., https://topepo.github.io/caret/.↩︎

  3. “Mlr-Org,” n.d., https://mlr-org.com/.↩︎

  4. Forecasting: Principles and Practice (3rd Ed), n.d., https://otexts.com/fpp3/.↩︎

  5. “R-INLA Project - What Is INLA?” n.d., https://www.r-inla.org/what-is-inla.↩︎