9  Predictive Modelling and Beyond

Learning Objectives

  • How to use models’ predictions on new data
  • Time series analysis
  • Mixed models


In the dynamic landscape of public health, the ability to forecast and predict future trends is essential for effective decision-making and the implementation of timely interventions. This chapter provides an overview of predictive modelling, focusing on the challenges of applying predictions to new data, the use of time series analysis, and mixed models to anticipate the trajectory of infectious diseases and health metrics. By exploring the underlying patterns and analysing historical data, we estimate the disease burden and evaluate the impact of interventions on population health. We demonstrate how time series analysis and mixed models can be tailored to address specific research questions, such as predicting disease outbreaks, estimating disease burden, and assessing the impact of interventions on population health.

9.1 Predictions About the Future

The previous chapters provided an attempt at estimating future outcomes based on the application of various type of statistical and machine learning techniques. Here we look at how to use these predictions on new data.

When making predictions on new data, it is important to consider the generalisability of the model. A model that performs well on the training data may not necessarily generalise well to new, unseen data. To evaluate the performance of the model we used the cross-validation technique, but there are more such as hold-out validation, k-fold cross-validation, leave-one-out cross-validation, and bootstrapping. These methods help assess the model’s ability to make accurate predictions on data not used during training, providing a more reliable estimate of the model’s performance in real-world scenarios.

Once the model has been trained and evaluated, it is used to make predictions on new data. This process involves applying the model to the new data to generate forecasts or estimates of the outcome of interest. For example, in the case of infectious diseases, predictive models can be used to forecast the trajectory of an outbreak, estimate the number of cases, or evaluate the impact of interventions on disease transmission. By leveraging historical data and the insights gained from the model, we can effectively evaluate the model performance and make informed decisions based on the predictions.

9.2 Example: Dengue Test Predictions for 2017-2021

Predictive modelling is a powerful tool for forecasting future trends. Here we test the Dengue’s model made with mlr3 meta-package in Chapter 8. The model was trained on data from 1990 to 2016 and now tested for years 2017 to 2021.

First we load the original data and name it old_data.

old_data <- hmsidwR::infectious_diseases %>%
  arrange(year)%>%
  filter(cause_name == "Dengue",
         year<=2017,
         !location_name %in% c("Eswatini", "Lesotho")) %>%
  drop_na() %>%
  group_by(location_id) %>%
  select(-location_name, -cause_name)

Then, we load Dengue data from 2017 to 2021 and name it new_data.

new_data <- hmsidwR::infectious_diseases %>%
  arrange(year)%>%
  filter(cause_name == "Dengue",
         year>=2017,
         !location_name %in% c("Eswatini", "Lesotho")) %>%
  drop_na() %>%
  group_by(location_id) %>%
  select(-location_name, -cause_name)

In the next step, we use the trained models from Chapter 8 to make predictions on the new data. The predictions are stored in new_pred_regr.cv_glmnet and new_pred_regr.xgboost. The resampling results: rr1$learners and rr2$learners, contain the trained models, in particular the first of 5 folds cross validation sample is used for the predictions.

new_pred_regr.cv_glmnet <- 
  rr1$learners[[1]]$predict_newdata(new_data, task = rr1$task)

We notice that the models are doing well, when old data are compared to the new data. The predictions are close to the actual values. Much more can be done to improve the models, such as hyperparameter tuning, feature engineering, and model ensembles, but this is a good point to start. We could also look at the confidence intervals of the predictions, to see how certain we are about the predictions, and so on.

This is the result of the cv_glmnet application on new data:

regr.cv_glmnet_new <- as.data.table(new_pred_regr.cv_glmnet) %>%
  mutate(learner = "regr.cv_glmnet") %>%
  cbind(new_data) 

regr.cv_glmnet_new %>%
  ggplot(aes(x = year)) +
  geom_line(data=old_data,aes(y = DALYs))+
  geom_line(aes(y = DALYs),color="brown") +
  geom_line(aes(y = response), linetype="dashed") +
  facet_wrap(vars(location_id),
             labeller = as_labeller(c(`182` = "Malawi",
                                      `191` = "Zambia",
                                      `169` = "Central African Republic")),
             scales = "free_y") +
  labs(title = "Dengue Predictions for 2017-2021: cv_glmnet")
Dengue Predictions for 2017-2021: cv_glmnet
Figure 9.1: Dengue Predictions for 2017-2021: cv_glmnet

The same steps are taken for the xgboost, and the result is shown below.

new_pred_regr.xgboost <- 
  rr2$learners[[1]]$predict_newdata(new_data, task = rr2$task)
regr.xgboost_new <- as.data.table(new_pred_regr.xgboost) %>%
  mutate(learner = "regr.xgboost") %>%
  cbind(new_data) 

regr.xgboost_new %>%
  ggplot(aes(x = year)) +
  geom_line(data=old_data,aes(y = DALYs))+
  geom_line(aes(y = DALYs),color="brown") +
  geom_line(aes(y = response), linetype="dashed") +
  facet_wrap(vars(location_id),
             labeller = as_labeller(c(`182` = "Malawi",
                                      `191` = "Zambia",
                                      `169` = "Central African Republic")),
             scales = "free_y") +
  labs(title = "Dengue Predictions for 2017-2021: xgboost")
Dengue Predictions for 2017-2021: xgboost
Figure 9.2: Dengue Predictions for 2017-2021: xgboost

In conclusion, predictive modelling is a valuable tool for forecasting future trends and making informed decisions based on historical data. By leveraging the insights gained from the models, we can anticipate the trajectory of infectious diseases, estimate disease burden, and evaluate the impact of interventions on population health. The models can be further refined and optimised to improve their predictive performance and provide more accurate forecasts. By combining predictive modelling with time series analysis and mixed models, we can gain a comprehensive understanding of the complex dynamics of public health and make data-driven decisions to improve population health outcomes.

9.3 Time Series Analysis

One important side of the analysis is to consider the evolution of the phenomenon in time. We can do this by using time as a factor in the model. Time series analysis serves as a tool for understanding and forecasting temporal patterns. Techniques such as ARIMA (AutoRegressive Integrated Moving Average) models offer a systematic approach to modelling time-dependent data, capturing seasonal variations, trends, and irregularities to generate accurate forecasts.

Mixed models, on the other side, provide a versatile framework for incorporating various sources of variability and correlation within the data. Also known as hierarchical models, multilevel models, or random effects models, are a type of statistical model that include both fixed effects and random effects. By combining fixed effects with random effects, mixed models accommodate complex data structures, such as hierarchical or longitudinal data, while accounting for individual-level and group-level factors that may influence the outcome of interest.

To analyse temporal data, where observations are collected at regular intervals over time, time series analysis allows for studying the patterns, trends, and dependencies present in the data to make forecasts or infer relationships. Time series data often exhibit inherent characteristics such as trend, seasonality, cyclic patterns, and irregular fluctuations, which can be explored using various methods such as decomposition, smoothing, and modelling. This analysis is widely used in fields like economics, finance, epidemiology, and environmental science to understand and predict future trends, identify anomalies, and make informed decisions based on historical patterns.

Decomposition methods play a crucial role by pulling out the underlying structure of temporal data. The components of a time series are trend, seasonality, and random fluctuations. Understanding the trend allows for the identification of long-term changes or shifts in the data, while detecting seasonal patterns helps uncover recurring fluctuations that may follow seasonal or cyclical patterns. By separating these components, through decomposition methods, it is possible to discover the temporal dynamics within the data, facilitating more accurate forecasting and predictive modelling. Moreover, decomposing time series data can aid in anomaly detection, trend analysis, and seasonal adjustment, making it a fundamental tool for interpreting complex temporal phenomena.

Modelling time series data often requires a combination of techniques to capture its complex nature. Mixed models, splines, and ARIMA are powerful tools commonly used for this purpose. Splines provide a flexible way to capture nonlinear relationships and smooth out the data, which is particularly useful for capturing trends or seasonal patterns. ARIMA models, on the other hand, are adept at capturing the autocorrelation structure present in the data, including both short-term and long-term dependencies.

Time series analysis can also be performed on estimated values produced by a machine learning model. This involves using vectors of estimates, actual values (ground truth), and time (such as years) to apply a time series model. By incorporating these elements, the time series analysis can help identify trends, seasonal patterns, and other temporal structures within the data. This approach enhances the robustness and accuracy of the predictions by combining the strengths of machine learning models with traditional time series techniques.

9.4 Example: SDI Time Series Analysis with Mixed Effect Models

From forecasting the trajectory of emerging pathogens to evaluating the long-term implications of public health policies, predictive modelling offers valuable insights into the future of global health. As seen in Chapter 5, the consideration of other factors such as the Social Demographic Index (SDI) can provide a comprehensive understanding of the impact social and economic determinants can have on health outcomes.

As an index, SDI is composed of the geometric mean of the total fertility rate (TFR) for those younger than 25 years old (TFU25), the mean education for those 15 years old and older (EDU15+), and lag-distributed income (LDI) per capita. It ranges from 0 to 1 but it is usually multiplied by 100 for a scale of 0 to 100.

Standard Calculation of TFR

\text{TFR} = \sum (\text{ASFR}_i \times 5) \tag{9.1}

where \text{ASFR}_i represents the age-specific fertility rates for age group i , and the sum is typically calculated over all childbearing age groups (often 5-year intervals from ages 15-49).

Standard Calculation of SDI

The SDI is calculated as the geometric mean of these three components:

SDI = \sqrt[3]{TFU25 \times EDU15+ \times LDI} \tag{9.2}

Each component (TFU25, EDU15+, and LDI) is normalised before taking the geometric mean to ensure they are on a comparable scale.

In summary, the TFR formula is specific to calculating fertility rates, while the SDI formula incorporates a specific subset of the TFR (TFU25) along with education and income metrics to create a broader socio-demographic index.

In general, the level of this index helps identify key drivers in the development of health outcomes. SDI can be used as a predictor in predictive models to estimate the burden of diseases, mortality rates, and other health metrics. For example, the relationship between SDI and the incidence of a disease can be modelled using a logistic regression model:

\log \left( \frac{p}{1 - p} \right) = \beta_0 + \beta_1 \times \text{SDI} + \beta_2 \times X_2 + \cdots + \beta_k \times X_k \tag{9.3}

Where, p is the probability of the incidence of the disease, \log \left( \frac{p}{1 - p} \right) is the log-odds (logit) of the disease incidence, \beta_0 is the intercept term, \beta_1, \beta_2, \ldots, \beta_k are the coefficients for the predictor variables, and X_2, \ldots, X_k are other potential predictor variables that might influence the disease incidence.

To convert the log-odds back to the probability of disease incidence given the value of SDI:

p = \frac{1}{1 + e^{-(\beta_0 + \beta_1 \times \text{SDI})}} \tag{9.4}

In this example we use the SDI values from 1990 to 2019, for 3 locations plus the Global region and evaluate the difference in SDI average across the time.

hmsidwR::sdi90_19 |> head()
#> # A tibble: 6 × 3
#>   location  year value
#>   <chr>    <dbl> <dbl>
#> 1 Global    1990 0.511
#> 2 Global    1991 0.516
#> 3 Global    1992 0.521
#> 4 Global    1993 0.525
#> 5 Global    1994 0.529
#> 6 Global    1995 0.534
sdi90_19 |>
  filter(location %in% c("Global", "Italy", 
                         "France", "Germany")) |>
  ggplot(aes(x = year, y = value, 
             linetype = location)) +
  geomtextpath::geom_textline(aes(label = location), 
                              hjust = 0, vjust = 0) +
  labs(title = "Social Demographic Index (SDI) from 1990 to 2019")+
  theme(legend.position = "none")
  
Social Demographic Index (SDI) from 1990 to 2019
Figure 9.3: Social Demographic Index (SDI) from 1990 to 2019
sdi90_19 |>
  group_by(location) |>
  reframe(avg = round(mean(value), 3)) |>
  filter(location %in% c("Global", "Italy", "France", "Germany"))
#> # A tibble: 4 × 2
#>   location   avg
#>   <chr>    <dbl>
#> 1 France   0.79 
#> 2 Germany  0.863
#> 3 Global   0.58 
#> 4 Italy    0.763

Then selecting only France as location to analyse the time series using spline interpolation and ARIMA(1,0,0), we can decompose the time series into its components and analyse the autocorrelation function.

sdi_fr <- sdi90_19 |>
  filter(location == "France")

sdi_fr |>
  head() |>
  str()
#> tibble [6 × 3] (S3: tbl_df/tbl/data.frame)
#>  $ location: chr [1:6] "France" "France" "France" "France" ...
#>  $ year    : num [1:6] 1990 1991 1992 1993 1994 ...
#>  $ value   : num [1:6] 0.738 0.743 0.75 0.755 0.759 0.763
library(tsibble)

sdi_fr_ts <- tsibble::as_tsibble(sdi_fr, index = year)

sdi_fr_ts |> autoplot()
Social Demographic Index (SDI) in France
Figure 9.4: Social Demographic Index (SDI) in France
dcmp <- sdi_fr_ts |>
  model(stl = STL(value))

components(dcmp) |> head()
#> # A dable: 6 x 6 [1Y]
#> # Key:     .model [1]
#> # :        value = trend + remainder
#>   .model  year value trend remainder season_adjust
#>   <chr>  <dbl> <dbl> <dbl>     <dbl>         <dbl>
#> 1 stl     1990 0.738 0.738 -0.000400         0.738
#> 2 stl     1991 0.743 0.744 -0.000560         0.743
#> 3 stl     1992 0.75  0.749  0.00128          0.75 
#> 4 stl     1993 0.755 0.754  0.00136          0.755
#> 5 stl     1994 0.759 0.758  0.000800         0.759
#> 6 stl     1995 0.763 0.762  0.000680         0.763

“…three types of time series patterns: trend, seasonality and cycles. When we decompose a time series into components, we usually combine the trend and cycle into a single trend-cycle component (often just called the trend for simplicity). Thus we can think of a time series as comprising three components: a trend-cycle component, a seasonal component, and a remainder component (containing anything else in the time series). Rob J Hyndman

y_t=S_t+T_t+R_t \tag{9.5} y_t=S_t*T_t*R_t \tag{9.6} log(y_t)=log(S_t)+log(T_t)+log(R_t) \tag{9.7}

The remainder component shown in the bottom panel is what is left over when the seasonal and trend-cycle components have been subtracted from the data.

components(dcmp) |> autoplot()

components(dcmp) |>
  as_tsibble() |>
  autoplot(value) +
  geom_line(aes(y = trend), 
            linetype="dashed")+
labs(title = "Trend line of SDI in France")
Components of SDI in France
(a) Components of SDI in France
Components of SDI in France
(b) Trend line of SDI in France
Figure 9.5: Components of SDI in France

9.4.1 Autocorrelation function

Time series that show no autocorrelation are called white noise. In this case we have autocorrelation. So, no white noise.

sdi_fr_ts |>
  ACF(value) |>
  autoplot()
Autocorrelation Function of SDI in France
Figure 9.6: Autocorrelation Function of SDI in France
sdi_fr_ts |>
  features(value,
           features = unitroot_kpss)
#> # A tibble: 1 × 2
#>   kpss_stat kpss_pvalue
#>       <dbl>       <dbl>
#> 1      1.09        0.01

Seasonal difference is not required

sdi_fr_ts |>
  features(value,
           features = unitroot_nsdiffs)
#> # A tibble: 1 × 1
#>   nsdiffs
#>     <int>
#> 1       0

Auto-ARIMA model is used to establish the best fit:

if needed install.packages("urca")

fit <- sdi_fr_ts |>
  model(ARIMA(value))

fit |> report()
#> Series: value 
#> Model: ARIMA(1,1,0) w/ drift 
#> 
#> Coefficients:
#>          ar1  constant
#>       0.5971    0.0013
#> s.e.  0.1562    0.0002
#> 
#> sigma^2 estimated as 8.634e-07:  log likelihood=162.46
#> AIC=-318.92   AICc=-317.96   BIC=-314.82
fit |>
  forecast(h = 10) |>
  ggplot(aes(x = year, y = .mean)) +
  geom_line(data = sdi_fr_ts,
            aes(y = value)) +
  geom_line(color = "purple", linetype = "dashed")+
  labs(title = "Forecast of SDI in France")
Forecast of SDI in France
Figure 9.7: Forecast of SDI in France

9.4.2 Partial Autocorrelations

sdi_fr_ts |>
  PACF(value) |>
  autoplot()

sdi_fr_ts |>
  gg_tsdisplay(y = value, plot_type = "partial")
Partial Autocorrelations of SDI in France
(a) PACF
Partial Autocorrelations of SDI in France
(b) gg_tsdisplay
Figure 9.8: Partial Autocorrelations of SDI in France

9.4.3 Model Ensembles

To improve the predictive performance, predictions of multiple single learners can be aggregated to a model ensemble by combining the forecasts of individual models.

Ensemble learning is a machine learning technique that leverages the diversity of individual models to make more accurate predictions and reduce overfitting. Ensemble methods can reduce the variance and bias of the predictions, leading to more accurate and reliable results. Ensemble techniques such as bagging, boosting, and stacking leverage the diversity of individual models to create a stronger collective model that outperforms its components. These methods are widely used in machine learning and predictive modelling to enhance the predictive power of the model and achieve better generalisation performance on unseen data.

In case is required install the {urca} package:

caf_fit <- sdi_fr_ts |>
  model(
    arima210 = ARIMA(value ~ pdq(2, 1, 0)),
    arima013 = ARIMA(value ~ pdq(0, 1, 3)),
    stepwise = ARIMA(value),
    search = ARIMA(value, stepwise = FALSE)
    )
caf_fit |>
  pivot_longer(cols = everything(),
               names_to = "Model name",
               values_to = "Orders")
#> # A mable: 4 x 2
#> # Key:     Model name [4]
#>   `Model name`                  Orders
#>   <chr>                        <model>
#> 1 arima210     <ARIMA(2,1,0) w/ drift>
#> 2 arima013     <ARIMA(0,1,3) w/ drift>
#> 3 stepwise     <ARIMA(1,1,0) w/ drift>
#> 4 search       <ARIMA(1,1,0) w/ drift>
glance(caf_fit) |>
  arrange(AICc) |>
  select(.model:BIC)
#> # A tibble: 4 × 6
#>   .model        sigma2 log_lik   AIC  AICc   BIC
#>   <chr>          <dbl>   <dbl> <dbl> <dbl> <dbl>
#> 1 stepwise 0.000000863    162. -319. -318. -315.
#> 2 search   0.000000863    162. -319. -318. -315.
#> 3 arima210 0.000000896    162. -317. -315. -311.
#> 4 arima013 0.000000950    162. -314. -312. -308.

Best ARIMA model selected is ARIMA(1,1,0):

caf_fit |>
  select(search) |>
  gg_tsresiduals()


caf_fit |>
  forecast(h = 5) |>
  filter(.model == "search") |>
  ggplot(aes(x = year, y = .mean)) +
  geom_line(data = sdi_fr_ts,
            aes(y = value)) +
  geom_line(color = "purple", linetype = "dashed")
Residuals of ARIMA Models
(a) Residuals of ARIMA Models
Residuals of ARIMA Models
(b) Forecast of SDI in France
Figure 9.9: Residuals of ARIMA Models

9.5 Mixed Models

Mixed models are a powerful statistical tool for analysing data with a hierarchical or nested structure, where observations are grouped within higher-level units. By incorporating both fixed and random effects, mixed models can account for the variability within and between groups, providing a flexible framework for modelling complex data structures. Mixed models are widely used in various fields, and particularly useful in fields like healthcare and infectious diseases where data might be collected across different subjects or time points.

9.5.1 Mixed-Effects Models in Estimating YLDs

Mixed-effects models are particularly useful in estimating YLDs because they can handle data that is hierarchically structured (e.g., patients within clinics, repeated measures over time) and account for both fixed effects (e.g., age, gender) and random effects (e.g., variability between patients or clinics).

Fixed Effects: These are the primary effects of interest and include variables such as age, gender, year, and specific interventions or treatments.

Random Effects: These capture the variability that is not explained by the fixed effects, such as differences between patients, clinics, or regions.

By using mixed-effects models, we can better account for the within-subject and between-subject variability, providing more accurate estimates of YLDs.

9.5.2 Different Ratios Used in Forecasting Non-Fatal Disease Burden

When forecasting the non-fatal disease burden, different mortality ratios are used to assess the dynamics and progression of diseases. Here are three important ratios:

MI Ratio (Mortality to Incidence Ratio): MI=M/I

This ratio compares the number of deaths due to a disease to the number of new cases of the disease in a specific time period. It is considered a severity indicator as it indicates the lethality of the disease after diagnosis. A high MI suggests that a higher proportion of diagnosed cases result in death, indicating a severe condition.

MP Ratio (Mortality to Prevalence Ratio): MP=M/P

This ratio compares the number of deaths due to a disease to the total number of existing cases (both new and old) at a given time. It is used in the Current Risk Assessment (CRA) as it reflects the current fatality risk among those living with the disease. A high MP suggests a significant risk of death for those currently affected. It helps in understanding the long-term burden and survivability of a disease.

PI Ratio (Prevalence to Incidence Ratio): PI=P/I

This ratio compares the total number of existing cases to the number of new cases in a specific time period. As a chronicity indicator, it indicates how long individuals live with the disease. A high PI suggests that individuals live longer with the disease, indicating it is more chronic. It provides insight into the chronic nature of a disease and the duration that individuals typically live with the disease.

In R, mixed-effects models can be integrated with Eigen, a C++ template library for linear algebra which significantly speeds up the computations involved in fitting mixed-effects models.

9.6 Example: Tuberculosis YLDs Estimation Using Mixed-Effects Models

To estimate Years Lived with Disability (YLDs) using mixed-effects models, we can use the lme4 package which provides functions to fit linear and generalised linear mixed-effects models. We will provide an example of how to estimate YLDs with mixed-effects models. We use the g7_hmetrics dataset found in hmsidwR package containing YLDs, deaths, incidence and prevalence for Respiratory infections and tuberculosis in 2010 and 2019 to demonstrate the estimation of YLDs using mixed-effects models.

library(lme4)
library(tidyverse)

data <- hmsidwR::g7_hmetrics %>%
  filter(measure %in% c("Incidence",
                        "Prevalence",
                        "YLDs", 
                        "Deaths"),
         str_detect(cause, "Tuberculosis"),
         !year == 2021,
         !location == "Global",
         metric == "Rate",
         sex == "both") %>%
  select(-metric, -sex, -cause, -upper, -lower) %>%
  pivot_wider(names_from = measure,
              values_from = val)

data %>% head()
#> # A tibble: 6 × 6
#>   location  year Deaths  YLDs Prevalence Incidence
#>   <chr>    <dbl>  <dbl> <dbl>      <dbl>     <dbl>
#> 1 US        2010  0.164  1.05      9991.      2.26
#> 2 Italy     2010  0.303  1.71     10306.      6.71
#> 3 UK        2010  0.359  2.67      8670.     12.0 
#> 4 Canada    2010  0.177  1.06      8840.      4.06
#> 5 Germany   2010  0.286  1.79      7758.      4.86
#> 6 UK        2019  0.238  1.84      7534.      9.11
library(ggpattern)
ggplot(data,
       aes(x = fct_reorder(location, YLDs),
           y = YLDs, group = year)) +
  ggpattern::geom_col_pattern(aes(pattern = factor(year)),
                              position = "dodge",
                              fill= 'white',
                              colour= 'black') +
  labs(title="YLDs for Tuberculosis by Location and Year",
       x = "Location", pattern = "Year")
    
YLDs for Tuberculosis by Location and Year
Figure 9.10: YLDs for Tuberculosis by Location and Year
hmsidwR::disweights %>%
  filter(str_detect(sequela, "tuberculosis")) %>%
  select(cause1, severity, dw)
#> # A tibble: 6 × 3
#>   cause1       severity    dw
#>   <chr>        <chr>    <dbl>
#> 1 Tuberculosis mean     0.333
#> 2 Tuberculosis mean     0.333
#> 3 Tuberculosis mean     0.333
#> 4 Tuberculosis mean     0.333
#> 5 Tuberculosis mean     0.333
#> 6 Tuberculosis mean     0.333

The disability weight is 0.333 for tuberculosis, and the duration derived from the GBD 2021 was estimated based on a systematic review and adjusted for treated and untreated cases. The duration of tuberculosis was estimated to be 3 years1 for untreated cases and 6 months for treated cases.

# Define the formula
formula <- YLDs ~ Prevalence + year + (1 | location)

# Fit the mixed-effects model
model <- lmer(formula, data = data)

model
#> Linear mixed model fit by REML ['lmerMod']
#> Formula: YLDs ~ Prevalence + year + (1 | location)
#>    Data: data
#> REML criterion at convergence: 34.6921
#> Random effects:
#>  Groups   Name        Std.Dev.
#>  location (Intercept) 0.6202  
#>  Residual             0.1773  
#> Number of obs: 12, groups:  location, 6
#> Fixed Effects:
#> (Intercept)   Prevalence         year  
#>  67.7988999    0.0001619   -0.0336388  
#> fit warnings:
#> Some predictor variables are on very different scales: consider rescaling
# Extract fixed effects
fixed_effects <- fixef(model)

fixed_effects
#>   (Intercept)    Prevalence          year 
#> 67.7988999255  0.0001619458 -0.0336387717

The model is:

YLDs = 67.7988999255 + 0.0001619458*Prevalence - 0.0336387717*year \tag{9.8}

In particular, a prevalence coefficient of 0.0001619458 means that for each unit increase in prevalence, YLDs increase by 0.0001619458, while a year coefficient of -0.0336387717 means that for each unit increase in year, YLDs decrease by 0.0336387717.

# Extract random effects
random_effects <- ranef(model)

random_effects
#> $location
#>         (Intercept)
#> Canada  -0.40736806
#> Germany  0.44667836
#> Italy   -0.09317570
#> Japan   -0.06401591
#> UK       0.87190491
#> US      -0.75402361
#> 
#> with conditional variances for "location"

The output of random_effects is a list of data frames with the random effects for each level of the random effect variable (location in this case), which means that the model is

\begin{split} YLDs = 67.7988999255 \\ + 0.0001619458 \cdot \text{Prevalence} \\ - 0.0336387717 \cdot \text{year} \\ + \text{random effect for each location} \end{split} \tag{9.9}

Then, predictions are made for the YLDs using the mixed-effects model on observed data and new data. The predictions are compared with the actual YLDs to evaluate the model’s performance.

# Predict YLDs for observed data
predictions <- predict(model, 
                       data,
                       allow.new.levels = TRUE)

data$predicted_YLD <- predictions

ggplot(data %>% filter(year == 2019,
                       !location=="Global"),
       aes(x = YLDs, y = predicted_YLD,
       group = location)) +
  geom_point(aes(shape = factor(location))) +
  geom_abline() +
  labs(title = "YLDs Predicted vs Estimated",
       x = "YLDs",
       y = "Predicted YLD",
       shape = "Location")
Predicted YLDs for Tuberculosis
Figure 9.11: Predicted YLDs for Tuberculosis
# Create new data for prediction
new_data <- data.frame(Prevalence = c(13000, 7000, 10200),
                       year = c(2021, 2021, 2021),
                       location = c("Japan", "UK", "US"))

# Predict YLDs for new data
predictions <- predict(model, 
                       new_data, 
                       allow.new.levels = TRUE)

And the predictions are shown in the following plot:

data %>%
  mutate(location = as.factor(location)) %>%
  filter(location%in%c("Japan", "UK", "US")) %>%
ggplot(aes(x = factor(year), y = predicted_YLD,
             group = location, fill=location)) +
  ggpattern::geom_col_pattern(aes(pattern = location),
                              position = "stack",
                              fill= 'white',
                              colour= 'black') +
  ggpattern::geom_col_pattern(data = bind_cols(new_data, 
                                               predicted_YLD = predictions),
                              aes(pattern = location),
                              position = "stack",
                              fill= 'white',
                              colour= 'black') +
  labs(title = "Predicted YLDs for Tuberculosis",
       x = "Year",
       y = "Predicted YLDs",
       fill = "Location")
New Data-Predicted YLDs for Tuberculosis
Figure 9.12: New Data-Predicted YLDs for Tuberculosis

9.7 Summary

Predictive modelling is a valuable tool for forecasting future trends and making informed decisions based on historical data. By leveraging the insights gained from the models, we can anticipate the trajectory of infectious diseases, estimate disease burden, and evaluate the impact of interventions on population health. The models can be further refined and optimised to improve their predictive performance and provide more accurate forecasts. By combining predictive modelling with time series analysis and mixed models, we can gain a comprehensive understanding of the complex dynamics of public health and make data-driven decisions to improve population health outcomes.


  1. Edine W. Tiemersma et al., “Natural History of Tuberculosis: Duration and Fatality of Untreated Pulmonary Tuberculosis in HIV Negative Patients: A Systematic Review,” PLOS ONE 6, no. 4 (April 4, 2011): e17601, doi:10.1371/journal.pone.0017601.↩︎