9 Predictive Modelling and Beyond
Learning Objectives
- How to use models’ predictions on new data
- Time series analysis
- Mixed models
In the dynamic landscape of public health, the ability to forecast and predict future trends is essential for effective decision-making and the implementation of timely interventions. This chapter provides an overview of predictive modelling, focusing on the challenges of applying predictions to new data, the use of time series analysis, and mixed models to anticipate the trajectory of infectious diseases and health metrics. By exploring the underlying patterns and analysing historical data, we estimate the disease burden and evaluate the impact of interventions on population health. We demonstrate how time series analysis and mixed models can be tailored to address specific research questions, such as predicting disease outbreaks, estimating disease burden, and assessing the impact of interventions on population health.
9.1 Predictions About the Future
The previous chapters provided an attempt at estimating future outcomes based on the application of various type of statistical and machine learning techniques. Here we look at how to use these predictions on new data.
When making predictions on new data, it is important to consider the generalisability of the model. A model that performs well on the training data may not necessarily generalise well to new, unseen data. To evaluate the performance of the model we used the cross-validation technique, but there are more such as hold-out validation, k-fold cross-validation, leave-one-out cross-validation, and bootstrapping. These methods help assess the model’s ability to make accurate predictions on data not used during training, providing a more reliable estimate of the model’s performance in real-world scenarios.
Once the model has been trained and evaluated, it is used to make predictions on new data. This process involves applying the model to the new data to generate forecasts or estimates of the outcome of interest. For example, in the case of infectious diseases, predictive models can be used to forecast the trajectory of an outbreak, estimate the number of cases, or evaluate the impact of interventions on disease transmission. By leveraging historical data and the insights gained from the model, we can effectively evaluate the model performance and make informed decisions based on the predictions.
9.2 Example: Dengue Test Predictions for 2017-2021
Predictive modelling is a powerful tool for forecasting future trends. Here we test the Dengue’s model made with mlr3 meta-package in Chapter 8. The model was trained on data from 1990 to 2016 and now tested for years 2017 to 2021.
First we load the original data and name it old_data
.
Then, we load Dengue data from 2017 to 2021 and name it new_data
.
In the next step, we use the trained models from Chapter 8 to make predictions on the new data. The predictions are stored in new_pred_regr.cv_glmnet
and new_pred_regr.xgboost
. The resampling results: rr1$learners
and rr2$learners
, contain the trained models, in particular the first of 5 folds cross validation sample is used for the predictions.
new_pred_regr.cv_glmnet <-
rr1$learners[[1]]$predict_newdata(new_data, task = rr1$task)
We notice that the models are doing well, when old data are compared to the new data. The predictions are close to the actual values. Much more can be done to improve the models, such as hyperparameter tuning, feature engineering, and model ensembles, but this is a good point to start. We could also look at the confidence intervals of the predictions, to see how certain we are about the predictions, and so on.
This is the result of the cv_glmnet
application on new data:
regr.cv_glmnet_new <- as.data.table(new_pred_regr.cv_glmnet) %>%
mutate(learner = "regr.cv_glmnet") %>%
cbind(new_data)
regr.cv_glmnet_new %>%
ggplot(aes(x = year)) +
geom_line(data=old_data,aes(y = DALYs))+
geom_line(aes(y = DALYs),color="brown") +
geom_line(aes(y = response), linetype="dashed") +
facet_wrap(vars(location_id),
labeller = as_labeller(c(`182` = "Malawi",
`191` = "Zambia",
`169` = "Central African Republic")),
scales = "free_y") +
labs(title = "Dengue Predictions for 2017-2021: cv_glmnet")
The same steps are taken for the xgboost
, and the result is shown below.
new_pred_regr.xgboost <-
rr2$learners[[1]]$predict_newdata(new_data, task = rr2$task)
regr.xgboost_new <- as.data.table(new_pred_regr.xgboost) %>%
mutate(learner = "regr.xgboost") %>%
cbind(new_data)
regr.xgboost_new %>%
ggplot(aes(x = year)) +
geom_line(data=old_data,aes(y = DALYs))+
geom_line(aes(y = DALYs),color="brown") +
geom_line(aes(y = response), linetype="dashed") +
facet_wrap(vars(location_id),
labeller = as_labeller(c(`182` = "Malawi",
`191` = "Zambia",
`169` = "Central African Republic")),
scales = "free_y") +
labs(title = "Dengue Predictions for 2017-2021: xgboost")
In conclusion, predictive modelling is a valuable tool for forecasting future trends and making informed decisions based on historical data. By leveraging the insights gained from the models, we can anticipate the trajectory of infectious diseases, estimate disease burden, and evaluate the impact of interventions on population health. The models can be further refined and optimised to improve their predictive performance and provide more accurate forecasts. By combining predictive modelling with time series analysis and mixed models, we can gain a comprehensive understanding of the complex dynamics of public health and make data-driven decisions to improve population health outcomes.
9.3 Time Series Analysis
One important side of the analysis is to consider the evolution of the phenomenon in time. We can do this by using time as a factor in the model. Time series analysis serves as a tool for understanding and forecasting temporal patterns. Techniques such as ARIMA (AutoRegressive Integrated Moving Average) models offer a systematic approach to modelling time-dependent data, capturing seasonal variations, trends, and irregularities to generate accurate forecasts.
Mixed models, on the other side, provide a versatile framework for incorporating various sources of variability and correlation within the data. Also known as hierarchical models, multilevel models, or random effects models, are a type of statistical model that include both fixed effects and random effects. By combining fixed effects with random effects, mixed models accommodate complex data structures, such as hierarchical or longitudinal data, while accounting for individual-level and group-level factors that may influence the outcome of interest.
To analyse temporal data, where observations are collected at regular intervals over time, time series analysis allows for studying the patterns, trends, and dependencies present in the data to make forecasts or infer relationships. Time series data often exhibit inherent characteristics such as trend, seasonality, cyclic patterns, and irregular fluctuations, which can be explored using various methods such as decomposition, smoothing, and modelling. This analysis is widely used in fields like economics, finance, epidemiology, and environmental science to understand and predict future trends, identify anomalies, and make informed decisions based on historical patterns.
Decomposition methods play a crucial role by pulling out the underlying structure of temporal data. The components of a time series are trend
, seasonality
, and random fluctuations
. Understanding the trend allows for the identification of long-term changes or shifts in the data, while detecting seasonal patterns helps uncover recurring fluctuations that may follow seasonal or cyclical patterns. By separating these components, through decomposition methods, it is possible to discover the temporal dynamics within the data, facilitating more accurate forecasting and predictive modelling. Moreover, decomposing time series data can aid in anomaly detection, trend analysis, and seasonal adjustment, making it a fundamental tool for interpreting complex temporal phenomena.
Modelling time series data often requires a combination of techniques to capture its complex nature. Mixed models, splines, and ARIMA are powerful tools commonly used for this purpose. Splines provide a flexible way to capture nonlinear relationships and smooth out the data, which is particularly useful for capturing trends or seasonal patterns. ARIMA models, on the other hand, are adept at capturing the autocorrelation structure present in the data, including both short-term and long-term dependencies.
Time series analysis can also be performed on estimated values produced by a machine learning model. This involves using vectors of estimates, actual values (ground truth), and time (such as years) to apply a time series model. By incorporating these elements, the time series analysis can help identify trends, seasonal patterns, and other temporal structures within the data. This approach enhances the robustness and accuracy of the predictions by combining the strengths of machine learning models with traditional time series techniques.
9.4 Example: SDI Time Series Analysis with Mixed Effect Models
From forecasting the trajectory of emerging pathogens to evaluating the long-term implications of public health policies, predictive modelling offers valuable insights into the future of global health. As seen in Chapter 5, the consideration of other factors such as the Social Demographic Index (SDI) can provide a comprehensive understanding of the impact social and economic determinants can have on health outcomes.
As an index, SDI is composed of the geometric mean of the total fertility rate (TFR) for those younger than 25 years old (TFU25), the mean education for those 15 years old and older (EDU15+), and lag-distributed income (LDI) per capita. It ranges from 0 to 1 but it is usually multiplied by 100 for a scale of 0 to 100.
Standard Calculation of TFR
\text{TFR} = \sum (\text{ASFR}_i \times 5) \tag{9.1}
where \text{ASFR}_i represents the age-specific fertility rates for age group i , and the sum is typically calculated over all childbearing age groups (often 5-year intervals from ages 15-49).
Standard Calculation of SDI
The SDI is calculated as the geometric mean of these three components:
SDI = \sqrt[3]{TFU25 \times EDU15+ \times LDI} \tag{9.2}
Each component (TFU25, EDU15+, and LDI) is normalised before taking the geometric mean to ensure they are on a comparable scale.
In summary, the TFR formula is specific to calculating fertility rates, while the SDI formula incorporates a specific subset of the TFR (TFU25) along with education and income metrics to create a broader socio-demographic index.
In general, the level of this index helps identify key drivers in the development of health outcomes. SDI can be used as a predictor in predictive models to estimate the burden of diseases, mortality rates, and other health metrics. For example, the relationship between SDI and the incidence of a disease can be modelled using a logistic regression model:
\log \left( \frac{p}{1 - p} \right) = \beta_0 + \beta_1 \times \text{SDI} + \beta_2 \times X_2 + \cdots + \beta_k \times X_k \tag{9.3}
Where, p is the probability of the incidence of the disease, \log \left( \frac{p}{1 - p} \right) is the log-odds (logit) of the disease incidence, \beta_0 is the intercept term, \beta_1, \beta_2, \ldots, \beta_k are the coefficients for the predictor variables, and X_2, \ldots, X_k are other potential predictor variables that might influence the disease incidence.
To convert the log-odds
back to the probability of disease incidence given the value of SDI:
p = \frac{1}{1 + e^{-(\beta_0 + \beta_1 \times \text{SDI})}} \tag{9.4}
In this example we use the SDI values from 1990 to 2019, for 3 locations plus the Global region and evaluate the difference in SDI average across the time.
sdi90_19 |>
filter(location %in% c("Global", "Italy",
"France", "Germany")) |>
ggplot(aes(x = year, y = value,
linetype = location)) +
geomtextpath::geom_textline(aes(label = location),
hjust = 0, vjust = 0) +
labs(title = "Social Demographic Index (SDI) from 1990 to 2019")+
theme(legend.position = "none")
Then selecting only France as location to analyse the time series using spline interpolation and ARIMA(1,0,0), we can decompose the time series into its components and analyse the autocorrelation function.
library(tsibble)
sdi_fr_ts <- tsibble::as_tsibble(sdi_fr, index = year)
sdi_fr_ts |> autoplot()
dcmp <- sdi_fr_ts |>
model(stl = STL(value))
components(dcmp) |> head()
#> # A dable: 6 x 6 [1Y]
#> # Key: .model [1]
#> # : value = trend + remainder
#> .model year value trend remainder season_adjust
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 stl 1990 0.738 0.738 -0.000400 0.738
#> 2 stl 1991 0.743 0.744 -0.000560 0.743
#> 3 stl 1992 0.75 0.749 0.00128 0.75
#> 4 stl 1993 0.755 0.754 0.00136 0.755
#> 5 stl 1994 0.759 0.758 0.000800 0.759
#> 6 stl 1995 0.763 0.762 0.000680 0.763
“…three types of time series patterns: trend, seasonality and cycles. When we decompose a time series into components, we usually combine the trend and cycle into a single trend-cycle component (often just called the trend for simplicity). Thus we can think of a time series as comprising three components: a trend-cycle component, a seasonal component, and a remainder component (containing anything else in the time series). Rob J Hyndman”
y_t=S_t+T_t+R_t \tag{9.5} y_t=S_t*T_t*R_t \tag{9.6} log(y_t)=log(S_t)+log(T_t)+log(R_t) \tag{9.7}
The remainder component shown in the bottom panel is what is left over when the seasonal and trend-cycle components have been subtracted from the data.
components(dcmp) |> autoplot()
components(dcmp) |>
as_tsibble() |>
autoplot(value) +
geom_line(aes(y = trend),
linetype="dashed")+
labs(title = "Trend line of SDI in France")
9.4.1 Autocorrelation function
Time series that show no autocorrelation are called white noise. In this case we have autocorrelation. So, no white noise.
sdi_fr_ts |>
ACF(value) |>
autoplot()
sdi_fr_ts |>
features(value,
features = unitroot_kpss)
#> # A tibble: 1 × 2
#> kpss_stat kpss_pvalue
#> <dbl> <dbl>
#> 1 1.09 0.01
Seasonal difference is not required
sdi_fr_ts |>
features(value,
features = unitroot_nsdiffs)
#> # A tibble: 1 × 1
#> nsdiffs
#> <int>
#> 1 0
Auto-ARIMA model is used to establish the best fit:
if needed install.packages("urca")
fit <- sdi_fr_ts |>
model(ARIMA(value))
fit |> report()
#> Series: value
#> Model: ARIMA(1,1,0) w/ drift
#>
#> Coefficients:
#> ar1 constant
#> 0.5971 0.0013
#> s.e. 0.1562 0.0002
#>
#> sigma^2 estimated as 8.634e-07: log likelihood=162.46
#> AIC=-318.92 AICc=-317.96 BIC=-314.82
9.4.2 Partial Autocorrelations
sdi_fr_ts |>
PACF(value) |>
autoplot()
sdi_fr_ts |>
gg_tsdisplay(y = value, plot_type = "partial")
9.4.3 Model Ensembles
To improve the predictive performance, predictions of multiple single learners can be aggregated to a model ensemble by combining the forecasts of individual models.
Ensemble learning is a machine learning technique that leverages the diversity of individual models to make more accurate predictions and reduce overfitting. Ensemble methods can reduce the variance and bias of the predictions, leading to more accurate and reliable results. Ensemble techniques such as bagging, boosting, and stacking leverage the diversity of individual models to create a stronger collective model that outperforms its components. These methods are widely used in machine learning and predictive modelling to enhance the predictive power of the model and achieve better generalisation performance on unseen data.
In case is required install the {urca}
package:
install.packages("urca")
caf_fit <- sdi_fr_ts |>
model(
arima210 = ARIMA(value ~ pdq(2, 1, 0)),
arima013 = ARIMA(value ~ pdq(0, 1, 3)),
stepwise = ARIMA(value),
search = ARIMA(value, stepwise = FALSE)
)
caf_fit |>
pivot_longer(cols = everything(),
names_to = "Model name",
values_to = "Orders")
#> # A mable: 4 x 2
#> # Key: Model name [4]
#> `Model name` Orders
#> <chr> <model>
#> 1 arima210 <ARIMA(2,1,0) w/ drift>
#> 2 arima013 <ARIMA(0,1,3) w/ drift>
#> 3 stepwise <ARIMA(1,1,0) w/ drift>
#> 4 search <ARIMA(1,1,0) w/ drift>
glance(caf_fit) |>
arrange(AICc) |>
select(.model:BIC)
#> # A tibble: 4 × 6
#> .model sigma2 log_lik AIC AICc BIC
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 stepwise 0.000000863 162. -319. -318. -315.
#> 2 search 0.000000863 162. -319. -318. -315.
#> 3 arima210 0.000000896 162. -317. -315. -311.
#> 4 arima013 0.000000950 162. -314. -312. -308.
Best ARIMA model selected is ARIMA(1,1,0):
caf_fit |>
select(search) |>
gg_tsresiduals()
caf_fit |>
forecast(h = 5) |>
filter(.model == "search") |>
ggplot(aes(x = year, y = .mean)) +
geom_line(data = sdi_fr_ts,
aes(y = value)) +
geom_line(color = "purple", linetype = "dashed")
9.5 Mixed Models
Mixed models are a powerful statistical tool for analysing data with a hierarchical or nested structure, where observations are grouped within higher-level units. By incorporating both fixed and random effects, mixed models can account for the variability within and between groups, providing a flexible framework for modelling complex data structures. Mixed models are widely used in various fields, and particularly useful in fields like healthcare and infectious diseases where data might be collected across different subjects or time points.
9.5.1 Mixed-Effects Models in Estimating YLDs
Mixed-effects models are particularly useful in estimating YLDs because they can handle data that is hierarchically structured (e.g., patients within clinics, repeated measures over time) and account for both fixed effects (e.g., age, gender) and random effects (e.g., variability between patients or clinics).
Fixed Effects: These are the primary effects of interest and include variables such as age, gender, year, and specific interventions or treatments.
Random Effects: These capture the variability that is not explained by the fixed effects, such as differences between patients, clinics, or regions.
By using mixed-effects models, we can better account for the within-subject and between-subject variability, providing more accurate estimates of YLDs.
9.5.2 Different Ratios Used in Forecasting Non-Fatal Disease Burden
When forecasting the non-fatal disease burden, different mortality ratios are used to assess the dynamics and progression of diseases. Here are three important ratios:
MI Ratio (Mortality to Incidence Ratio): MI=M/I
This ratio compares the number of deaths due to a disease to the number of new cases of the disease in a specific time period. It is considered a severity indicator as it indicates the lethality of the disease after diagnosis. A high MI suggests that a higher proportion of diagnosed cases result in death, indicating a severe condition.
MP Ratio (Mortality to Prevalence Ratio): MP=M/P
This ratio compares the number of deaths due to a disease to the total number of existing cases (both new and old) at a given time. It is used in the Current Risk Assessment (CRA) as it reflects the current fatality risk among those living with the disease. A high MP suggests a significant risk of death for those currently affected. It helps in understanding the long-term burden and survivability of a disease.
PI Ratio (Prevalence to Incidence Ratio): PI=P/I
This ratio compares the total number of existing cases to the number of new cases in a specific time period. As a chronicity indicator, it indicates how long individuals live with the disease. A high PI suggests that individuals live longer with the disease, indicating it is more chronic. It provides insight into the chronic nature of a disease and the duration that individuals typically live with the disease.
In R, mixed-effects models can be integrated with Eigen, a C++ template library for linear algebra which significantly speeds up the computations involved in fitting mixed-effects models.
9.6 Example: Tuberculosis YLDs Estimation Using Mixed-Effects Models
To estimate Years Lived with Disability (YLDs) using mixed-effects models, we can use the lme4
package which provides functions to fit linear and generalised linear mixed-effects models. We will provide an example of how to estimate YLDs with mixed-effects models. We use the g7_hmetrics
dataset found in hmsidwR package containing YLDs, deaths, incidence and prevalence for Respiratory infections and tuberculosis in 2010 and 2019 to demonstrate the estimation of YLDs using mixed-effects models.
library(lme4)
library(tidyverse)
data <- hmsidwR::g7_hmetrics %>%
filter(measure %in% c("Incidence",
"Prevalence",
"YLDs",
"Deaths"),
str_detect(cause, "Tuberculosis"),
!year == 2021,
!location == "Global",
metric == "Rate",
sex == "both") %>%
select(-metric, -sex, -cause, -upper, -lower) %>%
pivot_wider(names_from = measure,
values_from = val)
data %>% head()
#> # A tibble: 6 × 6
#> location year Deaths YLDs Prevalence Incidence
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 US 2010 0.164 1.05 9991. 2.26
#> 2 Italy 2010 0.303 1.71 10306. 6.71
#> 3 UK 2010 0.359 2.67 8670. 12.0
#> 4 Canada 2010 0.177 1.06 8840. 4.06
#> 5 Germany 2010 0.286 1.79 7758. 4.86
#> 6 UK 2019 0.238 1.84 7534. 9.11
library(ggpattern)
ggplot(data,
aes(x = fct_reorder(location, YLDs),
y = YLDs, group = year)) +
ggpattern::geom_col_pattern(aes(pattern = factor(year)),
position = "dodge",
fill= 'white',
colour= 'black') +
labs(title="YLDs for Tuberculosis by Location and Year",
x = "Location", pattern = "Year")
hmsidwR::disweights %>%
filter(str_detect(sequela, "tuberculosis")) %>%
select(cause1, severity, dw)
#> # A tibble: 6 × 3
#> cause1 severity dw
#> <chr> <chr> <dbl>
#> 1 Tuberculosis mean 0.333
#> 2 Tuberculosis mean 0.333
#> 3 Tuberculosis mean 0.333
#> 4 Tuberculosis mean 0.333
#> 5 Tuberculosis mean 0.333
#> 6 Tuberculosis mean 0.333
The disability weight is 0.333 for tuberculosis, and the duration derived from the GBD 2021 was estimated based on a systematic review and adjusted for treated and untreated cases. The duration of tuberculosis was estimated to be 3 years1 for untreated cases and 6 months for treated cases.
# Define the formula
formula <- YLDs ~ Prevalence + year + (1 | location)
# Fit the mixed-effects model
model <- lmer(formula, data = data)
model
#> Linear mixed model fit by REML ['lmerMod']
#> Formula: YLDs ~ Prevalence + year + (1 | location)
#> Data: data
#> REML criterion at convergence: 34.6921
#> Random effects:
#> Groups Name Std.Dev.
#> location (Intercept) 0.6202
#> Residual 0.1773
#> Number of obs: 12, groups: location, 6
#> Fixed Effects:
#> (Intercept) Prevalence year
#> 67.7988999 0.0001619 -0.0336388
#> fit warnings:
#> Some predictor variables are on very different scales: consider rescaling
# Extract fixed effects
fixed_effects <- fixef(model)
fixed_effects
#> (Intercept) Prevalence year
#> 67.7988999255 0.0001619458 -0.0336387717
The model is:
YLDs = 67.7988999255 + 0.0001619458*Prevalence - 0.0336387717*year \tag{9.8}
In particular, a prevalence coefficient of 0.0001619458 means that for each unit increase in prevalence, YLDs increase by 0.0001619458, while a year coefficient of -0.0336387717 means that for each unit increase in year, YLDs decrease by 0.0336387717.
# Extract random effects
random_effects <- ranef(model)
random_effects
#> $location
#> (Intercept)
#> Canada -0.40736806
#> Germany 0.44667836
#> Italy -0.09317570
#> Japan -0.06401591
#> UK 0.87190491
#> US -0.75402361
#>
#> with conditional variances for "location"
The output of random_effects is a list of data frames with the random effects for each level of the random effect variable (location in this case), which means that the model is
\begin{split} YLDs = 67.7988999255 \\ + 0.0001619458 \cdot \text{Prevalence} \\ - 0.0336387717 \cdot \text{year} \\ + \text{random effect for each location} \end{split} \tag{9.9}
Then, predictions are made for the YLDs using the mixed-effects model on observed data and new data. The predictions are compared with the actual YLDs to evaluate the model’s performance.
# Predict YLDs for observed data
predictions <- predict(model,
data,
allow.new.levels = TRUE)
data$predicted_YLD <- predictions
ggplot(data %>% filter(year == 2019,
!location=="Global"),
aes(x = YLDs, y = predicted_YLD,
group = location)) +
geom_point(aes(shape = factor(location))) +
geom_abline() +
labs(title = "YLDs Predicted vs Estimated",
x = "YLDs",
y = "Predicted YLD",
shape = "Location")
# Create new data for prediction
new_data <- data.frame(Prevalence = c(13000, 7000, 10200),
year = c(2021, 2021, 2021),
location = c("Japan", "UK", "US"))
# Predict YLDs for new data
predictions <- predict(model,
new_data,
allow.new.levels = TRUE)
And the predictions are shown in the following plot:
data %>%
mutate(location = as.factor(location)) %>%
filter(location%in%c("Japan", "UK", "US")) %>%
ggplot(aes(x = factor(year), y = predicted_YLD,
group = location, fill=location)) +
ggpattern::geom_col_pattern(aes(pattern = location),
position = "stack",
fill= 'white',
colour= 'black') +
ggpattern::geom_col_pattern(data = bind_cols(new_data,
predicted_YLD = predictions),
aes(pattern = location),
position = "stack",
fill= 'white',
colour= 'black') +
labs(title = "Predicted YLDs for Tuberculosis",
x = "Year",
y = "Predicted YLDs",
fill = "Location")
9.7 Summary
Predictive modelling is a valuable tool for forecasting future trends and making informed decisions based on historical data. By leveraging the insights gained from the models, we can anticipate the trajectory of infectious diseases, estimate disease burden, and evaluate the impact of interventions on population health. The models can be further refined and optimised to improve their predictive performance and provide more accurate forecasts. By combining predictive modelling with time series analysis and mixed models, we can gain a comprehensive understanding of the complex dynamics of public health and make data-driven decisions to improve population health outcomes.