dX_Cdt <- function(X_C, Y_C, X_A, Y_A, nu = 0.15, beta = 0.1, mu = 0.15, g = 0.04)
nu * (X_A + Y_A) - beta * X_C * (Y_C + Y_A) - (mu + g) * X_C
dY_Cdt <- function(X_C, Y_C, Y_A, beta = 0.1, mu = 0.15, g = 0.04, rho = 0.2)
beta * X_C * (Y_C + Y_A) - (mu + g + rho) * Y_C
dX_Adt <- function(X_C, Y_C, X_A, Y_A, beta = 0.1, g = 0.04)
g * X_C - beta * X_A * (Y_C + Y_A)
dY_Adt <- function(X_A, Y_C, Y_A, beta = 0.1, g = 0.04, rho = 0.2)
beta * X_A * (Y_C + Y_A) + g * Y_C - rho * Y_A
x <- eode(dX_Cdt = dX_Cdt, dY_Cdt = dY_Cdt, dX_Adt = dX_Adt, dY_Adt = dY_Adt,
constraint = c("X_C>=0","Y_C>=0","X_A>=0","Y_A>=0"))
5 Model Validation and Calibration
5.1 Data Input
Validation and calibration are two important processes to ensure the accuracy of a model. Model validation is the process of confirming that the model correctly predicts the real observations. Calibration, on the other hand, is the process of estimating model parameters based on available data. Calibration is perhaps also known as data assimilation, a process of updating models according to the real observations.
To implement these processes, the first step is to input observation data. Consider a tree population started to be affected by a pathogen at time t = 0 (year). According to historical evidence, the pathogen is highly virulent, and has removed many trees in other places. The local council is therefore worrying how the disease will affect the tree population, and how much time is left for the society to adapt to the tree disease, minimising its impact on public safety and biodiversity. Scientists start to monitor the disease at t = 0, with four plots established in the woodland, each having 15, 25, 35, and 45 tree individuals. All plots have 5 tree saplings, with the remaining individuals being adults that are able to reproduce. In the third year of the epidemics, scientists recorded disease incidence rates for each plot, namely the number of infectious trees (i.e. trees showing disease symptoms) divided by the total population size. They found an incidence rate of 0.4, 0.8, 0.9, and 0.95 for each plot, respectively. Now they want to model the disease impact on tree population with a four-dimensional ODE,
where
To input the observation data, we should create a ‘pdata’ object using its constructor function,
training_data <- pdata(
x,
init = data.frame(X_A = c(9, 19, 29, 39),
Y_A = c(1, 1, 1, 1),
X_C = c(5, 5, 5, 5),
Y_C = c(0, 0, 0, 0)),
t = c(3, 3, 3, 3),
lambda = data.frame(incidence = c(0.4, 0.8, 0.9, 0.95)),
formula = c("incidence = (Y_A + Y_C)/(X_A + X_C + Y_A + Y_C)")
)
training_data
## Population Dynamics Data.
## X_A_0 Y_A_0 X_C_0 Y_C_0 observation_time incidence
## 1 9 1 5 0 3 0.40
## 2 19 1 5 0 3 0.80
## 3 29 1 5 0 3 0.90
## 4 39 1 5 0 3 0.95
## Formula:
## incidence = (Y_A + Y_C)/(X_A + X_C + Y_A + Y_C)
The first argument of the pdata()
function specifies the ODE system under concern. The second argument should be an object of data.frame
type, which specifies the initial context for each data point. For example, the first plot has 15 tree individuals (of which 5 are tree saplings) at t = 0, and has a disease incidence rate of 0.4 at t = 3. Therefore, the data point 0.4 is embedded in the context of data.frame
format.
Here, we only concern the disease incidence rate, so the data frame has one column. Sometimes we may wish to add more variables such as tree mortality rate. In that case, the data frame should have two columns, the first one being disease incidence rate and the second one the tree mortality. The last argument formula
should be a vector that specifies how variables can be calculated from model outputs. For example, the disease incidence rate is the total number of infectious individuals divided by total population size. If both the disease incidence data and tree mortality data are used, this argument should be a vector of two elements. The second element should specify how tree mortality rate is calculated. Once a variable is calculated from model predictions, it can be compared with real observations to assess the model performance.
5.2 The Loss Function
Discrepancy between model prediction and real data can be measured by a quadratic loss function. This is commonly used for model validation. Function eode_lossf()
can implement this process,
where
eode_lossf(x, pdat = training_data)
## [1] 0.2040902
The result shows the sum of squares between predicted and measured disease incidence rate (varies from 0–1).
5.3 Grid Search
In Section 5.1, we created an ODE model with a disease transmission rate
Now, we run the model with different values of
res <- eode_gridSearch(x, pdat = training_data, space = list(beta = seq(0.05,0.15,0.01)))
## Start Grid Search...
## Parameter: beta = 0.05, Loss Function: 1.288
## Parameter: beta = 0.06, Loss Function: 0.958
## Parameter: beta = 0.07, Loss Function: 0.656
## Parameter: beta = 0.08, Loss Function: 0.422
## Parameter: beta = 0.09, Loss Function: 0.275
## Parameter: beta = 0.1, Loss Function: 0.204
## Parameter: beta = 0.11, Loss Function: 0.187
## Parameter: beta = 0.12, Loss Function: 0.199
## Parameter: beta = 0.13, Loss Function: 0.223
## Parameter: beta = 0.14, Loss Function: 0.248
## Parameter: beta = 0.15, Loss Function: 0.271
## Finished.
## Optimal Parameters:beta = 0.11
## Loss Function:0.186891109680728
It is apparent that the optimal value of
5.4 Simulated Annealing
Another way is a machine learning based method – simulated annealing. It is a probabilistic technique to find the global optimum. The process starts with high ‘potential energy’, which means that parameters are disturbed at a strong magnitude. A model with updated parameter values will be run and a new loss function will be calculated. If the new loss function is smaller than the old one, the updated model is retained. Otherwise, the previous model will be used instead. This process will be repeated, but with weaker disturbance force, until the ‘potential energy’ falls to zero. Now we disturb
res <- eode_simuAnnealing(x, pdat = training_data, paras = "beta", max_disturb = 0.05, AnnN = 20)
## Start Simulated Annealing...
## Step 0 - loss function =0.204090195414852
## Disturbance:5%...
## Step1- parameters remain unchanged.
## Disturbance:4.75%...
## Step2- parameter resetted:beta=0.1- loss function =0.201593315697878
## Disturbance:4.5%...
## Step3- parameters remain unchanged.
## Disturbance:4.25%...
## Step4- parameter resetted:beta=0.101- loss function =0.193480845163448
## Disturbance:4%...
## Step5- parameter resetted:beta=0.103- loss function =0.188052744345989
## Disturbance:3.75%...
## Step6- parameter resetted:beta=0.107- loss function =0.187196921397421
## Disturbance:3.5%...
## Step7- parameter resetted:beta=0.108- loss function =0.186980752562364
## Disturbance:3.25%...
## Step8- parameters remain unchanged.
## Disturbance:3%...
## Step9- parameter resetted:beta=0.111- loss function =0.186896786595896
## Disturbance:2.75%...
## Step10- parameters remain unchanged.
## Disturbance:2.5%...
## Step11- parameters remain unchanged.
## Disturbance:2.25%...
## Step12- parameter resetted:beta=0.11- loss function =0.186875555775785
## Disturbance:2%...
## Step13- parameters remain unchanged.
## Disturbance:1.75%...
## Step14- parameters remain unchanged.
## Disturbance:1.5%...
## Step15- parameters remain unchanged.
## Disturbance:1.25%...
## Step16- parameters remain unchanged.
## Disturbance:1%...
## Step17- parameters remain unchanged.
## Disturbance:0.75%...
## Step18- parameters remain unchanged.
## Disturbance:0.5%...
## Step19- parameters remain unchanged.
## Disturbance:0.25%...
## Step20- parameters remain unchanged.
## Finished.
## Parameters:nu=0.15, beta=0.11, mu=0.15, g=0.04, rho=0.2- loss function =0.186875555775785
After 20 steps of simulated annealing, the algorithm also obtained an optimal value of .