## 17.4 Evaluating dynamic models

Models of agents and environments are often created together and hardly distinguishable from each other. Nevertheless, both need to be evaluated. This initially includes ensuring that the agent and environment functions as intended, but also evaluating their performance and the consequences of their interaction. This section illustrates some methods and possible criteria.

### 17.4.1 Heuristics vs. optimal strategies

Our simulation so far paired a learning agent with a simple MAB problem. However, we can imagine and implement many alternative agent strategies. Typical questions raised in such contexts include:

What is the performance of Strategy X?

Is Strategy X better or worse than Strategy Y?

While all MAB settings invite the strategies that balance exploration with exploitation, we can distinguish between two general approaches towards creating and evaluating strategies:

*Heuristic*approaches create and evaluate strategies that are simple enough to be empirically plausible.*Heuristics*can be defined as adaptive strategies that ignore information to make efficient, accurate and robust decisions under conditions of uncertainty (see Gigerenzer & Gaissmaier, 2011; Neth & Gigerenzer, 2015, for details). As many researchers have a bias to reflexively associate heuristics with inferior performance, we emphasize that simple strategies are not necessarily worse than computationally more expensive strategies. The hallmark of heuristics is that they do not aim for optimality, but rather for simplicity by ignoring some information. Whether they turn out to be worse or better than alternative strategies is an empirical question and mostly depends on the criteria employed.

An example of a heuristic in a MAB setting with stochastic options is *sample-then-greedy*.
This heuristic explores each option for some number \(s\) trials, before exploiting the seemingly better one for the remaining trials.
Clearly, the success of this heuristic varies as a function of \(s\):
If \(s\) was too small, an agent may not be able to successfully discriminate between options and risk exploiting an inferior option. By contrast, larger values of \(s\) reduce the uncertainty about the options’ estimated rewards, but risk wasting too much trials on exploration.

The same considerations show that the performance of any strategy depends on additional assumptions regarding the nature of the task environment. Estimating the characteristics of an option by sampling it first assumes that options remain stable for the duration of a scenario.

*Optimality*approaches create and evaluate strategies that maximize some performance criterion. Typically, total reward is maximized at the expense of computational effort, but when there is a fixed reward it is also common to minimize the amount of time to reach some goal. There is a lot of scientific literature concerned with the discovery and verification of optimal MAB strategies. Most of these approaches strive for optimization under contraints, which renders the optimization problem even harder.

An example for an optimality approach towards MABs is the computation of the so-called *Gittins index* for each option. This index essentially incorporates all that can be known so far and computes the value of each option at this moment, given that we only choose optimal actions for all remaining trials (see Page, 2018, p. 322ff., for an example).
On every trial, the option with the highest Gittins index is the optimal choice. As this method is conceptually simple but computationally expensive, it is unlikely that organisms solve MAB problems in precisely this way, though they may approximate its performance by other mechanisms.

In a *Bayesian* agent framework, an agent has prior beliefs about the distribution of rewards of each option and adjusts these beliefs based on its experience in an optimal fashion. Interestingly, however, incorporating experienced rewards in an optimal (or rational) fashion does not yet yield an optimal strategy for choosing actions.

Comparisons between a range of strategies may yield surprising results. For instance, algorithms that perform well on some problems may turn out to be really bad for others. And simple heuristics often outperform theoretically superior algorithms by a substantial margin (Kuleshov & Precup, 2014).

Actually, we typically need a third category to limit the range of possible performances: *Baselines*.

Need for running competitions between strategies. (See methodology of evaluating models by benchmarking and RTA, below).

### 17.4.2 Benchmarking strategy performance

Need for comparing multiple strategies.

Beware of erroneous results: Benchmarking a range of strategies helps to avoid drawing premature conclusions regarding the (ir-)rationality of agents.

See the recommendations of RTA (from Neth et al., 2016).

### References

*Annual Review of Psychology*,

*62*(1), 451–482. https://doi.org/10.1146/annurev-psych-120709-145346

*arXiv Preprint arXiv:1402.6028*. http://arxiv.org/abs/1402.6028

*Emerging trends in the social and behavioral sciences*. Wiley Online Library. https://doi.org/10.1002/9781118900772.etrds0394

*Minds and Machines*,

*26*(1-2), 125–148. https://doi.org/10.1007/s11023-015-9368-8

*The model thinker: What you need to know to make data work for you*. Basic Books.