## 16.4 Evaluating dynamic models

Models of agents and environments are often created together and hardly distinguishable from each other. Nevertheless, both need to be evaluated. This initially includes ensuring that the agent and environment functions as intended, but also evaluating their performance and the consequences of their interaction. This section illustrates some methods and possible criteria.

### 16.4.1 Heuristics vs. optimal strategies

Our simulation so far paired a learning agent with a simple MAB problem. However, we can imagine and implement many alternative agent strategies. Typical questions raised in such contexts include:

What is the performance of Strategy X?

Is Strategy X better or worse than Strategy Y?

While all MAB settings invite the strategies that balance exploration with exploitation, we can distinguish between two general approaches towards creating and evaluating strategies:

*Heuristic*approaches create and evaluate strategies that are simple enough to be empirically plausible.*Heuristics*can be defined as adaptive strategies that ignore information to make efficient, accurate and robust decisions under conditions of uncertainty (see Gigerenzer & Gaissmaier, 2011; Neth & Gigerenzer, 2015, for details). As many researchers have a bias to reflexively associate heuristics with inferior performance, we emphasize that simple strategies are not necessarily worse than computationally more expensive strategies. The hallmark of heuristics is that they do not aim for optimality, but rather for simplicity by ignoring some information. Whether they turn out to be worse or better than alternative strategies is an empirical question and mostly depends on the criteria employed.

An example of a heuristic in a MAB setting with stochastic options is *sample-then-greedy*.
This heuristic explores each option for some number \(s\) trials, before exploiting the seemingly better one for the remaining trials.
Clearly, the success of this heuristic varies as a function of \(s\):
If \(s\) was too small, an agent may not be able to successfully discriminate between options and risk exploiting an inferior option. By contrast, larger values of \(s\) reduce the uncertainty about the options’ estimated rewards, but risk wasting too much trials on exploration.

The same considerations show that the performance of any strategy depends on additional assumptions regarding the nature of the task environment. Estimating the characteristics of an option by sampling it first assumes that options remain stable for the duration of a scenario.

*Optimality*approaches create and evaluate strategies that maximize some performance criterion. Typically, total reward is maximized at the expense of computational effort, but when there is a fixed reward it is also common to minimize the amount of time to reach some goal. There is a lot of scientific literature concerned with the discovery and verification of optimal MAB strategies. Most of these approaches strive for optimization under contraints, which renders the optimization problem even harder.

An example for an optimality approach towards MABs is the computation of the so-called *Gittins index* for each option. This index essentially incorporates all that can be known so far and computes the value of each option at this moment, given that we only choose optimal actions for all remaining trials (see Page, 2018, p. 322ff., for an example).
On every trial, the option with the highest Gittins index is the optimal choice. As this method is conceptually simple but computationally expensive, it is unlikely that organisms solve MAB problems in precisely this way, though they may approximate its performance by other mechanisms.

In a *Bayesian* agent framework, an agent has prior beliefs about the distribution of rewards of each option and adjusts these beliefs based on its experience in an optimal fashion. Interestingly, however, incorporating experienced rewards in an optimal (or rational) fashion does not yet yield an optimal strategy for choosing actions.

Comparisons between a range of strategies may yield surprising results. For instance, algorithms that perform well on some problems may turn out to be really bad for others. And simple heuristics often outperform theoretically superior algorithms by a substantial margin (Kuleshov & Precup, 2014).

Actually, we typically need a third category to limit the range of possible performances: *Baselines*.

Need for running competitions between strategies. (See methodology of evaluating models by benchmarking and RTA, below).

### 16.4.2 Benchmarking strategy performance

Beware of a common fallacy: When evaluating some strategy, researchers often note a deviation between observed performance and some normative model and then jump to premature conclusions (typically by diagnosing the “irrationality” of agents with respect to some task). A typical example of a premature conclusion in the context of a learning task would be to quantify an agent’s total reward \(R_T\) in an environment and contrast it with the maximum possible reward \(R_{max}\) that the environment could have provided. When \(R_T < R_{max}\), a researcher could diagnose “insufficient learning” or “suboptimal performance.” However, this conclusion falls prey to the experimenter’s fallacy of assuming that the agent’s environment corresponds to the experimenter’s environment. In reality, however, the agent views the environment from a different and more limited perspective: Rather than knowing all options and the range of their possible rewards, a learning agent needs to explore and estimate options in a trial-and-error fashion. Thus, an environment that appears to be under risk from the experimenter’s perspective, is typically an uncertain environment from the agent’s perspective. Rather than just making a terminological point, this difference can have substantial consequences for the evaluation of performance — and for implications regarding the agent’s (ir-)rationality (see Sims et al., 2013, for a prominent example).

To vaccinate researchers against drawing premature conclusions regarding the (ir-)rationality of agents, Neth et al. (2016) propose a methodology and perspective called *rational task analysis* (RTA).
RTA is anchored in the notion of *bounded rationality* (Simon, 1956) and aims for an unbiased interpretation of results and the design of more conclusive experimental paradigms.
Rather than the much more ambitious endeavor of *rational analysis* (Anderson, 1990), RTA focuses on concrete tasks as the primary interface between agents and environments.
By providing guidelines for evaluating performance under conditions of bounded rationality, RTA requires explicating essential task elements, specifying rational norms, and bracketing the range of possible performance, before contrasting various benchmarks with actual performance.
The recommendations of RTA are summarized in Figure 16.9:

Although the six steps of RTA look like a generic recipe, our evaluations always need to be adapted to the conditions and constraints of specific tasks. However, rather than just comparing some strategy against some normative model, we generally should compare a range of different strategies.

### References

*The adaptive character of thought*. Lawrence Erlbaum.

*Annual Review of Psychology*,

*62*(1), 451–482. https://doi.org/10.1146/annurev-psych-120709-145346

*arXiv Preprint arXiv:1402.6028*. http://arxiv.org/abs/1402.6028

*Emerging trends in the social and behavioral sciences*. Wiley Online Library. https://doi.org/10.1002/9781118900772.etrds0394

*Minds and Machines*,

*26*(1-2), 125–148. https://doi.org/10.1007/s11023-015-9368-8

*The model thinker: What you need to know to make data work for you*. Basic Books.

*Psychological Review*,

*63*(2), 129–138. https://doi.org/10.1037/h0042769

*Psychological Review*,

*120*(1), 139–154. https://doi.org/10.1037/a0030850