16.5 Conclusion

This chapter made our simulations more dynamic by allowing for incremental changes in agents and environments. This introduced two key topics commonly addressed by computational models: Adaptive agents (i.e., learning) and choice behavior in risky environments (in the form of multi-armed bandits). Additionally, the need for evaluating a variety of strategies raised issues regarding the performance of heuristics relative to baseline and optimal performance.

16.5.1 Summary

Most dynamic simulations distinguish between some agent and some environment and involve step-wise changes of internal states, environmental parameters, or their interactions (e.g., typically described as behavior). We distinguished between various locations and types of changes:

  1. The phenomenon of learning involved changing agents that adapt their internal states (e.g., beliefs, preferences, and actions) to an environment.

  2. The MAB paradigm allowed for changes in environments whose properties are initially unknown (with risky or uncertain options)

  3. The interaction between agents and environments raised questions regarding the performance of specific strategies (e.g., models of different learning mechanisms and heuristics). Performance should always be evaulated relative to sound benchmarks (ideally of both baseline and optimal performance).

Learning agents that interact with dynamic, risky and uncertain environments are a key topic in the emerging fields of artificial intelligence (AI) and machine learning (ML). Although we only took a peek at the surface of these models, we can note some regularities in the representations used for modeling the interactions between agents and environments:

  • time is represented as steps: The states of agents and environments can change

  • agent beliefs and representations of their environments as probability distributions

  • environments as a range of options (yielding rewards or utility values), that can be stable, risky, or uncertain, deterministic or probabilistic

This is progress, but still subject to many limitations. Two main constraints so far were that the nature or number of options did not change and that we only considered an individual agent.

Additional sources of variability

Adding uncertainty to payoff distributions is only a small step towards more dynamic and realistic environments. We can identify several sources of additional variability (i.e., both ignorance and uncertainy):

  1. Reactive and restless bandits:
    How do environments respond to and interact with agents? Many environments deplete as they are exploited (e.g., patches of food, fish in the sea, etc.), but some also grow or improve (e.g., acquiring expertise, playing an instrument, practicing some sport).
    Beyond providing rewards based on a stable probability distribution, environments may alter the number of options (i.e., additional or disappearing options), the reward types or magnitudes associated with options, or the distributions with which options provide rewards. All these changes can occur continuously (e.g., based on some internal mechanism) or suddenly (e.g., by some disruption).

  2. Multiple agents:
    Allowing for multiple agents in an environment changes an individual’s game against nature by adding a social game (see the chapter on Social situations). Although this adds many new opportunities for interaction (e.g., social influence and social learning) and even blurs the conceptual distinction between agents and environments, it is not immediately clear whether adding potential layers of complexity necessarily requires more complex agents or simulation models (see Section 17.1.1).

Some directions for possible extensions include:

  • More dynamic and responsive environments (restless bandits):

    • MAB with a fixed set of options that are changing (e.g., depleting or improving)

    • Environments with additional or disappearing options

  • Social settings: Multiple agents in the same environment (e.g., games, social learning).

Note that even more complex versions of dynamic simulations typically assume well-defined agents, environments, and interactions. This changes when AI systems are moved from small and closed worlds (e.g., with a limited number of risky options) into larger and more open worlds (e.g., with a variable number of options and uncertain payoffs). Overall, modeling a video game or self-driving car is quite a bit more challenging than a multi-armed bandit.

Fortunately, we may not always require models of optimal behavior for solving problems in real-world environements. Simon (1956) argues that an agent’s ability to satisfice (i.e., meeting some aspiration level) allows for simpler strategies that side-step most of the complexity faced when striving for optimal solutions.

Another important insight by Herbert Simon serves as a caveat that becomes even more important when moving from small worlds with fixed and stable options into larger and uncertain worlds:

An ant, viewed as a behaving system, is quite simple.
The apparent complexity of its behavior over time is largely a reflection
of the complexity of the environment in which it finds itself.

(Simon, 1996, p. 52)

Essentially, the complex challenges posed by real-world problems do not necessarily call for complex explanations, but they require that we are modeling the right entity. Whereas researchers in the behavioral and social sciences primarily aim to describe, predict, and understand the development and behavior of organisms, they should also study the structure of the environments in which behavior unfolds. Thus, successful models of organisms will always require valid models of environments.

16.5.2 Resources

This chapter only introduced the general principles of learning and multi-armed bandit (MAB) simulations. Here are some pointers to sources of much more comprehensive and detailed treatments:

Reinforcement learning

Multi-armed bandits (MABs)

  • The article by Kuleshov & Precup (2014) provides a comparison of essential MAB algorithms and shows that simple heuristics can outperform more elaborate strategies. It also applies the MAB framework to the allocation of clinical trials.

  • The book by Lattimore & Szepesvári (2020) provides abundant information about bandits and is available here. The blog Bandit Algorithms provides related information in smaller units.


Kuleshov, V., & Precup, D. (2014). Algorithms for multi-armed bandit problems. arXiv Preprint arXiv:1402.6028. http://arxiv.org/abs/1402.6028
Lattimore, T., & Szepesvári, C. (2020). Bandit algorithms. Cambridge University Press. https://tor-lattimore.com/downloads/book/book.pdf
Silver, D., Singh, S., Precup, D., & Sutton, R. S. (2021). Reward is enough. Artificial Intelligence, 103535. https://doi.org/10.1016/j.artint.2021.103535
Simon, H. A. (1956). Rational choice and the structure of the environment. Psychological Review, 63(2), 129–138. https://doi.org/10.1037/h0042769
Simon, H. A. (1996). The sciences of the artificial (3rd ed.). The MIT Press.
Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction (2nd ed.). MIT press. http://incompleteideas.net/book/the-book.html
Szepesvári, C. (2010). Algorithms for reinforcement learning (Vol. 4). Morgan & Claypool. https://sites.ualberta.ca/~szepesva/rlbook.html