16.1 Introduction

Imagine a simple-minded organism that faces an unknown environment. The organism has a single need (e.g., find food) and a limited repertoire for action (e.g., move, eat, rest). To fulfill its solitary need, it navigates a barren and rugged surface that occasionally yields a meal, and may face a series of challenges (e.g., obstacles, weather conditions, and various potential dangers). Which cognitive and perceptual-motor capacities does this organism require to reach its goal and sustain itself? And which other aspects of the environment does its success depend on?

An ant foraging for food in an environment.

Figure 16.1: An ant foraging for food in an environment.

This example is modeled on a basic foraging scenario discussed by Simon (1956), but is reflected in many similar models that study how some organism (e.g., a simple robot) explores and exploits an initially unknown environment (e.g., a grid-world offering both rewards and risks).

This may be a toy scenario, but allows distilling some key elements that are required for addressing basic questions of rationality. For instance, the scenario is based on a conceptual distinction between an organism and its environment. As the organism can be simple or complex — think of an ant, a human being, or an entire organization — we will refer to it as an agent. Given some goal, the agent faces the problem of “behaving approximately rationally, or adaptively, in a particular environment” (Simon, 1956, p. 130).

Note that the conceptual distinction between agent and its environment is helpful, but not as simple and straightforward as it seems. For instance, are memories of past events and cues or traces left by earlier explorations part of the agent or of its environment? Researchers from artificial intelligence and the fields of embodied and distributed cognition would argue that the boundaries are blurred. From the perspective of any individual agent, other agents are part of the environment. But from their perspectives, the category labels are reversed.

Two key challenges on elementary dimensions: Things can be unknown or uncertain. When things are unknown, we can discover them by exploring or familiarizing us with them. Doing so assumes some flexibility on part of the agent (typically described as aquiring knowledge or learning). However, when things are uncertain (e.g., offer rewards in a probabilistic fashion), even intimate familiarity does not provide certainty for any particular moment. Thus, both ignorance and uncertainty are problematic, but uncertainty is the more obstinate problem.

Importantly, both problems are inevitable when things are dynamic (i.e., changing). The fact that changes tend to create both ignorance and uncertainty makes it essential that our models can accommodate changes. However, changes are not just a source of problems, but also part of the solution: When aspects of our environment change, processes of adaptation and learning allow us to adjust again. (Note that this may sound clever and profound, but is really quite simple: When things change, we need to change as well — and the reverse is also true.)

Simon addresses the distinction between optimizing and satisficing. Organisms typically adapt well-enough to meet their goals, but often fail to optimize.

Clarify key terminology: What changes when things are dynamic?

  • agents (goals, knowledge and beliefs, policies/strategies, capacities for action)

  • environments (tasks, rewards, temporal and spatial structures)

  • interactions between agents and environments

16.1.1 What is dynamic?

The term dynamics typically implies movement or changes.

By contrast, our simulations so far were static in two ways:

  • the environment was fixed (even when being probabilistic): One-shot games, without dependencies between subsequent states.

  • the range of actions was fixed a priori and behavioral policy of agents did not change)

Three distinct complicating aspects:

  1. Environmental dependencies: Later states in the environment depend on earlier ones.

  2. Learning: Behavior of agents (i.e., their repertoire or selection of actions) changes as a function of experience or time.

  3. Interaction with agents: Environment or outcomes change as a function of (inter-)actions with agents.

Different types of dependencies (non-exhaustive):

  • temporal dependencies: Later states depend on previous ones

  • interactions: Interdependence between environment and an agent’s actions.

  • types of learning: agents can change their behavioral repertoire (actions available) or policies (selection of actions) over time

If the environment changes as a function of its prior states and/or the actions of some agent(s), it cannot be simulated a priori and independently of them (i.e., without considering actions).

Typical programming elements:

  • Trial-by-trial structure of simulations (i.e., using loops or tables that are filled incrementally). As agents or environments change over time, we need to explicitly represent time and typically proceed in a step-wise (i.e., iterative) fashion.
    For instance, an agent may interact with an environment over many decision cycles, observes an outcome, and adjusts its expectations or actions accordingly. Time is typically represented as individual steps \(t\) (aka. periods, rounds, or trials) that range from some starting value (e.g., 0 or 1) to some large number (e.g., \(T = 100\)). Often, we will even need inner loops within outer loops (e.g., for running several repetitions of a simulation, to assess the robustness of results).

  • More abstract representations: For instance, rather than explicitly representing each option in an environment, we could simply use a vector of integers to represent a series of chosen options (i.e., leave the state of the environment implicit). Similarly, an agent’s beliefs regarding the reward quality of options may be represented as a vector of their choice probabilities.


  • Environments with some form of “memory” (e.g., planting corn or wheat, changes in crop cycle)

  • Interactions: Environments that change by agent’s actions:

    • TRACS (environmental options change as function of agent’s actions) vs.
    • Tardast (dynamic changes over time and agent actions)

16.1.2 Topics and models addressed

In this chapter, we provide a glimpse on some important families of models, known as
learning agents and multi-armed bandits (MAB).

Typical topics addressed by these paradigms include: Learning by adjusting expectations to observed rewards and trade-off between exploration vs. exploitation.

Typical questions asked in these contexts include: How to learn to distinguish better from worse options? How to optimally allocate scarce resources (e.g., limitations of attention, experience, time)?

We can distinguish several levels of (potential) complexity:

  1. Dynamic agents: Constant environment, but an agent’s beliefs or repertoire for action changes based on experience (i.e., learning agent).

  2. Dynamic environments: Assuming some agent, the internal dynamics in the environment may change. MAB tasks with responsive or restless bandits (e.g., adding or removing options, changing rewards for options).

  3. Benchmarking the performance of different strategies: Evaluating the interactions between an agent (with different strategies) and environments: Agent and/or environment depend on strategies and previous actions.

A further level of complexity will be addressed in the next chapter on Social situations (see Chapter 17):

  1. Tasks involving multiple agents: Additional interactions between agents (e.g., opportunities for observing and influencing others, using social information, social learning, including topics of communication, competition, and cooperation, etc.).

Goals of this chapter:

  • Implementing a basic learning model (RL)
  • A general framework for environments with uncertain rewards (MAB)
  • Benchmarking the performance of various strategies (RTA)