## 16.2 Models of agents, environments, and interactions

The key motivation for dynamic simulations is that things that are alive constantly keep changing. Thus, our models need to allow for and capture changes on a variety of levels. The term “things” is used in its widest possible sense (i.e., entities, including objects and subjects), including environments, agents, representations of them, and their interactions.

General issues addressed here: Separating different types of representations:

• Modeling an agent (e.g., internal states and overt actions)

• Modeling the environment (e.g., options, rewards, and their probabilities)

A mundane, but essential aspect: Adding data structures for keeping track of changes (in both agents or environments).

Note: When the term “dynamic” makes you think of video games and self-driving cars, the environments and problems considered here may seem slow and disappointing. However, we will quickly see that even relatively simple tasks can present plenty of conceptual and computational challenges.

### 16.2.1 Learning

We first assume a stable environment and a dynamic agent. Adapting agents are typically described by the phenomenon of learning.

An elementary situation — known as binary forced choice — is the following: An agent faces a choice between several alternative options. For now, let’s assume that there are only two options and both options are deterministic and stable: Each option yields a constant reward by being chosen.

• How can the agent learn to choose the better option?

A general solution to the task of learning the best alternative is provided by the paradigm of reinforcement learning . Here, we can only illustrate its basic principles (following Page, 2018, p. 306f.).

#### Basic idea

Assume that an agent is facing a choice between $$N$$ options with rewards $$\pi(1) ... \pi(N)$$. The learner’s internal model or perspective on the world is represented as a set of weights $$w$$ that denote the expected value of or preference for each of the available options (i.e., $$w(1) ... w(N)$$). Given this representation, successful learning consists in adjusting the values of these weights to the (stable) rewards of the environment and choosing options accordingly.

Learning proceeds in a stepwise fashion: In each step, the learner acts (i.e., chooses an option), observes a reward, and then adjusts the weight of the chosen option (i.e., her internal expectations or preferences). Formally, each choice cycle is characterized by two components:

1. Choosing: The probability of choosing $$k$$-th alternative is given by:

$P(k) = \frac{w(k)}{\sum_{i}^{N} w(i)}$

1. Learning: The learner adjusts the weight of the $$k$$-th alternative after choosing it and receiving its reward $$\pi(K)$$ by adding the following increment:

$\Delta w(k) = \alpha \cdot P(k) \cdot (\pi(k) - A)$

with two parameters describing characteristics of the learner: $$\alpha > 0$$ denoting the learning rate (aka. rate of adjustment or step size parameter) and $$A < max_{k}\pi(k)$$ denoting the aspiration level.14

Note that the difference $$\pi(k) - A$$ compares the observed reward $$\pi(k)$$ to an aspiration level $$A$$ and can be characterized as a measure of surprise. Bigger rewards are more surprising and lead to larger adjustments. If a reward value exceeds $$A$$, $$w(k)$$ increases; if it is below $$A$$, $$w(k)$$ decreases.

For details and historical references, see the Wikipedia pages to the Rescorla-Wagner model, Reinforcement learning and Q-learning

#### Coding a model

To create a model that implements a basic learning agent that explores and adapts to a stable environment, we first define some objects and parameters:

# Initial setup (see Page, 2018, p. 308):

# Environment:
alt <- c("A", "B")  # constant options
rew <- c(10, 20)    # constant rewards (A < B)

# Agent:
alpha <- 1          # learning rate
A <- 5              # aspiration level
wgt <- c(50, 50)    # initial weights 

In the current model, the environmental options alt and their rewards rew are fixed and stable. Also, the learning rate alpha and A are assumed to be constants, but the weights wgt that represent the agent’s beliefs or expectations about the value of environmental options are parameters that may change after every cycle of choosing an action and observing a reward.

As the heart of our model, we translate functions 1. and 2. into R code:

# 1. Choosing:
p <- function(k){  # Probability of k:

wgt[k]/sum(wgt)  # wgt denotes current weights

}

# Reward from choosing k:
r <- function(k){

rew[k]  # rew denotes current rewards

}

# 2. Learning:
delta_w <- function(k){ # Adjusting the weight of k:

(alpha * p(k) * (r(k) - A))

}

Note that the additional function r(k) provides the reward obtained from choosing alternative k. Thus, r(k) is more a part of the environment than of the agent. Actually, it defines the interface between agent and environment. As the environmental rewards stored as rew are currently assumed to be constant, we could replace this function by writing rew[k] in the weight increment given by delta_w(k). However, as the r() function conceptually links the agent’s action (i.e., choosing option k) to the rewards provided by the environment rew and we will soon generalize this to stochastic environments (in Section 16.2.2), we already include this function here.15

With these basic functions in place, we define an iterative loop for 1:T time steps t (aka. choice-reward cycles, periods, or rounds) to simulate the iterative process. In each step, we use p() to choose an action and delta_w() to interpret and adjust the observed reward:

# Environment:
alt <- c("A", "B")  # constant options
rew <- c(10, 20)    # constant rewards (A < B)

# Agent:
alpha <- 1          # learning rate
A <- 5              # aspiration level
wgt <- c(50, 50)    # initial weights

# Simulation:
T <- 20  # time steps/cycles/rounds

for (t in 1:T){  # each step/trial: ----

# (1) Use wgt to determine current action:
cur_prob <- c(p(1), p(2))
cur_act  <- sample(alt, size = 1, prob = cur_prob)
ix_act   <- which(cur_act == alt)

# (2) Update wgt (based on reward):
new_w <- wgt[ix_act] + delta_w(ix_act)  # increment weight
wgt[ix_act] <- new_w  # update wgt

# (+) User feedback:
print(paste0(t, ": Choosing ", cur_act, " (", ix_act, "), and ",
"learning wgt = ", paste(round(wgt, 0), collapse = ":"),
" (A = ", A, ")"))
}

# Report result:
print(paste0("Final weights (after ", T, " steps): ",
paste(round(wgt, 0), collapse = ":")))

Due to the use of sample() in selecting the current action cur_act (i.e., choosing either the 1st or 2nd option out of alt), running this code repeatedly will yield different results.

The repeated references to the weights wgt imply that this vector is a global variable that is initialized once and then accessed and changed at various points in our code (e.g., in p() and when updating the weights in the loop). If this should get problematic, we could pass wgt as an argument to every function.

Running this model yields the following output:

#> [1] "1: Choosing A (1), and learning wgt = 52:50 (A = 5)"
#> [1] "2: Choosing A (1), and learning wgt = 55:50 (A = 5)"
#> [1] "3: Choosing B (2), and learning wgt = 55:57 (A = 5)"
#> [1] "4: Choosing B (2), and learning wgt = 55:65 (A = 5)"
#> [1] "5: Choosing B (2), and learning wgt = 55:73 (A = 5)"
#> [1] "6: Choosing A (1), and learning wgt = 57:73 (A = 5)"
#> [1] "7: Choosing B (2), and learning wgt = 57:81 (A = 5)"
#> [1] "8: Choosing B (2), and learning wgt = 57:90 (A = 5)"
#> [1] "9: Choosing B (2), and learning wgt = 57:99 (A = 5)"
#> [1] "10: Choosing B (2), and learning wgt = 57:109 (A = 5)"
#> [1] "11: Choosing A (1), and learning wgt = 59:109 (A = 5)"
#> [1] "12: Choosing B (2), and learning wgt = 59:119 (A = 5)"
#> [1] "13: Choosing A (1), and learning wgt = 61:119 (A = 5)"
#> [1] "14: Choosing B (2), and learning wgt = 61:128 (A = 5)"
#> [1] "15: Choosing B (2), and learning wgt = 61:139 (A = 5)"
#> [1] "16: Choosing B (2), and learning wgt = 61:149 (A = 5)"
#> [1] "17: Choosing A (1), and learning wgt = 62:149 (A = 5)"
#> [1] "18: Choosing B (2), and learning wgt = 62:160 (A = 5)"
#> [1] "19: Choosing B (2), and learning wgt = 62:170 (A = 5)"
#> [1] "20: Choosing B (2), and learning wgt = 62:181 (A = 5)"
#> [1] "Final weights (after 20 steps): 62:181"

The output is printed by the user feedback (i.e., the print() statements) that is provided at the end of each loop and after finishing its T cycles. Due to using sample() to determine the action (or option chosen) on each time step t, the results will differ every time the simulation is run.
Although we typically want to collect better records on the process, reading this output already hints at a successful reinforcement learning process: Due to the changing weights wgt, the better alternative (here: Option B) is increasingly preferred and chosen.

#### Practice

1. Which variable(s) allow us to see and evaluate that learning is taking place? What exactly do these represent (i.e., the agent’s internal state or overt behavior)? Why do we not monitor the values of cur_prob in each round?

2. Predict what happens when the reward value $$\pi(k)$$ equals the aspiration level $$A$$ (i.e., $$\pi(k) = A$$) or falls below it (i.e., $$\pi(k) < A$$). Test your predictions by running a corresponding model.

3. Playing with parameters:

• Set T and alpha to various values and observe how this affects the rate and results of learning.

• Set rew to alternative values on observe how this affects the rate and results of learning.

#### Keeping track

To evaluate simulations more systematically, we need to extend our model in two ways:

1. First, the code above only ran a single simulation of T steps. However, due to random fluctuations (e.g., in the results of the sample() function), we should not trust the results of any single simulation. Instead, we want to run and evaluate some larger number of simulations to get an idea about the variability and robustness of their results. This can easily be achieved by enclosing our entire simulation within another loop that runs several independent simulations.

2. Additionally, our initial simulation (above) used a print() statement to provide user feedback in each iteration of the loop. This allowed us to monitor the T time steps of our simulation and evaluate the learning process. However, displaying results as the model code is being executed becomes impractical when simulations get larger and run for longer periods of time. Thus, we need to collect required performance measures by setting up and using corresponding data structures as the simulation runs.

The following code takes care of both concerns: First, we embed our previous simulation within an outer loop for running a sequence of S simulations. Additionally, we prepare an auxiliary data frame data that allows recording the agent’s action and weights on each step (or choice-reward cycle) as we go along.

Note that working with two loops complicates setting up data (as its needs a total of S * T rows, as well as columns for the current values of s and t) and the indices for updating the current row of data. Also, using an outer loop that defines distinct simulations creates two possible levels for initializing information. In the present case, some information (e.g., the number of simulations, the number of steps per simulation, the available options, and their constant rewards) is initialized only once (globally). By contrast, we chose to initialize the parameters of the learner in every simulation (although we currently only update the weights wgt in each step).

The resulting simulation code is as follows:

# Simulation:
S <- 12  # number of simulations
T <- 20  # time steps/cycles (per simulation)

# Environmental constants:
alt <- c("A", "B")  # constant options
rew <- c(10, 20)    # constant rewards (A < B)

# Prepare data structure for storing results:
data <- as.data.frame(matrix(ncol = (3 + length(alt)), nrow = S * T))
names(data) <- c("s", "t", "act", paste0("w_", alt))

for (s in 1:S){     # each simulation: ----

# Initialize agent:
alpha <- 1        # learning rate
A <- 5            # aspiration level
wgt <- c(50, 50)  # initial weights

for (t in 1:T){   # each step/trial: ----

# (1) Use wgt to determine current action:
cur_prob <- c(p(1), p(2))
cur_act <- sample(alt, size = 1, prob = cur_prob)
ix_act <- which(cur_act == alt)

# (2) Update wgt (based on reward):
new_w <- wgt[ix_act] + delta_w(ix_act)  # increment weight
wgt[ix_act] <- new_w  # update wgt

# (+) Record results:
data[((s-1) * T) + t, ] <- c(s, t, ix_act, wgt)

} # for t:T end.

print(paste0("s = ", s, ": Ran ", T, " steps, wgt = ",
paste(round(wgt, 0), collapse = ":")))

} # for i:S end.
#> [1] "s = 1: Ran 20 steps, wgt = 66:167"
#> [1] "s = 2: Ran 20 steps, wgt = 68:156"
#> [1] "s = 3: Ran 20 steps, wgt = 61:184"
#> [1] "s = 4: Ran 20 steps, wgt = 83:96"
#> [1] "s = 5: Ran 20 steps, wgt = 63:180"
#> [1] "s = 6: Ran 20 steps, wgt = 63:171"
#> [1] "s = 7: Ran 20 steps, wgt = 67:156"
#> [1] "s = 8: Ran 20 steps, wgt = 57:209"
#> [1] "s = 9: Ran 20 steps, wgt = 61:192"
#> [1] "s = 10: Ran 20 steps, wgt = 62:173"
#> [1] "s = 11: Ran 20 steps, wgt = 61:183"
#> [1] "s = 12: Ran 20 steps, wgt = 73:126"

# Report result:
print(paste0("Finished running ", S, " simulations (see 'data' for results)"))
#> [1] "Finished running 12 simulations (see 'data' for results)"

The user feedback within the inner loop was now replaced by storing the current values of parameters of interest into a row of data. Thus, data collected information on all intermediate states:

Table 16.1: A record of simulation states and results.
s t act w_A w_B
1 1 1 52.50000 50.00000
1 2 1 55.06098 50.00000
1 3 1 57.68141 50.00000
1 4 2 57.68141 56.96499
1 5 2 57.68141 64.41812
1 6 1 60.04347 64.41812

Note that both the construction of our model and the selection of variables stored in data determine the scope of results that we can examine later. For instance, the current model did not explicitly represent the reward values received on every trial. As they were constant for each option, we did not need to know them here, but this may change if the environment became more dynamic (see Section 16.2.2). Similarly, we chose not to store the current value of the agent’s aspiration level $$A$$ for every trial. However, if $$A$$ ever changed within a simulation, we may want to store a record of its values in data.

#### Visualizing results

This allows us to visualize the learning process and progress. Figure 16.2 shows which option was chosen in each time step and simulation:

# Visualize results:
ggplot(data, aes(x = t)) +
facet_wrap(~s) +
geom_path(aes(y = act), col = Grau) +
geom_point(aes(y = act, col = factor(act)), size = 2) +
scale_color_manual(values = usecol(c(Bordeaux, Seegruen))) +
scale_y_continuous(breaks = 1:2, labels = alt) +
labs(title = paste0("Agent actions (choices) in ", S, " simulations"),
x = "Time steps", y = "Action", color = "Option:") +
theme_ds4psy()

As Option B yields higher rewards than Option A, learning would be reflected in Figure 16.2 by an increasing preference of B over A. Figure 16.3 shows the trends in option weights per time step for all 12 simulations:

ggplot(data, aes(x = t, group = s)) +
geom_path(aes(y = w_A), size = .5, col = usecol(Bordeaux, alpha = .5)) +
geom_path(aes(y = w_B), size = .5, col = usecol(Seegruen, alpha = .5)) +
labs(title = paste0("Agent weights (expectations/preferences) in ", S, " simulations"),
x = "Time steps", y = "Weights") +
theme_ds4psy()

The systematic difference in trends can be emphasized by showing their averages (Figure 16.4):

ggplot(data, aes(x = t)) +
geom_smooth(aes(y = w_A), size = 1, col = usecol(Bordeaux, alpha = .5)) +
geom_smooth(aes(y = w_B), size = 1, col = usecol(Seegruen, alpha = .5)) +
labs(title = paste0("Trends in (average) agent weights in ", S, " simulations"),
x = "Time steps", y = "Weights", col = "Option:") +
theme_ds4psy()

#### Interpretation

• The visualization of agent actions (in Figure 16.2) shows that most agents gradually chose the superior Option B more frequently. Thus, learning took place and manifested itself in a systematic trend in the choice behavior of the agent (particularly in later trials t).

• The visualization of agent weights (in Figure 16.3) illustrates that most learners preferred the superior Option B (shown in green) within about 5–10 trials of the simulation.

• The systematic preference of Option B over Option A (from Trial 5 onward) is further emphasized by comparing average trends (in Figure 16.4).

When interpreting the results of models, we must not forget that they show different facets of the same process. As the value of the weights wgt determine the probabilities of choosing each option (via p()), there is a logical link between the concepts illustrated by these visualizations: The behavioral choices shown in Figure 16.2 are a consequence of the weights shown in Figure 16.3 and 16.4, with some added noise due to random sampling. Thus, the three visualizations show how learning manifests itself on different conceptual and behavioral levels.

#### Practice

As both options always perform above the constant aspiration level (of $$A = 5$$), the weigth values wgt could only increase when experiencing rewards (as $$\Delta w > 0$$).

1. What would happen if the aspiration level was set to a value between both options (e.g., $$A = 15$$)?

2. What would happen if the aspiration level was set to the value of the lower option (e.g., $$A = 10$$)?

#### Boundary conditions

Reflecting on the consequences of different aspiration values highlights some boundary conditions that need to hold for learning to take place. We can easily see the impact of an aspiration level $$A$$ matching the value of a current reward $$r(k)$$.
Due to a lack of surprise (i.e., no difference $$r(k) - A = 0$$), the value of $$\Delta w(k)$$ would become 0 as well. And if the value of $$A$$ exceeds $$r(k)$$, the difference $$r(k) - A$$ and $$\Delta w(k)$$ are negative, which leads to a reduction of the corresponding weight. If option weights were not bounded to be $$\geq 0$$, this could create a problem for their conversion into a probability by $$p(k)$$. However, as we chose a value $$A$$ that was smaller than the smallest reward ($$A < min(\pi)$$), this problem could not occur here.

To avoid that all weights converge to zero, the value of the aspiration level $$A$$ must generally be lower than the reward of at least one option . Under this condition, the two functions specified above (i.e., p() and w()) will eventually place almost all weight on the best alternative (as its weight always increases by the most). And using these weights for choosing alternatives implies that — in the long run — the best alternative will be selected almost exclusively.

The basic framework shown here can be extended in many directions. For instance, the following example is provided by Page (2018) (p. 308). We consider two changes:

1. Endogeneous aspirations: Rather than using a constant aspiration level $$A$$, a learner could adjust the value of $$A$$ based on experience by setting it equal to the average reward value.

2. Fixed choice sequence: Rather than randomly sampling the chosen option (according to Equation 1, i.e., the probabilities of p()), we may want to evaluate the effects of a fixed sequence of actions (e.g., choosing B, B, A, B).

Given this setup, the question “What is being learned?” can only refer to the adjusted weights wgt (as the sequence of actions is predetermined). However, when moving away from a constant aspiration level, we also want to monitor the values of our adjusted aspiration level $$A_{t}$$.

These changes can easily be implemented as follows:16

# 1. Choosing:
p <- function(k){  # Probability of k:

wgt[k]/sum(wgt)  # wgt denotes current weights

}

# Reward from choosing k:
r <- function(k){

rew[k]  # rew denotes current rewards

}

# 2. REDUCED learning rule:
delta_w <- function(k){ # Adjusting the weight of k:

# (alpha * p(k) * (r(k) - A))

(alpha * 1 * (r(k) - A))  # HACK: seems to be assumed in Page (2018), p. 308

}
# Environment:
alt <- c("A", "B")  # constant options
rew <- c(20, 10)    # constant rewards (A > B)
wgt <- c(50, 50)    # initial weights

actions <- c("B", "B", "A", "B")
# actions <- rep("A", 10)
# actions <- rep(c("A", "B"), 50)

# Agent:
alpha <- 1  # learning rate
A <- 5      # initial aspiration level

# Simulation:
T <- length(actions)  # time steps/rounds
r_hist <- rep(NA, T)  # history of rewards

for (t in 1:T){

# (1) Use wgt to determine current action:
# cur_prob <- c(p(1), p(2))
# cur_prob <- c(1, 1)  # HACK: seems to be assumed in Page (2018), p. 308
# cur_act <- sample(alt, size = 1, prob = cur_prob)
cur_act <- actions[t]
ix_act  <- which(cur_act == alt)

# (2) Update wgt (based on reward):
new_w <- wgt[ix_act] + delta_w(ix_act)  # increment weight
wgt[ix_act] <- new_w  # update wgt

# (3) Determine reward and adjust aspiration level:
r_hist[t] <- r(ix_act)           # store reward value in history
A <- mean(r_hist, na.rm = TRUE)  # adapt A

# (+) User feedback:
print(paste0(t, ": ", cur_act, " (", ix_act, "), ",
"wgt = ", paste(round(wgt, 0), collapse = ":"),
", A(", t, ") = ", round(A, 1)))
}
#> [1] "1: B (2), wgt = 50:55, A(1) = 10"
#> [1] "2: B (2), wgt = 50:55, A(2) = 10"
#> [1] "3: A (1), wgt = 60:55, A(3) = 13.3"
#> [1] "4: B (2), wgt = 60:52, A(4) = 12.5"

#### Practice

Let’s further reflect on the consequences of our endogeneous aspiration level $$A_{k}$$:

1. How would $$A_{k}$$ change when an agent chose one option for 10-times in a row?

2. How would $$A_{k}$$ change when an agent alternated between Option A and Option B 50 times in a row (i.e., for a total of 100 trials)?

#### Limitations

Given the simplicity of our basic learning paradigm, we should note some limitations:

• Learning only from experience: As our model only learns from experienced options, it lacks the ability to learn from counterfactual information.

• Individual learning: The model here only considers an individual agent. Thus, it does not cover social situations and the corresponding phenomena of social information, social influence, and social learning.

• The model’s level of analysis is flexible or unclear: Some model may aim to represent the actual mechanism (implementation, e.g., in neuronal structures) while another may be content to be an abstract description of required components (e.g., a trade-off between hypothetical parts or processes).

Most of these limitations can be addressed by adding new elements to our basic learning paradigm. However, some limitations could also be interpreted as strengths: The simple, abstract and explicit nature of the model makes it easy to test and verify the necessary and sufficient conditions under which learning can take place. This higher degree of identifiability is the hallmark of formal models and distinguishes them from a verbal theory.

### 16.2.2 Dynamic environments

Our discussion so far allowed for some flexibility in an agent. Evidence for adaptive adjustments in the agent’s internal state or behavior surfaced as evidence for learning, but the environment was assumed to be stable. In reality, however, environments are rarely stable. Instead, a variety of possible changes keep presenting new challenges to a learning agent.

#### Multi-armed bandits (MABs)

A seemingly simple and subtle step away from a stable environment consists in adding uncertainty to the rewards received from choosing options. Adding uncertainty to the rewards of options can be achieved by rendering their rewards probabilistic (aka. stochastic). This creates a large and important family of models that are collectively known as multi-armed bandit (MAB) problems. The term “bandit” refers to a slot machine in a casino that allows players to continuously spend and occasionally win large amounts of money. As these bandits typically have only one lever, the term “multi-arm” refers to the fact that we can choose between several options (see Figure 16.5).

As all options are initially unfamiliar and may have different properties, an agent must first explore an option to estimate its potential rewards. With increasing experience, an attentive learner may notice differences between options and develop corresponding preferences. As soon as one option is perceived to be better than another, the agent must decide whether to keep exploring alternatives or to exploit the option that currently appears to be the best. Thus, MAB problems require a characteristic trade-off between exploration (i.e., searching for the best option) and exploitation (i.e., choosing the option that appears to be the best). The trade-off occurs because the total number of games (steps or trials) is finite (e.g., due to limitations of money or time). Thus, any trial spent on exploring an inferior option is costly, as it incurs a foregone payoff and reduces the total sum of rewards. Avoiding foregone payoffs creates a pressure towards exploiting seemingly superior options. However, we can easily imagine situations in which a better option is overlooked — either because the agent has not yet experienced or recognized its true potential or because an option has improved. Thus, as long as there remains some uncertainty about the current setup, the conflict between exploration and exploitation remains and must be negotiated.

MAB problems are widely used in biology, economics, engineering, psychology, and machine learning (see Hills et al., 2015, for a review). The main reason for this is that a large number of real-world problems can be mapped to the MAB template. For instance, many situations that involve a repeated choice between options (e.g., products, leisure activities, or even partners) or strategies (e.g., for advertising, developing, or selling stuff) can be modeled as MABs. Whereas the application contents and the mechanisms that govern the payoff distributions differ between tasks, the basic structure of a dynamic agent repeatedly interacting with a dynamic environment is similar in a large variety of situations. Thus, MAB problems provide an abstract modeling framework that accommodates many different tasks and domains.

Despite many parallels, different scientific disciplines still address different questions within the MAB framework. A basic difference consists in the distinction between theoretical vs. empirical approaches. As MABs offer a high level of flexibility, while still being analytically tractable, researchers in applied statistics, economics, operation research and decision sciences typically evaluate the performance of specific algorithms and aim for formal proofs of their optimality or boundary conditions. By contrast, researchers from biology, behavioral economics, and psychology are typically less concerned with optimality, and primarily interested in empirical data that informs about the strategies and rates of success when humans and other animals face environments that can be modeled as MABs. As both approaches are informative and not mutually exclusive, researchers typically need to balance the formal rigor and empirical relevance of their analysis — which can be described as a scientist facing a 2-armed bandit problem.

In the following, we extend our basic learning model by adding a MAB environment with probabilistic rewards.

#### Basic idea

Assume that an agent is facing a choice between $$N$$ options that yield rewards $$\pi(1) ... \pi(N)$$ with some probability. In its simplest form, each option either yields a fixed reward (1) or no reward (0), but the probability of receiving a reward from each option is based on an unknown probability distribution $$p(1) ... p(N)$$. More generally, an option $$k$$ can yield a range of possible reward values $$\pi(k)$$ that are given by a corresponding probability distribution $$p(k)$$. As the rewards from such options can be analytically modeled by Bernoulli distributions, MAB problems with these properties are also called Bernoulli bandits (e.g., Page, 2018, p. 320). However, many other reward distributions and mechanisms for MABs are possible.

An agent’s goal in a MAB problem is typically to earn as much rewards as possible (i.e., maximize rewards). But as we mentioned above, maximizing rewards requires balancing the two conflicting goals of exploring options to learn more about them vs. exploiting known options for as much as possible. As exploring options can yield benefits (e.g., due to discovering superior options) but also incurs costs (e.g., due to sampling inferior options), the agent constantly faces a trade-off between exploration and exploitation.

Note that — from the perspective of a modeler — Bernoulli bandits provide a situation under risk (i.e., known options, outcomes, and probabilities). However, from the perspective of the agent, the same environment presents a problem of uncertainty (i.e., unknown options, outcomes, or probabilities).

#### Coding a model

To create an MAB model, we extend the stable environment from above to a stochastic one, in which each option yields certain reward values with given probabilities:

# Initial setup (see Page, 2018, p. 308):

# Environment:
alt <- c("A", "B")  # constant options
rew_val <- list(c(10,  0), c(20,  0))  # reward values (by option)
rew_prb <- list(c(.5, .5), c(.5, .5))  # reward probabilities (by option)

# Agent:
alpha <- 1          # learning rate
A <- 5              # aspiration level
wgt <- c(50, 50)    # initial weights 

In the current model, we still face a binary-forced choice between the two environmental options given by alt, but now their rewards vary between some fixed value and zero (10 or 0 vs. 20 or 0, respectively) that occur with at a given rate or probability (here: 50:50 for both options). Note that both rew_val and rew_prb are defined as lists, as every element of them is a numeric vector (and remember that the i-th element of a list l is obtained by l[[i]]).

Recycling our learning agent from above, we only need to change the r() function that governs how the environment dispenses rewards:

# 1. Choosing:
p <- function(k){  # Probability of choosing k:

wgt[k]/sum(wgt)  # wgt denotes current weights

}

# Reward from choosing k:
r <- function(k){

reward <- NA

reward <- sample(x = rew_val[[k]], size = 1, prob = rew_prb[[k]])

# print(reward)  # 4debugging

return(reward)

}

# # Check: Choose each option N times
# N <- 1000
# v_A <- rep(NA, N)
# v_B <- rep(NA, N)
# for (i in 1:N){
#   v_A[i] <- r(1)
#   v_B[i] <- r(2)
# }
# table(v_A)
# table(v_B)

# 2. Learning:
delta_w <- function(k){ # Adjusting the weight of k:

(alpha * p(k) * (r(k) - A))

}

The functions for choosing options with probability p() and for adjusting the weight increment delta_w(k) of the chosen option k were copied from above. By contrast, the function r(k) was adjusted to now determine the reward of alternative k by sampling from its possible values rew_val[k] with the probabilities given by rew_prb[k].

Before running the simulation, let’s ask ourselves some simple questions:

• What should be learned in this setting?

• What aspects of the learning process change due to the introduction of stochastic options?

• Given the current changes to the environment, has the learning task become easier or more difficult than before?

We can answer these questions by copying the simulation code from above (i.e., only changing the environmental definitions in rew_val and rew_prb and their use in the r() function). Running this simulation yields the following results:

# Simulation:
S <- 12  # number of simulations
T <- 20  # time steps/cycles (per simulation)

# Environment:
alt <- c("A", "B")  # constant options
rew_val <- list(c(10,  0), c(20,  0))  # reward values (by option)
rew_prb <- list(c(.5, .5), c(.5, .5))  # reward probabilities (by option)

# Prepare data structure for storing results:
data <- as.data.frame(matrix(ncol = (3 + length(alt)), nrow = S * T))
names(data) <- c("s", "t", "act", paste0("w_", alt))

for (s in 1:S){     # each simulation: ----

# Initialize agent:
alpha <- 1        # learning rate
A <- 5            # aspiration level
wgt <- c(50, 50)  # initial weights

for (t in 1:T){   # each step/trial: ----

# (1) Use wgt to determine current action:
cur_prob <- c(p(1), p(2))
cur_act <- sample(alt, size = 1, prob = cur_prob)
ix_act <- which(cur_act == alt)

# (2) Update wgt (based on reward):
new_w <- wgt[ix_act] + delta_w(ix_act)  # increment weight
wgt[ix_act] <- new_w  # update wgt

# (+) Record results:
data[((s-1) * T) + t, ] <- c(s, t, ix_act, wgt)

} # for t:T end.

print(paste0("s = ", s, ": Ran ", T, " steps, wgt = ",
paste(round(wgt, 0), collapse = ":")))

} # for i:S end.
#> [1] "s = 1: Ran 20 steps, wgt = 53:71"
#> [1] "s = 2: Ran 20 steps, wgt = 50:82"
#> [1] "s = 3: Ran 20 steps, wgt = 45:65"
#> [1] "s = 4: Ran 20 steps, wgt = 64:80"
#> [1] "s = 5: Ran 20 steps, wgt = 51:127"
#> [1] "s = 6: Ran 20 steps, wgt = 49:60"
#> [1] "s = 7: Ran 20 steps, wgt = 46:96"
#> [1] "s = 8: Ran 20 steps, wgt = 41:86"
#> [1] "s = 9: Ran 20 steps, wgt = 60:63"
#> [1] "s = 10: Ran 20 steps, wgt = 57:83"
#> [1] "s = 11: Ran 20 steps, wgt = 45:71"
#> [1] "s = 12: Ran 20 steps, wgt = 45:65"

# Report result:
print(paste0("Finished running ", S, " simulations (see 'data' for results)"))
#> [1] "Finished running 12 simulations (see 'data' for results)"

The feedback from running the model suggests that the model ran successfully and led to some changes in the agents’ wgt values. More detailed information on the process of learning can be obtained by examining the collected data (see below).

Before we examine the results further, note a constraint in all our implementations so far: As we modeled the reward mechanism as a function r() that is only called when updating the agent weights (in delta_w()), we cannot easily collect the reward values obtained in every round when filling data. If we needed the reward values (e.g., for adjusting the aspiration level in Exercise 16.4.1), we could either collect them within the r() function or change the inner loop (and possibly re-write the function delta_w()) so that the current reward value is explicitly represented prior to using it for updating the agent’s expectations (i.e., wgt).

Leaving some information implicit in a model is not necessarily a bug, as it may enable short and elegant models. However, a model’s level of abstraction crucially depends on how its functions are written — and we often need to compromise between formal elegance and practical concerns.

#### Visualizing results

As before, we can visualize the learning process and progress recorded in data. Figure 16.6 shows which option was chosen in each time step and simulation:

# Visualize results:
ggplot(data, aes(x = t)) +
facet_wrap(~s) +
geom_path(aes(y = act), col = Grau) +
geom_point(aes(y = act, col = factor(act)), size = 2) +
scale_color_manual(values = usecol(c(Bordeaux, Seegruen))) +
scale_y_continuous(breaks = 1:2, labels = alt) +
labs(title = paste0("Agent actions (choices) in ", S, " simulations"),
x = "Time steps", y = "Action", color = "Option:") +
theme_ds4psy()

As before, Option B still yields higher rewards on average than Option A. However, due to the 50% chance of not receiving a reward for either option, the learning process is made more difficult. Although Figure 16.6 suggests that some agents develop a slight preference for the superior Option B, there are no clear trends within 20 trials. Figure 16.7 shows the option weights per time step for all 12 simulations:

ggplot(data, aes(x = t, group = s)) +
geom_path(aes(y = w_A), size = .5, col = usecol(Bordeaux, alpha = .5)) +
geom_path(aes(y = w_B), size = .5, col = usecol(Seegruen, alpha = .5)) +
labs(title = paste0("Agent weights (expectations/preferences) in ", S, " simulations"),
x = "Time steps", y = "Weights") +
theme_ds4psy()

Although there appears some general trend towards preferring the superior Option B (shown in green), the situation is messier than before. Interestingly, we occasionally see that the option weights can also decline (due to negative $$\Delta w$$ values, when an option performed below the aspiration level $$A$$).

To document that some systematic learning has occurred even in the stochastic MAB setting, Figure 16.8 shows the average trends in the option weights per time step for all 12 simulations:

ggplot(data, aes(x = t)) +
geom_smooth(aes(y = w_A), size = 1, col = usecol(Bordeaux, alpha = .5)) +
geom_smooth(aes(y = w_B), size = 1, col = usecol(Seegruen, alpha = .5)) +
labs(title = paste0("Trends in (average) agent weights in ", S, " simulations"),
x = "Time steps", y = "Weights", col = "Option:") +
theme_ds4psy()

#### Interpretation

• The visualization of agent actions (in Figure 16.6) in the MAB setting shows that only a few agents learn to prefer the superior Option B in the T = 20 trials.

• The visualization of agent weights (in Figure 16.7) illustrates that agents generally preferred the superior Option B (shown in green) in the second half (i.e., trials 10–20) of the simulation. However, weight values can both increase and decline and the entire situation is much noisier than before, i.e., the preferences are not clearly separated in individual simulations yet.

• However, averaging over the weights for both options (in Figure 16.8) shows that the preference for the better Option B is being learned, even if it does not manifest itself as clearly in the choice behavior yet.

Overall, switching from a stable (deterministic) environment to an uncertain (stochastic) environment rendered the learning task more difficult. But although individual agents may still exhibit some exploratory behavior after T = 20 trials, we see some evidence in the agents’ average belief (represented by the average weight values wgt over S = 12 agents) that they still learn to prefer the superior Option B over the inferior Option A.

#### Practice

Answering the following questions improves our understanding of our basic MAB simulation:

1. Describe what the learning agent “expects,” “observes,” and how it reacts, when it initially selects an option, based on whether this option yields its reward value or no reward value.

2. Play with the simulation parameters (S and T) or agent parameters (alpha) to show more robust evidence for successful learning.

3. How would the simulation results change when the agent’s initial weights (or expectations) were lowered from wgt <- c(50, 50) to wgt <- c(10, 10)? Why?

4. Change the simulation code so that the reward value obtained on every trial is stored as an additional variable in data.

5. Imagine a situation in which a first option yields a lower maximum reward value than a second option (i.e., $$\max \pi(A) < \max \pi(B)$$), but the lower option yields its maximum reward with a higher probability (i.e., $$p(\max \pi(A)) > p(\max \pi(B))$$). This conflict should allow for scenarios in which an agent learns to prefer Option $$A$$ over Option $$B$$, despite $$B$$’s higher maximum reward. Play with the environmental parameters to construct such scenarios.

Hint: Setting the reward probabilities to rew_prb <- list(c(.8, .2), c(.2, .8)) creates one of many such scenarios. Can we define a general condition that states which option should be preferred?

#### Heuristic vs. optimal strategies

Our simulation so far paired a learning agent with a simple MAB problem. However, we can also imagine alternative agent strategies. Typical questions raised in this context include:

• What would be the performance of some Strategy X?

• What strategy would provide an optimal solution?

While all MAB settings invite the creation of strategies that balance exploration with exploitation, we can distinguish between two braod categories of strategies:

Heuristic := a simple strategy that aims to be successful. Note that we do not say “suboptimal,” as this is an empirical question — and heuristics perform surprisingly well.

Much of the scientific literature is concerned with the discovery and verification of optimal strategies. Typically, we strive for optimization under contraints, which typically renders the optimization problem even harder.

Notion of Bayesian bandits: Agent has prior beliefs about the distribution of rewards of each option. Exploration and experience adjust these beliefs in an optimal fashion. Note that the optimal (or rational) incorporation of experience does not yet yield an optimal strategy for choosing actions.

A method for computing the optimal action consists in computing the so-called Gittins index for each option. This index essentially incorporates all that can be known so far and computes the value of each option at this moment, given that we only choose optimal actions on all remaining trials (see Page, 2018, p. 322ff., for an example). For each trial, the option with the highest Gittins index is the optimal choice. As this method is conceptually simple but computationally expensive, it is unlikely that organisms solve MAB problems in precisely this way (though they may approximate its performance by other mechanisms).

Actually, we typically need a third category to limit the range of possible performances: Baselines.

Need for running competitions between strategies. (See methodology of benchmarking and RTA, below).

### 16.2.3 Evaluating and extending agent models

#### Comparing strategies by benchmarking

Need for comparing multiple strategies.

Beware of erroneous results: Benchmarking a range of strategies helps to avoid drawing premature conclusions regarding the (ir-)rationality of agents.

See the recommendations of RTA (from Neth, Sims, & Gray, 2016).

Adding uncertainty to payoff distributions is only a small step towards more dynamic and realistic environemnts. We can identify several sources of additional variability (i.e., both ignorance and uncertainy):

1. Restless bandits: Beyond providing rewards based on a stable probability distribution, environments may alter the number of options (i.e., additional or disappearing options), the reward types or magnitudes associated with options, or the distributions with which options provide rewards. All these changes can occur continuously (e.g., based on some internal mechanism) or suddenly (e.g., by some disruption).

2. Reactive environments: How do environments respond to and interact with agents? Many environments deplete as they are exploited (e.g., patches of food, fish in the sea, etc.), but some also grow or improve (e.g., acquiring expertise, playing an instrument, practicing some sport).

3. Multiple agents: Allowing for multiple agents in an environment changes an individual’s game against nature by adding a social game (see the chapter on Social situations). Although this adds many new opportunities for interaction (e.g., social influence and social learning) and even blurs the conceptual distinction between agents and environments, it is not immediately clear whether adding potential layers of complexity necessarily requires more complex agents or simulation models (see Section 17.1.1).

1. Page (2018) uses the Greek letter $$\gamma$$ as the learning rate. But as some learning models also use $$\gamma$$ as a discount factor that determines the importance of future rewards, we denote the learning rate by the more common letter $$\alpha$$ here.↩︎

2. As r() is called by delta_w(), this creates a functional dependency in our model that renders information implicit (here: the value of the reward received on every trial). As we will see, models often need to balance concerns for elegance and practicality.↩︎

3. We use a reduced learning rule (which replaces $$p(k)$$ by a constant of 1) to obtain the same values as in Page (2018) (p. 308).↩︎