17.6 Exercises
Here are some exercises for simple learning agents that interact with stable or dynamic environments.
17.6.1 Learning with short-term aspirations
Our basic learning model (developed in Section 17.2) assumed that a learning agent has a constant aspiration level of \(A = 5\). As the reward from selecting the worst option still exceeds \(A\), this always leads to a positive “surprise” upon receiving a reward, corresponding increments of \(\Delta w > 0\), and monotonic increases in weight values wgt
. Although having low expectations may guarantee a happy agent, it seems implausible that an agent never adjusts its aspiration level \(A\). This exercise tries the opposite approach, by assuming an extremely myopic agent who always updates \(A\) to the most recent reward value:
- Adjust \(A\) to the current reward value on each trial \(t\).
Hint: This also requires that the current reward value is explicitly represented in each loop, rather than only being computed implicitly when calling the delta_w()
function.
Does the agent still successfully learn to choose the better option? Describe how the learning process changes (in terms of the values of \(A\) and \(wgt\), and the options being chosen).
Given the new updating mechanism for \(A\) and assuming no changes in the agent’s learning rate
alpha
, the initialwgt
values anddelta_w()
, how could a change in the probability of choosing options (i.e., thep()
function) increase the speed of learning?Explain why it is essential that \(A\) is adjusted after updating the agent’s
wgt
values. What would happen, if an agent prematurely updated its aspiration level to the current reward (prior to updatingwgt
)?
Solution
- ad 1.: We copy the functions
p()
,r()
anddelta_w()
, as defined in Section 17.2:
# 1. Choosing:
<- function(k){ # Probability of k:
p
/sum(wgt) # wgt denotes current weights
wgt[k]
}
# Reward from choosing k:
<- function(k){
r
# rew denotes current rewards
rew[k]
}
# 2. Learning:
<- function(k){ # Adjusting the weight of k:
delta_w
* p(k) * (r(k) - A))
(alpha
}
the following code extends the code from above in two ways:
After updating
wgt
, the current reward valuecur_rew
is explicitly determined (for a second time) and the aspiration levelA
is set to this value on each individual trial (see Step 3 within the inner loop);data
is extended to include the current value ofA
(orcur_rew
) on each individual trial.
# Simulation:
<- 12 # number of simulations
n_S <- 20 # time steps/cycles (per simulation)
n_T
# Environmental constants:
<- c("A", "B") # constant options
alt <- c(10, 20) # constant rewards (A < B)
rew
# Prepare data structure for storing results:
<- as.data.frame(matrix(ncol = (4 + length(alt)), nrow = n_S * n_T))
data names(data) <- c("s", "t", "act", "A", paste0("w_", alt))
for (s in 1:n_S){ # each simulation: ----
# Initialize agent:
<- 1 # learning rate
alpha <- 5 # aspiration level
A <- c(50, 50) # initial weights
wgt
for (t in 1:n_T){ # each step/trial: ----
# (1) Use wgt to determine current action:
<- c(p(1), p(2))
cur_prob <- sample(alt, size = 1, prob = cur_prob)
cur_act <- which(cur_act == alt)
ix_act
# (2) Update wgt (based on reward):
<- wgt[ix_act] + delta_w(ix_act) # increment weight
new_w <- new_w # update wgt
wgt[ix_act]
# (3) Determine reward and adjust A:
<- rew[ix_act] # value of current reward
cur_rew <- cur_rew # update A
A
# (+) Record results:
-1) * n_T) + t, ] <- c(s, t, A, ix_act, wgt)
data[((s
# for t:n_T end.
}
print(paste0("s = ", s, ": Ran ", n_T, " steps, wgt = ",
paste(round(wgt, 0), collapse = ":"),
", A = ", A))
# for i:n_S end.
} #> [1] "s = 1: Ran 20 steps, wgt = 34:76, A = 10"
#> [1] "s = 2: Ran 20 steps, wgt = 33:79, A = 10"
#> [1] "s = 3: Ran 20 steps, wgt = 43:66, A = 20"
#> [1] "s = 4: Ran 20 steps, wgt = 33:86, A = 20"
#> [1] "s = 5: Ran 20 steps, wgt = 37:76, A = 20"
#> [1] "s = 6: Ran 20 steps, wgt = 31:83, A = 10"
#> [1] "s = 7: Ran 20 steps, wgt = 39:72, A = 20"
#> [1] "s = 8: Ran 20 steps, wgt = 43:66, A = 20"
#> [1] "s = 9: Ran 20 steps, wgt = 30:94, A = 20"
#> [1] "s = 10: Ran 20 steps, wgt = 33:79, A = 10"
#> [1] "s = 11: Ran 20 steps, wgt = 30:94, A = 20"
#> [1] "s = 12: Ran 20 steps, wgt = 29:97, A = 20"
# Report result:
print(paste0("Finished running ", n_S, " simulations (see 'data' for results)"))
#> [1] "Finished running 12 simulations (see 'data' for results)"
The printed summary information after each simulation shows that all agents seem to successfully discriminate between options (by learning a higher weight for the superior Option B).
ad 2.: From the 2nd trial onwards, the agent always aspires to its most recent reward. Thus, the agent is never surprised when the same reward is chosen again and only learns upon switching options:
When switching from Option A to Option B, the reward received exceeds \(A\), leading to an increase in the weight for Option B.
When switching from Option B to Option A, the reward received is lower than \(A\), leading to an decrease in the weight for Option A. This explains why the weights of Option A can decrease and generally are lower than their initial values in this simulation.
ad 3.: Given this setup, the pace of learning depends on the frequency of switching options. Increasing the agent’s propensity for switching would increase its learning speed.
ad 4.: If Steps 2 and 3 were reversed (i.e.,
A
was updated to the current reward value prior to updatingwgt
), the differencer(k) - A
indelta_w()
would always be zero. Thus, the product \(\Delta w(k)\) will always be zero, irrespective of \(k\) or \(t\). Overall, the premature agent would never learn anything. (Consider trying this to verify it.)
17.6.2 Learning with additional options
This exercise extends our basic learning model (from Section 17.2) to a stable environment with additional options:
Extend our learning model to an environment with three options (named
A
,B
, andC
) that yield stable rewards (of 10, 30, and 20 units, respectively).Visualize both the agent choices (i.e., actions) and expectations (i.e., weights) and interpret what they show.
Assuming that the options and rewards remain unchanged, what could improve the learning success? (Make a prediction and then test it by running the simulation.)
Solution
- ad 1.: Still using the functions
p()
,r()
anddelta_w()
, as defined in Section 17.2:
# 1. Choosing:
<- function(k){ # Probability of k:
p
/sum(wgt) # wgt denotes current weights
wgt[k]
}
# Reward from choosing k:
<- function(k){
r
# rew denotes current rewards
rew[k]
}
# 2. Learning:
<- function(k){ # Adjusting the weight of k:
delta_w
* p(k) * (r(k) - A))
(alpha
}
the following code extends the simulation from above to three options:
# Simulation:
<- 12 # number of simulations
n_S <- 20 # time steps/cycles (per simulation)
n_T
# Environmental constants:
<- c("A", "B", "C") # constant options
alt <- c(10, 30, 20) # constant rewards
rew
# Prepare data structure for storing results:
<- as.data.frame(matrix(ncol = (3 + length(alt)), nrow = n_S * n_T))
data names(data) <- c("s", "t", "act", paste0("w_", alt))
for (s in 1:n_S){ # each simulation: ----
# Initialize learner:
<- c(50, 50, 50) # initial weights
wgt <- 1 # learning rate
alpha <- 5 # aspiration level
A
for (t in 1:n_T){ # each step: ----
# (1) Use wgt to determine current action:
<- c(p(1), p(2), p(3))
cur_prob <- sample(alt, size = 1, prob = cur_prob)
cur_act <- which(cur_act == alt)
ix_act
# (2) Update wgt (based on reward):
<- wgt[ix_act] + delta_w(ix_act) # increment weight
new_w <- new_w # update wgt
wgt[ix_act]
# (+) Record results:
-1) * n_T) + t, ] <- c(s, t, ix_act, wgt)
data[((s
# for t:n_T end.
}
print(paste0("s = ", s, ": Ran ", n_T, " steps, wgt = ",
paste(round(wgt, 0), collapse = ":")))
# for i:n_S end.
} #> [1] "s = 1: Ran 20 steps, wgt = 55:157:78"
#> [1] "s = 2: Ran 20 steps, wgt = 53:196:67"
#> [1] "s = 3: Ran 20 steps, wgt = 56:146:77"
#> [1] "s = 4: Ran 20 steps, wgt = 53:110:108"
#> [1] "s = 5: Ran 20 steps, wgt = 63:149:56"
#> [1] "s = 6: Ran 20 steps, wgt = 56:144:78"
#> [1] "s = 7: Ran 20 steps, wgt = 59:116:84"
#> [1] "s = 8: Ran 20 steps, wgt = 57:71:124"
#> [1] "s = 9: Ran 20 steps, wgt = 61:117:77"
#> [1] "s = 10: Ran 20 steps, wgt = 61:70:110"
#> [1] "s = 11: Ran 20 steps, wgt = 60:125:76"
#> [1] "s = 12: Ran 20 steps, wgt = 58:157:68"
- ad 2.: As before, the data frame
data
collects information on all intermediate states. This allows us to visualize the learning process in terms of the actions chosen (choices):

Figure 17.9: Options chosen (over 12 simulations).
internal expectations (weight values):

Figure 17.10: Weight values (over 12 simulations).
and their average trends:

Figure 17.11: Average trends in weight values.
Interpretation
The plot of agent actions (A) shows that some, but not all agents learn to choose the best Option B within 20 rounds. Due to the more complicated setup of three options and their more similar rewards, the learning is slower and not always successful. This is also reflected in the visualization of the weights (shown in B), which are not separated as cleanly as in the binary case (see Section 17.2). However, the average trends (shown in C) still provide systematic evidence for learning when averaging the agent weight values over all 12 simulations.
- ad 3.: Increasing the learning rate (e.g., to \(\alpha = 2\)) or increasing the number of trials \(n_T\) per simulation both lead to better (i.e., faster or more discriminating) learning.
17.6.3 A foraging MAB
Create a model of a multi-armed bandit as a dynamic simulation of a foraging situation:
Environment with 3 options, with dynamic rewards (either depleting or improving with use).
Assess the baseline (e.g., random) and the optimal performance (when knowing the definition of the environment).
Assess the performance of 2 learning agents (with different parameters) and 2 heuristics (without learning).
17.6.4 Maximization vs. melioration
A deep puzzle regarding the adaptive behavior of organisms concerns the mechanism that leads to adaptation and learning. In principle, two alternative mechanisms are possible:
- maximizing: Selecting the best option (global optimization)
- meliorating: Selecting the better option (local optimization)
In most environments, both mechanisms will be correlated and converge on the same option(s). However, consider the following environment:
There are two buttons/options for
max
(for maximizing) ormel
(for meliorating).The reward or value of each option depends on the current system state
x
. The value ofx
is defined as a function of the past \(w=10\) choices. Specifically,x
is defined as the proportion ofmax
choices over the \(w=10\) most recent choices.The value of choosing
max
ormel
for any value ofx
is shown in Figure 17.12
(on the y-axis, which may denote the magnitude of a reward or the probability of receiving a constant reward).

Figure 17.12: A dynamic 2-armed bandit that contrasts maximization with melioration.
- The rewards for meliorating (curve
mel(x)
, in red) are always higher than the rewards for maximizing (curvemax(x)
, shown in green). As both lines are parallel, the difference is always given by \(B\), regardless of the value ofx
.
Your tasks:
Implement this environment as a simulation in R.
Implement some simple strategies (e.g., a random strategy, a win-stay/lose-shift strategy, etc.) and evaluate their performance. Which strategies are most successful?
Try implementing a learning strategy. What does the strategy learn?