## 16.4 Exercises

Here are some exercises for simple learning agents that interact with stable or dynamic environments.

### 16.4.1 Learning with short-term aspirations

Our basic learning model (developed in Section 16.2.1) assumed that a learning agent has a constant aspiration level of \(A = 5\). As the reward from selecting the worst option still exceeds \(A\), this always leads to a positive “surprise” upon receiving a reward, corresponding increments of \(\Delta w > 0\), and monotonic increases in weight values `wgt`

. Although having low expectations may guarantee a happy agent, it seems implausible that an agent never adjusts its aspiration level \(A\). This exercise tries the opposite approach, by assuming an extremely myopic agent who always updates \(A\) to the most recent reward value:

- Adjust \(A\) to the current reward value on each trial \(t\).

**Hint:** This also requires that the current reward value is explicitly represented in each loop, rather than only being computed implicitly when calling the `delta_w()`

function.

Does the agent still successfully learn to choose the better option? Describe

*how*the learning process changes (in terms of the values of \(A\) and \(wgt\), and the options being chosen).Given the new updating mechanism for \(A\) and assuming no changes in the agent’s learning rate

`alpha`

, the initial`wgt`

values and`delta_w()`

, how could a change in the probability of choosing options (i.e., the`p()`

function) increase the speed of learning?Explain why it is essential that \(A\) is adjusted

*after*updating the agent’s`wgt`

values. What would happen, if an agent prematurely updated its aspiration level to the current reward (prior to updating`wgt`

)?

#### Solution

- ad 1.: We copy the functions
`p()`

,`r()`

and`delta_w()`

, as defined in Section 16.2.1:

```
# 1. Choosing:
<- function(k){ # Probability of k:
p
/sum(wgt) # wgt denotes current weights
wgt[k]
}
# Reward from choosing k:
<- function(k){
r
# rew denotes current rewards
rew[k]
}
# 2. Learning:
<- function(k){ # Adjusting the weight of k:
delta_w
* p(k) * (r(k) - A))
(alpha
}
```

the following code extends the code from above in two ways:

After updating

`wgt`

, the current reward value`cur_rew`

is explicitly determined (for a second time) and the aspiration level`A`

is set to this value on each individual trial (see Step 3 within the inner loop);`data`

is extended to include the current value of`A`

(or`cur_rew`

) on each individual trial.

```
# Simulation:
<- 12 # number of simulations
n_S <- 20 # time steps/cycles (per simulation)
n_T
# Environmental constants:
<- c("A", "B") # constant options
alt <- c(10, 20) # constant rewards (A < B)
rew
# Prepare data structure for storing results:
<- as.data.frame(matrix(ncol = (4 + length(alt)), nrow = n_S * n_T))
data names(data) <- c("s", "t", "act", "A", paste0("w_", alt))
for (s in 1:n_S){ # each simulation: ----
# Initialize agent:
<- 1 # learning rate
alpha <- 5 # aspiration level
A <- c(50, 50) # initial weights
wgt
for (t in 1:n_T){ # each step/trial: ----
# (1) Use wgt to determine current action:
<- c(p(1), p(2))
cur_prob <- sample(alt, size = 1, prob = cur_prob)
cur_act <- which(cur_act == alt)
ix_act
# (2) Update wgt (based on reward):
<- wgt[ix_act] + delta_w(ix_act) # increment weight
new_w <- new_w # update wgt
wgt[ix_act]
# (3) Determine reward and adjust A:
<- rew[ix_act] # value of current reward
cur_rew <- cur_rew # update A
A
# (+) Record results:
-1) * n_T) + t, ] <- c(s, t, A, ix_act, wgt)
data[((s
# for t:n_T end.
}
print(paste0("s = ", s, ": Ran ", n_T, " steps, wgt = ",
paste(round(wgt, 0), collapse = ":"),
", A = ", A))
# for i:n_S end.
} #> [1] "s = 1: Ran 20 steps, wgt = 34:76, A = 10"
#> [1] "s = 2: Ran 20 steps, wgt = 33:79, A = 10"
#> [1] "s = 3: Ran 20 steps, wgt = 43:66, A = 20"
#> [1] "s = 4: Ran 20 steps, wgt = 33:86, A = 20"
#> [1] "s = 5: Ran 20 steps, wgt = 37:76, A = 20"
#> [1] "s = 6: Ran 20 steps, wgt = 31:83, A = 10"
#> [1] "s = 7: Ran 20 steps, wgt = 39:72, A = 20"
#> [1] "s = 8: Ran 20 steps, wgt = 43:66, A = 20"
#> [1] "s = 9: Ran 20 steps, wgt = 30:94, A = 20"
#> [1] "s = 10: Ran 20 steps, wgt = 33:79, A = 10"
#> [1] "s = 11: Ran 20 steps, wgt = 30:94, A = 20"
#> [1] "s = 12: Ran 20 steps, wgt = 29:97, A = 20"
# Report result:
print(paste0("Finished running ", n_S, " simulations (see 'data' for results)"))
#> [1] "Finished running 12 simulations (see 'data' for results)"
```

The printed summary information after each simulation shows that all agents seem to successfully discriminate between options (by learning a higher weight for the superior Option B).

ad 2.: From the 2nd trial onwards, the agent always aspires to its most recent reward. Thus, the agent is never surprised when the same reward is chosen again and only learns upon switching options:

When switching from Option A to Option B, the reward received exceeds \(A\), leading to an

*increase*in the weight for Option B.When switching from Option B to Option A, the reward received is lower than \(A\), leading to an

*decrease*in the weight for Option A. This explains why the weights of Option A can decrease and generally are lower than their initial values in this simulation.

ad 3.: Given this setup, the pace of learning depends on the frequency of switching options. Increasing the agent’s propensity for switching would increase its learning speed.

ad 4.: If Steps 2 and 3 were reversed (i.e.,

`A`

was updated to the current reward value prior to updating`wgt`

), the difference`r(k) - A`

in`delta_w()`

would always be zero. Thus, the product \(\Delta w(k)\) will always be zero, irrespective of \(k\) or \(t\). Overall, the premature agent would never learn anything. (Consider trying this to verify it.)

### 16.4.2 Learning with additional options

This exercise extends our basic learning model (from Section 16.2.1) to a stable environment with additional options:

Extend our learning model to an environment with

*three*options (named`A`

,`B`

, and`C`

) that yield stable rewards (of 10, 30, and 20 units, respectively).Visualize both the agent choices (i.e., actions) and expectations (i.e., weights) and interpret what they show.

Assuming that the options and rewards remain unchanged, what could improve the learning success? (Make a prediction and then test it by running the simulation.)

#### Solution

- ad 1.: Still using the functions
`p()`

,`r()`

and`delta_w()`

, as defined in Section 16.2.1:

```
# 1. Choosing:
<- function(k){ # Probability of k:
p
/sum(wgt) # wgt denotes current weights
wgt[k]
}
# Reward from choosing k:
<- function(k){
r
# rew denotes current rewards
rew[k]
}
# 2. Learning:
<- function(k){ # Adjusting the weight of k:
delta_w
* p(k) * (r(k) - A))
(alpha
}
```

the following code extends the simulation from above to three options:

```
# Simulation:
<- 12 # number of simulations
n_S <- 20 # time steps/cycles (per simulation)
n_T
# Environmental constants:
<- c("A", "B", "C") # constant options
alt <- c(10, 30, 20) # constant rewards
rew
# Prepare data structure for storing results:
<- as.data.frame(matrix(ncol = (3 + length(alt)), nrow = n_S * n_T))
data names(data) <- c("s", "t", "act", paste0("w_", alt))
for (s in 1:n_S){ # each simulation: ----
# Initialize learner:
<- c(50, 50, 50) # initial weights
wgt <- 1 # learning rate
alpha <- 5 # aspiration level
A
for (t in 1:n_T){ # each step: ----
# (1) Use wgt to determine current action:
<- c(p(1), p(2), p(3))
cur_prob <- sample(alt, size = 1, prob = cur_prob)
cur_act <- which(cur_act == alt)
ix_act
# (2) Update wgt (based on reward):
<- wgt[ix_act] + delta_w(ix_act) # increment weight
new_w <- new_w # update wgt
wgt[ix_act]
# (+) Record results:
-1) * n_T) + t, ] <- c(s, t, ix_act, wgt)
data[((s
# for t:n_T end.
}
print(paste0("s = ", s, ": Ran ", n_T, " steps, wgt = ",
paste(round(wgt, 0), collapse = ":")))
# for i:n_S end.
} #> [1] "s = 1: Ran 20 steps, wgt = 55:157:78"
#> [1] "s = 2: Ran 20 steps, wgt = 53:196:67"
#> [1] "s = 3: Ran 20 steps, wgt = 56:146:77"
#> [1] "s = 4: Ran 20 steps, wgt = 53:110:108"
#> [1] "s = 5: Ran 20 steps, wgt = 63:149:56"
#> [1] "s = 6: Ran 20 steps, wgt = 56:144:78"
#> [1] "s = 7: Ran 20 steps, wgt = 59:116:84"
#> [1] "s = 8: Ran 20 steps, wgt = 57:71:124"
#> [1] "s = 9: Ran 20 steps, wgt = 61:117:77"
#> [1] "s = 10: Ran 20 steps, wgt = 61:70:110"
#> [1] "s = 11: Ran 20 steps, wgt = 60:125:76"
#> [1] "s = 12: Ran 20 steps, wgt = 58:157:68"
```

- ad 2.: As before, the data frame
`data`

collects information on all intermediate states. This allows us to visualize the learning process in terms of the actions chosen (choices):

internal expectations (weight values):

and their average trends:

#### Interpretation

The plot of agent actions (**A**) shows that some, but not all agents learn to choose the best Option B within 20 rounds.
Due to the more complicated setup of three options and their more similar rewards, the learning is slower and not always successful.
This is also reflected in the visualization of the weights (shown in **B**), which are not separated as cleanly as in the binary case (see Section 16.2.1). However, the average trends (shown in **C**) still provide systematic evidence for learning when averaging the agent weight values over all 12 simulations.

- ad 3.: Increasing the learning rate (e.g., to \(\alpha = 2\)) or increasing the number of trials \(n_T\) per simulation both lead to better (i.e., faster or more discriminating) learning.

### 16.4.3 A foraging MAB

Create a model of a multi-armed bandit as a dynamic simulation of a *foraging* situation:

Environment with 3 options, with dynamic rewards (either depleting or improving with use).

Assess the baseline (e.g., random) and the optimal performance (when knowing the definition of the environment).

Assess the performance of 2 learning agents (with different parameters) and 2 heuristics (without learning).

### 16.4.4 Maximization vs. melioration

A deep puzzle regarding the adaptive behavior of organisms concerns the mechanism that leads to adaptation and learning. In principle, two alternative mechanisms are possible:

*maximizing*: Selecting the best option (global optimization)*meliorating*: Selecting the better option (local optimization)

In most environments, both mechanisms will be correlated and converge on the same option(s). However, consider the following environment:

There are two buttons/options for

`max`

(for*maximizing*) or`mel`

(for*meliorating*).The reward or value of each option depends on the current system state

`x`

. The value of`x`

is defined as a function of the past \(w=10\) choices. Specifically,`x`

is defined as the proportion of`max`

choices over the \(w=10\) most recent choices.The value of choosing

`max`

or`mel`

for any value of`x`

is shown in Figure 16.12

(on the y-axis, which may denote the magnitude of a reward or the probability of receiving a constant reward).

- The rewards for meliorating (curve
`mel(x)`

, in red) are*always higher*than the rewards for maximizing (curve`max(x)`

, shown in green). As both lines are parallel, the difference is always given by \(B\), regardless of the value of`x`

.

Your tasks:

Implement this environment as a simulation in R.

Implement some simple strategies (e.g., a random strategy, a win-stay/lose-shift strategy, etc.) and evaluate their performance. Which strategies are most successful?

Try implementing a learning strategy. What does the strategy learn?