## 17.2 Models of learning

We first assume a stable environment and a dynamic agent.
Adapting agents are typically described by the phenomenon of *learning*.

An elementary situation — known as *binary forced choice* — is the following:
An agent faces a choice between several alternative options.
For now, let’s assume that there are only two options and both options are deterministic and stable: Each option yields a constant reward by being chosen.

- How can the agent learn to choose the better option?

A general solution to the task of learning the best alternative is provided by *reinforcement learning* (RL, Sutton & Barto, 2018).
The framework of RL assumes that intelligent behavior aims to reach goals in some environment.
Organisms learn by exploring their environments in a trial-and-error fashion.
As successful behavior is reinforced, their behavioral patterns is shaped by receiving and monitoring rewards.
Some authors even assume that any form of natural or artifical intelligence (including perception, knowledge acquisition, language, logical reasoning, and social intelligence) can be understood as subserving the maximization of reward (Silver et al., 2021).

Although RL has developed into a core branch of machine learning and artificial intelligence, it is based on a fairly simple theory of classical conditioning (Rescorla & Wagner, 1972). In the following, we will illustrate its basic principles (following Page, 2018, p. 306f.).

#### Basic idea

Assume that an agent is facing a choice between \(N\) options with rewards \(\pi(1) ... \pi(N)\).
The learner’s internal model or perspective on the world is represented as a set of *weights* \(w\) that denote the expected value of or preference for each of the available options (i.e., \(w(1) ... w(N)\)).
Given this representation, successful *learning* consists in adjusting the values of these weights to the (stable) rewards of the environment and choosing options accordingly.

Learning proceeds in a stepwise fashion: In each step, the learner acts (i.e., chooses an option), observes a reward, and then adjusts the weight of the chosen option (i.e., her internal expectations or preferences). Formally, each choice cycle is characterized by two components:

*Choosing*: The probability of choosing \(k\)-th alternative is given by:

\[P(k) = \frac{w(k)}{\sum_{i}^{N} w(i)}\]

*Learning*: The learner adjusts the weight of the \(k\)-th alternative after choosing it and receiving its reward \(\pi(K)\) by adding the following increment:

\[\Delta w(k) = \alpha \cdot P(k) \cdot (\pi(k) - A)\]

with two parameters describing characteristics of the learner:
\(\alpha > 0\) denoting the *learning rate* (aka. rate of adjustment or step size parameter) and
\(A < max_{k}\pi(k)\) denoting the *aspiration level*.^{17}

Note that the difference \(\pi(k) - A\) compares the observed reward \(\pi(k)\) to an aspiration level \(A\) and can be characterized as a measure of *surprise*.
Bigger rewards are more surprising and lead to larger adjustments.
If a reward value exceeds \(A\), \(w(k)\) increases; if it is below \(A\), \(w(k)\) decreases.

For details and historical references, see the Wikipedia pages to the Rescorla-Wagner model, Reinforcement learning and Q-learning

#### Coding a model

To create a model that implements a basic learning agent that explores and adapts to a stable environment, we first define some objects and parameters:

```
# Initial setup (see Page, 2018, p. 308):
# Environment:
<- c("A", "B") # constant options
alt <- c(10, 20) # constant rewards (A < B)
rew
# Agent:
<- 1 # learning rate
alpha <- 5 # aspiration level
A <- c(50, 50) # initial weights wgt
```

In the current model, the environmental options `alt`

and their rewards `rew`

are fixed and stable.
Also, the learning rate `alpha`

and `A`

are assumed to be constants, but the weights `wgt`

that represent the agent’s beliefs or expectations about the value of environmental options are parameters that may change after every cycle of choosing an action and observing a reward.

As the heart of our model, we translate functions 1. and 2. into R code:

```
# 1. Choosing:
<- function(k){ # Probability of k:
p
/sum(wgt) # wgt denotes current weights
wgt[k]
}
# Reward from choosing k:
<- function(k){
r
# rew denotes current rewards
rew[k]
}
# 2. Learning:
<- function(k){ # Adjusting the weight of k:
delta_w
* p(k) * (r(k) - A))
(alpha
}
```

Note that the additional function `r(k)`

provides the reward obtained from choosing alternative `k`

.
Thus, `r(k)`

is more a part of the environment than of the agent. Actually, it defines the interface between agent and environment.
As the environmental rewards stored as `rew`

are currently assumed to be constant, we could replace this function by writing `rew[k]`

in the weight increment given by `delta_w(k)`

.
However, as the `r()`

function conceptually links the agent’s action (i.e., choosing option `k`

) to the rewards provided by the environment `rew`

and we will soon generalize this to stochastic environments (in Section 17.3), we already include this function here.^{18}

With these basic functions in place, we define an iterative loop for `1:n_T`

time steps `t`

(aka. choice-reward cycles, periods, or rounds) to simulate the iterative process.
In each step, we use `p()`

to choose an action and `delta_w()`

to interpret and adjust the observed reward:

```
# Environment:
<- c("A", "B") # constant options
alt <- c(10, 20) # constant rewards (A < B)
rew
# Agent:
<- 1 # learning rate
alpha <- 5 # aspiration level
A <- c(50, 50) # initial weights
wgt
# Simulation:
<- 20 # time steps/cycles/rounds
n_T
for (t in 1:n_T){ # each step/trial: ----
# (1) Use wgt to determine current action:
<- c(p(1), p(2))
cur_prob <- sample(alt, size = 1, prob = cur_prob)
cur_act <- which(cur_act == alt)
ix_act
# (2) Update wgt (based on reward):
<- wgt[ix_act] + delta_w(ix_act) # increment weight
new_w <- new_w # update wgt
wgt[ix_act]
# (+) User feedback:
print(paste0(t, ": Choosing ", cur_act, " (", ix_act, "), and ",
"learning wgt = ", paste(round(wgt, 0), collapse = ":"),
" (A = ", A, ")"))
}
# Report result:
print(paste0("Final weights (after ", n_T, " steps): ",
paste(round(wgt, 0), collapse = ":")))
```

Due to the use of `sample()`

in selecting the current action `cur_act`

(i.e., choosing either the 1st or 2nd option out of `alt`

), running this code repeatedly will yield different results.

The repeated references to the weights `wgt`

imply that this vector is a global variable that is initialized once and then accessed and changed at various points in our code (e.g., in `p()`

and when updating the weights in the loop).
If this should get problematic, we could pass `wgt`

as an argument to every function.

Running this model yields the following output:

```
#> [1] "1: Choosing A (1), and learning wgt = 52:50 (A = 5)"
#> [1] "2: Choosing A (1), and learning wgt = 55:50 (A = 5)"
#> [1] "3: Choosing B (2), and learning wgt = 55:57 (A = 5)"
#> [1] "4: Choosing B (2), and learning wgt = 55:65 (A = 5)"
#> [1] "5: Choosing B (2), and learning wgt = 55:73 (A = 5)"
#> [1] "6: Choosing A (1), and learning wgt = 57:73 (A = 5)"
#> [1] "7: Choosing B (2), and learning wgt = 57:81 (A = 5)"
#> [1] "8: Choosing B (2), and learning wgt = 57:90 (A = 5)"
#> [1] "9: Choosing B (2), and learning wgt = 57:99 (A = 5)"
#> [1] "10: Choosing B (2), and learning wgt = 57:109 (A = 5)"
#> [1] "11: Choosing A (1), and learning wgt = 59:109 (A = 5)"
#> [1] "12: Choosing B (2), and learning wgt = 59:119 (A = 5)"
#> [1] "13: Choosing A (1), and learning wgt = 61:119 (A = 5)"
#> [1] "14: Choosing B (2), and learning wgt = 61:128 (A = 5)"
#> [1] "15: Choosing B (2), and learning wgt = 61:139 (A = 5)"
#> [1] "16: Choosing B (2), and learning wgt = 61:149 (A = 5)"
#> [1] "17: Choosing A (1), and learning wgt = 62:149 (A = 5)"
#> [1] "18: Choosing B (2), and learning wgt = 62:160 (A = 5)"
#> [1] "19: Choosing B (2), and learning wgt = 62:170 (A = 5)"
#> [1] "20: Choosing B (2), and learning wgt = 62:181 (A = 5)"
#> [1] "Final weights (after 20 steps): 62:181"
```

The output is printed by the user feedback (i.e., the `print()`

statements) that is provided at the end of each loop and after finishing its `n_T`

cycles.
Due to using `sample()`

to determine the action (or option chosen) on each time step `t`

, the results will differ every time the simulation is run.

Although we typically want to collect better records on the process, reading this output already hints at a successful reinforcement learning process:
Due to the changing weights `wgt`

, the better alternative (here: Option B) is increasingly preferred and chosen.

#### Practice

Which variable(s) allow us to see and evaluate that

*learning*is taking place? What exactly do these represent (i.e., the agent’s internal state or overt behavior)? Why do we not monitor the values of`cur_prob`

in each round?Predict what happens when the reward value \(\pi(k)\) equals the aspiration level \(A\) (i.e., \(\pi(k) = A\)) or falls below it (i.e., \(\pi(k) < A\)). Test your predictions by running a corresponding model.

Playing with parameters:

Set

`n_T`

and`alpha`

to various values and observe how this affects the rate and results of learning.Set

`rew`

to alternative values on observe how this affects the rate and results of learning.

#### Keeping track

To evaluate simulations more systematically, we need to extend our model in two ways:

First, the code above only ran a single simulation of

`n_T`

steps. However, due to random fluctuations (e.g., in the results of the`sample()`

function), we should not trust the results of any single simulation. Instead, we want to run and evaluate some larger number of simulations to get an idea about the variability and robustness of their results. This can easily be achieved by enclosing our entire simulation within another loop that runs several independent simulations.Additionally, our initial simulation (above) used a

`print()`

statement to provide user feedback in each iteration of the loop. This allowed us to monitor the`n_T`

time steps of our simulation and evaluate the learning process. However, displaying results as the model code is being executed becomes impractical when simulations get larger and run for longer periods of time. Thus, we need to collect required performance measures by setting up and using corresponding data structures as the simulation runs.

The following code takes care of both concerns:
First, we embed our previous simulation within an outer loop for running a sequence of `n_S`

simulations.
Additionally, we prepare an auxiliary data frame `data`

that allows recording the agent’s action and weights on each step (or choice-reward cycle) as we go along.

Note that working with two loops complicates setting up `data`

(as its needs a total of `n_S * n_T`

rows, as well as columns for the current values of `s`

and `t`

) and the indices for updating the current row of `data`

.
Also, using an outer loop that defines distinct simulations creates *two* possible levels for initializing information.
In the present case, some information (e.g., the number of simulations, the number of steps per simulation, the available options, and their constant rewards) is initialized only once (globally).
By contrast, we chose to initialize the parameters of the learner in every simulation (although we currently only update the weights `wgt`

in each step).

The resulting simulation code is as follows:

```
# Simulation:
<- 12 # Number of simulations
n_S <- 20 # Number of time steps/cycles/rounds/trials (per simulation)
n_T
# Environmental constants:
<- c("A", "B") # constant options
alt <- c(10, 20) # constant rewards (A < B)
rew
# Prepare data structure for storing results:
<- as.data.frame(matrix(ncol = (3 + length(alt)), nrow = n_S * n_T))
data names(data) <- c("s", "t", "act", paste0("w_", alt))
for (s in 1:n_S){ # each simulation: ----
# Initialize agent:
<- 1 # learning rate
alpha <- 5 # aspiration level
A <- c(50, 50) # initial weights
wgt
for (t in 1:n_T){ # each step/trial: ----
# (1) Use wgt to determine current action:
<- c(p(1), p(2))
cur_prob <- sample(alt, size = 1, prob = cur_prob)
cur_act <- which(cur_act == alt)
ix_act
# (2) Update wgt (based on reward):
<- wgt[ix_act] + delta_w(ix_act) # increment weight
new_w <- new_w # update wgt
wgt[ix_act]
# (+) Record results:
-1) * n_T) + t, ] <- c(s, t, ix_act, wgt)
data[((s
# for t:n_T end.
}
print(paste0("s = ", s, ": Ran ", n_T, " steps, wgt = ",
paste(round(wgt, 0), collapse = ":")))
# for i:n_S end.
} #> [1] "s = 1: Ran 20 steps, wgt = 66:167"
#> [1] "s = 2: Ran 20 steps, wgt = 68:156"
#> [1] "s = 3: Ran 20 steps, wgt = 61:184"
#> [1] "s = 4: Ran 20 steps, wgt = 83:96"
#> [1] "s = 5: Ran 20 steps, wgt = 63:180"
#> [1] "s = 6: Ran 20 steps, wgt = 63:171"
#> [1] "s = 7: Ran 20 steps, wgt = 67:156"
#> [1] "s = 8: Ran 20 steps, wgt = 57:209"
#> [1] "s = 9: Ran 20 steps, wgt = 61:192"
#> [1] "s = 10: Ran 20 steps, wgt = 62:173"
#> [1] "s = 11: Ran 20 steps, wgt = 61:183"
#> [1] "s = 12: Ran 20 steps, wgt = 73:126"
# Report result:
print(paste0("Finished running ", n_S, " simulations (see 'data' for results)"))
#> [1] "Finished running 12 simulations (see 'data' for results)"
```

The user feedback within the inner loop was now replaced by storing the current values of parameters of interest into a row of `data`

. Thus, `data`

collected information on all intermediate states:

s | t | act | w_A | w_B |
---|---|---|---|---|

1 | 1 | 1 | 52.50000 | 50.00000 |

1 | 2 | 1 | 55.06098 | 50.00000 |

1 | 3 | 1 | 57.68141 | 50.00000 |

1 | 4 | 2 | 57.68141 | 56.96499 |

1 | 5 | 2 | 57.68141 | 64.41812 |

1 | 6 | 1 | 60.04347 | 64.41812 |

Note that both the construction of our model and the selection of variables stored in `data`

determine the scope of results that we can examine later. For instance, the current model did not explicitly represent the reward values received on every trial. As they were constant for each option, we did not need to know them here, but this may change if the environment became more dynamic (see Section 17.3). Similarly, we chose not to store the current value of the agent’s aspiration level \(A\) for every trial. However, if \(A\) ever changed within a simulation, we may want to store a record of its values in `data`

.

#### Visualizing results

This allows us to visualize the learning process and progress. Figure 17.2 shows which option was chosen in each time step and simulation:

```
# Visualize results:
ggplot(data, aes(x = t)) +
facet_wrap(~s) +
geom_path(aes(y = act), col = Grau) +
geom_point(aes(y = act, col = factor(act)), size = 2) +
scale_color_manual(values = usecol(c(Bordeaux, Seegruen))) +
scale_y_continuous(breaks = 1:2, labels = alt) +
labs(title = paste0("Agent actions (choices) in ", n_S, " simulations"),
x = "Time steps", y = "Action", color = "Option:") +
theme_ds4psy()
```

As Option B yields higher rewards than Option A, learning would be reflected in Figure 17.2 by an increasing preference of B over A. Figure 17.3 shows the trends in option weights per time step for all 12 simulations:

```
ggplot(data, aes(x = t, group = s)) +
geom_path(aes(y = w_A), size = .5, col = usecol(Bordeaux, alpha = .5)) +
geom_path(aes(y = w_B), size = .5, col = usecol(Seegruen, alpha = .5)) +
labs(title = paste0("Agent weights (expectations/preferences) in ", n_S, " simulations"),
x = "Time steps", y = "Weights") +
theme_ds4psy()
```

The systematic difference in trends can be emphasized by showing their averages (Figure 17.4):

```
ggplot(data, aes(x = t)) +
geom_smooth(aes(y = w_A), size = 1, col = usecol(Bordeaux, alpha = .5)) +
geom_smooth(aes(y = w_B), size = 1, col = usecol(Seegruen, alpha = .5)) +
labs(title = paste0("Trends in (average) agent weights in ", n_S, " simulations"),
x = "Time steps", y = "Weights", col = "Option:") +
theme_ds4psy()
```

#### Interpretation

The visualization of agent actions (in Figure 17.2) shows that most agents gradually chose the superior Option B more frequently. Thus, learning took place and manifested itself in a systematic trend in the choice behavior of the agent (particularly in later trials

`t`

).The visualization of agent weights (in Figure 17.3) illustrates that most learners preferred the superior Option B (shown in green) within about 5–10 trials of the simulation.

The systematic preference of Option B over Option A (from Trial 5 onward) is further emphasized by comparing average trends (in Figure 17.4).

When interpreting the results of models, we must not forget that they show different facets of the same process.
As the value of the weights `wgt`

determine the probabilities of choosing each option (via `p()`

), there is a logical link between the concepts illustrated by these visualizations:
The behavioral choices shown in Figure 17.2 are a consequence of the weights shown in Figure 17.3 and 17.4, with some added noise due to random sampling.
Thus, the three visualizations show how learning manifests itself on different conceptual and behavioral levels.

#### Practice

As both options always perform above the constant aspiration level (of \(A = 5\)), the weigth values `wgt`

could only increase when experiencing rewards (as \(\Delta w > 0\)).

What would happen if the aspiration level was set to a value between both options (e.g., \(A = 15\))?

What would happen if the aspiration level was set to the value of the lower option (e.g., \(A = 10\))?

#### Boundary conditions

Reflecting on the consequences of different aspiration values highlights some boundary conditions that need to hold for learning to take place.
We can easily see the impact of an aspiration level \(A\) matching the value of a current reward \(r(k)\).

Due to a lack of surprise (i.e., no difference \(r(k) - A = 0\)), the value of \(\Delta w(k)\) would become 0 as well.
And if the value of \(A\) exceeds \(r(k)\), the difference \(r(k) - A\) and \(\Delta w(k)\) are negative, which leads to a reduction of the corresponding weight.
If option weights were not bounded to be \(\geq 0\), this could create a problem for their conversion into a probability by \(p(k)\). However, as we chose a value \(A\) that was smaller than the smallest reward (\(A < min(\pi)\)), this problem could not occur here.

To avoid that all weights converge to zero, the value of the aspiration level \(A\) must generally be lower than the reward of at least one option (Page, 2018, p. 307). Under this condition, the two functions specified above (i.e., `p()`

and `w()`

) will eventually place almost all weight on the best alternative (as its weight always increases by the most). And using these weights for choosing alternatives implies that — in the long run — the best alternative will be selected almost exclusively.

#### Adapting aspirations

The basic framework shown here can be extended in many directions. For instance, the following example is provided by Page (2018) (p. 308). We consider two changes:

*Endogeneous aspirations*: Rather than using a constant aspiration level \(A\), a learner could adjust the value of \(A\) based on experience by setting it equal to the average reward value.*Fixed choice sequence*: Rather than randomly sampling the chosen option (according to Equation 1, i.e., the probabilities of`p()`

), we may want to evaluate the effects of a fixed sequence of actions (e.g., choosing`B, B, A, B`

).

Given this setup, the question “What is being learned?” can only refer to the adjusted weights `wgt`

(as the sequence of actions is predetermined). However, when moving away from a constant aspiration level, we also want to monitor the values of our adjusted aspiration level \(A_{t}\).

These changes can easily be implemented as follows:^{19}

```
# 1. Choosing:
<- function(k){ # Probability of k:
p
/sum(wgt) # wgt denotes current weights
wgt[k]
}
# Reward from choosing k:
<- function(k){
r
# rew denotes current rewards
rew[k]
}
# 2. REDUCED learning rule:
<- function(k){ # Adjusting the weight of k:
delta_w
# (alpha * p(k) * (r(k) - A))
* 1 * (r(k) - A)) # HACK: seems to be assumed in Page (2018), p. 308
(alpha
}
```

```
# Environment:
<- c("A", "B") # constant options
alt <- c(20, 10) # constant rewards (A > B)
rew <- c(50, 50) # initial weights
wgt
<- c("B", "B", "A", "B")
actions # actions <- rep("A", 10)
# actions <- rep(c("A", "B"), 50)
# Agent:
<- 1 # learning rate
alpha <- 5 # initial aspiration level
A
# Simulation:
<- length(actions) # time steps/rounds
n_T <- rep(NA, n_T) # history of rewards
r_hist
for (t in 1:n_T){
# (1) Use wgt to determine current action:
# cur_prob <- c(p(1), p(2))
# cur_prob <- c(1, 1) # HACK: seems to be assumed in Page (2018), p. 308
# cur_act <- sample(alt, size = 1, prob = cur_prob)
<- actions[t]
cur_act <- which(cur_act == alt)
ix_act
# (2) Update wgt (based on reward):
<- wgt[ix_act] + delta_w(ix_act) # increment weight
new_w <- new_w # update wgt
wgt[ix_act]
# (3) Determine reward and adjust aspiration level:
<- r(ix_act) # store reward value in history
r_hist[t] <- mean(r_hist, na.rm = TRUE) # adapt A
A
# (+) User feedback:
print(paste0(t, ": ", cur_act, " (", ix_act, "), ",
"wgt = ", paste(round(wgt, 0), collapse = ":"),
", A(", t, ") = ", round(A, 1)))
}#> [1] "1: B (2), wgt = 50:55, A(1) = 10"
#> [1] "2: B (2), wgt = 50:55, A(2) = 10"
#> [1] "3: A (1), wgt = 60:55, A(3) = 13.3"
#> [1] "4: B (2), wgt = 60:52, A(4) = 12.5"
```

#### Practice

Let’s further reflect on the consequences of our endogeneous aspiration level \(A_{k}\):

How would \(A_{k}\) change when an agent chose one option for 10-times in a row?

How would \(A_{k}\) change when an agent alternated between Option A and Option B 50 times in a row (i.e., for a total of 100 trials)?

#### Limitations

Given the simplicity of our basic learning paradigm, we should note some limitations:

Learning only from experience: As our model only learns from experienced options, it lacks the ability to learn from counterfactual information.

Individual learning: The model here only considers an individual agent. Thus, it does not cover social situations and the corresponding phenomena of social information, social influence, and social learning.

The model’s level of analysis is flexible or unclear: Some model may aim to represent the actual mechanism (implementation, e.g., in neuronal structures) while another may be content to be an abstract description of required components (e.g., a trade-off between hypothetical parts or processes).

Most of these limitations can be addressed by adding new elements to our basic learning paradigm. However, some limitations could also be interpreted as strengths: The simple, abstract and explicit nature of the model makes it easy to test and verify the necessary and sufficient conditions under which learning can take place. This higher degree of identifiability is the hallmark of formal models and distinguishes them from a verbal theory.

### References

*The model thinker: What you need to know to make data work for you*. Basic Books.

*Classical conditioning II: Current research and theory*(pp. 64–99). Appleton-Century-Crofts.

*Artificial Intelligence*, 103535. https://doi.org/10.1016/j.artint.2021.103535

*Reinforcement learning: An introduction*(2nd ed.). MIT press. http://incompleteideas.net/book/the-book-2nd.html

Page (2018) uses the Greek letter \(\gamma\) as the learning rate. But as some learning models also use \(\gamma\) as a discount factor that determines the importance of future rewards, we denote the learning rate by the more common letter \(\alpha\) here.↩︎

As

`r()`

is called by`delta_w()`

, this creates a functional dependency in our model that renders information implicit (here: the value of the reward received on every trial). As we will see, models often need to balance concerns for elegance and practicality.↩︎We use a reduced learning rule (which replaces \(p(k)\) by a constant of 1) to obtain the same values as in Page (2018) (p. 308).↩︎