16.2 Models of learning

We first assume a stable environment and a dynamic agent. Adapting agents are typically described by the phenomenon of learning.

An elementary situation — known as binary forced choice — is the following: An agent faces a choice between several alternative options. For now, let’s assume that there are only two options and both options are deterministic and stable: Each option yields a constant reward by being chosen.

  • How can the agent learn to choose the better option?

A general solution to the task of learning the best alternative is provided by reinforcement learning (RL, Sutton & Barto, 2018). The framework of RL assumes that intelligent behavior aims to reach goals in some environment. Organisms learn by exploring their environments in a trial-and-error fashion. As successful behavior is reinforced, their behavioral patterns is shaped by receiving and monitoring rewards. Some authors even assume that any form of natural or artifical intelligence (including perception, knowledge acquisition, language, logical reasoning, and social intelligence) can be understood as subserving the maximization of reward (Silver et al., 2021).

Although RL has developed into a core branch of machine learning and artificial intelligence, it is based on a fairly simple theory of classical conditioning (Rescorla & Wagner, 1972). In the following, we will illustrate its basic principles (following Page, 2018, p. 306f.).

Basic idea

Assume that an agent is facing a choice between \(N\) options with rewards \(\pi(1) ... \pi(N)\). The learner’s internal model or perspective on the world is represented as a set of weights \(w\) that denote the expected value of or preference for each of the available options (i.e., \(w(1) ... w(N)\)). Given this representation, successful learning consists in adjusting the values of these weights to the (stable) rewards of the environment and choosing options accordingly.

Learning proceeds in a stepwise fashion: In each step, the learner acts (i.e., chooses an option), observes a reward, and then adjusts the weight of the chosen option (i.e., her internal expectations or preferences). Formally, each choice cycle is characterized by two components:

  1. Choosing: The probability of choosing \(k\)-th alternative is given by:

\[P(k) = \frac{w(k)}{\sum_{i}^{N} w(i)}\]

  1. Learning: The learner adjusts the weight of the \(k\)-th alternative after choosing it and receiving its reward \(\pi(K)\) by adding the following increment:

\[\Delta w(k) = \alpha \cdot P(k) \cdot [\pi(k) - A]\]

with two parameters describing characteristics of the learner: \(\alpha > 0\) denoting the learning rate (aka. rate of adjustment or step size parameter) and \(A < \text{max}_{k}{\pi(k)}\) denoting the aspiration level.18

Note the following details of the learning rule:

  • The difference \([\pi(k) - A]\) is the reason for the \(\Delta\)-Symbol that commonly denotes differences or deviations. Here, we compare the observed reward \(\pi(k)\) to an aspiration level \(A\), which can be characterized as a measure of surprise: When a reward \(\pi(k)\) corresponds to our aspiration level \(A\), we are not surprised (i.e., \(\Delta w(k) = 0\)). By contrast, rewards much smaller or much larger than our aspirations are more surprising and lead to larger adjustments. If a reward value exceeds \(A\), \(w(k)\) increases; if it is below \(A\), \(w(k)\) decreases.

  • The \(\Delta w(k)\)-increment specified by Page (2018) includes the current probability \(P(k)\) of choosing the \(k\)-th option as a weighting factor, but alternative formulations of \(\Delta\)-rules omit this factor or introduce additional parameters.

For details and historical references, see the Wikipedia pages to the Rescorla-Wagner model, Reinforcement learning and Q-learning. See Sutton & Barto (2018), for a general introduction to reinforcement learning.

Coding a model

To create a model that implements a basic learning agent that explores and adapts to a stable environment, we first define some objects and parameters:

# Initial setup (see Page, 2018, p. 308):

# Environment: 
alt <- c("A", "B")  # constant options
rew <- c(10, 20)    # constant rewards (A < B)

# Agent: 
alpha <- 1          # learning rate
A     <- 5          # aspiration level
wgt   <- c(50, 50)  # initial weights 

In the current model, the environmental options alt and their rewards rew are fixed and stable. Also, the learning rate alpha and A are assumed to be constants, but the weights wgt that represent the agent’s beliefs or expectations about the value of environmental options are parameters that may change after every cycle of choosing an action and observing a reward.

As the heart of our model, we translate functions 1. and 2. into R code:

# 1. Choosing an option: ------ 

p <- function(k){  # Probability of k:
  wgt[k]/sum(wgt)  # wgt denotes current weights

# Reward from choosing Option k: ------  

r <- function(k){
  rew[k]  # rew denotes current rewards

# 2. Learning: ------ 

delta_w <- function(k){  # adjusting the weight of k: 
  (alpha * p(k) * (r(k) - A))

Note that the additional function r(k) provides the reward obtained from choosing alternative k. Thus, r(k) is more a part of the environment than of the agent. Actually, it defines the interface between agent and environment. As the environmental rewards stored as rew are currently assumed to be constant, we could replace this function by writing rew[k] in the weight increment given by delta_w(k). However, as the r() function conceptually links the agent’s action (i.e., choosing option k) to the rewards provided by the environment rew and we will soon generalize this to stochastic environments (in Section 16.3), we already include this function here.19

With these basic functions in place, we define an iterative loop for 1:n_T time steps t (aka. choice-reward cycles, periods, or rounds) to simulate the iterative process. In each step, we use p() to choose an action and delta_w() to interpret and adjust the observed reward:

# Environment: 
alt <- c("A", "B")  # constant options
rew <- c(10, 20)    # constant rewards (A < B)

# Agent: 
alpha <- 1          # learning rate
A     <- 5          # aspiration level
wgt   <- c(50, 50)  # initial weights 

# Simulation:
n_T <- 20  # time steps/cycles/rounds

for (t in 1:n_T){  # each step/trial: ---- 

  # (1) Use wgt to determine the current action: 
  cur_prob <- c(p(1), p(2))
  cur_act  <- sample(alt, size = 1, prob = cur_prob)
  ix_act   <- which(cur_act == alt)
  # (2) Update wgt (based on action & reward): 
  new_w <- wgt[ix_act] + delta_w(ix_act)  # increment weight by delta w
  wgt[ix_act] <- new_w  # update wgt
  # (+) User feedback:
  print(paste0(t, ": Choose ", cur_act, " (", ix_act, "), and ",
              "learn wgt = ", paste(round(wgt, 0), collapse = ":"), 
              " (A = ", A, ")"))

# Report result:
print(paste0("Final weight values (after ", n_T, " steps): ", 
             paste(round(wgt, 0), collapse = ":")))

Due to the use of sample() in selecting the current action cur_act (i.e., choosing either the 1st or 2nd option out of alt), running this code repeatedly will yield different results.

The repeated references to the weights wgt imply that this vector is a global variable that is initialized once and then accessed and changed at various points in our code (e.g., in p() and when updating the weights in the loop). If this should get problematic, we could pass wgt as an argument to every function.

Running this model yields the following output:

#> [1] "1: Choose A (1), and learn wgt = 52:50 (A = 5)"
#> [1] "2: Choose A (1), and learn wgt = 55:50 (A = 5)"
#> [1] "3: Choose B (2), and learn wgt = 55:57 (A = 5)"
#> [1] "4: Choose B (2), and learn wgt = 55:65 (A = 5)"
#> [1] "5: Choose B (2), and learn wgt = 55:73 (A = 5)"
#> [1] "6: Choose A (1), and learn wgt = 57:73 (A = 5)"
#> [1] "7: Choose B (2), and learn wgt = 57:81 (A = 5)"
#> [1] "8: Choose B (2), and learn wgt = 57:90 (A = 5)"
#> [1] "9: Choose B (2), and learn wgt = 57:99 (A = 5)"
#> [1] "10: Choose B (2), and learn wgt = 57:109 (A = 5)"
#> [1] "11: Choose A (1), and learn wgt = 59:109 (A = 5)"
#> [1] "12: Choose B (2), and learn wgt = 59:119 (A = 5)"
#> [1] "13: Choose A (1), and learn wgt = 61:119 (A = 5)"
#> [1] "14: Choose B (2), and learn wgt = 61:128 (A = 5)"
#> [1] "15: Choose B (2), and learn wgt = 61:139 (A = 5)"
#> [1] "16: Choose B (2), and learn wgt = 61:149 (A = 5)"
#> [1] "17: Choose A (1), and learn wgt = 62:149 (A = 5)"
#> [1] "18: Choose B (2), and learn wgt = 62:160 (A = 5)"
#> [1] "19: Choose B (2), and learn wgt = 62:170 (A = 5)"
#> [1] "20: Choose B (2), and learn wgt = 62:181 (A = 5)"
#> [1] "Final weight values (after 20 steps): 62:181"

The output is printed by the user feedback (i.e., the print() statements) that is provided at the end of each loop and after finishing its n_T cycles. Due to using sample() to determine the action (or option chosen) on each time step t, the results will differ every time the simulation is run.
Although we typically want to collect better records on the process, reading this output already hints at a successful reinforcement learning process: Due to the changing weights wgt, the better alternative (here: Option B) is increasingly preferred and chosen.


  1. Which variable(s) allow us to see and evaluate that learning is taking place? What exactly do these represent (i.e., the agent’s internal state or overt behavior)? Why do we not monitor the values of cur_prob in each round?

  2. Predict what happens when the reward value \(\pi(k)\) equals the aspiration level \(A\) (i.e., \(\pi(k) = A\)) or falls below it (i.e., \(\pi(k) < A\)). Test your predictions by running a corresponding model.

  3. Playing with parameters:

    • Set n_T and alpha to various values and observe how this affects the rate and results of learning.

    • Set rew to alternative values on observe how this affects the rate and results of learning.

Keeping track

To evaluate simulations more systematically, we need to extend our model in two ways:

  1. First, the code above only ran a single simulation of n_T steps. However, due to random fluctuations (e.g., in the results of the sample() function), we should not trust the results of any single simulation. Instead, we want to run and evaluate some larger number of simulations to get an idea about the variability and robustness of their results. This can easily be achieved by enclosing our entire simulation within another loop that runs several independent simulations.

  2. Additionally, our initial simulation (above) used a print() statement to provide user feedback in each iteration of the loop. This allowed us to monitor the n_T time steps of our simulation and evaluate the learning process. However, displaying results as the model code is being executed becomes impractical when simulations get larger and run for longer periods of time. Thus, we need to collect required performance measures by setting up and using corresponding data structures as the simulation runs.

The following code takes care of both concerns: First, we embed our previous simulation within an outer loop for running a sequence of n_S simulations. Additionally, we prepare an auxiliary data frame data that allows recording the agent’s action and weights on each step (or choice-reward cycle) as we go along.

Note that working with two loops complicates setting up data (as its needs a total of n_S * n_T rows, as well as columns for the current values of s and t) and the indices for updating the current row of data. Also, using an outer loop that defines distinct simulations creates two possible levels for initializing information. In the present case, some information (e.g., the number of simulations, the number of steps per simulation, the available options, and their constant rewards) is initialized only once (globally). By contrast, we chose to initialize the parameters of the learner in every simulation (although we currently only update the weights wgt in each step).

The resulting simulation code is as follows:

# Simulation:
n_S <- 12  # Number of simulations     
n_T <- 20  # Number of time steps/cycles/rounds/trials (per simulation)

# Environmental constants: 
alt <- c("A", "B")  # constant options
rew <- c(10, 20)    # constant rewards (A < B)

# Prepare data structure for storing results: 
data <- as.data.frame(matrix(ncol = (3 + length(alt)), nrow = n_S * n_T))
names(data) <- c("s", "t", "act", paste0("w_", alt))

for (s in 1:n_S){     # each simulation: ---- 
  # Initialize agent: 
  alpha <- 1        # learning rate
  A <- 5            # aspiration level
  wgt <- c(50, 50)  # initial weights 
  for (t in 1:n_T){   # each step/trial: ---- 
    # (1) Use wgt to determine current action: 
    cur_prob <- c(p(1), p(2))
    cur_act <- sample(alt, size = 1, prob = cur_prob)
    ix_act <- which(cur_act == alt)
    # (2) Update wgt (based on reward): 
    new_w <- wgt[ix_act] + delta_w(ix_act)  # increment weight
    wgt[ix_act] <- new_w  # update wgt
    # (+) Record results:
    data[((s-1) * n_T) + t, ] <- c(s, t, ix_act, wgt)
  } # for t:n_T end.
  print(paste0("s = ", s, ": Ran ", n_T, " steps, wgt = ", 
               paste(round(wgt, 0), collapse = ":")))
} # for i:n_S end.
#> [1] "s = 1: Ran 20 steps, wgt = 66:167"
#> [1] "s = 2: Ran 20 steps, wgt = 68:156"
#> [1] "s = 3: Ran 20 steps, wgt = 61:184"
#> [1] "s = 4: Ran 20 steps, wgt = 83:96"
#> [1] "s = 5: Ran 20 steps, wgt = 63:180"
#> [1] "s = 6: Ran 20 steps, wgt = 63:171"
#> [1] "s = 7: Ran 20 steps, wgt = 67:156"
#> [1] "s = 8: Ran 20 steps, wgt = 57:209"
#> [1] "s = 9: Ran 20 steps, wgt = 61:192"
#> [1] "s = 10: Ran 20 steps, wgt = 62:173"
#> [1] "s = 11: Ran 20 steps, wgt = 61:183"
#> [1] "s = 12: Ran 20 steps, wgt = 73:126"

# Report result:
print(paste0("Finished running ", n_S, " simulations (see 'data' for results)"))
#> [1] "Finished running 12 simulations (see 'data' for results)"

The user feedback within the inner loop was now replaced by storing the current values of parameters of interest into a row of data. Thus, data collected information on all intermediate states:

Table 16.1: A record of simulation states and results.
s t act w_A w_B
1 1 1 52.50000 50.00000
1 2 1 55.06098 50.00000
1 3 1 57.68141 50.00000
1 4 2 57.68141 56.96499
1 5 2 57.68141 64.41812
1 6 1 60.04347 64.41812

Note that both the construction of our model and the selection of variables stored in data determine the scope of results that we can examine later. For instance, the current model did not explicitly represent the reward values received on every trial. As they were constant for each option, we did not need to know them here, but this may change if the environment became more dynamic (see Section 16.3). Similarly, we chose not to store the current value of the agent’s aspiration level \(A\) for every trial. However, if \(A\) ever changed within a simulation, we may want to store a record of its values in data.

Visualizing results

This allows us to visualize the learning process and progress. Figure 16.2 shows which option was chosen in each time step and simulation:

# Visualize results:
ggplot(data, aes(x = t)) + 
  facet_wrap(~s) + 
  geom_path(aes(y = act), col = Grau) + 
  geom_point(aes(y = act, col = factor(act)), size = 2) + 
  scale_color_manual(values = usecol(c(Bordeaux, Seegruen))) + 
  scale_y_continuous(breaks = 1:2, labels = alt) +  
  labs(title = paste0("Agent actions (choices) in ", n_S, " simulations"), 
       x = "Time steps", y = "Action", color = "Option:") + 
The agent’s action (i.e., option chosen) in a stable environment per time step and simulation.

Figure 16.2: The agent’s action (i.e., option chosen) in a stable environment per time step and simulation.

As Option B yields higher rewards than Option A, learning would be reflected in Figure 16.2 by an increasing preference of B over A. Figure 16.3 shows the trends in option weights per time step for all 12 simulations:

ggplot(data, aes(x = t, group = s)) + 
  geom_path(aes(y = w_A), size = .5, col = usecol(Bordeaux, alpha = .5)) + 
  geom_path(aes(y = w_B), size = .5, col = usecol(Seegruen, alpha = .5)) + 
  labs(title = paste0("Agent weights (expectations/preferences) in ", n_S, " simulations"),   
       x = "Time steps", y = "Weights") + 
Trends in option weights in a stable environment per time step for all simulations.

Figure 16.3: Trends in option weights in a stable environment per time step for all simulations.

The systematic difference in trends can be emphasized by showing their averages (Figure 16.4):

ggplot(data, aes(x = t)) + 
  geom_smooth(aes(y = w_A), size = 1, col = usecol(Bordeaux, alpha = .5)) + 
  geom_smooth(aes(y = w_B), size = 1, col = usecol(Seegruen, alpha = .5)) +
  labs(title = paste0("Trends in (average) agent weights in ", n_S, " simulations"),   
       x = "Time steps", y = "Weights", col = "Option:") + 
Average trends in option weights in a stable environment per time step for all simulations.

Figure 16.4: Average trends in option weights in a stable environment per time step for all simulations.


  • The visualization of agent actions (in Figure 16.2) shows that most agents gradually chose the superior Option B more frequently. Thus, learning took place and manifested itself in a systematic trend in the choice behavior of the agent (particularly in later trials t).

  • The visualization of agent weights (in Figure 16.3) illustrates that most learners preferred the superior Option B (shown in green) within about 5–10 trials of the simulation.

  • The systematic preference of Option B over Option A (from Trial 5 onward) is further emphasized by comparing average trends (in Figure 16.4).

When interpreting the results of models, we must not forget that they show different facets of the same process. As the value of the weights wgt determine the probabilities of choosing each option (via p()), there is a logical link between the concepts illustrated by these visualizations: The behavioral choices shown in Figure 16.2 are a consequence of the weights shown in Figure 16.3 and 16.4, with some added noise due to random sampling. Thus, the three visualizations show how learning manifests itself on different conceptual and behavioral levels.


As both options always perform above the constant aspiration level (of \(A = 5\)), the weigth values wgt could only increase when experiencing rewards (as \(\Delta w > 0\)).

  1. What would happen if the aspiration level was set to a value between both options (e.g., \(A = 15\))?

  2. What would happen if the aspiration level was set to the value of the lower option (e.g., \(A = 10\))?

Boundary conditions

Reflecting on the consequences of different aspiration values highlights some boundary conditions that need to hold for learning to take place. We can easily see the impact of an aspiration level \(A\) matching the value of a current reward \(r(k)\).
Due to a lack of surprise (i.e., no difference \(r(k) - A = 0\)), the value of \(\Delta w(k)\) would become 0 as well. And if the value of \(A\) exceeds \(r(k)\), the difference \(r(k) - A\) and \(\Delta w(k)\) are negative, which leads to a reduction of the corresponding weight. If option weights were not bounded to be \(\geq 0\), this could create a problem for their conversion into a probability by \(p(k)\). However, as we chose a value \(A\) that was smaller than the smallest reward (\(A < min(\pi)\)), this problem could not occur here.

To avoid that all weights converge to zero, the value of the aspiration level \(A\) must generally be lower than the reward of at least one option (Page, 2018, p. 307). Under this condition, the two functions specified above (i.e., p() and w()) will eventually place almost all weight on the best alternative (as its weight always increases by the most). And using these weights for choosing alternatives implies that — in the long run — the best alternative will be selected almost exclusively.

Adapting aspirations

The basic framework shown here can be extended in many directions. For instance, the following example is provided by Page (2018) (p. 308). We consider two changes:

  1. Endogeneous aspirations: Rather than using a constant aspiration level \(A\), a learner could adjust the value of \(A\) based on experience by setting it equal to the average reward value.

  2. Fixed choice sequence: Rather than randomly sampling the chosen option (according to Equation 1, i.e., the probabilities of p()), we may want to evaluate the effects of a fixed sequence of actions (e.g., choosing B, B, A, B).

Given this setup, the question “What is being learned?” can only refer to the adjusted weights wgt (as the sequence of actions is predetermined). However, when moving away from a constant aspiration level, we also want to monitor the values of our adjusted aspiration level \(A_{t}\).

These changes can easily be implemented as follows:20

# 1. Choosing: 
p <- function(k){  # Probability of k:
  wgt[k]/sum(wgt)  # wgt denotes current weights

# Reward from choosing k: 
r <- function(k){
  rew[k]  # rew denotes current rewards

# 2. REDUCED learning rule: 
delta_w <- function(k){ # Adjusting the weight of k: 
  # (alpha * p(k) * (r(k) - A))
  (alpha * 1 * (r(k) - A))  # HACK: seems to be assumed in Page (2018), p. 308
# Environment: 
alt <- c("A", "B")  # constant options
rew <- c(20, 10)    # constant rewards (A > B)
wgt <- c(50, 50)    # initial weights 

actions <- c("B", "B", "A", "B")
# actions <- rep("A", 10)
# actions <- rep(c("A", "B"), 50)

# Agent: 
alpha <- 1  # learning rate
A <- 5      # initial aspiration level

# Simulation:
n_T <- length(actions)  # time steps/rounds
r_hist <- rep(NA, n_T)  # history of rewards 

for (t in 1:n_T){

  # (1) Use wgt to determine current action: 
  # cur_prob <- c(p(1), p(2))
  # cur_prob <- c(1, 1)  # HACK: seems to be assumed in Page (2018), p. 308
  # cur_act <- sample(alt, size = 1, prob = cur_prob)
  cur_act <- actions[t]
  ix_act  <- which(cur_act == alt)
  # (2) Update wgt (based on reward): 
  new_w <- wgt[ix_act] + delta_w(ix_act)  # increment weight
  wgt[ix_act] <- new_w  # update wgt
  # (3) Determine reward and adjust aspiration level:
  r_hist[t] <- r(ix_act)           # store reward value in history 
  A <- mean(r_hist, na.rm = TRUE)  # adapt A  
  # (+) User feedback:
  print(paste0(t, ": ", cur_act, " (", ix_act, "), ", 
               "wgt = ", paste(round(wgt, 0), collapse = ":"), 
               ", A(", t, ") = ", round(A, 1)))
#> [1] "1: B (2), wgt = 50:55, A(1) = 10"
#> [1] "2: B (2), wgt = 50:55, A(2) = 10"
#> [1] "3: A (1), wgt = 60:55, A(3) = 13.3"
#> [1] "4: B (2), wgt = 60:52, A(4) = 12.5"


Let’s further reflect on the consequences of our endogeneous aspiration level \(A_{k}\):

  1. How would \(A_{k}\) change when an agent chose one option for 10-times in a row?

  2. How would \(A_{k}\) change when an agent alternated between Option A and Option B 50 times in a row (i.e., for a total of 100 trials)?


Given the simplicity of our basic learning paradigm, we should note some limitations:

  • Learning only from experience: As our model only learns from experienced options, it lacks the ability to learn from counterfactual information.

  • Individual learning: The model here only considers an individual agent. Thus, it does not cover social situations and the corresponding phenomena of social information, social influence, and social learning.

  • The model’s level of analysis is flexible or unclear: Some model may aim to represent the actual mechanism (implementation, e.g., in neuronal structures) while another may be content to be an abstract description of required components (e.g., a trade-off between hypothetical parts or processes).

Most of these limitations can be addressed by adding new elements to our basic learning paradigm. However, some limitations could also be interpreted as strengths: The simple, abstract and explicit nature of the model makes it easy to test and verify the necessary and sufficient conditions under which learning can take place. This higher degree of identifiability is the hallmark of formal models and distinguishes them from a verbal theory.


Page, S. E. (2018). The model thinker: What you need to know to make data work for you. Basic Books.
Rescorla, R. A., & Wagner, A. R. (1972). A theory of Pavlovian conditioning: Variations in the effectiveness of reinforcement and nonreinforcement. In A. H. Black & W. F. Prokasy (Eds.), Classical conditioning II: Current research and theory (pp. 64–99). Appleton-Century-Crofts.
Silver, D., Singh, S., Precup, D., & Sutton, R. S. (2021). Reward is enough. Artificial Intelligence, 103535. https://doi.org/10.1016/j.artint.2021.103535
Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction (2nd ed.). MIT press. http://incompleteideas.net/book/the-book.html

  1. Page (2018) uses the Greek letter \(\gamma\) as the learning rate. But as some learning models also use \(\gamma\) as a discount factor that determines the importance of future rewards, we denote the learning rate by the more common letter \(\alpha\) here.↩︎

  2. As r() is called by delta_w(), this creates a functional dependency in our model that renders some information implicit (here: the value of the reward received on every trial). As we will see, models often need to balance concerns for elegance and practicality.↩︎

  3. We use a reduced learning rule (which replaces \(p(k)\) by a constant of 1) to obtain the same values as in Page (2018) (p. 308).↩︎