## 18.6 Exercises

Ideas for corresponding exercises and projects:

### 18.6.1 Reflecting on Pac-Man

The video game of Pac-Man is far too fast, rich, and complicated to be modeled as one of our exercises. However, it is instructive to reflect on what would be required to implement a corresponding model.

• How could the environment (i.e., the maze, pellets, time steps and rounds) be represented in a computer simulation?

• What would a model of the Pac-Man agent require (in terms of goals, internal states, and actions)?

• What would change if we were to model the ghosts instead?

• Bonus: What’s the name of the 4th ghost in Pac-Man?

### 18.6.2 Learning in coordination and conflict games

The prisoner’s dilemma (PD) is a notorious example of a two-player coordination game that shows why two rational players may not cooperate, even if it is in their best interests to do so (see Wikipedia for details). Table 18.4 shows a payoff matrix of a PD game.

Table 18.4: A payoff matrix of a prisoners’ dilemma (PD).
Options cooperate defect
cooperate $$(-1, -1)$$ $$(-3, +0)$$
defect $$(+0, -3)$$ $$(-2, -2)$$

1. Classify the PD’s type of game by describing its payoff matrix.

2. Implement a population of 10 x 10 agents that play the game for 100 rounds (using the lattice/torus setup from Section 18.2). What do they learn? Why?

1. Repeat 1. and 2. for the battle of the sexes (BoS) game — a two-player coordination game that pits diverging preferences against each other and thus may involve some conflict (see Wikipedia for details). As the standard version assigns somewhat sexist roles to males and females, Table 18.5 shows a payoff matrix of a BoS game as a dilemma to attend a performance playing either music composed by “Bach” or by “Stravinsky”:
Table 18.5: A payoff matrix of a battle of the sexes (BoS) game.
Options Bach Stravinsky
Bach $$(3, 2)$$ $$(0, 0)$$
Stravinsky $$(0, 0)$$ $$(2, 3)$$

### 18.6.3 Playing rock-paper-scissors (RPS)

Table 18.6 illustrates the payoff matrix of the rock-paper-scissors (RPS) game.

Table 18.6: The payoff matrix of the rock-paper-scissors (RPS) game.
Options rock paper scissors
rock $$(0, 0)$$ $$(-1, +1)$$ $$(+1, -1)$$
paper $$(+1, -1)$$ $$(0, 0)$$ $$(-1, +1)$$
scissors $$(-1, +1)$$ $$(+1, -1)$$ $$(0, 0)$$
1. Classify this type of game and justify your choices by describing the payoff matrix.

2. What is the most rational strategy when playing this game against an opponent? Why?

3. Implement the game for a population of 100 agents arranged in a 10 x 10 lattice/torus design, in which each agent plays its 8 surrounding agents on each of $$N_t = 100$$ rounds (as in Section 18.2). What do the agents learn?

Hint: This exercise is based on a simulation by Wataru Toyokawa, which can be found here. The original example uses a population of $$42^2 = 1764$$ agents, which takes longer, but yields more stable results.

1. Bonus: Let’s try to reproduce an interesting result based on a Japanese children’s game that uses a biased payoff matrix (see Wataru’s remarks on “Gu-ri-co”).
Assume a RPS game in which each winning player receives the number of points that corresponds to the number of characters in the name of her or his chosen option. This results in a biased payoff matrix that yields 8 points for winning with scissors (as nchar(scissors) is 8), 5 points for winning with paper, and 4 points for winning with rock (and ties and losses earning 1 and 0 points, respectively).

2. Bonus: Assume that your opponent(s) selects her options randomly, but with biased preferences. Specifically, she plays rock on 50% of all trials, paper on 30% of all trials, and scissors on 20% of all trials. Show that and how your learning agent would beat this agent.

#### Solution

• ad 1.: The RPS game is a purely competitive zero-sum game. Its three options are symmetrical in the sense as each option is winning against one and losing against one other option.

• ad 2.: The most rational strategy in playing the RPS game is to be maximally unpredictable. This is achieved by selecting each of the three options with a random probability of $$p=1/3$$.

• ad 3.: We need to re-define the parameters of the game (i.e., options and payoffs), as well as change the auxiliary simulation code to accommodate three options (e.g., their densities).
Here is the resulting code:

Game definition: Change payoff matrices to accommodate three options (and avoid negative payoffs):

# Game:
N_opt <- 3  # NEW: Number of options

# NEW payoff matrices (per player) of the RPS game:
payoff_p1 <- matrix(c( 0, -1, +1,
+1,  0, -1,
-1, +1,  0),
nrow = N_opt, byrow = TRUE)  # Player 1 payoffs
payoff_p2 <- -1 * payoff_p1  # Player 2 payoffs

# HACK to avoid negative payoffs (NEW):
payoff_p1 <- payoff_p1 + 1
payoff_p2 <- payoff_p2 + 1

get_payoffs <- function(opt_p1, opt_p2){

# Look up values in matrices:
pay_p1 <- payoff_p1[opt_p1, opt_p2]
pay_p2 <- payoff_p2[opt_p1, opt_p2]

pay_offs <- c(pay_p1, pay_p2)

return(pay_offs)
}

## Check:
# get_payoffs(3, 1)

Simulation parameters (unchanged):

LAT <- 10     # length/width of lattice
N_p <- LAT^2  # population size/number of players
N_t <- 100    # number of games/rounds/time steps

# The boundary condition of the lattice/torus:
boundary_cond <- function(i) { # i is x or y coordinate:

if (i < 1) {

return (i + LAT)

} else if (i > LAT) {

return (i - LAT)

} else {

return (i)

}
}

Utility functions (unchanged):

# exp_ceiling() limits the exponential of x
# to a maximum value of 2^1023:
exp_ceiling <- function(x) {

exp_x <- exp(x)  # compute exp once

ix_inf <- which(is.infinite(exp_x))  # ix of infinite values

if (length(ix_inf) == 0) {

return(exp_x)  # return as is

} else { # limit infinite values:

exp_x[ix_inf] <- 8.988466e+307  # 2^1023

return(exp_x)

}
}

## Check:
# x <- c(10, 100, 1000)  # Note: exp(710) => Inf
# exp(x)
# exp_ceiling(x)

# Simple vector calculation functions (to use in apply()):
divide_vec = function (num, den) {

return(num/den)

}

multiply_vec = function (Q, beta) {

return(Q * beta)

}

Learner parameters (unchanged):

# Learner:
epsilon   <- 0.10  # exploration rate (epsilon-greedy model)
alpha     <- 0.50  # learning rate/step size parameter
Q_initial <- 0     # initial payoff expectation
Q         <- array(dim = c(N_opt, N_t, N_p))
Q[ , 1, ] <- Q_initial

Data structures for storing intermediate results: Adding a density_3 vector to store the probability density of a 3rd option.

# Storing history:
choices       <- matrix(nrow = N_p, ncol = N_t)  # in wide format
payoffs       <- matrix(nrow = N_p, ncol = N_t)
performance   <- matrix(nrow = N_p, ncol = N_t)

choiceCounter   <- array(0, dim = c(N_opt, N_p))
riskyChoiceProb <- matrix(nrow = N_p, ncol = N_t)
netChoiceProb   <- array(dim = c(N_opt,  N_t, N_p))
netChoiceProb[ , 1, ] <- 1/N_opt

density_1 <- rep(NA, N_t)
density_2 <- rep(NA, N_t)
density_3 <- rep(NA, N_t)  # NEW

# this is a list where the tactics distribution of each time step will be recorded
all_choices <- list() 

Running the simulation as a for loop (for each time step $$t$$): Adding a line to collect density_3 values on each time step.

for (t in 1:N_t) {

# Each individual chooses one option based on his/her choice probability netChoiceProb[ , t, ]:
choices[ , t] = apply(netChoiceProb[ , t, ], MARGIN = 2,
FUN = function(netChoiceProbInstance){
sample(1:N_opt, 1, prob = netChoiceProbInstance, replace = FALSE)})

all_choices$step[[t]] <- matrix(choices[ , t], nrow = LAT, ncol = LAT, byrow = TRUE) # Calculate the density of each option: density_1[t] <- which(choices[ , t] == 1) %>% length() %>% divide_vec(., LAT^2) density_2[t] <- which(choices[ , t] == 2) %>% length() %>% divide_vec(., LAT^2) density_3[t] <- which(choices[ , t] == 3) %>% length() %>% divide_vec(., LAT^2) # NEW # OR: # density_3[t] <- 1 - (density_1[t] + density_2[t]) # Calculate payoffs of each player: # Assumption: Each player plays the game once against each of the 8 neighbors: for (p in 1:N_p) { # the position (address?) of the focal player in the lattice: i <- (p - 1) %/% LAT + 1 # quotient j <- (p - 1) %% LAT + 1 # remainder payoffs[p, t] <- 0 # initialize value for (g in c(-1, 0, 1)) { # neighbor x for (h in c(-1, 0, 1)) { # neighbor y if (g^2 + h^2 > 0) { # is FALSE iff g == h == 0, which eliminates the focal player: # Determine current payoffs (given player choices): cur_payoffs <- get_payoffs(opt_p1 = all_choices$step[[t]][i, j],
opt_p2 = all_choices\$step[[t]][boundary_cond(i+g), boundary_cond(j+h)])

cur_payoff_p1 <- cur_payoffs[1]  # payoff of focal player

payoffs[p, t] <- payoffs[p, t] + cur_payoff_p1  # accumulate payoffs of focal player

} # if.

} # for h.
} # for g.

} # for p.

# cur_choices tracks which option was chosen by each individual on this trial:
cur_choices <- choices[ , t] + N_opt * (1:N_p - 1)

# Updating Q values (only if another trial is following):
if (t < N_t) {

# Copying the previous Q value to the current Q value:
Q[ , t + 1, ] <- Q[ , t, ]

QQ <- aperm(Q, c(1, 3, 2))  # transpose dimensions 2 and 3 of Q
dim(QQ) <- c(N_opt * N_p, N_t)

# Q value updating (note: alpha is a step-size parameter which is 1/n; the sample averaging rule):
QQ[cur_choices, t + 1] <- QQ[cur_choices, t] + alpha * (payoffs[ , t] - QQ[cur_choices, t])

dim(QQ) <- c(N_opt, N_p, N_t)
Q <- aperm(QQ, c(1, 3, 2))

###############
## Softmax choice (based only on Q values):
###############

Q_exp <- (Q[ , t + 1, ] * rep(20, each = N_opt)) %>%
apply(., MARGIN = 2, FUN = exp_ceiling) %>% round(., 5)

softmaxMatrix <- Q_exp %>%
apply(1, divide_vec, den = apply(Q_exp, MARGIN = 2, FUN = sum)) %>%
t() %>% round(., 5)

###############
## Softmax end.
###############

## Update choice probability matrix:
netMatrix <- apply(X = softmaxMatrix, MARGIN = 1, FUN = multiply_vec, beta = (1 - epsilon)) %>%
t() + apply(matrix(rep(1/N_opt, N_opt * N_p), N_opt:N_p), MARGIN = 1, FUN = multiply_vec, beta = epsilon) %>%
t() %>% round(., 4)

netChoiceProbAperm      <- aperm(netChoiceProb, c(1, 3, 2))
dim(netChoiceProbAperm) <- c(N_opt * N_p, N_t)
dim(netMatrix)          <- c(N_opt * N_p, 1)
netChoiceProbAperm[ , t + 1] <- netMatrix
dim(netChoiceProbAperm)      <- c(N_opt, N_p, N_t)
netChoiceProb                <- aperm(netChoiceProbAperm, c(1, 3, 2))

} # if (t < N_t).

} # for t.

Results: Density dynamics of option choices over 100 trials: Adding some details for 3rd option:

# Prepare data frame:
density_data <- data.frame(t = 1:N_t,
density_1 = density_1,
density_2 = density_2,
density_3 = density_3  # NEW
)

density_long <- density_data %>%
pivot_longer(cols = c(-t), names_to = 'option', values_to = 'density')

ggplot(density_long) +
geom_hline(yintercept = 1/N_opt, linetype = 'dashed', color = "grey50")+
geom_line(aes(t, density, color = option), size = 1, alpha = 3/4)+
scale_color_manual(values = c('density_1' = usecol(Pinky),
'density_2' = usecol(Seeblau),
'density_3' = usecol(Seegruen)),  # NEW
labels = c("Rock", "Paper", "Scissors")) +   # NEW
ylim(c(0, 1)) +
labs(title = "Density dynamics of options over time",
x = "Time step t", y = "Option density", color = "Options:") +
theme_ds4psy()

Interpretation: There is no evidence for any dominant strategy. As soon as one strategy is chosen more frequently than by chance, its opposite strategy becomes more successful. Thus, the population oscillates around the random baseline at $$p=1/3$$, which is the optimal strategy in this competitive game with three symmetrical options.

• ad 4.: Here is the biased payoff matrix and some modified parameter settings:
# BIASED payoff matrices (per player) of the RPS game:
payoff_p1 <- matrix(c(1, 0, 4,
5, 1, 0,
0, 8, 1),
nrow = N_opt, byrow = TRUE)  # Player 1 payoffs
payoff_p2 <- matrix(c( 1, 0, 8,
1, 1, 0,
0, 2, 1),
nrow = N_opt, byrow = TRUE)  # Player 2 payoffs

# NEW simulation/agent parameters:
N_t <- 200       # number of games/rounds/time steps
epsilon <- 0.01  # exploration rate (epsilon-greedy model)

set.seed(468)    # reproducible randomness

To obtain the following result (shown in Figure 18.6), we reduced the model’s exploration rate epsilon to 0.01 (to obtain smoother curves) and increased the number of trials n_T to 200:

Interpretation: It seems that the option with the highest payoff (i.e., “scissors”) is favored over the other options, but not by a wide margin. Interestingly, the second highest option (at the end of 200 trials) is “rock.” Winning with this option yields the least number of points, but it beats the popular “scissors” strategy. Overall, even the biased payoffs do not allow the population to gravitate toward a single strategy, as any strategy can be beaten by another.

### 18.6.4 Sequential mate search

A fundamental and important social game in the biological realm is provided by the problem of finding a mate to procreate. Mate search simulations are well-suited to illustrate the dynamics of social situations. However, they also contain many simplifying assumptions that may not hold in nature and society. In this exercise, we first create a basic simulation to explore some dynamics, before critically evaluating its results and deficits.

Assume a simulation for the following situation:

• Assumption 1: Population:
Imagine a population of simple creatures: $$N_p = 200$$ individuals, equally distributed into 2 types (i.e., $$N_p/2 = 100$$ individuals per type), with procreation depending on a long-term hetero-sexual partnership.

• Assumption 2: Attractiveness:
All relevant features of an individual $$i$$ for the this task can be summarized in its overall attractiveness score $$A_{i}$$. As all individuals of the opposite type share the same criteria for evaluating attractiveness, an individual’s attractiveness for individuals of the other type are fully described by the rank of its attractiveness score (i.e., a value of 1–100, assuming no ties). However, the individuals of the population do initially not know their own attractiveness score and do not know the distribution of attractiveness values in the population (i.e., they initially do not know how many individuals with values above 50 exist).

• Assumption 3: Dating and marrying:
Two individuals of opposite types meet in a random fashion in a series of $$N_t = 100$$ rounds of “dating.” In each round, the two individuals evaluate each other (based on some strategy that is only known to each individual) and decide whether they want to partner up with their match and make an offer if they do. Given a mutual choice for each other, the couple is matched (“married”) and removed from any further rounds of dating (i.e., married couples stay together).

• Assumption 4: Offspring and relationship value:
In each round after their marriage, a married couple produces an individual offspring with a probability $$P_{off}=.10$$. As the attractiveness value of this offspring is the average of its parents, the value of a relationship can be summarized by $$N_{off} \cdot \frac{A_{i} + A_{j}}{2}$$, where $$A_{i}$$ and $$A_{j}$$ are the attractiveness values of the married partners and $$N_{off}$$ is the total number of a couple’s offspring.

1. Your task is to devise a successful strategy for 1/3 (34%) of each type of the population in an environment in which two other strategies exist (and are randomly distributed over all individuals):

• 1/3 (33%) of the population (of each type) follows an experience-based aspiration level strategy that wants to partner up with any date that exceeds the average attractiveness of the first 10 dates.

• 1/3 (33%) of the population (of each type) first aims to estimate their own attractiveness by monitoring the offers received on the first 20 dating rounds. They then estimate their own attractiveness (as the average of the attractiveness values of the partners who made positive offers up to this point) and then set their aspiration level to a partner that is at least as attractive as themselves.

Justify why you believe that your strategy is more successful than the other two.

1. Create the simulation in R and evaluate the performance of all three strategies (numerically and visually). Which measures (beyond each strategy’s overall value) is informative regarding the success of a strategy?

#### Solution

General simulation parameters:

N_p <- 200      # population size
A_1 <- 1:N_p/2  # attractiveness ranks of type 1
A_2 <- 1:N_p/2  # attractiveness ranks of type 2
N_t <- 100      # number of time steps/trials/rounds

Note that the assumed value of a relationship can be directly computed when a match is made (at $$t_{match}$$) and is a linear function of $$P_{off}$$, the remaining time periods ($$N_{t} - t_{match}$$), and the attractiveness values of both partners ($$A_i + A_j$$):

match_value <- function(t_match, p_off, A_1, A_2){

t_left <- N_t - (t_match + 1)  # time left

N_off <- p_off * t_left  # expected number of offspring in time left

A_mean <- (A_1 + A_2)/2  # mean attractiveness of partners

N_off * A_mean

}
1. Discuss in which ways these assumptions are not valid for human societies (which ones?), how more realistic assumptions would change the corresponding simulation, and what this means for the results from corresponding models.

#### Solution

There are many simplifying assumptions that clearly — and fortunately — do not apply to a realistic population of human beings. For instance:

• Each individual is assumed to belong to one of two types and assumed to seek an exclusive and stable relationship with an individual of the opposite type. Thus, both the population setup and the dating regime disregard the possibility of other genders, non-permanent relationships, and any non hetero-sexual preferences. Similarly, individuals who do not seek a relationship or remain childless are not accounted for and even discounted as worthless by the assumed value function.

Thus, the presence of superficial analogies in structure, process, and terminology (e.g., “attractiveness,” “dating,” “marriage,” etc.) must not trick us into assuming that this model reflects something fundamental about human societies. Essentially, the results of any simulation are only as valid as its assumptions — and we should at least be skeptical when a model makes many unrealistic ones.

Note, however, that we also emphasized that all models are wrong, but may still be useful (in Section 15.1). Thus, reasons for skepticism do not automatically render all model results useless. An advocate of this particular model could claim that it may fail for human societies, but may still mimick important aspects of some animal population. But beyond the flaws that may be fatal for depicting a human society, there are more subtle deficits that may also be violated by most animal organisms. For instance:

• All individuals (in the population and within both types) are assumed to share the same preferences.

• The assumption that individuals’ strategies are independent of their attractiveness score (due to being distributed randomly) is unrealistic.

• An individual’s attractiveness score is an abstraction that serves multiple functions. First, it matters crucially for finding a match, then plays no role in the probability of procreation, but determines the value of a match again.

• The round value $$t$$ paces the dating regime (i.e., meeting one condidate per round), but also the rate of procreation (and thus the overall value of a partnership). In most animal populations, both time scales would differ from each other and not all individuals may follow the same rate.

• Allowing for non-permanent relationships and extra-marital affairs could completely change the dynamics of the situation.

Thus, the simulation may be suited to illustrate some dynamics of social situations, but provides no realistic model of mate search in most human or non-human societies. Interestingly, if we re-frame the two types in terms of supply and demand (i.e., as repeated interactions between sellers and buyers) many people may be less critical and willing to consider the same simulation as a decent economic model of market interactions…