15.3 Exercises

i2ds: Exercises

15.3.1 Model revolutions

Sketch an instance of a scientific revolution that resulted in a fundamental new model of some discipline or domain. (Candidates include cosmological, biological, chemical, medical or physical phenomena.)

  1. What was the established model that was replaced by a new one?

  2. Was the transition from old to new model smooth or bumpy? How long did it take?

  3. What were the benefits and costs of the new model? Which new predictions were enabled by it?

Hints: This exercise uses the term model in a loose way, not distinguishing it from scientific theories or paradigms. A good starting point for corresponding searches is the classic text by T. S. Kuhn (1962).

Solution

  • ad 1.: There are many candidates for fundamental shifts in scientific perspectives. Examples include the shift from the cardiocentric hypothesis (see Wikipedia) to the cephalocentric perspective (see Wikipedia).

  • ad 2.: Most shifts are quite bumpy and messy, as several conflicting perspectives can co-exist for some time, before one gains a decisive advantage over the other(s). Depending on the domain and models, this may take anything from years to centuries.

  • ad 3.: This depends on the nature and purpose of the models considered. For instance, conducting heart or brain surgery requires a different neurophysiological understanding than speculating about the seat of her soul.

15.3.2 Miniature model

In May 2021, the German news site Spiegel Online reported on the efforts of Reinhold Dukat, who re-constructed a miniature version of the Würzburg cathedral (St. Kilian’s dome, see Wikipedia) out of approx. 2.5 million Lego pieces (see Figure 15.5):

A proud architectural modeler. (Image by Nicolas Armer/dpa at SPON on 2021-05-05.)

Figure 15.5: A proud architectural modeler. (Image by Nicolas Armer/dpa at SPON on 2021-05-05.)

This is undoubtedly an impressive achievement. We will seize the occasion to reflect further on the nature of scientific models.

  1. Does creating this model make Mr. Dukat a scientist?

  2. Which characteristics of a scientific model does his model have? Where does it fall short?

  3. In a similar vein: Discuss how drawing a map (e.g., of a city or region) may constitute a model.
    Would a satellite image of the area be a better model than a hand-drawn sketch?

Solution

  • ad 1.: Scientists use many models, but creating a model does not necessarily make someone a scientist. For instance, children, architects, or engineers create various models, but these are typically built for non-scientific purposes. By contrast, scientists create models for answering scientific questions (i.e., their models are means towards achieving scientific ends).
    As it is possible to use a miniature replica of a building for answering scientific questions (e.g., regarding its aesthetic properties, or safety issues in emergency situations), we cannot exclude that Mr. Dukat is a scientist. However, any such judgment should be based on the questions addressed by the model, rather than on the model itself.

  • ad 2.: Building a smaller and simpler replica of a complex phenomenon can be useful features of scientific models. Additionally, re-creating a model out of different pieces and materials imply some level of abstraction.
    However, looking like an original or recreating a building in as many detail as possible are not scientific goals per se. Accuracy has many different facets and is only one of many criteria for evaluating scientific models.

  • ad 3.: A map is a miniature representation that may or may not be a good model of the city or landscape it represents. Its utility depends primarily on the questions addressed by it. The main purpose of a map consists in facilitating orientation and navigation tasks. For many tasks, an abstract and simple model can be as useful as a more detailed and naturalistic model.

15.3.3 An almost perfect model

Measure or create some data by rolling a fair dice 100 times and recording its outcomes in data:

data <- ds4psy::dice(100)
table(data)
#> data
#>  1  2  3  4  5  6 
#> 18 19 16 12 21 14

Create a simple model with perfect fit and perfect explanationory power:

model <- data

Note that model could be implemented as an artificial neural network or as a universal Turing machine made out of toilet paper rolls. Here, we chose to implement model in the circuits of a computer that stores information in binary form on some silicon-based device and allows explaining and predicting data as the elements of a linear vector.

Let’s evaluate the performance of our model:

# Perfect explanation: 
data[13]  # individual data point
#> [1] 6
model[13]
#> [1] 6

sum(model == data)/length(data)  # overall
#> [1] 1

This shows that our model successfully captures the structure in our data. In fact, any point in data is perfectly predicted and explained by the corresponding entry in model. Thus, the scatterplot of Figure 15.6 shows that model provides a perfect model of data:

A scatterplot showing a perfect fit between data and model (showing 100 points jittered around true value).

Figure 15.6: A scatterplot showing a perfect fit between data and model (showing 100 points jittered around true value).

What’s wrong with the model fit shown in Figure 15.6?

  1. Argue why model — despite its impressive elegance and perfect performance — may not be a such good model.

  2. Someone proposes an alternative model that goes by the name of probability theory. It assumes that any event in data is independent of all other events and occurs with a probability of \(\frac{1}{6}\). However, this model is more complicated and its fit to the observed data is only about 16.7%. Could this still be a better model than model? (Argue why or why not.)

Solution

  • ad 1.: As model is a copy of data, it is not surprising that it perfectly describes every point. Unfortunately, it also provides no benefit over data. For instance, it falls short in terms of simplicity: Rather than providing an abstract description of the data, it is exactly as complex as the data. Additionally, it perfectly explains past data, but is unlikely to provide a benefit (beyond a random baseline model) in predicting new data in the future.

  • ad 2.: The alternative model of probability theory is both more abstract and more flexible than model. Despite providing a less accurate description of existing data, this model is equally successful in describing new data. Its key benefit is that it provides a general account that — provided that its assumptions hold — can be applied to many similar problems (e.g., coin flips, lotteries, etc.).

15.3.4 A vague verbal model

As a warning against the potential pitfalls of merely verbal theories, Hintzman (1991) (p. 41) cited the following claim from a sociobiology texbook:

While adultery rates for men and women may be equalizing,
men still have more partners than women do, and they are
more likely to have one-night stands

(Leahey & Harris, 1985, see Google books for context and full quote).

According to Hintzman (1991), this statement may sound plausible, but is actually “mathematically impossible” (p. 41). Although we agree with the general point (that verbal theories are often vague and overly permissive), his challenge requires a narrow interpretation of “more partners.”

Create a model (or a sketch) that depicts an equal number of individuals from two (or more arbitrary) genders that entertain heterosexual relationships with the other gender in some imbalanced fashion.

Note: The quoted statement only mentioned two genders (M and W) and heterosexual relations. This does not deny or exclude more diverse situations, but they are not required for solving this exercise.

  1. In which sense is it “mathematically impossible” that men have more partners than women?

  2. In which other sense is it possible?

Solution

Figure 15.7 shows a hypothetical constellation with possible relations between two genders (M and F):

An abstract illustration of possible relations between two genders (M and F). (Dashed and solid lines depict different types of heterosexual relationships.)

Figure 15.7: An abstract illustration of possible relations between two genders (M and F). (Dashed and solid lines depict different types of heterosexual relationships.)

Assuming that the dashed and solid lines depict different types of relationships, we can see that

  • ad 1.: It is impossible that the average number of relationships of M is higher than the average number of F. This is true for any number or type of heterosexual relationship, as it would always involve both genders.

  • ad 2.: However, it is possible that more Ms (or a higher proportion of Ms) have more than one heterosexual relationship than (the proportion of) Fs. In the situation shown here, 2 out of 3 Ms have more than one relationship, whereas only 1 out of 3 Fs has multiple relationships.

Note that this example lacks a more diverse perspective, but says nothing about the time or nature of the relationships (shown as solid or dashed lines) and does not deny or exclude the existence of other genders or relationship types.

Thus, the original statement must have implied a difference in the proportions among individuals of the two genders, rather than a difference in the absolute or average number of relationships. (Note that showing a possible model does not imply its truth, of course. Otherwise, the mere fact that the moon could be made of cheese would make this true as well…)

15.3.5 Modeling samples of famous people

In this exercise, we will conduct and compare some summary statistics on population vs. sample data. As we rarely have the data of an entire population, we will use a very large dataset and pretend that it represents a population. A reasonable approximation is the following:

  1. Browse the Pantheon site to get an impression of its data contents and variables.

  2. Go to https://pantheon.world/data/datasets to download a recent version of the Pantheon dataset (Yu et al., 2016).

  3. Load the data into R and establish its dimensions.

Solution

library(tidyverse)
library(ds4psy)
library(unikn)

# Load data:
fm <- readr::read_csv("data-raw/person_2020_update.csv") 
dim(fm)
#> [1] 88937    34
  1. Conduct an EDA on the data (as the “population” of famous people).

Solution

  • Missing vs. complete cases:
## (1) Quick summaries:
# summary(fm)
# tibble::glimpse(fm)
# skimr::skim(fm)

# (1b) Missing vs. complete cases: 
sum(is.na(fm))           # missing values?
#> [1] 574735
sum(complete.cases(fm))  # complete cases?
#> [1] 14

# (1c) Fix capitalization:
fm$occupation <- capitalize(tolower(fm$occupation))
  • Alive vs. dead:
# (2) Alive vs. dead?
fm_t2 <- fm %>% 
  group_by(alive) %>% 
  summarise(n = n(),
            pc = n/nrow(fm)*100)
fm_t2
#> # A tibble: 2 x 3
#>   alive     n    pc
#> * <lgl> <int> <dbl>
#> 1 FALSE 41366  46.5
#> 2 TRUE  47571  53.5

ggplot(fm_t2, aes(x = alive, fill = alive, y = pc)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = round(pc, 1), vjust = 2)) + 
  labs(title = "Famous people dead vs. alive", 
       x = "Alive?", y = "Percentage", fill = "Alive:") +
  scale_fill_manual(values = c("grey60", "palegreen3")) +
  theme_ds4psy()

  • Gender distribution:
# (3) Gender distribution:
fm_t3 <- fm %>% 
  group_by(gender) %>%
  summarise(n = n(),
            pc = n/nrow(fm)*100)
fm_t3
#> # A tibble: 3 x 3
#>   gender     n      pc
#> * <chr>  <int>   <dbl>
#> 1 F      19993 22.5   
#> 2 M      68928 77.5   
#> 3 <NA>      16  0.0180

ggplot(fm_t3, aes(x = gender, fill = gender, y = pc)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = round(pc, 1), vjust = 2)) +
  labs(title = "Gender share", x = "Gender", y = "Percentage", fill = "Gender:") +
  scale_fill_manual(values = c("rosybrown2", "deepskyblue1", "firebrick")) +
  theme_ds4psy()

  • Occupations:
# (4) Occupations:
fm_t4 <- fm %>% 
  group_by(occupation) %>%
  summarise(n = n(),
            pc = n/nrow(fm)*100) %>%
  arrange(desc(n))
fm_t4
#> # A tibble: 101 x 3
#>    occupation             n    pc
#>    <chr>              <int> <dbl>
#>  1 Soccer player      16923 19.0 
#>  2 Politician         15640 17.6 
#>  3 Actor              10017 11.3 
#>  4 Writer              5777  6.50
#>  5 Singer              3544  3.98
#>  6 Athlete             3061  3.44
#>  7 Musician            2674  3.01
#>  8 Religious figure    2246  2.53
#>  9 Film director       1583  1.78
#> 10 Military personnel  1470  1.65
#> # … with 91 more rows

ggplot(fm) +
  geom_bar(aes(x = reorder(occupation, -table(occupation)[occupation]), 
               y = ..count../sum(..count..)), fill = "deepskyblue3") +
  geom_hline(yintercept = c(.05, .10, .15, .20), linetype = 3) + 
  labs(title = "Occupations of famous people", 
       x = "Occupation", y = "Percentage") +
  coord_flip() +
  theme_minimal() + 
  theme(axis.text.y = element_text(size = rel(0.75)))

  • Occupation by alive vs. dead:
# (4b) Occupation x Alive: 
ggplot(fm) +
  geom_bar(aes(x = reorder(occupation, -alive, mean), 
               y = ..count../sum(..count..), fill = alive), pos = "fill") +
  geom_hline(yintercept = c(.25, .50, .75), linetype = c(3, 2, 3)) + 
  labs(title = "Occupations of famous people alive vs. dead", 
       x = "Occupation", y = "Percentage", fill = "Alive:") +
  scale_fill_manual(values = c("grey60", "palegreen3")) +
  coord_flip() +
  theme_minimal() + 
  theme(axis.text.y = element_text(size = rel(0.75)))

  • Occupation by gender:
# (4c) Occupation x Gender: 
fm_f <- fm %>% 
  filter(!is.na(gender)) %>% 
  mutate(female = ifelse(gender == "F", 1, 0)) 
# table(fm_f$female)

ggplot(fm_f) +
  geom_bar(aes(x = reorder(occupation, female, mean), 
               y = ..count../sum(..count..), fill = gender), pos = "fill") +
  geom_hline(yintercept = c(.25, .50, .75), linetype = c(3, 2, 3)) + 
  labs(title = "Occupations of famous people by gender", 
       x = "Occupation", y = "Percentage", fill = "Gender:") +
  scale_fill_manual(values = c("rosybrown2", "deepskyblue1")) + 
  coord_flip() +
  theme_minimal() + 
  theme(axis.text.y = element_text(size = rel(0.75)))

From this overview, it is pretty obvious that the data contains substantial biases. This is not surprising: Being famous is largely a matter of definition — and the inclusion criteria in any such collection will inevitably vary as a function of time and the prevailing societal norms.

  1. Draw a sub-sample (e.g., only people from some country, only people still alive, vs. a random subset) and repeat your analyses from 4. How does your sample compare to the population data? Which one is more representative?

Solution

It is clear from our general EDA that selecting cases based on gender and occupations will yield highly biased samples.

  • We illustrate this by selecting the subset of females in professions in which females are a majority:
fm_fem_occu <- fm %>% 
  filter(!is.na(gender)) %>% 
  group_by(occupation) %>%
  mutate(n_occupation = n()) %>%
  ungroup() %>% 
  group_by(gender, occupation) %>%
  mutate(n_gender_occupation = n(),
         p_gender_occupation = n_gender_occupation/n_occupation) %>%
  select(name, occupation, n_occupation:p_gender_occupation) %>%
  filter(p_gender_occupation > .5, gender == "F") %>% 
  arrange(p_gender_occupation)
fm_fem_occu
#> # A tibble: 3,599 x 6
#> # Groups:   gender, occupation [10]
#>    gender name       occupation n_occupation n_gender_occupat… p_gender_occupat…
#>    <chr>  <chr>      <chr>             <int>             <int>             <dbl>
#>  1 F      Nadia Com… Gymnast             183                92             0.503
#>  2 F      Larisa La… Gymnast             183                92             0.503
#>  3 F      Olga Korb… Gymnast             183                92             0.503
#>  4 F      Olga Tass  Gymnast             183                92             0.503
#>  5 F      Věra Čásl… Gymnast             183                92             0.503
#>  6 F      Ágnes Kel… Gymnast             183                92             0.503
#>  7 F      Nellie Kim Gymnast             183                92             0.503
#>  8 F      Estella A… Gymnast             183                92             0.503
#>  9 F      Maria Gor… Gymnast             183                92             0.503
#> 10 F      Helena Ra… Gymnast             183                92             0.503
#> # … with 3,589 more rows
nrow(fm_fem_occu)
#> [1] 3599

# Numbers:
fm_fem_occu %>%
  group_by(occupation) %>%
  summarise(n = n(),
            pc = n/nrow(fm_fem_occu) * 100, 
            pc_female = mean(p_gender_occupation) * 100) %>%
  arrange(desc(pc))
#> # A tibble: 10 x 4
#>    occupation             n    pc pc_female
#>    <chr>              <int> <dbl>     <dbl>
#>  1 Singer              1839 51.1       51.9
#>  2 Companion            636 17.7       93.1
#>  3 Model                240  6.67      96.4
#>  4 Swimmer              215  5.97      54.8
#>  5 Pornographic actor   194  5.39      87.4
#>  6 Skater               168  4.67      53.3
#>  7 Celebrity            130  3.61      71.4
#>  8 Gymnast               92  2.56      50.3
#>  9 Badminton player      45  1.25      55.6
#> 10 Dancer                40  1.11      53.3

# Relative frequency:
ggplot(fm_fem_occu) +
  geom_bar(aes(x = reorder(occupation, -table(occupation)[occupation]), 
               y = ..count../sum(..count..)), fill = "deepskyblue3") +
  geom_hline(yintercept = c(.05, .10, .15, .20), linetype = 3) + 
  labs(title = "Famous females in occupations with mostly females", 
       x = "Occupation", y = "Percentage") +
  coord_flip() +
  theme_minimal() + 
  theme(axis.text.y = element_text(size = rel(0.75)))

In this particular sample (still containing 3599 individuals), more than 75% are singers, companions, or models. This narrow range of professions with predominantly female individuals is due to a combination of two factors:

  1. The inclusion criteria of the Pantheon data;

  2. Our sample criteria (here: females in professions with a majority of females).

Interestingly, each step may seem quite innocuous in itself, but their combination paints a very biased and bleak impression of female celebrity. We can hope that this one-sided image will become more balanced and diverse in future iterations of similar data.

Overall, neither our sample nor the original data can be considered to be “representative” of humanity, of course. Instead, both samples provide insights in the mechanisms and potential consequences of collecting such data. Thus, our highly biased sample reveals that the Pantheon data also reflects deep-rooted biases in our cultural history. Such biases are common in any large dataset and illustrate why it is indispensable to always explicate our criteria for considering and including cases.

References

Hintzman, D. L. (1991). Why are formal models useful in psychology. In S. L. William E. Hockley (Ed.), Relating theory and data: Essays on human memory in honor of Bennet B. Murdock (pp. 39–56). Lawrence Erlbaum.
Kuhn, T. S. (1962). The structure of scientific revolutions. University of Chicago Press.
Yu, A. Z., Ronen, S., Hu, K., Lu, T., & Hidalgo, C. A. (2016). Pantheon 1.0, a manually verified dataset of globally famous biographies. Scientific Data, 3(1), 1–16. https://doi.org/10.1038/sdata.2015.75