19.3 Qualitative prediction tasks
- Screening test: Healthy tissue or signs of cancer?
- Will some market or individual stock go up or down today/this month/year?
- A classic task: Who survived the Titanic disaster?
Strategy: Use 2x2 matrix as an analytic device. Beyond predicting category membership, we also want to evaluate the quality of the resulting prediction.
- qualitatively: Predict membership to a binary category.
- quantitatively: Describe result and measure success (as accuracy or some other metric).
Start with a dataset:
- Binary variables
# As contingency df: <- as.data.frame(Titanic) t_df # with(tt, table(Survived, Sex)) # only number of cases <- t_df %>% t2 group_by(Sex, Survived) %>% summarise(n = n(), freq = sum(Freq)) t2#> # A tibble: 4 x 4 #> # Groups: Sex  #> Sex Survived n freq #> <fct> <fct> <int> <dbl> #> 1 Male No 8 1364 #> 2 Male Yes 8 367 #> 3 Female No 8 126 #> 4 Female Yes 8 344 # Frame a 2x2 matrix: ------ # (a) Pivot summary into 2x2 matrix: %>% t2 pivot_wider(names_from = Sex, values_from = freq) %>% select(-n) #> # A tibble: 2 x 3 #> Survived Male Female #> <fct> <dbl> <dbl> #> 1 No 1364 126 #> 2 Yes 367 344 # (b) From contingency df: xtabs(Freq ~ Survived + Sex, data = t_df) #> Sex #> Survived Male Female #> No 1364 126 #> Yes 367 344 # (c) From raw data cases: <- i2ds::expand_freq_table(t_df) t_raw table(t_raw$Survived, t_raw$Sex) #> #> Male Female #> No 1364 126 #> Yes 367 344
Note complexity of table interpretation, due to a difference between different measures and different perspectives that we can adopt on them. In terms of measures, we see a difference between frequency counts, proportions, and different kinds of probabilities:
- Frequencies: Absolute numbers show many more males than females
- Proportions of survivors by gender: Majority of females survived, majority of males died.
- Probabilities can be joint, marginal, or conditional (depending on the computation of their numerator and denominator).
+++ here now +++
19.3.1 ToDo 1
Steps of the matrix lens model (see the MLM package):
Determine a pair of a predictor and a criterion variable.
Frame a 2x2 matrix:
m_1rowSums(m_1) colSums(m_1) sum(m_1) # Get four basic values: <- c(m_1[1, 1], m_1[1, 2], m_1[2, 1], m_1[2, 2])) (abcd # Probabilities and marginal probabilities: prop.table(m_2) * 100 prop.table(m_2, margin = 1) * 100 # by rows prop.table(m_2, margin = 2) * 100 # by cols # ToDo: Diagnonal (margin = 3) # Test: chisq.test(m_1) chisq.test(m_2) chisq.test(m_3) # Visualization: mosaicplot(t(m_2), color = c("skyblue1", "grey75")) mosaicplot(t(m_3), color = c("skyblue1", "grey75"))
Focusing: Compute various metrics
When predictor variable is continuous (and criterion is binary): Determine an optimal cut-off point to maximize some criterion.
- Goal: Illustrate cases of binary prediction by the FFTrees package (Phillips et al., 2017).
library(FFTrees) library(tidyverse) <- FFTrees::titanic t_df <- as_tibble(t_df) t_tb t_tb ## variables as factors: # t_tb$survived = factor(t_tb$survived, levels = c(1, 0)) # t_tb$sex = factor(t_tb$sex, levels = c("female", "male")) t_tb <- t_tb %>% t4 group_by(class, age, sex, survived) %>% count() t4 <- t_tb %>% t3 group_by(sex, survived) %>% count() t3 xtabs(cbind(survived, sex) ~ ., data = t_tb) # Pivot into 2x2 matrix: %>% t3 pivot_wider(names_from = sex, values_from = n)