18.4 Qualitative prediction tasks
Qualitative prediction models address classification tasks: Assigning elements to categories.
Examples:
- Screening test: Healthy tissue or signs of cancer?
- Will some market or individual stock go up or down today/this month/year?
- A classic task: Who survived the Titanic disaster?
Strategy: Use 2x2 matrix as an analytic device. Beyond predicting category membership, we also want to evaluate the quality of the resulting prediction.
- qualitatively: Predict membership to a binary category.
- quantitatively: Describe result and measure success (as accuracy or some other metric).
Start with a dataset:
- Binary variables
# As contingency df:
<- as.data.frame(Titanic)
t_df
# with(tt, table(Survived, Sex)) # only number of cases
<- t_df %>%
t2 group_by(Sex, Survived) %>%
summarise(n = n(),
freq = sum(Freq))
t2#> # A tibble: 4 × 4
#> # Groups: Sex [2]
#> Sex Survived n freq
#> <fct> <fct> <int> <dbl>
#> 1 Male No 8 1364
#> 2 Male Yes 8 367
#> 3 Female No 8 126
#> 4 Female Yes 8 344
# Frame a 2x2 matrix: ------
# (a) Pivot summary into 2x2 matrix:
%>%
t2 pivot_wider(names_from = Sex, values_from = freq) %>%
select(-n)
#> # A tibble: 2 × 3
#> Survived Male Female
#> <fct> <dbl> <dbl>
#> 1 No 1364 126
#> 2 Yes 367 344
# (b) From contingency df:
xtabs(Freq ~ Survived + Sex, data = t_df)
#> Sex
#> Survived Male Female
#> No 1364 126
#> Yes 367 344
# (c) From raw data cases:
<- i2ds::expand_freq_table(t_df)
t_raw table(t_raw$Survived, t_raw$Sex)
#>
#> Male Female
#> No 1364 126
#> Yes 367 344
Note complexity of table interpretation, due to a difference between different measures and different perspectives that we can adopt on them. In terms of measures, we see a difference between frequency counts, proportions, and different kinds of probabilities:
- Frequencies: Absolute numbers show many more males than females
- Proportions of survivors by gender: Majority of females survived, majority of males died.
- Probabilities can be joint, marginal, or conditional (depending on the computation of their numerator and denominator).
18.4.1 ToDo 1
Steps of the matrix lens model (see the MLM package):
Determine a pair of a predictor and a criterion variable.
Frame a 2x2 matrix:
Matrix transformations:
m_1rowSums(m_1)
colSums(m_1)
sum(m_1)
# Get four basic values:
<- c(m_1[1, 1], m_1[1, 2], m_1[2, 1], m_1[2, 2]))
(abcd
# Probabilities and marginal probabilities:
prop.table(m_2) * 100
prop.table(m_2, margin = 1) * 100 # by rows
prop.table(m_2, margin = 2) * 100 # by cols
# ToDo: Diagnonal (margin = 3)
# Test:
chisq.test(m_1)
chisq.test(m_2)
chisq.test(m_3)
# Visualization:
mosaicplot(t(m_2), color = c("skyblue1", "grey75"))
mosaicplot(t(m_3), color = c("skyblue1", "grey75"))
Focusing: Compute various metrics
When predictor variable is continuous (and criterion is binary): Determine an optimal cut-off point to maximize some criterion.
18.4.2 Trees
- Goal: Illustrate cases of binary prediction by the FFTrees package (Phillips et al., 2017).
library(FFTrees)
library(tidyverse)
<- FFTrees::titanic
t_df <- as_tibble(t_df)
t_tb
t_tb
## variables as factors:
# t_tb$survived = factor(t_tb$survived, levels = c(1, 0))
# t_tb$sex = factor(t_tb$sex, levels = c("female", "male"))
t_tb
<- t_tb %>%
t4 group_by(class, age, sex, survived) %>%
count()
t4
<- t_tb %>%
t3 group_by(sex, survived) %>%
count()
t3
xtabs(cbind(survived, sex) ~ ., data = t_tb)
# Pivot into 2x2 matrix:
%>%
t3 pivot_wider(names_from = sex, values_from = n)
Resources
Statistical methods for tackling classification tasks:
- See Chapter 4: Classification of James et al. (2021).
Note existing resources for cross-tabulations:
Consider using the
tab_xtab()
andplot_xtab()
functions from the sjPlot package.See examples at Chapter 8: Cross-Tabulation of R you Ready for R? (by Wade Roberts)