## 17.3 Qualitative prediction tasks

Examples:

• Screening test: Healthy tissue or signs of cancer?
• Will some market or individual stock go up or down today/this month/year?
• A classic task: Who survived the Titanic disaster?

Strategy: Use 2x2 matrix as an analytic device. Beyond predicting category membership, we also want to evaluate the quality of the resulting prediction.

• qualitatively: Predict membership to a binary category.
• quantitatively: Describe result and measure success (as accuracy or some other metric).

Start with a dataset:

1. Binary variables
t_df <- as.data.frame(Titanic)

# with(tt, table(Survived, Sex))  # only number of cases

t2 <- t_df %>%
group_by(Sex, Survived) %>%
summarise(n = n(),
freq = sum(Freq))
t2
#> # A tibble: 4 x 4
#> # Groups:   Sex 
#>   Sex    Survived     n  freq
#>   <fct>  <fct>    <int> <dbl>
#> 1 Male   No           8  1364
#> 2 Male   Yes          8   367
#> 3 Female No           8   126
#> 4 Female Yes          8   344

# Pivot into 2x2 matrix:
t2 %>%
pivot_wider(names_from = Sex, values_from = freq) %>%
select(-n)
#> # A tibble: 2 x 3
#>   Survived  Male Female
#>   <fct>    <dbl>  <dbl>
#> 1 No        1364    126
#> 2 Yes        367    344

Note complexity of table interpretation, due to a difference between different measures and different perspectives that we can adopt on them. In terms of measures, we see a difference between frequency counts, proportions, and different kinds of probabilities:

• Frequencies: Absolute numbers show many more males than females
• Proportions of survivors by gender: Majority of females survived, majority of males died.
• Probabilities can be joint, marginal, or conditional (depending on the computation of their numerator and denominator).

+++ here now +++

### 17.3.1 ToDo 1

Steps of the matrix lens model:

• Determine a pair of a predictor and a criterion variable.

• Frame a 2x2 matrix:

# This version of frame() assumes that data contains individual cases
# (i.e., not a contingency table with a column of frequency counts)

# Inputs:
# - data = data with binary variables (after filter step)

# Output: Returns a 2x2 matrix

frame <- function(data, x, y,
z = NA, x_levels = NA, y_levels = NA, z_val = NA){

# 0. Verify that
# a. data is binary
# b. x, y (and z) are variables in data

# conditionalize data on z:
if (!is.na(z) & !is.na(z_val)){

ix_z  <- which(names(data) == z)
vec_z <- data[ , ix_z]
tof_z <- vec_z == z_val

data <- data[tof_z, ]  # filter cases for which condition z == z_val is TRUE

}

ix_x <- which(names(data) == x)
ix_y <- which(names(data) == y)

nam_x <- names(data)[ix_x]
nam_y <- names(data)[ix_y]

vec_x <- data[ , ix_x]
vec_y <- data[ , ix_y]

# Note non-binary variables:
nval_x <- length(unique(vec_x))
if (nval_x != 2){
message(paste0("frame: x is non-binary (", nval_x, " unique values)"))
}

nval_y <- length(unique(vec_y))
if (nval_y != 2){
message(paste0("frame: y is non-binary (", nval_y, " unique values)"))
}

# as factors:
if (!all(is.na(x_levels))){
vec_x <- factor(vec_x, levels = x_levels, ordered = FALSE)
}

if (!all(is.na(y_levels))){
vec_y <- factor(vec_y, levels = y_levels, ordered = FALSE)
}

table(vec_y, vec_x, dnn = c(nam_y, nam_x))

} # frame().

# Check:
# (a) Basics:
(m_1 <- frame(data = t_df, x = "sex", y = "survived"))
is.matrix(m_1)
typeof(m_1)

# (b) with factors:
m_2 <- frame(data = t_df, x = "sex", y = "survived",
x_levels = c("male", "female"),
y_levels = c(1, 0))
m_2
sum(m_2)

# (c) factors and conditionalized:
(m_3 <- frame(t_df, x = "sex", y = "survived", z = "age", z_val = "child",
x_levels = c("male", "female"),
y_levels = c(1, 0)))
sum(m_3)

# (d) Note: Non-binary variables:
frame(t_df, x = "class", y = "survived")
frame(t_df, y = "class", x = "survived")

Matrix transformations:

m_1
rowSums(m_1)
colSums(m_1)
sum(m_1)

# Get four basic values:
(abcd <- c(m_1[1, 1], m_1[1, 2], m_1[2, 1], m_1[2, 2]))

# Probabilities and marginal probabilities:
prop.table(m_2) * 100
prop.table(m_2, margin = 1) * 100  # by rows
prop.table(m_2, margin = 2) * 100  # by cols
# ToDo: Diagnonal (margin = 3)

# Test:
chisq.test(m_1)
chisq.test(m_2)
chisq.test(m_3)

# Visualization:
mosaicplot(t(m_2), color = c("skyblue1", "grey75"))
mosaicplot(t(m_3), color = c("skyblue1", "grey75"))
• Focusing: Compute various metrics

• When predictor variable is continuous (and criterion is binary): Determine an optimal cut-off point to maximize some criterion.

### 17.3.2 Trees

• Goal: Illustrate cases of binary prediction by the FFTrees package .
library(FFTrees)
library(tidyverse)

t_df <- FFTrees::titanic
t_tb <- as_tibble(t_df)
t_tb

## variables as factors:
# t_tb$survived = factor(t_tb$survived, levels = c(1, 0))
# t_tb$sex = factor(t_tb$sex, levels = c("female", "male"))

t_tb

t4 <- t_tb %>%
group_by(class, age, sex, survived) %>%
count()
t4

t3 <- t_tb %>%
group_by(sex, survived) %>%
count()
t3

xtabs(cbind(survived, sex) ~ ., data = t_tb)

# Pivot into 2x2 matrix:
t3 %>%
pivot_wider(names_from = sex, values_from = n)

#### Resources

Note existing resources for cross-tabulations: