Classification: Logistic model

Learning outcomes/objective: Learn…

1 Predicting Recidvism: The data

Table 1 depicts the first rows/columns of the dataset (COMPAS) we use.

Table 1: Dataset/sample
id name compas_screening_date is_recid is_recid_factor age priors_count
1 miguel hernandez 2013-08-14 0 no 69 0
3 kevon dixon 2013-01-27 1 yes 34 0
4 ed philo 2013-04-14 1 yes 24 4
5 marcu brown 2013-01-13 0 no 23 1
6 bouthy pierrelouis 2013-03-26 0 no 43 2
7 marsha miles 2013-11-30 0 no 44 0
8 edward riddle 2014-02-19 1 yes 41 14
9 steven stewart 2013-08-30 NA NA 43 3
10 elizabeth thieme 2014-03-16 0 no 39 0
13 bo bradac 2013-11-04 1 yes 21 1

2 Predicting Recidvism: Background story

  • Background story by ProPublica: Machine Bias
  • Replication and extension by Dressel and Farid (2018): The Accuracy, Fairness, and Limits of Predicting Recidivism
    • Abstract: “Algorithms for predicting recidivism are commonly used to assess a criminal defendant’s likelihood of committing a crime. […] used in pretrial, parole, and sentencing decisions. […] We show, however, that the widely used commercial risk assessment software COMPAS is no more accurate or fair than predictions made by people with little or no criminal justice expertise. In addition, despite COMPAS’s collection of 137 features, the same accuracy can be achieved with a simple linear classifier with only two features.”
  • Very nice lab by Lee, Du, and Guerzhoy (2020): Auditing the COMPAS Score: Predictive Modeling and Algorithmic Fairness
  • We will work with the corresponding data and use it to grasp various concepts underlying statistical/machine learning

3 Logistic model

3.1 Logistic model: Equation (1)

  • Predicting recidivism (0/1): How should we model the relationship between \(Pr(Y=1|X)=p(X)\) and \(X\)?
    • See Figure 1 below
    • Use either linear probability model or logistic regression
  • Linear probability model: \(p(X)=\beta_{0}+\beta_{1}X\)
    • Linear predictions of our outcome (probabilities), can be out of [0,1] range
  • Logistic regression: \(p(X)=\frac{e^{\beta_{0}+\beta_{1}X}}{1+e^{\beta_{0}+\beta_{1}X}}\)
    • …force them into range using logistic function
    • odds: \(\frac{p(X)}{1-p(X)}=e^{\beta_{0}+\beta_{1}X}\) (range: \([0,\infty]\), the higher, the higher probability of recidivism/default)
    • log-odds/logit: \(log\left(\frac{p(X)}{1-p(X)}\right) = \beta_{0}+\beta_{1}X\) (James et al. 2013, 132)
      • …take logarithm on both sides.
      • Increasing X by one unit, increases the log odds by \(\beta_{1}\)
      • Usually we get that in R output
  • Estimation of \(\beta_{0}\) and \(\beta_{1}\) usually relies on maximum likelihood

3.2 Logistic model: Visually

  • Figure 1 displays a scatterplot of predictor Balance and outcome Default with linear regression on the left and logistic regression on the right.

Figure 1: Figure 4.2 in James et al. (2013, 131)

3.3 Logistic model: Equation (2)

  • Logistic regression (LR) models the probability that \(Y\) belongs to a particular category (0 or 1)
    • Rather than modeling response \(Y\) directly
  • COMPAS data: Model probability to recidivate (reoffend)
    • Outcome \(y\): Recidivism is_recid (0,1,0,0,1,1,...)
    • Various predictors \(x's\)
      • age = age
      • prior offenses = priors_count
  • Use LR to obtain predicted values \(\hat{y}\)
    • Predicted values will range between 0 and 1 = probabilities
    • Depend on input/features (e.g., age, prior offences)
  • Convert predicted values (probabilities) to a binary variable
    • e.g., predict individuals will recidivate (is_recid = Yes) if Pr(is_recid=Yes|age) > 0.5
    • Here we call this variable y_hat_01
  • Source: James et al. (2013, chap. 4.3)

3.4 Logistic model: Estimation

  • Estimate model in R: glm(y ~ x1 + x2, family = binomial, data = data_train)
fit <- glm(as.factor(is_recid) ~ age + priors_count, 
           family = binomial, 
           data = data)
cat(paste(capture.output(summary(fit))[6:10], collapse="\n"))
Coefficients:
              Estimate Std. Error z value            Pr(>|z|)    
(Intercept)   1.029632   0.084679   12.16 <0.0000000000000002 ***
age          -0.048161   0.002478  -19.44 <0.0000000000000002 ***
priors_count  0.169292   0.007331   23.09 <0.0000000000000002 ***
  • R output shows log odds: e.g., a one-unit increase in age is associated with an increase in the log odds of is_recid by -0.05 units
  • Difficult to interpret.. much easier to use predicted probabilities

3.5 Logistic model: Prediction

  • augment(): Predict values in R (or predict())

    • Once coefficients have been estimated, it is a simple matter to compute the probability of outcome for values of our predictors (James et al. 2013, 134)
    • augment(fit, newdata = ..., type.predict = "response")1: Predict probability for each unit
    • Use argument type.predict="response" to output probabilities of form \(P(Y=1|X)\) (as opposed to other information such as the logit)
    • Use argument newdata = ... to obtain predictions for new data
  • predict(fit, newdata = data_predict, type = "response"): Predict probability setting values for particular Xs (contained in data_predict)

data_new = data.frame(age = c(20, 20, 40, 40),
                          priors_count = c(0, 2, 0, 2))
data_new <- augment(fit, newdata = data_new, type.predict = "response")
data_new
# A tibble: 4 × 3
    age priors_count .fitted
  <dbl>        <dbl>   <dbl>
1    20            0   0.517
2    20            2   0.600
3    40            0   0.290
4    40            2   0.364
  • Q: How would you interpret these values?

4 Classification (logistic model): Accuracy

  • See James et al. (2013, Ch. 4.4.3)

4.1 True/false negatives/positives

  • Q: The tables in Figure 2 and Figure 3 illustrate how to calculate different accuracy metrics. Assuming the outcome is cancer (1 = yes, 0 = no), how could we interpret the different cells?

Figure 2: Source: James et al. (2013), Ch. 4.4.3

Figure 3: Source: James et al. (2013), Ch. 4.4.3

4.2 Accuracy

  • Accuracy = Correct Classification Rate (CCR)
    • \(\text{Accuracy} = \frac{\text{Number of correct predictions}}{\text{Total number of predictions}} = \frac{TP + TN}{TP + TN + FP + FN}\)
  • Error rate
    • 1- Accuracy

4.3 Precision and recall (binary classification)

  • Precision \((TP/P^{*})\)
    • Number of true positive (TP) results divided by the number of all positive results (P*), including those not identified correctly
    • e.g., \(\text{Accuracy} = \frac{\text{N of patients that have cancer AND are predicted to have cancer (true positives TP)}}{\text{N of all patients predicted to have cancer (positives P*)}}\)
  • Recall \((TP/P)\)
    • Number of true positive results divided by the number of all samples that should have been identified as positive (P)
    • e.g., \(\text{Accuracy} = \frac{\text{N of patients that have cancer AND are predicted to have cancer (true positives TP)}}{\text{N of all patients that have cancer and SHOULD have been predicted to have cancer (the "trues" P)}}\)



  • F1-Score: \(F1 = 2 \times \frac{precision \times recall}{precision + recall}\)
    • Highest possible value of an F-score is 1.0, indicating perfect precision and recall, and the lowest possible value is 0, if either precision or recall are zero

4.4 Exercise: recidivism

  • Take our recidivism example where reoffend = recidivate = 1 = positive and not reoffend = not recidivate = 0 and discuss the table in Figure 2 using this example.
  • Take our recidivism example and discuss the different definitions (Column 2) in the Table in Figure 3. Explain what precision and recall would be in that case.
Answer
  • Precision \((TP/P^{*})\): \(\frac{\text{N of prisoners that recidivate AND are predicted to recidivate (true positives TP)}}{\text{N of all prisoners predicted to recidivate}}\)

  • Recall \((TP/P)\): \(\frac{\text{N of prisoners that recidivate AND are predicted to recidivate (true positives TP)}}{\text{N of all prisoners that recidivate and SHOULD have been predicted to recidivate (the "trues" P)}}\)

4.5 Exercise: Calculate different accuracy measures

  • Below you find a table cross-tabulating the true/actual values (rows) with the predicted values (columns) (important: conf_mat() swaps rows/columns)
    • Please use the tables in Figure 2 and Figure 3 and to calculate the Accuracy, Type-I-error, Precision, Recall and F1-score for the data in ?@tbl-accuracy.

Figure 4: Confusion matrix (Source: Own illustration)
Table with labels (helper!)

Figure 5: Confusion matrix (Source: Own illustration)
Solutions
  • Accuracy: \((540 + 424)/1443 = 0.67\)
  • Type-I-error: \(224/764 = 0.29\)
  • Precision: \(424/648 = 0.65\)
  • Recall: \(424/679 = 0.62\)
  • F1-score: \(2*(0.65*0.62)/(0.65+0.62) = 0.63\)

4.6 Error rates and ROC curve (binary classification)

  • See James et al. (2013, Ch. 4.4.3)
  • The error rates are a function of the classification threshold
    • If we lower the classification threshold (e.g., use a predicted probability of 0.3 to classify someone as reoffender) we classify more items as positive (e.g. predict them to recidivate), thus increasing both False Positives and True Positives
    • See James et al. (2013), Fig 4.7, p. 147
  • Goal: we want to know how error rates develop as a function of the classification threshold
  • The ROC curve is a popular graphic for simultaneously displaying the two types of errors (false positives, true positives rate) for all possible thresholds
    • See James et al. (2013), Fig 4.8, p. 148 where \(1 - specificity = FP/N = \text{false positives rate (FPR)}\) and \(sensitivity = TP/P = \text{true positives rate (TPR)}\)
  • The Area under the (ROC) curve (AUC) provides the overall performance of a classifier, summarized over all possible thresholds (James et al. 2013, 147)
    • The maximum is 1, hence, values closer to it a preferable as shown in Figure 6 (maximize TRP, minimize FPR!)

Figure 6: Source: The ROC space for a “better” and “worse” classifier (Wikipedia).

4.7 Weighting FPR and TPR differently (recidivism)

  • Outcome: Recidivism where individual recidivates (1) or not (0)
  • False Positive (FP): Model predicts an individual will recidivate when they actually do not. This could lead to unnecessary interventions or harsher sentences for individuals.
  • False Negative (FN): Model predicts an individual will not recidivate when they actually do. This could result in individuals who are at risk, being released without proper intervention, potentially leading to repeat offenses.
  • Assign costs
    • Cost of False Positive (FP): unnecessary prison time (costs for individual and society)
    • Cost of False Negative (FN): repeat offenses (costs for society probably higher) (probably more important) → (weight FN more strongly)
      • Pick threshold focusing on minimizing FN (ideally both FPR/FNR high!)
  • Important: Probability threshold of classifying someone as 0/1 not shown in Figure 6
    • We need to get it from the underlying calculations in Table 2

5 Lab: Classification using a logistic model

Learning outcomes/objective: Learn…

  • …how to use trainingset and validation dataset for ML in R.
  • …how to predict binary outcomes in R (using a simple logistic regression).
  • …how to assess accuracy in R (logistic regression).

5.1 The data

Our lab is based on Lee, Du, and Guerzhoy (2020) and on James et al. (2013, chap. 4.6.2) with various modifications. We will be using the dataset at LINK that is described by Angwin et al. (2016). - It’s data based on the COMPAS risk assessment tools (RAT). RATs are increasingly being used to assess a criminal defendant’s probability of re-offending. While COMPAS seemingly uses a larger number of features/variables for the prediction, Dressel and Farid (2018) showed that a model that includes only a defendant’s sex, age, and number of priors (prior offences) can be used to arrive at predictions of equivalent quality.

Overview of Compas dataset variables
  • id: ID of prisoner, numeric
  • name: Name of prisoner, factor
  • compas_screening_date: Date of compass screening, date
  • decile_score: the decile of the COMPAS score, numeric
  • is_recid: whether somone reoffended/recidivated (=1) or not (=0), numeric
  • is_recid_factor: same but factor variable
  • age: a continuous variable containing the age (in years) of the person, numeric
  • age_cat: age categorized
  • priors_count: number of prior crimes committed, numeric
  • sex: gender with levels “Female” and “Male”, factor
  • race: race of the person, factor
  • juv_fel_count: number of juvenile felonies, numeric
  • juv_misd_count: number of juvenile misdemeanors, numeric
  • juv_other_count: number of prior juvenile convictions that are not considered either felonies or misdemeanors, numeric

We first import the data into R:

# install.packages(pacman)
pacman::p_load(tidyverse,
               tidymodels,
               knitr,
               kableExtra,
               DataExplorer,
               visdat,
               naniar)

rm(list=ls())
load(url(sprintf("https://docs.google.com/uc?id=%s&export=download",
                         "1gryEUVDd2qp9Gbgq8G0PDutK_YKKWWIk")))

5.2 Inspecting the dataset

The variables were named quite well, so that they are often self-explanatory:

  • decile_score is the COMPAS score
  • is_recid wether someone reoffended (1 = recidividate = reoffend, 0 = NOT)
  • race contains the race
  • age contains age.
  • priors_count contains the number of prior offenses
  • etc.

First we should make sure to really explore/unterstand our data. How many observations are there? How many different variables (features) are there? What is the scale of the outcome? What are the averages etc.? What kind of units are in your dataset?

nrow(data)
[1] 7214
ncol(data)
[1] 14
dim(data)
[1] 7214   14
str(data) # Better use glimpse()
tibble [7,214 × 14] (S3: tbl_df/tbl/data.frame)
 $ id                   : num [1:7214] 1 3 4 5 6 7 8 9 10 13 ...
 $ name                 : Factor w/ 7158 levels "aajah herrington",..: 4922 4016 1989 4474 690 4597 2035 6345 2091 680 ...
 $ compas_screening_date: Date[1:7214], format: "2013-08-14" "2013-01-27" ...
 $ decile_score         : num [1:7214] 1 3 4 8 1 1 6 4 1 3 ...
 $ is_recid             : num [1:7214] 0 1 1 0 0 0 1 NA 0 1 ...
 $ is_recid_factor      : Factor w/ 2 levels "no","yes": 1 2 2 1 1 1 2 NA 1 2 ...
 $ age                  : num [1:7214] 69 34 24 23 43 44 41 43 39 21 ...
 $ age_cat              : Factor w/ 3 levels "25 - 45","Greater than 45",..: 2 1 3 3 1 1 1 1 1 3 ...
 $ priors_count         : num [1:7214] 0 0 4 1 2 0 14 3 0 1 ...
 $ sex                  : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 1 2 ...
 $ race                 : Factor w/ 6 levels "African-American",..: 6 1 1 1 6 6 3 6 3 3 ...
 $ juv_fel_count        : num [1:7214] 0 0 0 0 0 0 0 0 0 0 ...
 $ juv_misd_count       : num [1:7214] 0 0 0 1 0 0 0 0 0 0 ...
 $ juv_other_count      : num [1:7214] 0 0 1 0 0 0 0 0 0 0 ...
# glimpse(data)
# skimr::skim(data)

Also always inspect summary statistics for both numeric and categorical variables to get a better understanding of the data. Often such summary statistics will also reveal errors in the data.

Q: Does anything strike you as interesting the two tables below?

library(modelsummary)
datasummary_skim(data, type = "numeric", output = "html")
Unique (#) Missing (%) Mean SD Min Median Max
id 7214 0 5501.3 3175.7 1.0 5509.5 11001.0
decile_score 10 0 4.5 2.9 1.0 4.0 10.0
is_recid 3 9 0.5 0.5 0.0 0.0 1.0
age 65 0 34.8 11.9 18.0 31.0 96.0
priors_count 37 0 3.5 4.9 0.0 2.0 38.0
juv_fel_count 11 0 0.1 0.5 0.0 0.0 20.0
juv_misd_count 10 0 0.1 0.5 0.0 0.0 13.0
juv_other_count 10 0 0.1 0.5 0.0 0.0 17.0
datasummary_skim(data, type = "categorical", output = "html")
N %
is_recid_factor no 3422 47.4
yes 3178 44.1
age_cat 25 - 45 4109 57.0
Greater than 45 1576 21.8
Less than 25 1529 21.2
sex Female 1395 19.3
Male 5819 80.7
race African-American 3696 51.2
Asian 32 0.4
Caucasian 2454 34.0
Hispanic 637 8.8
Native American 18 0.2
Other 377 5.2

The table() function is also useful to get an overview of variables. Use the argument useNA = "always" to display potential missings.

table(data$race, useNA = "always")

African-American            Asian        Caucasian         Hispanic 
            3696               32             2454              637 
 Native American            Other             <NA> 
              18              377                0 
table(data$is_recid, data$is_recid_factor)
   
      no  yes
  0 3422    0
  1    0 3178
table(data$decile_score)

   1    2    3    4    5    6    7    8    9   10 
1440  941  747  769  681  641  592  512  508  383 

Finally, there are some helpful functions to explore missing data included in the naniar package. Can you decode those graphs? What do they show? (for publications the design would need to be improved)

vis_miss(data)

gg_miss_upset(data, nsets = 2, nintersects = 10) 

# Ideally, use higher number of nsets/nintersects 
# with more screen space

5.3 Exploring potential predictors

A correlation matrix can give us first hints regarding important predictors.

  • Q: Can we identify anything interesting?
plot_correlation(data %>% dplyr::select(is_recid, age, 
                                        priors_count,sex, 
                                        race,
                                        juv_fel_count), 
                 cor_args = list("use" = "pairwise.complete.obs"))

5.4 Building a first logistic ML model

Below we estimate a simple logistic regression machine learning model only using one split into training and test data. To start, we check whether there are any missings on our outcome variable is_recid_factor (we use the factor version of our outcome variable). We extract the subset of individuals for whom our outcome is_recid_factor is missing, store them data_missing_outcome and delete those individuals from the actual dataset data.

# Extract data with missing outcome
  data_missing_outcome <- data %>% filter(is.na(is_recid_factor))
  dim(data_missing_outcome)
[1] 614  14
# Omit individuals with missing outcome from data
  data <- data %>% drop_na(is_recid_factor) # ?drop_na
  dim(data)
[1] 6600   14



Then we split the data into training and test data.

# Split the data into training and test data
  data_split <- initial_split(data, prop = 0.80)
  data_split # Inspect
<Training/Testing/Total>
<5280/1320/6600>
# Extract the two datasets
  data_train <- training(data_split)
  data_test <- testing(data_split) # Do not touch until the end!
  dim(data_train)
[1] 5280   14
  dim(data_test)
[1] 1320   14



Subsequently, we estimate our linear model based on the training data. Below we just use 1 predictor:

# Fit the model
  fit1 <- logistic_reg() %>% # logistic model
        set_engine("glm") %>% # define lm package/function
        set_mode("classification") %>%# define mode
        fit(is_recid_factor ~ age, # fit the model
        data = data_train) # based on training data
  fit1 # Class model output with summary(fit1$fit)
parsnip model object


Call:  stats::glm(formula = is_recid_factor ~ age, family = stats::binomial, 
    data = data)

Coefficients:
(Intercept)          age  
    1.17594     -0.03601  

Degrees of Freedom: 5279 Total (i.e. Null);  5278 Residual
Null Deviance:      7313 
Residual Deviance: 7090     AIC: 7094



Then, we predict our outcome in the training data and evaluate the accuracy in the training data.

  • Q: How can we interpret the accuracy metrics? Are we happy?
# Training data: Add predictions 
data_train %>%
  augment(x = fit1, type.predict = "response") %>%
    select(is_recid_factor, age, .pred_class, .pred_no, .pred_yes) %>%
      head()
# A tibble: 6 × 5
  is_recid_factor   age .pred_class .pred_no .pred_yes
  <fct>           <dbl> <fct>          <dbl>     <dbl>
1 no                 26 yes            0.440     0.560
2 no                 27 yes            0.449     0.551
3 yes                41 no             0.575     0.425
4 no                 50 no             0.651     0.349
5 yes                28 yes            0.458     0.542
6 no                 33 no             0.503     0.497
# Cross-classification table (Columns = Truth, Rows = Predicted)
data_train %>%
  augment(x = fit1, type.predict = "response") %>%
      conf_mat(truth = is_recid_factor, estimate = .pred_class)
          Truth
Prediction   no  yes
       no  1505  948
       yes 1225 1602
# Training data: Metrics
data_train %>%
  augment(x = fit1, type.predict = "response") %>%
      metrics(truth = is_recid_factor, estimate = .pred_class)
# A tibble: 2 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.588
2 kap      binary         0.179



Finally, we can also predict data for the test data and evaluate the accuracy in the test data.

# Test data: Add predictions 
  data_test %>%
    augment(x = fit1, type.predict = "response") %>%
    select(is_recid_factor, age, .pred_class, .pred_no, .pred_yes) %>%
      head()
# A tibble: 6 × 5
  is_recid_factor   age .pred_class .pred_no .pred_yes
  <fct>           <dbl> <fct>          <dbl>     <dbl>
1 no                 43 no             0.592     0.408
2 no                 39 no             0.557     0.443
3 no                 27 yes            0.449     0.551
4 no                 37 no             0.539     0.461
5 no                 21 yes            0.397     0.603
6 no                 43 no             0.592     0.408
# Cross-classification table (Columns = Truth, Rows = Predicted)
  data_test %>%
    augment(x = fit1, type.predict = "response") %>%
      conf_mat(truth = is_recid_factor, estimate = .pred_class)
          Truth
Prediction  no yes
       no  379 255
       yes 313 373
# Test data: Metrics
  data_test %>%
    augment(x = fit1, type.predict = "response") %>%
      metrics(truth = is_recid_factor, estimate = .pred_class)
# A tibble: 2 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.570
2 kap      binary         0.141
Possible reasons if accuracy is higher on test data than training data:

Observing higher test accuracy may be due to:

  • Bad training accuracy: Already bad accuracy in training data is easy to beat.
  • Small Dataset: Test set may contain easier examples due to small dataset size.
  • Overfitting to Test Data: Repeated tweaking against the same test set can lead to overfitting.
  • Data Leakage: Information from the test set influencing the model during training.
  • Strong Regularization: Techniques like dropout can make the model generalize better but underperform on training data.
  • Evaluation Methodology: The splitting method can affect results, e.g., stratified splits.
  • Random Variation: Small test sets can lead to non-representative results.
  • Improper Training: Inadequate training epochs or improper learning rates.



Below code to visualize the ROC-curve. The function roc_curve() calculates the data for the ROC curve.

# Calculate data for ROC curve - threshold, specificity, sensitivity
  data_test %>%
    augment(x = fit1, type.predict = "response") %>%
      roc_curve(truth = is_recid_factor, .pred_no) %>%
  head() %>% kable()
Table 2: Data: ROC curve - threshold, specificity, sensitivity
.threshold specificity sensitivity
-Inf 0.0000000 1.0000000
0.3794654 0.0000000 1.0000000
0.3879800 0.0079618 1.0000000
0.3965635 0.0429936 0.9855491
0.4052112 0.1114650 0.9566474
0.4139181 0.1528662 0.9219653



We can then visualize is using autoplot(). Since it’s a ggplot we can make change labels etc. with +. Subsequently, we can use roc_auc() to calculate the area under the curve.

# Calculate data for ROC curve - threshold, specificity, sensitivity
  data_test %>%
    augment(x = fit1, type.predict = "response") %>%
      roc_curve(truth = is_recid_factor, .pred_no) %>%
  head()
# A tibble: 6 × 3
  .threshold specificity sensitivity
       <dbl>       <dbl>       <dbl>
1   -Inf         0             1    
2      0.379     0             1    
3      0.388     0.00796       1    
4      0.397     0.0430        0.986
5      0.405     0.111         0.957
6      0.414     0.153         0.922
# Calculate data for ROC curve and visualize
  data_test %>%
    augment(x = fit1, type.predict = "response") %>%
      roc_curve(truth = is_recid_factor, .pred_no)  %>% # Default: Uses first class (=0=no)
  autoplot() +
  xlab("False Positive Rate (FPR, 1 - specificity)") +
  ylab("True Positive Rate (TPR, sensitivity, recall)")

# Calculate are under the curve
  data_test %>%
    augment(x = fit1, type.predict = "response") %>%
      roc_auc(truth = is_recid_factor, .pred_no)
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary         0.597



If we are happy with the accuracy in the training data we could then use our model to predict the outcomes for those individuals for which we did not observe the outcome which we stored in data_missing.

# Missing outcome data predictions
  data_missing_outcome <- data_missing_outcome %>%
                                augment(x = fit1, type.predict = "response") 

  data_missing_outcome %>%
    select(is_recid_factor, age, .pred_class, .pred_no, .pred_yes) %>%
        head()
# A tibble: 6 × 5
  is_recid_factor   age .pred_class .pred_no .pred_yes
  <fct>           <dbl> <fct>          <dbl>     <dbl>
1 <NA>               43 no             0.592     0.408
2 <NA>               31 yes            0.485     0.515
3 <NA>               21 yes            0.397     0.603
4 <NA>               32 yes            0.494     0.506
5 <NA>               30 yes            0.476     0.524
6 <NA>               21 yes            0.397     0.603

5.4.1 Visualizing predictions

It is often insightful to visualize a MLM’s predictions, e..g, exploring whether our predictions are better or worse for certain population subsets (e.g., the young). In other words, whether the model works better/worse across groups. Below we take data_test from above (which includes the predictions) and calculate the errors and the absolute errors.

data_test %>%
    augment(x = fit1, type.predict = "response") %>%
        select(is_recid_factor, .pred_class, .pred_no, .pred_yes, age, sex, race, priors_count)
# A tibble: 1,320 × 8
   is_recid_factor .pred_class .pred_no .pred_yes   age sex   race  priors_count
   <fct>           <fct>          <dbl>     <dbl> <dbl> <fct> <fct>        <dbl>
 1 no              no             0.592     0.408    43 Male  Other            2
 2 no              no             0.557     0.443    39 Fema… Cauc…            0
 3 no              yes            0.449     0.551    27 Male  Cauc…            0
 4 no              no             0.539     0.461    37 Fema… Cauc…            0
 5 no              yes            0.397     0.603    21 Fema… Cauc…            0
 6 no              no             0.592     0.408    43 Male  Cauc…            1
 7 no              yes            0.494     0.506    32 Male  Other            0
 8 no              no             0.691     0.309    55 Male  Cauc…            0
 9 yes             yes            0.467     0.533    29 Male  Afri…            0
10 no              yes            0.458     0.542    28 Fema… Other            0
# ℹ 1,310 more rows

Figure 7 visualizes the variation of the predicted probabilites. What can we see?

# Visualize errors and predictors
data_test %>%
    augment(x = fit1, type.predict = "response") %>%
ggplot(aes(x = .pred_yes)) +
  geom_histogram() +
  xlim(0,1)

Figure 7: Histogram of errors/residuals

In Figure 8 we visualize the predicted probability of recidivating as a function of covariates/predictors after discretizing and factorizing some variables. Imporantly, the ML model is only based on one of those variables namely age, hence, why the predictions do not vary that strongly with the other variables.

Q: What can we observe? What problem does that point to?

# Visualize errors and predictors
library(patchwork)
library(ggplot2)
data_plot <- data_test %>%
    augment(x = fit1, type.predict = "response") %>%
  select(.pred_yes, age, sex, race, priors_count) %>%
  mutate(age = cut_interval(age, 8),
         priors_count = as.factor(priors_count))


p1 <- ggplot(data = data_plot, aes(y = .pred_yes, x = sex)) + 
  geom_boxplot()
p2 <- ggplot(data = data_plot, aes(y = .pred_yes, x =  age)) + 
  geom_boxplot() + 
      theme(axis.text.x = element_text(angle = 30, hjust = 1))
p3 <- ggplot(data = data_plot, aes(y = .pred_yes, x = race)) + 
      geom_boxplot() + 
      theme(axis.text.x = element_text(angle = 30, hjust = 1))
p4 <- ggplot(data = data_plot, aes(y = .pred_yes, x = priors_count)) + 
      geom_boxplot() + 
      theme(axis.text.x = element_text(angle = 30, hjust = 1))
p1 + p2 + p3 + p4

Figure 8: Visualizing predicted probability (for recidvism = yes) as a function of predictors/covariates

5.5 Exercise: Enhance simple logistic model

  1. Use the code below to load the data.
  2. In the next chunk you find the code we used above to built our first predictive model for our outcome is_recid_factor. Please use the code and add further predictors to the model. Can you find a model with better accuracy picking further predictors?
# install.packages(pacman)
pacman::p_load(tidyverse,
               tidymodels,
               knitr,
               kableExtra,
               DataExplorer,
               visdat,
               naniar)

rm(list=ls())
load(url(sprintf("https://docs.google.com/uc?id=%s&export=download",
                         "1gryEUVDd2qp9Gbgq8G0PDutK_YKKWWIk")))
# Extract data with missing outcome
  data_missing_outcome <- data %>% filter(is.na(is_recid_factor))
  dim(data_missing_outcome)

# Omit individuals with missing outcome from data
  data <- data %>% drop_na(is_recid_factor) # ?drop_na
  dim(data)

# Split the data into training and test data
  data_split <- initial_split(data, prop = 0.80)
  data_split # Inspect

# Extract the two datasets
  data_train <- training(data_split)
  data_test <- testing(data_split) # Do not touch until the end!

  
# Fit the model
  fit1 <- logistic_reg() %>% # logistic model
        set_engine("glm") %>% # define lm package/function
        set_mode("classification") %>%# define mode
        fit(is_recid_factor ~ age, # fit the model
        data = data_train) # based on training data
  fit1

# Training data: Add predictions 
data_train %>%
  augment(x = fit1, type.predict = "response") %>%
    select(is_recid_factor, age, .pred_class, .pred_no, .pred_yes) %>%
      head()

# Cross-classification table (Columns = Truth, Rows = Predicted)
data_train %>%
  augment(x = fit1, type.predict = "response") %>%
      conf_mat(truth = is_recid_factor, estimate = .pred_class)
  
# Training data: Metrics
data_train %>%
  augment(x = fit1, type.predict = "response") %>%
      metrics(truth = is_recid_factor, estimate = .pred_class)



# Test data: Add predictions 
  data_test %>%
    augment(x = fit1, type.predict = "response") %>%
    select(is_recid_factor, age, .pred_class, .pred_no, .pred_yes) %>%
      head()

# Cross-classification table (Columns = Truth, Rows = Predicted)
  data_test %>%
    augment(x = fit1, type.predict = "response") %>%
      conf_mat(truth = is_recid_factor, estimate = .pred_class)
  
# Test data: Metrics
  data_test %>%
    augment(x = fit1, type.predict = "response") %>%
      metrics(truth = is_recid_factor, estimate = .pred_class)

6 Homework/Exercise:

Above we used a logistic regression model to predict recidivism. In principle, we could also use a linear probability model, i.e., estimate a linear regression and convert the predicted probabilities to a predicted binary outcome variable later on.

  1. What might be a problem when we use a linear probability model to obtain predictions (see James et al. (2013), Figure, 4.2, p. 131)?
  2. Please use the code above (see next section below) but now change the model to a linear probability model using the same variables. How is the accuracy of the lp-model as compared to the logistic model? Did you expect that?
  • Tipps
    • The linear probability model is defined through linear_reg() %>% set_engine('lm') %>% set_mode('regression')
    • The linear probability model provides a predicted probability that needs to be converted to a binary class variable at the end.
    • The linear probability model requires a numeric outcome, i.e., use is_recid as outcome and only convert is_recid to a factor at the end (as well as the predicted class).
Solution
# install.packages(pacman)
pacman::p_load(tidyverse,
               tidymodels,
               knitr,
               kableExtra,
               DataExplorer,
               visdat,
               naniar)

rm(list=ls())
load(url(sprintf("https://docs.google.com/uc?id=%s&export=download",
                         "1gryEUVDd2qp9Gbgq8G0PDutK_YKKWWIk")))
# Extract data with missing outcome
  data_missing_outcome <- data %>% filter(is.na(is_recid))
  dim(data_missing_outcome)

# Omit individuals with missing outcome from data
  data <- data %>% drop_na(is_recid) # ?drop_na
  dim(data)

# Split the data into training and test data
  data_split <- initial_split(data, prop = 0.80)
  data_split # Inspect

# Extract the two datasets
  data_train <- training(data_split)
  data_test <- testing(data_split) # Do not touch until the end!

  
# Fit the model
  fit1 <- linear_reg() %>% # logistic model
        set_engine("lm") %>% # define lm package/function
        set_mode("regression") %>%# define mode
        fit(is_recid ~ age, # fit the model
        data = data_train) # based on training data
  fit1

# Training data: Add predictions 
  data_train <- augment(x = fit1, data_train) %>%
  mutate(.pred_class = as.factor(ifelse(.pred>=0.5, 1, 0)),
         is_recid = factor(is_recid))

  head(data_train %>%
      select(is_recid, is_recid_factor, age, .pred, .resid, .pred_class))

# Training data: Metrics
  data_train %>%
      metrics(truth = is_recid, estimate = .pred_class)
  
  
  
# Test data: Add predictions 
  data_test <- augment(x = fit1, data_test) %>%
  mutate(.pred_class = as.factor(ifelse(.pred>=0.5, 1, 0)),
         is_recid = factor(is_recid))

  head(data_test %>%
      select(is_recid, is_recid_factor, age, .pred, .resid, .pred_class))

# Test data: Metrics
  data_test %>%
      metrics(truth = is_recid, estimate = .pred_class)  

6.1 All the code

# namer::unname_chunks("03_01_predictive_models.qmd")
# namer::name_chunks("03_01_predictive_models.qmd")
options(scipen=99999)

#| echo: false
library(tidyverse)
library(lemon)
library(knitr)
library(kableExtra)
library(dplyr)
library(plotly)
library(randomNames)
library(stargazer)
library(gt)
library(tidymodels)
library(tidyverse)
set.seed(0)
load("www/data/data_compas.RData")


data_table <- data %>% select(id, name, compas_screening_date, is_recid,
                        is_recid_factor, age, priors_count) %>%
  slice(1:10)


 data_table %>%
  kable(format = "html",
      escape=FALSE, 
      align=c(rep("l", 1), "c", "c")) %>%
  kable_styling(font_size = 16)
library(tidyverse)
set.seed(0)

load(file = "www/data/data_compas.Rdata")

# First 5000 observations as training data
data_train <- data[1:5000,]
fit <- glm(as.factor(is_recid) ~ age + priors_count, 
           family = binomial, 
           data = data)
cat(paste(capture.output(summary(fit))[6:10], collapse="\n"))
ggplot(data_train) + geom_smooth(aes(x = priors_count, y = y_hat)) + geom_point(aes(x = priors_count, y = is_recid)) + ylim(0,1)
data_new = data.frame(age = c(20, 20, 40, 40),
                          priors_count = c(0, 2, 0, 2))
data_new <- augment(fit, newdata = data_new, type.predict = "response")
data_new
# Visualize predicted values in plot
fit_linear <- lm(is_recid ~ age, 
           data = data_train)
data_train$y_hat_lm <- predict(fit_linear, type = "response")
ggplot(data = data_train, aes(x = age, y = is_recid)) + 
  geom_jitter(width = 0.1, height = 0.1, alpha = 0.1) +
  geom_point(aes(x = age, y = y_hat), color = "red") +
  geom_point(aes(x = age, y = y_hat_lm), color = "blue")
  
    fit <- glm(as.factor(is_recid) ~ age, 
           family = binomial, 
           data = data_train)
    fit2 <- lm(is_recid ~ age,
           data = data_train)
    data_train$y_hat <- predict(fit, type = "response")
    data_train$y_hat2 <- predict(fit2, type = "response")

    ggplot(data_train) + 
      geom_smooth(aes(x = priors_count, y = y_hat)) + 
      #geom_smooth(aes(x = priors_count, y = y_hat2)) +
      geom_point(aes(x = priors_count, y = is_recid)) + 
      ylim(0,1)
    
x <- structure(c(540L, 255L, 224L, 424L), 
               dim = c(2L, 2L), 
               dimnames = structure(list(
    c("0", "1"), 
    c("0", "1")), 
    names = c("", "")), 
    class = "table")

x <- cbind(x, rowSums(x))
x <- rbind(x, colSums(x))

x <- as.data.frame.matrix(x) 

rownames(x)[3] <- "Total"
names(x)[3] <- "Total"

x <- x %>%
  mutate(row_spanner = "True/actual values") %>%
  mutate(category = rownames(.)) %>%
  remove_rownames() %>%
  select(category, row_spanner, everything())


confusion_matrix_reduced <- gt::gt(x, 
       
       #rownames_to_stub = TRUE,
       rowname_col = "category",
       groupname_col = "row_spanner") %>%
  tab_spanner(
    label = "Predicted values",
    columns = c("0","1")
  ) |>
  cols_width(everything() ~ px(120)) |>
  tab_header(
    title = md("Outcome recidivism:<br> 1 = recidivate, 0 = not recidivate"))
gtsave(data = confusion_matrix_reduced, filename = "confusion_matrix_reduced.png")

# Add names

x[1,3] <- paste0(x[1,3], " (TN)")
x[2,3] <- paste0(x[2,3], " (FN)")
x[1,4] <- paste0(x[1,4], " (FP)")
x[2,4] <- paste0(x[2,4], " (TP)")

x[3,3] <- paste0(x[3,3], " (N*)")
x[3,4] <- paste0(x[3,4], " (P*)")

x[1,5] <- paste0(x[1,5], " (N)")
x[2,5] <- paste0(x[2,5], " (P)")

confusion_matrix <- gt::gt(x, 
       
       #rownames_to_stub = TRUE,
       rowname_col = "category",
       groupname_col = "row_spanner") %>%
  tab_spanner(
    label = "Predicted values",
    columns = c("0","1")
  ) |>
  cols_width(everything() ~ px(120) ) |>
  tab_header(
    title = md("Outcome recidivism:<br> 1 = recidivate, 0 = not recidivate"))
confusion_matrix
gtsave(data = confusion_matrix, filename = "confusion_matrix.png")

# install.packages(pacman)
pacman::p_load(tidyverse,
               tidymodels,
               knitr,
               kableExtra,
               DataExplorer,
               visdat,
               naniar)
load(file = "www/data/data_compas.Rdata")
# install.packages(pacman)
pacman::p_load(tidyverse,
               tidymodels,
               knitr,
               kableExtra,
               DataExplorer,
               visdat,
               naniar)

rm(list=ls())
load(url(sprintf("https://docs.google.com/uc?id=%s&export=download",
                         "1gryEUVDd2qp9Gbgq8G0PDutK_YKKWWIk")))
nrow(data)
ncol(data)
dim(data)
str(data) # Better use glimpse()
# glimpse(data)
# skimr::skim(data)
library(modelsummary)
datasummary_skim(data, type = "numeric", output = "html")
datasummary_skim(data, type = "categorical", output = "html")
table(data$race, useNA = "always")
table(data$is_recid, data$is_recid_factor)
table(data$decile_score)
vis_miss(data)
gg_miss_upset(data, nsets = 2, nintersects = 10) 
# Ideally, use higher number of nsets/nintersects 
# with more screen space
plot_correlation(data %>% dplyr::select(is_recid, age, 
                                        priors_count,sex, 
                                        race,
                                        juv_fel_count), 
                 cor_args = list("use" = "pairwise.complete.obs"))

# Extract data with missing outcome
  data_missing_outcome <- data %>% filter(is.na(is_recid_factor))
  dim(data_missing_outcome)

# Omit individuals with missing outcome from data
  data <- data %>% drop_na(is_recid_factor) # ?drop_na
  dim(data)
  
# Split the data into training and test data
  data_split <- initial_split(data, prop = 0.80)
  data_split # Inspect

# Extract the two datasets
  data_train <- training(data_split)
  data_test <- testing(data_split) # Do not touch until the end!
  dim(data_train)
  dim(data_test)
# Fit the model
  fit1 <- logistic_reg() %>% # logistic model
        set_engine("glm") %>% # define lm package/function
        set_mode("classification") %>%# define mode
        fit(is_recid_factor ~ age, # fit the model
        data = data_train) # based on training data
  fit1 # Class model output with summary(fit1$fit)
# Training data: Add predictions 
data_train %>%
  augment(x = fit1, type.predict = "response") %>%
    select(is_recid_factor, age, .pred_class, .pred_no, .pred_yes) %>%
      head()

# Cross-classification table (Columns = Truth, Rows = Predicted)
data_train %>%
  augment(x = fit1, type.predict = "response") %>%
      conf_mat(truth = is_recid_factor, estimate = .pred_class)
  
# Training data: Metrics
data_train %>%
  augment(x = fit1, type.predict = "response") %>%
      metrics(truth = is_recid_factor, estimate = .pred_class)
  
  
# Test data: Add predictions 
  data_test %>%
    augment(x = fit1, type.predict = "response") %>%
    select(is_recid_factor, age, .pred_class, .pred_no, .pred_yes) %>%
      head()

# Cross-classification table (Columns = Truth, Rows = Predicted)
  data_test %>%
    augment(x = fit1, type.predict = "response") %>%
      conf_mat(truth = is_recid_factor, estimate = .pred_class)
  

  
# Test data: Metrics
  data_test %>%
    augment(x = fit1, type.predict = "response") %>%
      metrics(truth = is_recid_factor, estimate = .pred_class)
# Calculate data for ROC curve - threshold, specificity, sensitivity
  data_test %>%
    augment(x = fit1, type.predict = "response") %>%
      roc_curve(truth = is_recid_factor, .pred_no) %>%
  head() %>% kable()
# Calculate data for ROC curve - threshold, specificity, sensitivity
  data_test %>%
    augment(x = fit1, type.predict = "response") %>%
      roc_curve(truth = is_recid_factor, .pred_no) %>%
  head()

# Calculate data for ROC curve and visualize
  data_test %>%
    augment(x = fit1, type.predict = "response") %>%
      roc_curve(truth = is_recid_factor, .pred_no)  %>% # Default: Uses first class (=0=no)
  autoplot() +
  xlab("False Positive Rate (FPR, 1 - specificity)") +
  ylab("True Positive Rate (TPR, sensitivity, recall)")

# Calculate are under the curve
  data_test %>%
    augment(x = fit1, type.predict = "response") %>%
      roc_auc(truth = is_recid_factor, .pred_no)
# Missing outcome data predictions
  data_missing_outcome <- data_missing_outcome %>%
                                augment(x = fit1, type.predict = "response") 

  data_missing_outcome %>%
    select(is_recid_factor, age, .pred_class, .pred_no, .pred_yes) %>%
        head()

data_test %>%
    augment(x = fit1, type.predict = "response") %>%
        select(is_recid_factor, .pred_class, .pred_no, .pred_yes, age, sex, race, priors_count)
# Visualize errors and predictors
data_test %>%
    augment(x = fit1, type.predict = "response") %>%
ggplot(aes(x = .pred_yes)) +
  geom_histogram() +
  xlim(0,1)
# Visualize errors and predictors
library(patchwork)
library(ggplot2)
data_plot <- data_test %>%
    augment(x = fit1, type.predict = "response") %>%
  select(.pred_yes, age, sex, race, priors_count) %>%
  mutate(age = cut_interval(age, 8),
         priors_count = as.factor(priors_count))


p1 <- ggplot(data = data_plot, aes(y = .pred_yes, x = sex)) + 
  geom_boxplot()
p2 <- ggplot(data = data_plot, aes(y = .pred_yes, x =  age)) + 
  geom_boxplot() + 
      theme(axis.text.x = element_text(angle = 30, hjust = 1))
p3 <- ggplot(data = data_plot, aes(y = .pred_yes, x = race)) + 
      geom_boxplot() + 
      theme(axis.text.x = element_text(angle = 30, hjust = 1))
p4 <- ggplot(data = data_plot, aes(y = .pred_yes, x = priors_count)) + 
      geom_boxplot() + 
      theme(axis.text.x = element_text(angle = 30, hjust = 1))
p1 + p2 + p3 + p4
# install.packages(pacman)
pacman::p_load(tidyverse,
               tidymodels,
               knitr,
               kableExtra,
               DataExplorer,
               visdat,
               naniar)
load(file = "www/data/data_compas.Rdata")
# install.packages(pacman)
pacman::p_load(tidyverse,
               tidymodels,
               knitr,
               kableExtra,
               DataExplorer,
               visdat,
               naniar)

rm(list=ls())
load(url(sprintf("https://docs.google.com/uc?id=%s&export=download",
                         "1gryEUVDd2qp9Gbgq8G0PDutK_YKKWWIk")))

# Extract data with missing outcome
  data_missing_outcome <- data %>% filter(is.na(is_recid_factor))
  dim(data_missing_outcome)

# Omit individuals with missing outcome from data
  data <- data %>% drop_na(is_recid_factor) # ?drop_na
  dim(data)

# Split the data into training and test data
  data_split <- initial_split(data, prop = 0.80)
  data_split # Inspect

# Extract the two datasets
  data_train <- training(data_split)
  data_test <- testing(data_split) # Do not touch until the end!

  
# Fit the model
  fit1 <- logistic_reg() %>% # logistic model
        set_engine("glm") %>% # define lm package/function
        set_mode("classification") %>%# define mode
        fit(is_recid_factor ~ age, # fit the model
        data = data_train) # based on training data
  fit1

# Training data: Add predictions 
data_train %>%
  augment(x = fit1, type.predict = "response") %>%
    select(is_recid_factor, age, .pred_class, .pred_no, .pred_yes) %>%
      head()

# Cross-classification table (Columns = Truth, Rows = Predicted)
data_train %>%
  augment(x = fit1, type.predict = "response") %>%
      conf_mat(truth = is_recid_factor, estimate = .pred_class)
  
# Training data: Metrics
data_train %>%
  augment(x = fit1, type.predict = "response") %>%
      metrics(truth = is_recid_factor, estimate = .pred_class)



# Test data: Add predictions 
  data_test %>%
    augment(x = fit1, type.predict = "response") %>%
    select(is_recid_factor, age, .pred_class, .pred_no, .pred_yes) %>%
      head()

# Cross-classification table (Columns = Truth, Rows = Predicted)
  data_test %>%
    augment(x = fit1, type.predict = "response") %>%
      conf_mat(truth = is_recid_factor, estimate = .pred_class)
  
# Test data: Metrics
  data_test %>%
    augment(x = fit1, type.predict = "response") %>%
      metrics(truth = is_recid_factor, estimate = .pred_class)
# install.packages(pacman)
pacman::p_load(tidyverse,
               tidymodels,
               knitr,
               kableExtra,
               DataExplorer,
               visdat,
               naniar)
load(file = "www/data/data_compas.Rdata")
# install.packages(pacman)
pacman::p_load(tidyverse,
               tidymodels,
               knitr,
               kableExtra,
               DataExplorer,
               visdat,
               naniar)

rm(list=ls())
load(url(sprintf("https://docs.google.com/uc?id=%s&export=download",
                         "1gryEUVDd2qp9Gbgq8G0PDutK_YKKWWIk")))
# Extract data with missing outcome
  data_missing_outcome <- data %>% filter(is.na(is_recid))
  dim(data_missing_outcome)

# Omit individuals with missing outcome from data
  data <- data %>% drop_na(is_recid) # ?drop_na
  dim(data)

# Split the data into training and test data
  data_split <- initial_split(data, prop = 0.80)
  data_split # Inspect

# Extract the two datasets
  data_train <- training(data_split)
  data_test <- testing(data_split) # Do not touch until the end!

  
# Fit the model
  fit1 <- linear_reg() %>% # logistic model
        set_engine("lm") %>% # define lm package/function
        set_mode("regression") %>%# define mode
        fit(is_recid ~ age, # fit the model
        data = data_train) # based on training data
  fit1

# Training data: Add predictions 
  data_train <- augment(x = fit1, data_train) %>%
  mutate(.pred_class = as.factor(ifelse(.pred>=0.5, 1, 0)),
         is_recid = factor(is_recid))

  head(data_train %>%
      select(is_recid, is_recid_factor, age, .pred, .resid, .pred_class))

# Training data: Metrics
  data_train %>%
      metrics(truth = is_recid, estimate = .pred_class)
  
  
  
# Test data: Add predictions 
  data_test <- augment(x = fit1, data_test) %>%
  mutate(.pred_class = as.factor(ifelse(.pred>=0.5, 1, 0)),
         is_recid = factor(is_recid))

  head(data_test %>%
      select(is_recid, is_recid_factor, age, .pred, .resid, .pred_class))

# Test data: Metrics
  data_test %>%
      metrics(truth = is_recid, estimate = .pred_class)  
labs = knitr::all_labels()
labs = setdiff(labs, c("setup", "setup2", "homework-solution-hide", "get-labels", "load-data-1"))

References

Angwin, Julia, Jeff Larson, Lauren Kirchner, and Surya Mattu. 2016. “Machine Bias.” https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing.
Dressel, Julia, and Hany Farid. 2018. “The Accuracy, Fairness, and Limits of Predicting Recidivism.” Sci Adv 4 (1): eaao5580.
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An Introduction to Statistical Learning: With Applications in R. Springer Texts in Statistics. Springer.
Lee, Claire S, Jeremy Du, and Michael Guerzhoy. 2020. “Auditing the COMPAS Recidivism Risk Assessment Tool: Predictive Modelling and Algorithmic Fairness in CS1.” In Proceedings of the 2020 ACM Conference on Innovation and Technology in Computer Science Education, 535–36. ITiCSE ’20. New York, NY, USA: Association for Computing Machinery.

Footnotes

  1. Alternatively use predict(fit, newdata = NULL, type = "response").↩︎