# Chapter 7 Logistic regression

Many analytics problems can be expressed as the prediction of which of two outcomes is more likely. The target variable in such applications is binary and expressed as “1” or “0.” There are many questions with a binary outcome, such as:

• Success/Failure.
• Heart attack/No heart attack.
• Continue/Not continue.
• Fraud/No fraud.
• Cancer/No cancer.
• Vote yes/Yote no.

This, of course, is just a partial list since binary outcomes are of concern in many different disciplines including human resources, marketing, medicine, education, political science, criminology, and others.

Logistic regression is one technique that is frequently used to create predictive models where the target is binary. (There are other techniques which will be discussed in subsequent chapters.) To build such a model, a data set is obtained with the following structure shown in Figure 7.1. Figure 7.1: Typical data structure.

The blue squares in Figure 7.1 represent a zero and the maroon squares represent a one. The cases may be individuals, companies, or other types of entities. The predictors can be continuous variables (e.g., age) or categorical (e.g., sex). At first it might seem that a data set with this structure could be analyzed with ordinary least squares regression. There are several reasons why linear is not appropriate, including the fact that the estimates of the target are not constrained to be between 0 and 1. Furthermore, logistic regression does not require the assumption of a normally distributed error term nor the assumption that the error variance is constant. (In fact, the error variance is a function of the predictors.)

One way of understanding the logistic regression model is to think of an intermediate variable (which is not explicitly observed) and the predictors. While the target variable yi (where the subscript represents the ith case in the data set) is binary, the intermediate variable pi is continuous and can be thought of as a propensity for the target to be 1.0 versus 0. Therefore, pi is continuous in the range of zero to one. Once pi is modeled, then predictions of the target can be obtained by using a decision rule based on a threshold value: if pi is greater than the threshold, then yi is predicted to be 1; otherwise yi is predicted to be 0.

An equation that is a weighted linear combination of the predictors is created. The linear combination can produce a continuous outcome with estimates ranging from -infinity to +infinity. A logit transformation of pi to $$\hat{y_i}$$ is created that results in a continuous variable as a linear function of the predictors:
$\begin{equation} \ \mathit{If}\;p_{i}\; >\; threshold,\; \hat{y}_{i} = 1;\; else\; \hat{y}_{i} = 0, \\ \mathit{where}\;p_{i}\; = \frac{1}{1 + e^{-(log(odds\;ratio)}} \tag{7.1} \end{equation}$
The log odds is a transformation of the probabilities of 1 and 0. In logistic regression the model is estimated as a linear function of log [p(1)/p(0)] where p( ) indicates probability.

The p(1) and p(0) are unobserved; the coefficients of the logistic regression model are estimated using a technique that works to create p(1) and p(0) that maximizes conformance with the observed 1’s and 0’s for the observations in the data set. Unlike ordinary least-squares regression, there is no closed-form algebraic solution to this estimation problem, so an iterative maximum likelihood algorithm can be used.

The effect of the logistic transformation is illustrated in Figure 7.2. While the variable on the x-axis is continuous, the y-axis variable is constrained to be between zero and one due to the logistic transformation. Figure 7.2: The logistic function.

Note that logistic regression is not the only technique that can be used in these cases. There is a very closely related technique called probit analysis which works in a very similar manner. Instead of using a logistic function for the S-shaped curve, probit analysis uses a cumulative normal function. Investigations of the two techniques have shown very similar results. Also, the hyperbolic tangent function can be used which has properties similar to the logistic function except that it ranges between -1 and +1 instead of 0 and +1. Of course, techniques such as neural networks, decision trees, or discriminant analysis can also be used for with binary targets.

## 7.1 Example with a single predictor

Consider a very simple classification problem using synthetic data. The example involves predicting purchase (“Yes” or “No”) by a customer as a function of customer’s age. Twenty observations with nine “Yes” values and 11 “No” values were created for demonstration8. A listing of the first 10 rows of the data set is shown in Table 7.1.

Table 7.1: First 10 rows of simulated data
Purchase Age
No 53
No 51
Yes 43
Yes 36
Yes 33
Yes 30
Yes 41
No 54
Yes 39
Yes 23

A logistic regression was run with Purchase as the target and Age as the predictor variable. The equation estimated using logistic regression on the data from Table 7.1 is:

$\begin{equation} log \left[ \frac{\;p_{i}\;}{1-\;p_{i}\;} \right] = 11.75 - 0.286 \cdot \;Age_{i}\; \tag{7.2} \end{equation}$

Interpreting the coefficients produced in the logistic regression model is not straightforward due to the non-linear relationship between predictors and probabilities. For this example, it can be noted that since the coefficient on Age is negative, increases in Age result in decreased likelihood of purchasing the product. The value of this coefficient (-0.286) means that a one unit increase in Age is associated with a -0.286 decrease in the log odds of purchase. The intercept (11.75) is usually not of much interest in such problems. The intercept is there to adjust the model to fit the overall proportion of “yeses” and “noes.”

Using equation (7.3), a graph of probability versus age (Figure 7.3) was created, showing a decrease in probability as age increases.

$\begin{equation} \;p_{i}\; = \frac{1}{1 + e^{11.75 - 0.286 \cdot \;Age_{i}\;}} \tag{7.3} \end{equation}$ Figure 7.3: Probability of purchase versus age.

Equation  can also be used to create predictions by comparing each probability to the threshold value using the following decision rules:

If probability of purchase > .5,
then Most likely purchase=“Yes,”
else Most likely purchase=No.

Table 7.2 shows the results of from applying the decision rule.

Table 7.2: Predictions using the logistic model.
Purchase Age Logit Probability.of.purchase Most.Likely.Purchase
No 53 -3.415 0.032 No
No 51 -2.843 0.055 No
Yes 43 -0.555 0.366 No*
Yes 36 1.448 0.811 Yes
Yes 33 2.306 0.910 Yes
Yes 30 3.164 0.960 Yes
Yes 41 0.018 0.506 Yes
No 54 -3.702 0.024 No
Yes 39 0.590 0.645 Yes
Yes 23 5.167 0.994 Yes
No 40 0.304 0.577 Yes*
No 48 -1.985 0.122 No
No 58 -4.846 0.008 No
No 52 -3.129 0.042 No
No 46 -1.413 0.197 No
No 34 2.020 0.883 Yes*
No 47 -1.699 0.156 No
No 44 -0.841 0.303 No
Yes 42 -0.269 0.435 No*
Yes 25 4.595 0.990 Yes
Note: * Indicates prediction error.

The results showing the performance of predictive classification models are typically displayed as a “classification matrix” (also known as a “confusion matrix”) which is shown in Table 7.3.

Table 7.3: Confusion matrix
Predicted
Actual No Yes Totals
No 9 2 11
Yes 2 7 9
Totals 11 9 20

## 7.2 Example: Predictive analytic in HR

Employee retention is an important objective for HR departments in many organizations. According to a report from the Work Institute , more than 27 percent of U.S. employees voluntarily left their jobs in 2020 with a total estimate cost of \$630 billion due to factors such as cost of replacement, loss of productivity. This same study reported that more than three-fourths of the turnover was preventable.

Using predictive analytics can enable employers to identify employees at risk of leaving and to take preemptive corrective action. A data set on employee turnover was obtained from Kaggle . This data set has 14,999 rows and 9 columns:

1. Satisfaction: Level scored 0 to 1.
2. Evaluation: Last evaluation rating scored 0 to 1.
3. Projects: Number of projects completed while at work.
4. Hours: Average monthly hours at workplace.
5. Years: Number of years spent in the company.
6. Promotion: Whether the employee was promoted in the last five years.
7. Department: Department in which the employee worked.
8. Salary: Relative level of salary: low, medium, high.
9. Left: Whether the employee left the workplace or not.

A KNIME workflow, shown in Figure 7.4, was created to analyze the HR data. Figure 7.4: KNIME workflow for logistic regression on employee turnover data.

A description of each node is shown Table 7.4.

Table 7.4: Description of workflow nodes for employee turnover logistic model.
Node Label Description
2 Data Explorer Create summary statistics and histograms for each variable. All variables with the exception of YearsAtCompany are okay for analysis. YearsAtCompany is highly skewed, so logarithmic transformation is indicated.
3 Math Formula Compute ln(YearsAtCompany)
4 Normalizer Rescale continuous variables to range of 0 to 1.
5 Partitioning A 70/30 (Training/Test) random split of the data set stratified on Left is formed. Set random seed = 123. The Test portion is set aside for validation.
6 SMOTE SMOTE (Synthetic Minority Over-sampling Technique) oversamples the class where employees Left in order to create an equal number of those who Left and those who Did not leave. By balancing the target variable a better logistic model is formed.
7 X-Partitioner The beginning loop of a 10-fold cross validation is created.
8 Logistic Regression Learner Run logistic regression with Left as the target.
9 Logistic Regression Predictor Predict the response from the logistic regression model using the Training data.
10 X-Aggregator The end of the cross validation loop.
11 ROC Curve Create ROC curve for the Training data.
12 Scorer Compute performance statistics and classification (confusion) matrix for the Training data.
13 Logistic Regression Predictor Predict the response from the logistic regression model using the Test data.
14 ROC Curve Create ROC curve for the Test data.
15 Scorer Compute performance statistics and classification (confusion) matrix for the Test data.

In Node 1 the data set with 14,999 observations and nine variables was read using the File Reader node. Promotion, Salary and Left are string variables; the others are numeric. The data set was explored using Data Explorer. All of the variables appeared to be suitable for analysis with the exception of YearsAtCompany which was highly skewed as shown in the histogram below in Figure 7.5. Figure 7.5: Histogram of YearsAtCompany.

Since a skewed variable can lead to poorer performance with logistic regression, a transformation of YearsAtCompany was computed using a Math Formula (Node 3) and the original variable replaced in the subsequent nodes. This reduced the skewness of YearsAtCompany from 1.85 to 0.59.

Predictive models typically do better with normalized predictors, so the following variables were normalized (Node 4) to a range of 0 to 1: Satisfaction, Evaluation, Projects, Hours, and Years.

In Node 5 the data was split randomly into two subsets: Training and Test. The split ratio was 70/30 and the sample was stratified on Left so that the proportions of Left / Did not leave would be about the same in both the Training and Test partitions.

SMOTE (Node 6) was used to oversample the Left category. Prior to oversampling, the split in the Training partition was 7,999 Left and 2,499 Did not leave. Such an imbalance in a binary target variable will usually cause the logistic regression to make most predictions to the level with the larger number of cases. After applying SMOTE, both levels (Left and Did not leaver) had 7,999 cases. This balanced sample was used to build the logistic model, but the evaluation of the model performance was done using the Test set with the original level proportions. The balanced sample was used to facilitate the estimation of the logistic regression model .

A k-fold cross validation (Node 7) was used to assess the stability of the logistic model using the Training data by employing the X-Partitioner and X-Aggregator nodes in KNIME with k = 10. The Training data is divided randomly into 10 approximately equal segments. Then, the logistic regression was run 10 times, each time withholding a 10% subset. Logistic regression requires two steps in KNIME; in Node 8 the Logistic Regression Learner runs the model and outputs the model coefficients; in Node 9 the Logistic Regression Predictor is applied to compute predictions for each row of the Training data. The 10 held out samples were used to assess the predictive accuracy. The results of the 10 runs were obtained from the output of Node 10 and are shown in Table 7.5.

Table 7.5: Results of k-fold validation.
Fold Error in % Accuracy in % Size of test set Error count
fold 0 18.56 81.44 1600 354
fold 1 19.00 81.00 1600 323
fold 2 19.69 80.31 1600 358
fold 3 18.38 81.63 1600 379
fold 4 19.88 80.13 1600 341
fold 5 20.69 79.31 1600 323
fold 6 19.00 81.00 1600 354
fold 7 18.69 81.31 1600 349
fold 8 20.33 79.67 1599 331
fold 9 21.14 78.86 1599 335

The average accuracy was 80.57%, with a maximum of 81.63% and a minimum of 78.86%. This shows that the model had reasonable accuracy and was quite stable in performance across the 10 folds. Recall that these analyses were conducted using the oversampled data.

The final model from Node 10 (using the last of the 10-fold partitions) was passed to the evaluation nodes for the Training data. The predictions for the Training data computed in Node 10. An ROC curve was created in Node 11 and the Scorer Node 12 created a classification (or confusion) matrix along with several evaluation metrics. A corresponding set of analyses was run for the Test data, shown in Nodes 13, 14, and 15.

The classification matrices for the Training and Test data are shown in the Tables 7.6 and 7.7.

Table 7.6: Confusion matrix for training data.
Predicted
Actual Left company Did not leave Total
Left company 6,909 1,090 7,999
Did not leave 2,035 5,964 7,989
Total 8,944 7,054 15,900
Table 7.7: Confusion matrix for test data.
Predicted
Actual Left company Did not leave Total
Left company 919 152 1,071
Did not leave 879 2,559 3,429
Total 1,789 2,711 4,500

Note that the results for the Training data are based on the oversampled data while the Test data results used the original ratio of Left versus Did not leave. The imbalanced data created less than ideal results. The Scorer Nodes (13 and 15) also provide summary descriptive statistics derived from the classification matrices. The results for both partitions are shown in Table 7.8.

Table 7.8: Performance metircs for training and test data.
Metric Training data Test data
Accuracy 0.805 0.773
Cohen’s kappa 0.609 0.491
Precision 0.772 0.514
Sensitivity 0.864 0.858
Specificity 0.746 0.749
F-measure 0.834 0.643

The overall accuracies for the Training and Test data sets were close, with a slight edge for the Training data. This is to be expected since the logistic regression was optimized for the Training data. However, accuracy is not the best measure in this situation due to the imbalance in the binary outcomes. In fact, if all observations in the Test data a predicted Did not leave, the accuracy would be .762. This approach would not be useful, of course, since it never identifies any of those that left the company.

Cohen’s kappa was lower in the Test data than with the Training data. This is likely due to the imbalance in binary outcomes in the original data. It is known that Kappa reaches its maximum of 1.0 only for balanced outcomes, so the lower kappa in the Test data is not surprising .

An area of concern with the results for the Test data is precision, which is much lower for the Test data than for the Training data. This too is due to the imbalance in the number of cases which Left versus the number that Did not leave. Because of this imbalance, it is “easier” to predict Did not leave. In the context of this HR application, it could be argued that mistakenly predicting an employee will leave is not as serious as failing to predict that an employee would leave. The sensitivity for predicting who is likely to lead was more than 0.8 with the training data. Thus, the vast majority of those likely to turnover can be identified. In such a situation the results of the model can only be used to signal HR that a further investigation of those identified as likely to leave.

ROC curves were obtained for the Training data in Node 12 and the testing data in Node 15. The curves are shown in Figures 7.6 and 7.7. Figure 7.6: ROC curve for the training data. (AUC = 0.829) Figure 7.7: ROC curve for the training data. (AUC = 0.833)

The ROC curves for both data sets are quite similar, indicating comparable predictive performance with the Training and Test data. The areas under the ROC curves (AUC) for both are about 0.83, which is fairly high, as the AUC varies from 0.0 to 1.0. The overall assessment of the logistic model with this data is that it would be useful in identifying employees likely to turnover when the predictor variables are input to the model. As noted earlier, however, when an employee is predicted to turnover, additional human review is needed since the precision of the model was not high.

## 7.3 Predictor interpretation and importance

Some questions about this predictive model are: “How can the changes in the predictor variables be interpreted? What is the relative importance of each of the predictors of employee turnover?” Answering this might provide insight into what steps might be taken to reduce turnover.

The question of predictor (or feature) importance has been studied extensively with multiple regression models. However, comparatively little has been published about measuring predictive importance in logistic regression . An approach based on “dominance analysis” has been developed and is available in Python and R. Dominance analysis is not directly available in KNIME. A simpler approach will be discussed in this section (but one that has limitations like any of the other methods).

The coefficients on the predictors cannot be used to infer importance in logistic regression because these coefficients are for the log of the odds. Rather, what is usually desired is the impact on the probability of changes in each predictor. This is further complicated because the effect on probability of the outcome is not linear, and the probabilities associated with a predictor depend upon the values for the other predictors. Thus, the “importance” of a particular predictor varies based on the range of the predictor being considered and the settings of every other predictor. So, the relative sizes of the coefficients in a logistic model cannot indicate predictor importance, even if the predictor variables are normalized.

Likewise, the p-values for significance tests on each predictor cannot be interpreted as a measure of importance in a practical sense. A small p-value indicates that the variable has a low variance compared with its magnitude, but the variable could still have a very minor effect on the target variable.

In general, for binary logistic models no approach to interpretation can fully describe the relationship between changes in a predictor value and the probability of the target variable .
An approach is to examine the effect on the probability of the outcome as each predictor is varied from its minimum to maximum. Since the results of changing one input variable depend upon the values of the other predictors, a “baseline” set of values for the predictors was established. Each of the continuous predictors was set to their respective means and the nominal values were set to the modal values. The results (Table 7.9) indicate that “Number of years spend in the company” was most important followed by Satisfaction. Note that this still does not fully address the question of interpretation, but it may provide some insight.

Table 7.9: Range of probabilities of leaving by predictors.
Predictor Range
Last evaluation rating scored 0.12
Department in which the employee worked 0.23
Average monthly hours at workplace 0.33
Whether the employee promoted in last five years 0.39
Relative level of salary: low 0.47
Number of projects completed while at work 0.51
Satisfaction 0.75
Number of years spent in the company 0.77

Another approach to interpreting predictors is to chart changes in probability by a predictor at different levels of another predictor. Figure 7.8 shows how the probability of leaving the company varies with level of satisfaction at three salary levels. As might be expected, employees with Low salary were generally more likely to leave at any level of satisfaction.

This type of analysis is useful when examining the impact of two variables but still does not fully address the question of variable importance. Figure 7.8: Probability of leaving by satisfaction for different salary levels.

## 7.4 Regularized logistic regression

Next, we will look at a subset of the data used in a Kaggle competition developed by the University of Melbourne which asked participants to predict the success of research grant applications. The subset used is based on the involved data preparation process available in the R package “Applied Predictive Modeling” . A subset of the data in that package was created by removing observations with missing values, yielding a total of 5,503 observations on 258 columns, including the binary target column “Class” (successful versus unsuccessful).

As discussed in the chapter on regression, KNIME includes an integration of algorithms from the H2O suite of analytics programs. The H2O algorithm “Generalized Linear Model (GLM)” was used to analyze the grant application data. The GLM program was run with the following settings:

• Target Column => Class
• Predictors => NumCI through Day (257 columns)
• The Ignore constant columns was checked
• The random seed was set to 123.
• The algorithm family was set to Binomial, the Link to logit.
• The alpha parameter controls the penalty functions for LASSO and ridge regression. An alpha of 1.0 produces LASSO and 0.0 produces ridge regression. This was set to 1.0 for LASSO.

The lambda parameter controls the amount of regularization from 0 (no regularization) and to larger values for more regularization. The option to perform an automatic search for the optimal value of the lambda parameter was selected. GLM will first fit a model with maximum regularization using a high lambda value that causes all coefficients to be zero, and then sequentially reduces lambda until the minimum lambda (set to .0001) or until overfitting occurs. The best value for lambda is determined by using the validation subset of the data, which was set to 15%. The best lambda is that which maximizes the log-likelihood in the GLM model).

A workflow for the regularized logistic regression was created using KNIME and H2O (Figure 7.9) with corresponding node descriptions in Table 7.10. Note that the optimum value of lambda is found independently of the test data. Figure 7.9: Probability of leaving by satisfaction for different salary levels.

Table 7.10: Node descriptions for regularized logistic regression.
Node Label Description
2 H2O Local Connect Allows running H2O models in KNIME workflow.
3 Table to H2O Convert KNIME data to H2O frame.
4 H2O Partitioning An 80/20 (Training/Test) random split of the data set stratified on Class, random seed = 123. The Test portion is set aside for validation.
5 H2O Generalized Linear Model Learner GLM set up to run logistic regression with regulation.
6 H2O Predictor (Classification) Create predictions and probabilities.
7 H2O to Table Convert H2O frame to KNIME data.

Regularization was able to simplify the model considerably (with 74 fewer predictors) and actually improved the area under the ROC curve, accuracy, and sensitivity (Table 7.11). Specificity was slightly lower with the regularized model.

Table 7.11: Comparison of full and regularized logistic regression models.
Model and number of predictors
Metric Full (257) Regularized (184)
AUC 0.897 0.903
Accuracy 0.823 0.828
Sensitivity 0.814 0.833
Specificity 0.831 0.824

## 7.5 Probability calibration

In some applications, it is important to predict the probability that an observation belongs to a specific classification outcome. Performance metrics such as accuracy, sensitivity, and specificity focus on the overall classification to one of two outcomes. Typical models are also used to provide a rank ordering of cases from highest to lowest probability. Such metrics may not correspond to the actual frequencies of outcomes. In other words, the probabilities are not calibrated to the “true” probabilities. Some algorithms, such as support vector machines, boosted trees, and naïve Bayes may be accurate in terms of classification, but not in terms of matching probabilities. Logistic regression has been shown to produce well-calibrated probabilities. In this section we will assess the calibration of logistic regression and present two calibration methods, Platt Scaling and Isotonic Regression.

Assessing calibration accuracy with binary outcomes can be done via a calibration plot of the observed fraction of positive outcomes versus the mean probability obtained from a model. A calibration plot for the success of grant applications example from the previous section was computed. Nodes were added to the workflow shown in Figure 7.9. The output of the H2O to Table node from that workflow was submitted to the following series of nodes (Figure 7.10 and Table 7.12). Figure 7.10: Workflow for creating calibration plot.

Table 7.12: Node descriptions for calibration workflow.
Node Label Description
10 Column Filter Select columns needed for calibration: ROWNUM, actual class, predicted probability of success.
11 Numeric Binner Create bins of probabilities from 0 to 1 in increments of .1.
12 Pivoting Create pivot table of bins, count of successful and unsuccessful
13 Column Expressions Calculate fraction of success
14 GroupBy Calculate mean probability by bin
15 Joiner Join fraction of success and mean probability
16 Excel Writer Output results to Excel for charting

The logistic regression was fairly well calibrated, with some deviation in the mid-range of probabilities (Figure 7.11). Figure 7.11: Calibration of logistic regression probabilities.

## 7.6 Evaluation of logistic regression

Logistic regression is one of the most used algorithms for predicting binary classes. It is easy to run and can handle large numbers of observations efficiently. As with ordinary regression, the number of observations in an analysis should be much greater than the number of predictors. Also, care should be taken to avoid overfitting (as is true of most supervised models). While logistic regression can be used to predict categorical targets with three or more levels, other models such as decision trees, neural nets, k nearest neighbors are better choices for such problems. In some cases, multi-class problems can be converted to just two levels so logistic regression can be meaningfully applied.

The H2O GLM regularization model available in KNIME was demonstrated with a fairly large data set. Regularization reduced the number of predictors significantly while maintaining overall model performance. The estimated probabilities from the model were generally aligned with the observed proportions.

Logistic regression assumes linearity between the log-odds ratio and the predictor variables. Interactions can be created, but this can cause computational problems. As discussed earlier, interpreting the coefficients on predictors is not straightforward since each coefficient reflects the change in log odds which is difficult to intuit.

Finally, with some data samples logistic regression can result in failure to converge to a solution. One other issue is the potential for complete separation of the data by a single feature. In this situation, no weight can be estimated for the feature in question; it is essentially infinite.

### References

“2020 Retention Report.” 2020. Work Institute. https://doi.org/" ".
Azen, Razia, and Nicole Traxel. 2009. “Using Dominance Analysis to Determine Predictor Importance in Logistic Regression.” Journal of Educational and Behavioral Statistics 34: 319–47.
ChristianSalas-Eljatiba, Timothy G. Gregoireb, AndresFuentes-Ramireza, and ValeskaYaitul. 2018. “A Study on the Effects of Unbalanced Data When Fitting Logistic Regression Models in Ecology.” Ecological Indicators, 502–8.
“Hr—Analytics-Employee-Turnover.” n.d. Work Institute. https://doi.org/" ".
Long, J. Scott, and Jeremy Frees. 2006. Regression Models for Categorical Dependent Variables Using Stata. 2nd ed. College Station, Texas: Chapman; Hall/CRC.
Widmann, Maarit. n.d. “Cohen’s Kappa: What It Is, When to Use It, and How to Avoid Its Pitfalls.” https://thenewstack.io/cohens-kappa-what-it-is-when-to-use-it-and-how-to-avoid-its-pitfalls/#:~:text=Cohen's%20kappa%20is%20a%20metric,performance%20of%20a%20classification%20model.&text=Like%20many%20other%20evaluation%20metrics,based%20on%20the%20confusion%20matrix.

1. This example is for illustration only. Predictive analysis in a data mining context should not be used with such a small data set.↩︎