6 Logistic regression

In this chapter we introduce logistic regression as a tool for building models when there is a categorical response variable with two levels, e.g. yes and no. Logistic regression is a type of generalized linear model (GLM) for response variables where regular multiple regression does not work very well. Ultimately, the application of a GLM will feel very similar to multiple regression, even if some of the details are different.

To illustrate the examples in this chapter, we will use the resume dataset, which has data from a study¹⁷ that sought to understand the effect of race and gender on job application callback rates.

The variables in the resume dataset and their descriptions are in table 6.1. The first 5 rows of the dataset are in table 6.2.

Table 6.1: Codebook for the resume dataset.
Variable	Description
job_ad_id	Unique ID associated with the advertisement.
job_city	City where the job was located.
job_industry	Industry of the job.
job_type	Type of role.
job_fed_contractor	Indicator for if the employer is a federal contractor.
job_equal_opp_employer	Indicator for if the employer is an Equal Opportunity Employer.
job_ownership	The type of company, e.g. a nonprofit or a private company.
job_req_any	Indicator for if any job requirements are listed. If so, the other job_req_* fields give more detail.
job_req_communication	Indicator for if communication skills are required.
job_req_education	Indicator for if some level of education is required.
job_req_min_experience	Amount of experience required.
job_req_computer	Indicator for if computer skills are required.
job_req_organization	Indicator for if organization skills are required.
job_req_school	Level of education required.
received_callback	Indicator for if there was a callback from the job posting for the person listed on this resume.
firstname	The first name used on the resume.
race	Inferred race associated with the first name on the resume.
gender	Inferred gender associated with the first name on the resume.
years_college	Years of college education listed on the resume.
college_degree	Indicator for if the resume listed a college degree.
honors	Indicator for if the resume listed that the candidate has been awarded some honors.
worked_during_school	Indicator for if the resume listed working while in school.
years_experience	Years of experience listed on the resume.
computer_skills	Indicator for if computer skills were listed on the resume. These skills were adapted for listings, though the skills were assigned independently of other details on the resume.
special_skills	Indicator for if any special skills were listed on the resume.
volunteer	Indicator for if volunteering was listed on the resume.
military	Indicator for if military experience was listed on the resume.
employment_holes	Indicator for if there were holes in the person’s employment history.
has_email_address	Indicator for if the resume lists an email address.
resume_quality	Each resume was generally classified as either lower or higher quality.

Table 6.2: First five rows of the resume dataset.
job_ad_id	job_city	job_industry	job_type	job_fed_contractor	job_equal_opp_employer	job_ownership	job_req_any	job_req_min_experience	job_req_computer	job_req_organization	job_req_school	firstname	race	gender	years_college	college_degree	worked_during_school	years_experience	computer_skills	special_skills	volunteer	military	employment_holes	has_email_address	resume_quality
384	Chicago	manufacturing	supervisor	NA	1	unknown	1	5	1	0	none_listed	Allison	white	f	4	1	0	6	1	0	0	0	1	0	low
384	Chicago	manufacturing	supervisor	NA	1	unknown	1	5	1	0	none_listed	Kristen	white	f	3	0	1	6	1	0	1	1	0	1	high
384	Chicago	manufacturing	supervisor	NA	1	unknown	1	5	1	0	none_listed	Lakisha	black	f	4	1	1	6	1	0	0	0	0	0	low
384	Chicago	manufacturing	supervisor	NA	1	unknown	1	5	1	0	none_listed	Latonya	black	f	3	0	0	6	1	1	1	0	1	1	high
385	Chicago	other_service	secretary	0	1	nonprofit	1	some	1	1	none_listed	Carrie	white	f	3	0	1	22	1	0	0	0	0	1	high

To evaluate which factors were important, job postings were identified in Boston and Chicago for the study, and researchers created many fake resumes to send off to these jobs to see which would elicit a callback. The study monitored job postings for several months during 2001 and 2002 and used this to build up a set of test cases. The researchers enumerated important characteristics, such as years of experience and education details, and they used these characteristics to randomly generate the resumes. Finally, they randomly assigned a name to each resume, where the name would imply the applicant’s gender and race.

The first names that were used and randomly assigned in this experiment were selected so that they would predominantly be recognized as belonging to male or female individuals, as well as Black or White individuals; other races and gender identities were not considered in this study. While no name would definitively be inferred as pertaining to a Black individual or to a White individual, the researchers conducted a survey to check for racial association of the names; names that did not pass this survey check were excluded from usage in the experiment. For example, Lakisha was a name that their survey indicated would be interpreted as a Black woman, while Greg was a name that would generally be interpreted as a White male.

The response variable of interest is whether or not there was a callback from the employer for the applicant, and there were 8 attributes that were randomly assigned that we’ll consider, with special interest in the race and gender variables. Race and sex are protected classes in the United States, meaning they are not legally permitted factors for hiring or employment decisions.

All of the attributes listed on each resume were randomly assigned. This means that no attributes that might be favorable or detrimental to employment would favor one demographic over another on these resumes. Importantly, due to the experimental nature of this study, we can infer causation between these variables and the callback rate, if the variable is statistically significant. Our analysis will allow us to compare the practical importance of each of the variables relative to each other.

6.1 The model

Logistic regression is a generalized linear model where the outcome is a two-level categorical variable. The outcome, $Y_i$ , takes the value 1 (in our application, this represents a callback for the resume) with probability $p_i$ and the value 0 with probability $1-p_i$ . Because each observation has a slightly different context, e.g. different education level or a different number of years of experience, the probability $p_i$ will differ for each observation. Ultimately, it is this probability that we model in relation to the predictor variables: we will examine which resume characteristics correspond to higher or lower callback rates.

Consider a sample of size $n$ with a two-level categorical response variable, $Y$ , and $k$ explanatory variables, $X_1, X_2, \dots, X_k.$ Denote the values of the response variable by $(Y_1, Y_2, \dots, Y_n)$ and the values of the $k$ explanatory vaiables by $\begin{eqnarray} (X_{11}, X_{21}, X_{31}, \dots, X_{n1})\\ (X_{12}, X_{22}, X_{32}, \dots, X_{n2})\\ \vdots\\ (X_{1k}, X_{2k}, X_{3k}, \dots, X_{nk}) \end{eqnarray}$

The logistic regression model that relates the probability of $Y=1$ with the predictors $X_1, X_2, \dots, X_k$ has the form $f(\hat{p}) = \hat{a} + \hat{b}_1 X_1 + \hat{b}_2X_2 + \cdots + \hat{b}_nX_k,$

where $f$ is a suitable transformation. We can write, for each case in the dataset: $f(\hat{p}_i) = \hat{a} + \hat{b}_1 X_{i1} + \hat{b}_2X_{i2} + \cdots + \hat{b}_kX_{ik}, \quad i=1,2,\cdots,n.$

We want to choose a transformation $f$ that makes practical and mathematical sense. For example, we want a transformation that makes the range of possibilities on the left hand side of the equation equal to the range of possibilities for the right hand side; if there was no transformation for this equation, the left hand side could only take values between 0 and 1 (because they are probabilities), but the right hand side could take values outside of this range. A common transformation for $p_i$ is the logit transformation, which is given by

$logit(p_i) = \ln\left(\frac{p_i}{1-p_i}\right).$

In R, the logit function is qlogis and its graph is

x <- seq(0,1,0.005)
plot(qlogis(x) ~ x, type = "l", xlab = "p", ylab = "logit(p)")

The first line in the code above creates a sequence of numbers that go from 0 to 1, in intervals of 0.005. Notice how the logit function maps the interval $(0,1)$ to $(-\infty,\infty)$ .

In LS regression, the coefficients were estimated based on the least squares criterion. In logistic regression, the estimates for the coefficients are performed via maximum likelihood estimation, whose theory is beyond the scope of this course. The least squares optimization problem could be solved exactly (that is, we had formulas for the slopes and intercept). In logistic regression, there are no closed formulas for the coefficients $\hat{a}$ , $\hat{b}_j$ , $j=1,\dots,k$ , and they are found through an algorithm called Iteratively Reweighted Least Squares (IRLS).

In R, we use the glm function instead of the lm function to calculate the coefficients of logistic regression. The glm function can also find other types of generalized linear models that are not covered in this course.

Example 6.1 We start by fitting a model with a single predictor: honors. This variable indicates whether the applicant had any type of honors listed on their resume, such as employee of the month. In R, we run:

m <- glm(received_callback ~ honors, family = "binomial", data = resume)
m$coefficients

## (Intercept)      honors 
##  -2.4997953   0.8668269

Notice that the main difference between the syntax used in lm and the one for glm is that we must provide a family for the generalized linear model. In the case of logistic regression, the family is “binomial”.

This gives the model $\ln\left(\frac{\hat{p}}{1-\hat{p}}\right) = -2.5 + 0.87\times honors.$

The variable honors can take the values 1 (if the applicant had any type of honors listed on their resume) and 0 (if no honors are listed). So for candidates who had honors listed, $\ln\left(\frac{\hat{p}}{1-\hat{p}}\right) = -1.63.$ To estimate the probability that a candidate who lists honors receives a call back, we must solve for $\hat{p}$ :

$\hat{p} = \frac{e^{-1.63}}{1+e^{-1.63}} = 0.16.$

That is, the probability that a resume with honors resulted in a callback is 0.16 (16%). For a resume without honors, the callback probability is $\hat{p} = \frac{e^{-2.5}}{1+e^{-2.5}} = 0.076.$

In general, to find $\hat{p}$ in a logistic regression model, we need to apply the inverse of the logit function to the right-hand-side of the model equation. The inverse of the logistic function also has a special name, it’s the logistic function (and now you know why we call this logistic regression). Specifically, $logit^{-1}(x) = logistic(x) = \frac{e^x}{1+e^x}.$

So for the logistic model, $\hat{p} = logit^{-1}(\hat{a} + \hat{b}_1 X_1 + \cdots + \hat{b}_nX_k) = \frac{e^{\hat{a} + \hat{b}_1 X_1 + \cdots + \hat{b}_nX_k}}{1+ e^{\hat{a} + \hat{b}_1 X_1 + \cdots + \hat{b}_nX_k}}.$

6.2 Multivariable

In the previous example, we calculated the probability of a callback for the different levels of the variable honors. What if there are more predictors in the model? The main goal of this section is to walk through multivariable modeling with logistic regression.

Let’s use the following variables as predictors: job_city, college_degree, years_experience, honors, military, has_email_address, race, and gender:

m1 <- glm(received_callback ~ job_city + college_degree + years_experience + honors + military + has_email_address + race + gender, family = "binomial", data = resume)

We can look at just the coefficients of the model by running

m1$coefficients

##       (Intercept)   job_cityChicago    college_degree  years_experience 
##       -2.66318055       -0.44026675       -0.06664685        0.01998154 
##            honors          military has_email_address         racewhite 
##        0.76941762       -0.34216575        0.21826057        0.44241098 
##           genderm 
##       -0.18183541

We can also look at more information by running the summary function:

summary(m1)

## 
## Call:
## glm(formula = received_callback ~ job_city + college_degree + 
##     years_experience + honors + military + has_email_address + 
##     race + gender, family = "binomial", data = resume)
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -2.66318    0.18196 -14.636  < 2e-16 ***
## job_cityChicago   -0.44027    0.11421  -3.855 0.000116 ***
## college_degree    -0.06665    0.12110  -0.550 0.582076    
## years_experience   0.01998    0.01021   1.957 0.050298 .  
## honors             0.76942    0.18581   4.141 3.46e-05 ***
## military          -0.34217    0.21569  -1.586 0.112657    
## has_email_address  0.21826    0.11330   1.926 0.054057 .  
## racewhite          0.44241    0.10803   4.095 4.22e-05 ***
## genderm           -0.18184    0.13757  -1.322 0.186260    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2726.9  on 4869  degrees of freedom
## Residual deviance: 2659.2  on 4861  degrees of freedom
## AIC: 2677.2
## 
## Number of Fisher Scoring iterations: 5

The structure of the summary table is similar to that of multiple LS regression; the only notable difference is that the p-values are calculated using a “z value” instead of a “t value”.

Just like multiple LS regression, we could trim some variables from the model. For example, we can use the AIC (Akaike information criterion) by looking for models with a lower AIC through a backward elimination strategy. The step function used in section 5.3 can also be used for logistic regression. Recall that we must omit missing values before running the step function. However, since we are only using 8 of the 30 predictors available, we should be careful to not use na.omit(resume) because this will omit rows that have ANY empty cells from any of the 30 variables. To reduce the number of rows excluded, we can create a dataset with only the 8 predictors and the response variable:

resume2 <- resume[, c("job_city", "college_degree", "years_experience", "honors", "military", "has_email_address", "race", "gender", "received_callback")]

We then run the step function ommiting cases that have missing values in any of the 9 variables in the resume2 dataset:

full <- glm(received_callback ~ ., family = "binomial", data = na.omit(resume2))
step(full)

## Start:  AIC=2677.25
## received_callback ~ job_city + college_degree + years_experience + 
##     honors + military + has_email_address + race + gender
## 
##                     Df Deviance    AIC
## - college_degree     1   2659.6 2675.6
## - gender             1   2661.0 2677.0
## <none>                   2659.2 2677.2
## - military           1   2661.9 2677.9
## - has_email_address  1   2662.9 2678.9
## - years_experience   1   2662.9 2678.9
## - job_city           1   2674.1 2690.1
## - honors             1   2674.3 2690.3
## - race               1   2676.3 2692.3
## 
## Step:  AIC=2675.55
## received_callback ~ job_city + years_experience + honors + military + 
##     has_email_address + race + gender
## 
##                     Df Deviance    AIC
## <none>                   2659.6 2675.6
## - gender             1   2661.7 2675.7
## - military           1   2662.2 2676.2
## - has_email_address  1   2663.4 2677.4
## - years_experience   1   2663.5 2677.5
## - job_city           1   2674.2 2688.2
## - honors             1   2674.5 2688.5
## - race               1   2676.7 2690.7

## 
## Call:  glm(formula = received_callback ~ job_city + years_experience + 
##     honors + military + has_email_address + race + gender, family = "binomial", 
##     data = na.omit(resume2))
## 
## Coefficients:
##       (Intercept)    job_cityChicago   years_experience             honors  
##          -2.71616           -0.43642            0.02055            0.76341  
##          military  has_email_address          racewhite            genderm  
##          -0.34426            0.22208            0.44291           -0.19591  
## 
## Degrees of Freedom: 4869 Total (i.e. Null);  4862 Residual
## Null Deviance:       2727 
## Residual Deviance: 2660  AIC: 2676

Through backwards elimination, the first (and only) variable eliminated was college_degree. Our final model is then:

final <- glm(received_callback ~ job_city + years_experience + honors + military + has_email_address + race + gender, family = "binomial", data = resume)
summary(final)

## 
## Call:
## glm(formula = received_callback ~ job_city + years_experience + 
##     honors + military + has_email_address + race + gender, family = "binomial", 
##     data = resume)
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -2.71616    0.15510 -17.513  < 2e-16 ***
## job_cityChicago   -0.43642    0.11406  -3.826  0.00013 ***
## years_experience   0.02055    0.01015   2.024  0.04297 *  
## honors             0.76341    0.18525   4.121 3.77e-05 ***
## military          -0.34426    0.21571  -1.596  0.11050    
## has_email_address  0.22208    0.11301   1.965  0.04940 *  
## racewhite          0.44291    0.10802   4.100 4.13e-05 ***
## genderm           -0.19591    0.13520  -1.449  0.14733    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2726.9  on 4869  degrees of freedom
## Residual deviance: 2659.5  on 4862  degrees of freedom
## AIC: 2675.5
## 
## Number of Fisher Scoring iterations: 5

This model can be written as: $\begin{eqnarray} \ln\left(\frac{\hat{p}}{1-\hat{p}}\right) &=& -2.716 - 0.436\times job\_cityChicago + 0.021\times years\_experience \\ &+& 0.763 \times honors - 0.344 \times military + 0.222 \times has\_email\_address \\ &+& 0.443 \times racewhite - 0.196\times genderm \end{eqnarray}$

Example 6.2 We can use the model above to estimate the probability of receiving a callback for a job in Chicago where the candidate lists 14 years experience, no honors, no military experience, includes an email address, and has a first name that implies they are a White male:

$\begin{eqnarray} \ln\left(\frac{\hat{p}}{1-\hat{p}}\right) &=& -2.716 - 0.436\times 1 + 0.021\times 14 + 0.763 \times 0\\ &-& 0.344 \times 0 + 0.222 \times 1 + 0.443 \times 1 - 0.196\times 1\\ &=& -2.389. \end{eqnarray}$ We can now back-solve for $\hat{p}$ : $\hat{p} = \frac{e^{-2.389}}{1+e^{-2.389}} = 0.084.$

That is, the chance such an individual will receive a callback is about 8.4%.

We can complete the same steps for an individual with the same characteristics who is Black, where the only difference in the calculation is that the indicator variable racewhite will take a value of 0. Doing so yields a probability of 0.055 (5.5%). We see that the p-value for the slope for race is very small (nearly zero), which implies that this difference in probabilities for black and white individuals is statistically significant.

6.3 Interpretation

In the previous examples, we used logistic regression models to calculate the probability of a callback for resumes with certain characteristics. However, could we directly interpret the slope of the model in some meaningful way? It turns our that there are three ways to directly use the values of the slopes in logistic regression for interpretation. Before we get to them, observe the following:

The quantity $\left(\frac{\hat{p}}{1-\hat{p}}\right)$ , called the odds ratio, will be greater than 1 if the level $Y=1$ (getting a callback) is more likely to happen, that is, if $\hat{p} > 1-\hat{p}$ ; and it will be less than one if the level $Y=0$ (not getting a callback) is more likely to happen, that is, if $\hat{p} < 1-\hat{p}$ . The odds ratio will be equal to 1 if both levels of $Y$ are equally likely to happen.
The quantity $\ln\left(\frac{\hat{p}}{1-\hat{p}}\right)$ , called the log odds ratio, will be positive if the level $Y=1$ is more likely to happen and it will be negative if the level $Y=0$ is more likely to happen. The log odds ratio will be 0 if both levels of $Y$ are equally likely to happen.

With these observations in mind, we can provide the following interpretations for the slope of racewhite:

Log odds ratio interpretation. Holding all other variables constant, resumes that are perceived to be from a white person have a callback log odds ratio that is higher by 0.443, on average, than those from resumes that are perceived to be from a black person.
Odds ratio interpretation. Holding all other variables constant, resumes that are perceived to be from a white person have a callback odds ratio that is higher by a factor of $e^{0.443}=1.56$ , on average, than those from resumes that are perceived to be from a black person.
Sign interpretation. Holding all other variables constant, resumes that are perceived to be from a white person have higher callback probability, on average, than those from resumes that are perceived to be from a black person.

Interpretations 1 and 2 use the specific value of the slope, while interpretation 3 is less specific and uses only the sign of the slope (if the slope was negative we would have said “lower callback probability”).

Finally, to address the main research question posed by the study, we conclude that race played a statistically signifficant role in whether a candidate received a callback, while gender did not, after taking all variables that are typically in a resume into account. In other words, this study found strong evidence of racism in employement opportunities.

Bertrand M, Mullainathan S. 2004. Are Emily and Greg More Employable than Lakisha and Jamal? A Field Experiment on Labor Market Discrimination. The American Economic Review 94:4 (991-1013).↩︎