6 Logistic regression
In this chapter we introduce logistic regression as a tool for building models when there is a categorical response variable with two levels, e.g. yes and no. Logistic regression is a type of generalized linear model (GLM) for response variables where regular multiple regression does not work very well. Ultimately, the application of a GLM will feel very similar to multiple regression, even if some of the details are different.
To illustrate the examples in this chapter, we will use the resume
dataset, which has data from a study17 that sought to understand the effect of race and gender on job application callback rates.
The variables in the resume
dataset and their descriptions are in table 6.1. The first 5 rows of the dataset are in table 6.2.
Variable | Description |
---|---|
job_ad_id | Unique ID associated with the advertisement. |
job_city | City where the job was located. |
job_industry | Industry of the job. |
job_type | Type of role. |
job_fed_contractor | Indicator for if the employer is a federal contractor. |
job_equal_opp_employer | Indicator for if the employer is an Equal Opportunity Employer. |
job_ownership | The type of company, e.g. a nonprofit or a private company. |
job_req_any | Indicator for if any job requirements are listed. If so, the other job_req_* fields give more detail. |
job_req_communication | Indicator for if communication skills are required. |
job_req_education | Indicator for if some level of education is required. |
job_req_min_experience | Amount of experience required. |
job_req_computer | Indicator for if computer skills are required. |
job_req_organization | Indicator for if organization skills are required. |
job_req_school | Level of education required. |
received_callback | Indicator for if there was a callback from the job posting for the person listed on this resume. |
firstname | The first name used on the resume. |
race | Inferred race associated with the first name on the resume. |
gender | Inferred gender associated with the first name on the resume. |
years_college | Years of college education listed on the resume. |
college_degree | Indicator for if the resume listed a college degree. |
honors | Indicator for if the resume listed that the candidate has been awarded some honors. |
worked_during_school | Indicator for if the resume listed working while in school. |
years_experience | Years of experience listed on the resume. |
computer_skills | Indicator for if computer skills were listed on the resume. These skills were adapted for listings, though the skills were assigned independently of other details on the resume. |
special_skills | Indicator for if any special skills were listed on the resume. |
volunteer | Indicator for if volunteering was listed on the resume. |
military | Indicator for if military experience was listed on the resume. |
employment_holes | Indicator for if there were holes in the person’s employment history. |
has_email_address | Indicator for if the resume lists an email address. |
resume_quality | Each resume was generally classified as either lower or higher quality. |
job_ad_id | job_city | job_industry | job_type | job_fed_contractor | job_equal_opp_employer | job_ownership | job_req_any | job_req_communication | job_req_education | job_req_min_experience | job_req_computer | job_req_organization | job_req_school | received_callback | firstname | race | gender | years_college | college_degree | honors | worked_during_school | years_experience | computer_skills | special_skills | volunteer | military | employment_holes | has_email_address | resume_quality |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
384 | Chicago | manufacturing | supervisor | NA | 1 | unknown | 1 | 0 | 0 | 5 | 1 | 0 | none_listed | 0 | Allison | white | f | 4 | 1 | 0 | 0 | 6 | 1 | 0 | 0 | 0 | 1 | 0 | low |
384 | Chicago | manufacturing | supervisor | NA | 1 | unknown | 1 | 0 | 0 | 5 | 1 | 0 | none_listed | 0 | Kristen | white | f | 3 | 0 | 0 | 1 | 6 | 1 | 0 | 1 | 1 | 0 | 1 | high |
384 | Chicago | manufacturing | supervisor | NA | 1 | unknown | 1 | 0 | 0 | 5 | 1 | 0 | none_listed | 0 | Lakisha | black | f | 4 | 1 | 0 | 1 | 6 | 1 | 0 | 0 | 0 | 0 | 0 | low |
384 | Chicago | manufacturing | supervisor | NA | 1 | unknown | 1 | 0 | 0 | 5 | 1 | 0 | none_listed | 0 | Latonya | black | f | 3 | 0 | 0 | 0 | 6 | 1 | 1 | 1 | 0 | 1 | 1 | high |
385 | Chicago | other_service | secretary | 0 | 1 | nonprofit | 1 | 0 | 0 | some | 1 | 1 | none_listed | 0 | Carrie | white | f | 3 | 0 | 0 | 1 | 22 | 1 | 0 | 0 | 0 | 0 | 1 | high |
To evaluate which factors were important, job postings were identified in Boston
and Chicago for the study, and researchers created many fake resumes to send off to these jobs to see which would elicit a callback. The study monitored job postings for several months during 2001 and 2002 and used this to build up a set of test cases. The researchers enumerated important characteristics, such as years of experience and education details, and they used these characteristics to randomly generate the resumes. Finally, they randomly assigned a name to each resume, where the name would imply the applicant’s gender and race.
The first names that were used and randomly assigned in this experiment were selected so that they would predominantly be recognized as belonging to male or female individuals, as well as Black or White individuals; other races and gender identities were not considered in this study. While no name would definitively be inferred as pertaining to a Black individual or to a White individual, the researchers conducted a survey to check for racial association of the names; names that did not pass this survey check were excluded from usage in the experiment. For example, Lakisha was a name that their survey indicated would be interpreted as a Black woman, while Greg was a name that would generally be interpreted as a White male.
The response variable of interest is whether or not there was a callback from the employer for the applicant, and there were 8 attributes that were randomly assigned that we’ll consider, with special interest in the race and gender variables. Race and sex are protected classes in the United States, meaning they are not legally permitted factors for hiring or employment decisions.
All of the attributes listed on each resume were randomly assigned. This means that no attributes that might be favorable or detrimental to employment would favor one demographic over another on these resumes. Importantly, due to the experimental nature of this study, we can infer causation between these variables and the callback rate, if the variable is statistically significant. Our analysis will allow us to compare the practical importance of each of the variables relative to each other.
6.1 The model
Logistic regression is a generalized linear model where the outcome is a two-level categorical variable. The outcome, \(Y_i\), takes the value 1 (in our application, this represents a callback for the resume) with probability \(p_i\) and the value 0 with probability \(1-p_i\). Because each observation has a slightly different context, e.g. different education level or a different number of years of experience, the probability \(p_i\) will differ for each observation. Ultimately, it is this probability that we model in relation to the predictor variables: we will examine which resume characteristics correspond to higher or lower callback rates.
Consider a sample of size \(n\) with a two-level categorical response variable, \(Y\), and \(k\) explanatory variables, \(X_1, X_2, \dots, X_k.\) Denote the values of the response variable by \((Y_1, Y_2, \dots, Y_n)\) and the values of the \(k\) explanatory vaiables by \[\begin{eqnarray} (X_{11}, X_{21}, X_{31}, \dots, X_{n1})\\ (X_{12}, X_{22}, X_{32}, \dots, X_{n2})\\ \vdots\\ (X_{1k}, X_{2k}, X_{3k}, \dots, X_{nk}) \end{eqnarray}\]
The logistic regression model that relates the probability of \(Y=1\) with the predictors \(X_1, X_2, \dots, X_k\) has the form \[f(\hat{p}) = \hat{a} + \hat{b}_1 X_1 + \hat{b}_2X_2 + \cdots + \hat{b}_nX_k,\]
where \(f\) is a suitable transformation. We can write, for each case in the dataset: \[f(\hat{p}_i) = \hat{a} + \hat{b}_1 X_{i1} + \hat{b}_2X_{i2} + \cdots + \hat{b}_kX_{ik}, \quad i=1,2,\cdots,n.\]
We want to choose a transformation \(f\) that makes practical and mathematical sense. For example, we want a transformation that makes the range of possibilities on the left hand side of the equation equal to the range of possibilities for the right hand side; if there was no transformation for this equation, the left hand side could only take values between 0 and 1 (because they are probabilities), but the right hand side could take values outside of this range. A common transformation for \(p_i\) is the logit transformation, which is given by
\[logit(p_i) = \ln\left(\frac{p_i}{1-p_i}\right).\]
In R, the logit function is qlogis
and its graph is
The first line in the code above creates a sequence of numbers that go from 0 to 1, in intervals of 0.005. Notice how the logit function maps the interval \((0,1)\) to \((-\infty,\infty)\).
In LS regression, the coefficients were estimated based on the least squares criterion. In logistic regression, the estimates for the coefficients are performed via maximum likelihood estimation, whose theory is beyond the scope of this course. The least squares optimization problem could be solved exactly (that is, we had formulas for the slopes and intercept). In logistic regression, there are no closed formulas for the coefficients \(\hat{a}\), \(\hat{b}_j\), \(j=1,\dots,k\), and they are found through an algorithm called Iteratively Reweighted Least Squares (IRLS).
In R, we use the glm
function instead of the lm
function to calculate the coefficients of logistic regression. The glm
function can also find other types of generalized linear models that are not covered in this course.
Example 6.1 We start by fitting a model with a single predictor: honors
. This variable indicates whether the applicant had any type of honors listed on their resume, such as employee of the month. In R, we run:
## (Intercept) honors
## -2.4997953 0.8668269
Notice that the main difference between the syntax used in lm
and the one for glm
is that we must provide a family
for the generalized linear model. In the case of logistic regression, the family is “binomial”.
This gives the model \[\ln\left(\frac{\hat{p}}{1-\hat{p}}\right) = -2.5 + 0.87\times honors.\]
The variable honors
can take the values 1 (if the applicant had any type of honors listed on their resume) and 0 (if no honors are listed). So for candidates who had honors listed, \(\ln\left(\frac{\hat{p}}{1-\hat{p}}\right) = -1.63.\) To estimate the probability that a candidate who lists honors receives a call back, we must solve for \(\hat{p}\):
\[\hat{p} = \frac{e^{-1.63}}{1+e^{-1.63}} = 0.16.\]
That is, the probability that a resume with honors resulted in a callback is 0.16 (16%). For a resume without honors, the callback probability is \[\hat{p} = \frac{e^{-2.5}}{1+e^{-2.5}} = 0.076.\]
In general, to find \(\hat{p}\) in a logistic regression model, we need to apply the inverse of the logit function to the right-hand-side of the model equation. The inverse of the logistic function also has a special name, it’s the logistic function (and now you know why we call this logistic regression). Specifically, \[logit^{-1}(x) = logistic(x) = \frac{e^x}{1+e^x}.\]
So for the logistic model, \[\hat{p} = logit^{-1}(\hat{a} + \hat{b}_1 X_1 + \cdots + \hat{b}_nX_k) = \frac{e^{\hat{a} + \hat{b}_1 X_1 + \cdots + \hat{b}_nX_k}}{1+ e^{\hat{a} + \hat{b}_1 X_1 + \cdots + \hat{b}_nX_k}}.\]
6.2 Multivariable
In the previous example, we calculated the probability of a callback for the different levels of the variable honors
. What if there are more predictors in the model? The main goal of this section is to walk through multivariable modeling with logistic regression.
Let’s use the following variables as predictors: job_city
, college_degree
, years_experience
, honors
, military
, has_email_address
, race
, and gender
:
m1 <- glm(received_callback ~ job_city + college_degree + years_experience + honors + military + has_email_address + race + gender, family = "binomial", data = resume)
We can look at just the coefficients of the model by running
## (Intercept) job_cityChicago college_degree years_experience
## -2.66318055 -0.44026675 -0.06664685 0.01998154
## honors military has_email_address racewhite
## 0.76941762 -0.34216575 0.21826057 0.44241098
## genderm
## -0.18183541
We can also look at more information by running the summary function:
##
## Call:
## glm(formula = received_callback ~ job_city + college_degree +
## years_experience + honors + military + has_email_address +
## race + gender, family = "binomial", data = resume)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.66318 0.18196 -14.636 < 2e-16 ***
## job_cityChicago -0.44027 0.11421 -3.855 0.000116 ***
## college_degree -0.06665 0.12110 -0.550 0.582076
## years_experience 0.01998 0.01021 1.957 0.050298 .
## honors 0.76942 0.18581 4.141 3.46e-05 ***
## military -0.34217 0.21569 -1.586 0.112657
## has_email_address 0.21826 0.11330 1.926 0.054057 .
## racewhite 0.44241 0.10803 4.095 4.22e-05 ***
## genderm -0.18184 0.13757 -1.322 0.186260
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2726.9 on 4869 degrees of freedom
## Residual deviance: 2659.2 on 4861 degrees of freedom
## AIC: 2677.2
##
## Number of Fisher Scoring iterations: 5
The structure of the summary table is similar to that of multiple LS regression; the only notable difference is that the p-values are calculated using a “z value” instead of a “t value”.
Just like multiple LS regression, we could trim some variables from the model. For example, we can use the AIC (Akaike information criterion) by looking for models with a lower AIC through a backward elimination strategy. The step
function used in section 5.3 can also be used for logistic regression. Recall that we must omit missing values before running the step
function. However, since we are only using 8 of the 30 predictors available, we should be careful to not use na.omit(resume)
because this will omit rows that have ANY empty cells from any of the 30 variables. To reduce the number of rows excluded, we can create a dataset with only the 8 predictors and the response variable:
resume2 <- resume[, c("job_city", "college_degree", "years_experience", "honors", "military", "has_email_address", "race", "gender", "received_callback")]
We then run the step
function ommiting cases that have missing values in any of the 9 variables in the resume2
dataset:
## Start: AIC=2677.25
## received_callback ~ job_city + college_degree + years_experience +
## honors + military + has_email_address + race + gender
##
## Df Deviance AIC
## - college_degree 1 2659.6 2675.6
## - gender 1 2661.0 2677.0
## <none> 2659.2 2677.2
## - military 1 2661.9 2677.9
## - has_email_address 1 2662.9 2678.9
## - years_experience 1 2662.9 2678.9
## - job_city 1 2674.1 2690.1
## - honors 1 2674.3 2690.3
## - race 1 2676.3 2692.3
##
## Step: AIC=2675.55
## received_callback ~ job_city + years_experience + honors + military +
## has_email_address + race + gender
##
## Df Deviance AIC
## <none> 2659.6 2675.6
## - gender 1 2661.7 2675.7
## - military 1 2662.2 2676.2
## - has_email_address 1 2663.4 2677.4
## - years_experience 1 2663.5 2677.5
## - job_city 1 2674.2 2688.2
## - honors 1 2674.5 2688.5
## - race 1 2676.7 2690.7
##
## Call: glm(formula = received_callback ~ job_city + years_experience +
## honors + military + has_email_address + race + gender, family = "binomial",
## data = na.omit(resume2))
##
## Coefficients:
## (Intercept) job_cityChicago years_experience honors
## -2.71616 -0.43642 0.02055 0.76341
## military has_email_address racewhite genderm
## -0.34426 0.22208 0.44291 -0.19591
##
## Degrees of Freedom: 4869 Total (i.e. Null); 4862 Residual
## Null Deviance: 2727
## Residual Deviance: 2660 AIC: 2676
Through backwards elimination, the first (and only) variable eliminated was college_degree
.
Our final model is then:
final <- glm(received_callback ~ job_city + years_experience + honors + military + has_email_address + race + gender, family = "binomial", data = resume)
summary(final)
##
## Call:
## glm(formula = received_callback ~ job_city + years_experience +
## honors + military + has_email_address + race + gender, family = "binomial",
## data = resume)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.71616 0.15510 -17.513 < 2e-16 ***
## job_cityChicago -0.43642 0.11406 -3.826 0.00013 ***
## years_experience 0.02055 0.01015 2.024 0.04297 *
## honors 0.76341 0.18525 4.121 3.77e-05 ***
## military -0.34426 0.21571 -1.596 0.11050
## has_email_address 0.22208 0.11301 1.965 0.04940 *
## racewhite 0.44291 0.10802 4.100 4.13e-05 ***
## genderm -0.19591 0.13520 -1.449 0.14733
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2726.9 on 4869 degrees of freedom
## Residual deviance: 2659.5 on 4862 degrees of freedom
## AIC: 2675.5
##
## Number of Fisher Scoring iterations: 5
This model can be written as: \[\begin{eqnarray} \ln\left(\frac{\hat{p}}{1-\hat{p}}\right) &=& -2.716 - 0.436\times job\_cityChicago + 0.021\times years\_experience \\ &+& 0.763 \times honors - 0.344 \times military + 0.222 \times has\_email\_address \\ &+& 0.443 \times racewhite - 0.196\times genderm \end{eqnarray}\]
Example 6.2 We can use the model above to estimate the probability of receiving a callback for a job in Chicago where the candidate lists 14 years experience, no honors, no military experience, includes an email address, and has a first name that implies they are a White male:
\[\begin{eqnarray} \ln\left(\frac{\hat{p}}{1-\hat{p}}\right) &=& -2.716 - 0.436\times 1 + 0.021\times 14 + 0.763 \times 0\\ &-& 0.344 \times 0 + 0.222 \times 1 + 0.443 \times 1 - 0.196\times 1\\ &=& -2.389. \end{eqnarray}\] We can now back-solve for \(\hat{p}\): \[\hat{p} = \frac{e^{-2.389}}{1+e^{-2.389}} = 0.084.\]
That is, the chance such an individual will receive a callback is about 8.4%.
We can complete the same steps for an individual with the same characteristics who is Black, where
the only difference in the calculation is that the indicator variable racewhite will take a value of 0.
Doing so yields a probability of 0.055 (5.5%). We see that the p-value for the slope for race
is very small (nearly zero), which implies that this difference in probabilities for black and white individuals is statistically significant.
6.3 Interpretation
In the previous examples, we used logistic regression models to calculate the probability of a callback for resumes with certain characteristics. However, could we directly interpret the slope of the model in some meaningful way? It turns our that there are three ways to directly use the values of the slopes in logistic regression for interpretation. Before we get to them, observe the following:
The quantity \(\left(\frac{\hat{p}}{1-\hat{p}}\right)\), called the odds ratio, will be greater than 1 if the level \(Y=1\) (getting a callback) is more likely to happen, that is, if \(\hat{p} > 1-\hat{p}\); and it will be less than one if the level \(Y=0\) (not getting a callback) is more likely to happen, that is, if \(\hat{p} < 1-\hat{p}\). The odds ratio will be equal to 1 if both levels of \(Y\) are equally likely to happen.
The quantity \(\ln\left(\frac{\hat{p}}{1-\hat{p}}\right)\), called the log odds ratio, will be positive if the level \(Y=1\) is more likely to happen and it will be negative if the level \(Y=0\) is more likely to happen. The log odds ratio will be 0 if both levels of \(Y\) are equally likely to happen.
With these observations in mind, we can provide the following interpretations for the slope of racewhite
:
Log odds ratio interpretation. Holding all other variables constant, resumes that are perceived to be from a white person have a callback log odds ratio that is higher by 0.443, on average, than those from resumes that are perceived to be from a black person.
Odds ratio interpretation. Holding all other variables constant, resumes that are perceived to be from a white person have a callback odds ratio that is higher by a factor of \(e^{0.443}=1.56\), on average, than those from resumes that are perceived to be from a black person.
Sign interpretation. Holding all other variables constant, resumes that are perceived to be from a white person have higher callback probability, on average, than those from resumes that are perceived to be from a black person.
Interpretations 1 and 2 use the specific value of the slope, while interpretation 3 is less specific and uses only the sign of the slope (if the slope was negative we would have said “lower callback probability”).
Finally, to address the main research question posed by the study, we conclude that race played a statistically signifficant role in whether a candidate received a callback, while gender did not, after taking all variables that are typically in a resume into account. In other words, this study found strong evidence of racism in employement opportunities.
Bertrand M, Mullainathan S. 2004. Are Emily and Greg More Employable than Lakisha and Jamal? A Field Experiment on Labor Market Discrimination. The American Economic Review 94:4 (991-1013).↩︎