At 7:55 on a Monday morning, the conference room at FairBank’s corporate office was beginning to fill up with senior executives. The CEO of FairBank, Stephanie Treynor, had emailed her executive team the previous evening requesting to meet first thing in the morning. Her email included a link to a news article published over the weekend about the lending practices of MassBank, one of FairBank’s competitors in the industry. The article alleged that MassBank’s lending decisions were discriminatory and unfair since, in the previous calendar year, only 27% of Black loan applicants were granted loans by MassBank compared to 46% of White loan applicants.
FairBank had recently developed and implemented a data-driven algorithm that automates the process of approving small retail loans. Stephanie had greenlighted the deployment of the algorithm after a lengthy evaluation process demonstrated that it would boost FairBank’s small-loan profitability by enabling more accurate predictions of which applicants would repay their loan. The algorithm had other benefits as well: it allowed FairBank to re-assign experienced loan officers to work on larger loans and to grow the small-loan portfolio without a commensurate increase in staffing costs. These benefits notwithstanding, the news article made Stephanie wonder if the new algorithm, given ‘free rein’ of the small-loan approval process, may have inadvertently led to biases such as those uncovered at MassBank.
At 8:00 a.m. sharp, she entered the conference room and, without preamble, addressed the group: “That’s the kind of article I don’t want to see about us. Let’s get to work.”
She turned to the VP of Consumer Lending, Mark Chen, and asked him to give the group a quick overview of the new lending algorithm. According to Mark, the FairBank analytics team had developed a proprietary “loan-worthiness” metric based on relevant applicant data such as debt levels, monthly income, loan-to-discretionary-income-ratio and past loan repayment history. Using this metric as the primary input variable, the team had built a model that could predict the probability of loan repayment with high accuracy.
The FairBank executive team had chosen a minimum repayment probability threshold for loan approval that balanced financial and other considerations. This translated to an equivalent threshold on the loan-worthiness metric which made it very easy to implement the automated system: applicants whose loan-worthiness exceeded the threshold were granted the loan; those who fell below the threshold were denied the loan.
Mark emphasized that a common threshold was used for all applicants. He also emphasized that protected attributes like race and gender were never used by the algorithm.
“So, Mark, you are telling us that if a White applicant and a Black applicant with the same data apply for a loan, we will treat them the same way — we will approve or deny both applicants, right?” asked Jen Chisholm, the VP of Marketing. When Mark nodded, she continued, “That seems pretty fair to me!”
“I’m not sure, Jen,” Mark responded. “For all we know, MassBank may have used an approach just like ours and they still got into trouble. I pulled numbers from last year for us to look at.”
|All applicants||White applicants||Black applicants|
|Proportion of applicants who are approved||0.43||0.54||0.32|
The first executive to respond was Igor Krilov, the CFO. “Wow, our numbers aren’t any better than MassBank’s. We granted loans to only 32% of Black applicants compared to 54% of White applicants. We appear to have a significant disparity here.”
“Yes, the pattern looks similar,” Mark conceded, “but there’s a clear and innocuous explanation for that, Igor.”
As Igor listened intently, Mark continued, “If you don’t mind, let’s consider an artificially simple example to get at the crux of what’s going on here. Suppose applicants with a loan-worthiness metric of 800 almost always repay their loan while applicants with a loan-worthiness metric of 700, say, have a 50-50 chance of repaying. A good algorithm will figure this out from the data and would set our approval threshold at, say, 750.”
“Now suppose, unknown to the algorithm, all White applicants happen to have a loan-worthiness metric of 800 while all Black applicants happen to have a loan-worthiness metric of 700. Then every White applicant will be approved and every Black applicant will be denied. The loan approval rate will be 100% for White applicants and 0% for Black applicants.”
“Of course, in reality, the distribution of the loan-worthiness metric is more spread out than in my simplified example. But it is true that our Black loan applicants have lower loan-worthiness scores on average than White applicants, so I’m not surprised by these numbers.”
“Got it, makes sense, Mark,” Igor acknowledged.
“If our algorithm is blind to race, how were you able to generate these numbers, Mark?” demanded Stephanie. “We don’t ask applicants for their race, obviously, so how did you produce this report?”
“We don’t ask applicants for their race or for any protected attributes and we don’t use any such attributes in our algorithm,” Mark responded. “But we have lots of demographic data about the applicants from third-party sources. We only use it for analysis such as this one or for marketing purposes. And I repeat — we don’t use it in the algorithm.”
Jen cautioned, “Third-party data are often wrong, Mark, and there are lots of missing values. Are you sure this is not skewing your analysis?”
“In general, you’re correct, Jen. But not in this case. The race data are quite reliable and complete. I don’t think we have a data issue here,” replied Mark.
After a brief moment of silence, Igor said, “So it seems we’re well-prepared if we come under scrutiny by the press. We have a well-documented process that clearly doesn’t use race and we have a clear explanation for the disparity in approval rates.”
“We are not discussing this only for the sake of the press,” clarified Stephanie. “The important question is, are we doing the right thing by our customers? Are our processes fundamentally fair?”
She continued after a brief pause, “Is it enough that we don’t use protected attributes? Is there something else that we should be looking at?”
At this point the VP of Analytics, Salma Khan, cleared her throat. She was often quiet during meetings but when she spoke, everyone paid close attention since she almost always brought clarity to whatever was being debated.
“Look, if we have a perfect model, there’s no problem. We will know exactly who will repay and who will default and we can make perfect decisions and these decisions will be fair. The key issue here is that our model — any model for that matter — isn’t perfect and will make errors”.
“We continually strive to improve our models but, let’s face it, we will never get to perfection,” Salma continued, “The best we can do, frankly, is to make sure that our model makes errors equitably.”
There was silence in the room for a moment.
“That is interesting, Salma. If I understand you correctly, our model shouldn’t make more mistakes for certain groups than for others,” said Stephanie. “But how do you define ‘error’?”
“Well, we can look at the model’s error rate!” Mark jumped in. “That’s just the percentage of applicants for whom the model predicts incorrectly. We did this analysis just prior to launching the algorithm and we can dive deeper into it to see if there are differences across applicant groups.”
“We need to be careful here, Mark,” Salma cautioned. “The overall error rate is a good starting point but we need to go deeper. There are two kinds of errors the model can make. It can deny applicants who should have been approved and it can approve applicants who should have been denied. The overall error rate lumps these two together. We need to examine them separately.”
She continued after a moment’s thinking, “In fact, unless Black and White applicants are statistically identical in every variable we use as input to our algorithm, I cannot be confident that we make errors equitably.”
“I didn’t realize that there was so much complexity here!” Stephanie exclaimed. “Mark and Salma, can you analyze this and give me a full picture of the errors our model makes? We really need to unpack the details of what’s going on here. Otherwise, we won’t know if or how we need to mitigate.”
“One moment,” Igor interjected. “I understand the distinction between these two errors, Salma. But how can we measure the error of denying applicants who ‘should have been approved’? How do we know if they would have repaid or defaulted if we denied their loan?”
“That’s a great point, Igor,” Salma replied. “You’re asking about what we analytics types refer to as the counterfactual. I’m not sure if you recall, but when we started to develop the algorithm two years ago, we asked the team for permission to run an experiment where we approved loans to all applicants in a random sample of applicants.”
“Yes, I remember”, Igor replied, “You wanted to grant loans to some applicants who were clearly unlikely to repay. That was one hard sell!”
“Yes, it was a tough decision but we needed the data from that experiment to develop our algorithm without the constraints of the prior process,” said Salma.
She continued with a smile. “But now, guess what? The wisdom of this team in allowing us to run that experiment is about to pay off — we can use that very data to do the error analysis Stephanie wants!”
“I don’t know if it is foresight or luck, but that’s indeed good news, Salma!” said Stephanie. “Let’s schedule a meeting for Wednesday 4:00 p.m. for us to review your findings as a team.”
The next day, Salma emailed the following exhibits to the team ahead of their follow-up meeting:
At 4:00 p.m. on Wednesday, FairBank’s CEO Stephanie Treynor and her direct reports were assembled and ready.
Stephanie kicked off the meeting. “Salma, please walk us through the error analysis that you and Mark completed.”
“Well, this turned out to be quite an eye-opening exercise,” Salma said. “Let’s start with the algorithm’s error rate. That’s the fraction of applicants misclassified by the model. An applicant is misclassified if they default on a loan we approved or if they were denied a loan they would have repaid.” She then projected the following exhibit onto the screen.
|All applicants||White applicants||Black applicants|
“It turns out our algorithm’s error rate is the same across Black and White applicants”.
“That’s a relief!” exclaimed Jen Chisholm, VP of Marketing.
“Not really, Jen, as you’ll see in a second,” cautioned Salma. “The overall error we just saw is a combination of the two types of errors I described in our previous meeting. Here’s the breakdown.”
|All applicants||White applicants||Black applicants|
|Proportion of applicants who would have repaid their loan but are denied||0.23||0.18||0.30|
|Proportion of applicants who default on their loan but are approved||0.10||0.13||0.08|
“The first type of error is denying loans to applicants who would have repaid it. The numbers in the top row capture this. We commit this error more for Black applicants than for White applicants. The second type of error is approving loans to applicants who end up defaulting. That’s the bottom row. We commit this error less for Black applicants than for White applicants.”
“But here’s the thing. In both cases the bias is against the Black applicants.”
There was a long silence in the room. Everyone appeared to be waiting for Stephanie’s reaction.
Stephanie, who had been studying the numbers intently, finally spoke: “We denied 30% of all Black applicants who deserved a loan but only 18% of White applicants. These numbers are unsettling.”
She looked around the silent room and asked, “Why? Why did this happen?”
“Do we get fewer Black applicants than White applicants? Could this be biasing the algorithm?” Jen asked.
“That’s not the issue here,” Mark responded. “What matters is that the data mirror our applicant pool. I checked and it does.”
Then Salma chimed in: “And we do have adequate data — our sample size is sufficiently large. What we’re seeing here is real disparity, not a statistical error or some other artifact of the data”.
After a brief pause, Salma spoke again. “Perhaps the simplified example Mark used in our last meeting can shed some light on this.”
Stephanie nodded for her to continue.
“In Mark’s example, the 750 threshold implies that all Black applicants, including all deserving applicants, are denied while none of the White applicants are denied. In that scenario, the error rate of denying loans to deserving applicants is 50% for Black applicants and 0% for White applicants! Of course, in reality, there is overlap in the distributions of loan-worthiness values. That’s why our numbers are a bit less extreme, but the bias exists nonetheless.”
She continued, “The bias is implicit in the sense that it is driven indirectly and unwittingly by the disparity in loan-worthiness values.”
“I see,” Stephanie nodded. “The errors can vary across groups even though the algorithm is blind to race. It’s because the loan-worthiness metric correlates with race.”
At this point the CFO Igor Krilov spoke. “Interesting. A good loan officer would have also set the threshold at 750. We might have been facing this same problem whether or not we had been using an algorithm!”. He then added, “Unless of course, the loan officer can do better by incorporating information that is not easily accessible to the algorithm . . .”
“Okay, here’s what we are going to do next.” Stephanie’s decisiveness was legendary. “I want all of you to come up with concrete options about how to mitigate this problem. I understand there will be trade-offs. But we need to be proactive and get ahead of this issue. Let’s regroup Friday afternoon.”
Stephanie and her senior team convened at 3:30 p.m. on Friday as planned.
“Let’s get started. I am curious to hear your thoughts on how we can mitigate the bias problem,” Stephanie prompted.
Jen Chisholm, the VP of Marketing, spoke first: “A few months ago, we started to market to prospects from under-served communities who had good loan-worthiness values. Since disparities in loan-worthiness values in our applicant pools is a key factor here, this will help close the gap between our White and Black applicant pools. But, unfortunately, this will not be a quick fix. There are significant economic differences along demographic lines in the population of potential applicants which are unlikely to disappear in the short-run.”
“I agree about it not being a quick fix, Jen, but we must absolutely give this higher priority now. Please get back to me with the budget you think you need to ramp up the marketing program,” Stephanie responded.
Then she turned to Mark Chen, VP of Consumer Lending. “Mark, what can we do with our current applicant pool?”
“There is no silver bullet, Stephanie,” Mark said. “If we want to ensure that the same percentage of deserving applicants are granted loans, regardless of their race, we need to use different loan-worthiness thresholds for Black and White applicants. In particular, we will need to use a lower threshold for Black applicants,” he concluded.
“Even if we do this, the other type of error we discussed in our last meeting — the percentage of undeserving applicants that we grant loans to — may still be unequal across Black and White applicants,” added Salma, the VP of Analytics.
There was silence in the room for a moment.
Then the CFO Igor Krilov spoke up. “Lowering our loan-worthiness threshold will hurt our profitability, Mark. Our default rates will go up.”
“It is not just profitability,” Jen pointed out. “By using different thresholds, we may be able to make it fair for each group as a whole, but we will no longer be fair at the individual level. We will be granting loans to Black applicants while denying loans to White applicants with the same (or better) loan-worthiness!”
“Wait a second, folks,” Stephanie interjected. “Since race is a protected attribute, if we use it to set different thresholds, aren’t we basically using it to make loan decisions? I thought that was illegal.”
“Good point, Stephanie,” Mark sighed. “I wish Jeff were here.1 He would know.”
Stephanie looked around the room. “Any other suggestions, team?”
“What about our modeling approach, Salma?” Igor asked. “I have been reading about these powerful AI models being used in the banking industry. Are we using the best models that are out there? This issue isn’t arising due to weak models, is it?”
“Our modeling approach isn’t the weak link here, Igor. I compared notes with my peers in other banks at the Analytics in Banking conference earlier this year, and it was pretty clear that we are state of the art,” Salma replied evenly.
She continued, “That said, continuously improving our modeling approach and the resulting predictions is the best long-term remedy. The fewer mistakes our model makes, the better our customers are served and the more profitable we are. And we should never forget that our model is only as good as the data it is built on. Perhaps we need to think more broadly about what additional data we are missing that could provide a fuller, more nuanced, picture of the loan-worthiness of our applicants.”
“Yes, we should give this higher priority as well, Salma,” Stephanie said. “Thanks for all your suggestions.” She got up, indicating that the meeting was over. “I need to process all this over the weekend, touch base with Jeff on Monday, and update the Board later in the week. I will keep you all posted.”
“Meanwhile, please continue making progress on the ideas we discussed. I am hopeful we will find a good solution. After all, it should be easier to fix an algorithm than to fix an idiosyncratic process subject to human biases.”