1 Introduction

In our project, we use the “Census Tract File” which contains mortgage-level data on all single-family (1-4 unit) properties. This data contains information on thousands of mortgages, including mortgage specific features such as the interest rate or the term as well as borrower and co-borrower specific features such as rating or income, ethnicity, race and gender. We mainly follow the Exploratory Data Analysis: A 10 Step Introduction book which serves us as a guide throughout the EDA process.

Based on the data set and available data, we decided upon the following research question:

“Does the mortgage company discriminate against certain people?”

We would like to know if the mortgage companies have ‘prejudice’ or tend to be biased toward certain people, especially based on other non-economical factors, such as gender or race. We think it is important that financial institutions stay highly objective and do not hinder people with different backgrounds in obtaining the mortgage.

For this purpose we also developed following sub-questions to help us narrow down our focus and better answer our main research question:

  • Which financial and economic features in the data set have the highest importance regarding the mortgage purchase?
  • Does the borrower’s race had an impact on eventual sum of loan given and respective interest rate? What other social factors have influenced the Enterprises’ decisions?
  • Did people with higher annual income receive a higher interest rate for the mortgage?

To begin with, we need to get rid of any possible data in our cache and then load all the libraries which are useful for data analysis. The first can be achieved with the command:

rm(list=ls())
knitr::opts_chunk$set(warning = FALSE, message = FALSE, fig.align = "center")

The latter is done by integrating the following libraries:

library(tidyverse)
library(DescTools)
library(gridExtra)
library(hexbin)
library(RColorBrewer)
library(naniar)
library(janitor)
library(scales)

We will use the tidyverse package to get access to a great variety of packages such as dplyr and ggplot2 which helps us to clean and visualize the data. The package DescTools is a tailor-made package for EDA with simple solutions to receive a great overview about our data. gridExtra will allow us to stack many plots done with ggplot into one window. hexbin extends the ggplot functionality by adding a plot with hexagons. The package RColorBrewer allows us to pick certain color gradients and is an extremely helpful and underrated tool for EDA. naniar package helps us fill our data set with NA values, while the package janitor correctly formats the column names of our table, so that we can use function for plotting interrelations between different variables. Finally, scales package helps with defining unit formats on the plots.