4 Decluttering
4.1 Decluttering rows
Next, we use the technique of decluttering to reduce the number of overall rows. We can only delete the rows which have no value for us. That means rows that either have no information for us because of only missing values or they are not relevant for our research question. In the first case, we can proceed by identifying the percentage of entries per row, which are actually NA. Afterwards, we plot these number to check how much data per row is missing and whether there are some extreme cases where a lot of data is missing.
<- apply(is.na(data),1,sum)/ncol(data)
na_rows options(scipen=999)
plot(na_rows, type="h", xlab = "Index", ylab = "NA Frequency")
From the chart above we can conclude that the severity of NA values is not huge, however, there seem to be a number of entries which miss more data than the majority of records. But overall, we have no records to delete. There are also no rows that are currently ‘not relevant’ so we can proceed to the decluttering columns step.
4.2 Decluttering columns
For decluttering the columns we will use again the same two criteria as previously mentioned. That means we are going to eliminate all columns with (almost) all entries missing and we are going to delete columns which only have one unique value.
Following the standard procedure, we create a table to help us with an overview of the missingness and uniqueness of values in our columns. First, we calculate the percentage of missing values per column.
<- as.matrix(data %>% is.na %>% colSums)
missing_vals <- round(100*missing_vals/nrow(data),2) missing_vals_rel
Then, we to do the same for the unique values in the columns:
<- t(as.matrix(data %>% summarise_all(n_distinct)))
unique_vals <- round(100*unique_vals/nrow(data),2) unique_vals_rel
Last but not least, we can summarize the findings in a small data summary table:
<- data.frame(missing_vals, missing_vals_rel, unique_vals, unique_vals_rel)
data_summary colnames(data_summary) <- c("NAs", "NAs_rel", "Unique", "Unique_rel")
data_summary
## NAs NAs_rel Unique Unique_rel
## Enterprise Flag 0 0.00 2 0.00
## Record Number 0 0.00 1000000 100.00
## US Postal State Code 0 0.00 54 0.01
## Metropolitan Statistical Area (MSA) Code 0 0.00 393 0.04
## County - 2010 Census 0 0.00 317 0.03
## Census Tract - 2010 Census 159 0.02 23161 2.32
## 2010 Census Tract - Percent Minority 423 0.04 9712 0.97
## 2010 Census Tract - Median Income 848 0.08 37191 3.72
## Local Area Median Income 848 0.08 1222 0.12
## Tract Income Ratio 848 0.08 346 0.03
## Borrower’s (or Borrowers’) Annual Income 0 0.00 1619 0.16
## Area Median Family Income (2019) 1 0.00 422 0.04
## Borrower Income Ratio 15 0.00 1984 0.20
## Acquisition Unpaid Principal Balance (UPB) 0 0.00 131 0.01
## Purpose of Loan 0 0.00 4 0.00
## Federal Guarantee 0 0.00 4 0.00
## Number of Borrowers 0 0.00 6 0.00
## First-Time Home Buyer 265 0.03 3 0.00
## Borrower Race1 141739 14.17 6 0.00
## Borrower Race2 986242 98.62 6 0.00
## Borrower Race3 999528 99.95 6 0.00
## Borrower Race4 999948 99.99 5 0.00
## Borrower Race5 999986 100.00 3 0.00
## Borrower Ethnicity 143473 14.35 3 0.00
## Co-Borrower Race1 602783 60.28 6 0.00
## Co-Borrower Race2 990779 99.08 6 0.00
## Co-Borrower Race3 999763 99.98 6 0.00
## Co-Borrower Race4 999977 100.00 5 0.00
## Co-Borrower Race5 999997 100.00 3 0.00
## Co-Borrower Ethnicity 609922 60.99 3 0.00
## Borrower Gender 79610 7.96 3 0.00
## Co-Borrower Gender 575553 57.56 3 0.00
## Age of Borrower 25 0.00 8 0.00
## Age of Co-Borrower 535309 53.53 8 0.00
## Occupancy Code 0 0.00 3 0.00
## Rate Spread 963661 96.37 298 0.03
## HOEPA Status 0 0.00 1 0.00
## Property Type 0 0.00 2 0.00
## Lien Status 0 0.00 1 0.00
## Borrower Age 62 or older 25 0.00 3 0.00
## Co-Borrower Age 62 or older 535309 53.53 3 0.00
## Loan-to-Value Ratio (LTV) 3 0.00 9168 0.92
## Date of Mortgage Note 0 0.00 2 0.00
## Term of Mortgage at Origination 0 0.00 156 0.02
## Number of Units 0 0.00 4 0.00
## Interest Rate at Origination 74 0.01 367 0.04
## Note Amount 74 0.01 131 0.01
## Preapproval 679221 67.92 3 0.00
## Application Channel 204345 20.43 4 0.00
## Automated Underwriting System (AUS) 11919 1.19 4 0.00
## Credit Score Model - Borrower 216866 21.69 4 0.00
## Credit Score Model - Co-Borrower 217164 21.72 5 0.00
## Debt-to-Income (DTI) Ratio 97 0.01 20 0.00
## Discount Points 294369 29.44 140862 14.09
## Introductory Rate Period 992288 99.23 11 0.00
## Manufactured Home – Land Property Interest 997977 99.80 4 0.00
## Property Value 74 0.01 369 0.04
## Rural Census Tract 0 0.00 2 0.00
## Lower Mississippi Delta County 0 0.00 2 0.00
## Middle Appalachia County 0 0.00 2 0.00
## Persistent Poverty County 0 0.00 2 0.00
## Area of Concentrated Poverty 0 0.00 2 0.00
## High Opportunity Area 0 0.00 2 0.00
## Qualified Opportunity Zone (QOZ) 0 0.00 2 0.00
Based on the table, we can see that there are some columns exhibiting 100 percent missing values or having only one unique value, such as Borrower Race5
, HOEPA Status
, or Introductory Rate Period
. We delete these columns from our data as there is no relevant information to be gained. Furthermore, we also delete columns that have an extremely high proportion of missing values of more than 96% as these columns would not be useful either.
<- rownames(data_summary)[which(data_summary$Unique==1)]
uniques <- rownames(data_summary)[which(data_summary$NAs_rel==100)]
full_NA <- rownames(data_summary)[which(data_summary$NAs_rel>96&data_summary$NAs_rel<100)]
high_NA
<- match(unique(c(uniques,full_NA,high_NA)),colnames(data)) # unique, because we can have some double counts
col_del_ind <- data[,-col_del_ind]
data
dim(data)
## [1] 1000000 51
Looking at the table above, we see that we have managed to drop 13 unnecessary columns, leaving us with remaining 51 columns that need to be further qualitatively assessed.