3 Structure of Data

In order to get familiar with the extent of the data and check, whether we have imported everything well, we examine the dimensions (extent) of the data set as well as the structure:

dim(data)

## [1] 1000000      64

str(data)

## tibble [1,000,000 × 64] (S3: tbl_df/tbl/data.frame)
##  $ Enterprise Flag                           : num [1:1000000] 1 2 1 1 2 2 2 2 1 1 ...
##  $ Record Number                             : num [1:1000000] 635011 3140543 485568 1577326 3333736 ...
##  $ US Postal State Code                      : chr [1:1000000] "36" "36" "19" "53" ...
##  $ Metropolitan Statistical Area (MSA) Code  : num [1:1000000] 40380 15380 11180 99999 14260 ...
##  $ County - 2010 Census                      : chr [1:1000000] "055" "029" "169" "045" ...
##  $ Census Tract - 2010 Census                : chr [1:1000000] "014203" "014208" "010300" "961200" ...
##  $ 2010 Census Tract - Percent Minority      : num [1:1000000] 21.66 4.12 6.16 11.27 18.22 ...
##  $ 2010 Census Tract - Median Income         : num [1:1000000] 64456 114696 82969 77500 63436 ...
##  $ Local Area Median Income                  : num [1:1000000] 71548 71482 84479 62046 67137 ...
##  $ Tract Income Ratio                        : num [1:1000000] 0.9 1.6 0.98 1.24 0.94 0.65 1.32 1.49 0.7 0.98 ...
##  $ Borrower’s (or Borrowers’) Annual Income  : num [1:1000000] 136000 94000 62000 141000 41000 42000 207000 61000 118000 91000 ...
##  $ Area Median Family Income (2019)          : num [1:1000000] 75800 76900 89400 64500 71700 67800 77800 80800 72900 87400 ...
##  $ Borrower Income Ratio                     : num [1:1000000] 1.79 1.22 0.69 2.18 0.57 0.61 2.66 0.75 1.61 1.04 ...
##  $ Acquisition Unpaid Principal Balance (UPB): num [1:1000000] 125000 255000 175000 455000 85000 115000 185000 175000 155000 205000 ...
##  $ Purpose of Loan                           : num [1:1000000] 7 1 1 1 1 1 1 1 1 2 ...
##  $ Federal Guarantee                         : num [1:1000000] 1 1 1 1 1 1 1 1 1 1 ...
##  $ Number of Borrowers                       : num [1:1000000] 2 2 1 2 2 1 2 1 2 2 ...
##  $ First-Time Home Buyer                     : num [1:1000000] 2 1 2 2 2 1 2 2 1 2 ...
##  $ Borrower Race1                            : num [1:1000000] 5 5 6 5 5 6 5 2 5 6 ...
##  $ Borrower Race2                            : num [1:1000000] 9 9 9 9 9 9 9 9 9 9 ...
##  $ Borrower Race3                            : num [1:1000000] 9 9 9 9 9 9 9 9 9 9 ...
##  $ Borrower Race4                            : num [1:1000000] 9 9 9 9 9 9 9 9 9 9 ...
##  $ Borrower Race5                            : num [1:1000000] 9 9 9 9 9 9 9 9 9 9 ...
##  $ Borrower Ethnicity                        : num [1:1000000] 1 2 3 2 2 2 2 2 1 3 ...
##  $ Co-Borrower Race1                         : num [1:1000000] 5 5 9 5 5 9 5 9 5 6 ...
##  $ Co-Borrower Race2                         : num [1:1000000] 9 9 9 9 9 9 9 9 9 9 ...
##  $ Co-Borrower Race3                         : num [1:1000000] 9 9 9 9 9 9 9 9 9 9 ...
##  $ Co-Borrower Race4                         : num [1:1000000] 9 9 9 9 9 9 9 9 9 9 ...
##  $ Co-Borrower Race5                         : num [1:1000000] 9 9 9 9 9 9 9 9 9 9 ...
##  $ Co-Borrower Ethnicity                     : num [1:1000000] 1 2 9 2 2 9 2 9 2 3 ...
##  $ Borrower Gender                           : num [1:1000000] 1 1 3 2 1 3 1 2 2 1 ...
##  $ Co-Borrower Gender                        : num [1:1000000] 2 2 4 1 2 9 2 9 1 2 ...
##  $ Age of Borrower                           : num [1:1000000] 4 2 2 3 5 1 4 3 4 3 ...
##  $ Age of Co-Borrower                        : num [1:1000000] 4 2 9 4 5 9 4 9 5 3 ...
##  $ Occupancy Code                            : num [1:1000000] 1 1 1 1 1 1 1 1 1 1 ...
##  $ Rate Spread                               : num [1:1000000] 0 0 0 2.7 0 0 0 0 0 0 ...
##  $ HOEPA Status                              : num [1:1000000] 2 2 2 2 2 2 2 2 2 2 ...
##  $ Property Type                             : num [1:1000000] 1 1 1 1 1 1 1 1 1 1 ...
##  $ Lien Status                               : num [1:1000000] 1 1 1 1 1 1 1 1 1 1 ...
##  $ Borrower Age 62 or older                  : num [1:1000000] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Co-Borrower Age 62 or older               : num [1:1000000] 0 0 9 0 0 9 0 9 0 0 ...
##  $ Loan-to-Value Ratio (LTV)                 : num [1:1000000] 80 95 79 90 27.1 ...
##  $ Date of Mortgage Note                     : num [1:1000000] 1 1 1 1 1 1 1 1 1 1 ...
##  $ Term of Mortgage at Origination           : num [1:1000000] 120 360 360 360 360 360 360 360 360 360 ...
##  $ Number of Units                           : num [1:1000000] 1 1 1 1 1 1 1 1 1 1 ...
##  $ Interest Rate at Origination              : num [1:1000000] 4.62 4.25 4.12 5.75 3.75 4.37 4.5 4.37 4.75 3.99 ...
##  $ Note Amount                               : num [1:1000000] 125000 255000 175000 455000 85000 115000 185000 175000 155000 205000 ...
##  $ Preapproval                               : num [1:1000000] 9 9 9 2 9 9 9 9 1 2 ...
##  $ Application Channel                       : num [1:1000000] 1 3 9 1 2 1 1 1 1 1 ...
##  $ Automated Underwriting System (AUS)       : num [1:1000000] 1 1 2 1 2 2 1 2 1 1 ...
##  $ Credit Score Model - Borrower             : num [1:1000000] 3 99 1 2 1 3 99 2 2 2 ...
##  $ Credit Score Model - Co-Borrower          : num [1:1000000] 3 99 10 1 1 10 99 10 1 1 ...
##  $ Debt-to-Income (DTI) Ratio                : num [1:1000000] 46 20 20 40 20 38 20 49 30 30 ...
##  $ Discount Points                           : num [1:1000000] 0 1375 0 0 999999 ...
##  $ Introductory Rate Period                  : num [1:1000000] 999 999 999 999 999 999 999 999 999 999 ...
##  $ Manufactured Home – Land Property Interest: num [1:1000000] 9 5 5 5 5 5 5 5 5 9 ...
##  $ Property Value                            : num [1:1000000] 155000 275000 235000 505000 295000 135000 235000 195000 205000 235000 ...
##  $ Rural Census Tract                        : num [1:1000000] 0 0 0 1 0 1 0 0 0 0 ...
##  $ Lower Mississippi Delta County            : num [1:1000000] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Middle Appalachia County                  : num [1:1000000] 0 0 0 0 0 0 0 1 0 0 ...
##  $ Persistent Poverty County                 : num [1:1000000] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Area of Concentrated Poverty              : num [1:1000000] 0 0 0 0 0 1 0 0 0 0 ...
##  $ High Opportunity Area                     : num [1:1000000] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Qualified Opportunity Zone (QOZ)          : num [1:1000000] 0 0 0 0 0 0 0 0 0 0 ...

It becomes very obvious that all of the columns in the data set are purely of type numeric. That is because all the columns are encoded. We take a look at them thoroughly in later steps. For now, we can see there are 64 columns in total and 1 million rows. In the pdf-file with the description of each column we noticed that almost all our columns may have NA values, but they are recorded as different numbers or characters (e.g. 9, 9999, “00”). So, we now need to change these values to NA, because we will need it in further analysis.

We assume that not applicable means the same as not available. With the help of replace_with_na command from naniar package we can fill our columns with NA values really quick.

data <- data %>%
  replace_with_na(replace = list("US Postal State Code" = "00",
                                 "Metropolitan Statistical Area (MSA) Code" = 00000,
                                 "County - 2010 Census" = "000",
                                 "Census Tract - 2010 Census" = "000000",
                                 "2010 Census Tract - Percent Minority" = 9999.0,
                                 "2010 Census Tract - Median Income" = 999999,                                 
                                 "Local Area Median Income" = 999999,
                                 "Tract Income Ratio" = 9999.000,
                                 "Borrower’s (or Borrowers’) Annual Income" = 999999999,
                                 "Area Median Family Income (2019)" = 999999,
                                 "Borrower Income Ratio" = 9999.000,
                                  "Acquisition Unpaid Principal Balance (UPB)" = 999999999,
                                 "Purpose of Loan" = 9,
                                 "Number of Borrowers" = 99,
                                 "First-Time Home Buyer" = 9,
                                 "Borrower Race1" = c(6,7,9),
                                 "Borrower Race2" = c(6,7,9),
                                 "Borrower Race3" = c(6,7,9),
                                 "Borrower Race4" = c(6,7,9),
                                 "Borrower Race5" = c(6,7,9),
                                 "Borrower Ethnicity" = c(3,4,9),
                                 "Co-Borrower Race1" = c(6,7,9),
                                 "Co-Borrower Race2" = c(6,7,9),
                                 "Co-Borrower Race3" = c(6,7,9),
                                 "Co-Borrower Race4" = c(6,7,9),
                                 "Co-Borrower Race5" = c(6,7,9),
                                 "Co-Borrower Ethnicity" = c(3,4,9),
                                 "Borrower Gender" = c(3,4,9),
                                 "Co-Borrower Gender" = c(3,4,9),
                                 "Age of Borrower" = 9,
                                 "Age of Co-Borrower" = 9,
                                 "Occupancy Code" = 9,
                                 "Rate Spread" = 0,
                                 "HOEPA Status" = 9,
                                 "Property Type" = 9,
                                 "Lien Status" = 4,
                                 "Borrower Age 62 or older" = 9,
                                 "Co-Borrower Age 62 or older" = 9,
                                 "Loan-to-Value Ratio (LTV)" = 999.00,
                                 "Date of Mortgage Note" = 9,
                                 "Term of Mortgage at Origination" = 999,
                                 "Interest Rate at Origination" = 99.000,
                                 "Note Amount" = 999999999,
                                 "Preapproval" = 9,
                                 "Application Channel" = 9,
                                 "Automated Underwriting System (AUS)" = c(6,9),
                                 "Credit Score Model - Borrower" = c(9,99),
                                 "Credit Score Model - Co-Borrower" = c(9,99),
                                 "Debt-to-Income (DTI) Ratio" = 99,
                                 "Discount Points" = 999999,
                                 "Introductory Rate Period" = 999,
                                 "Manufactured Home – Land Property Interest" = c(5,9),
                                 "Property Value" = 999999999
                                 ))

We can further check whether the data is imported in the correct manner by checking some of the rows. So, let’s check the first, last and some random entries (25000 to 25005) of our data.

3.1 Row Check

3.1.1 Head

knitr::kable(
  head(data[, 1:8], 6), booktabs = TRUE,
  caption = "A table of the first 6 rows of the data set"
)

Table 3.1: A table of the first 6 rows of the data set
Enterprise Flag	Record Number	US Postal State Code	Metropolitan Statistical Area (MSA) Code	County - 2010 Census	Census Tract - 2010 Census	2010 Census Tract - Percent Minority	2010 Census Tract - Median Income
1	635011	36	40380	055	014203	21.66	64456
2	3140543	36	15380	029	014208	4.12	114696
1	485568	19	11180	169	010300	6.16	82969
1	1577326	53	99999	045	961200	11.27	77500
2	3333736	16	14260	027	020901	18.22	63436
2	2768127	45	17900	055	970602	31.51	43448

3.1.2 Tail

knitr::kable(
  tail(data[, 1:8], 6), booktabs = TRUE,
  caption = "A table of the last 6 rows of the data set"
)

Table 3.2: A table of the last 6 rows of the data set
Enterprise Flag	Record Number	US Postal State Code	Metropolitan Statistical Area (MSA) Code	County - 2010 Census	Census Tract - 2010 Census	2010 Census Tract - Percent Minority	2010 Census Tract - Median Income
1	1720431	19	36540	155	031601	4.16	103664
1	1431808	24	47900	021	752204	34.60	148553
1	310895	25	39300	005	610204	15.34	124142
2	3055430	25	14460	017	322300	14.03	94188
1	1944696	41	38900	051	008700	27.64	63850
1	793677	51	47900	153	901305	13.10	120909

3.1.3 Random Entries

knitr::kable(
  data[25000:25005, 1:8], booktabs = TRUE,
  caption = "A table of the random entries of the data set"
)

Table 3.3: A table of the random entries of the data set
Enterprise Flag	Record Number	US Postal State Code	Metropolitan Statistical Area (MSA) Code	County - 2010 Census	Census Tract - 2010 Census	2010 Census Tract - Percent Minority	2010 Census Tract - Median Income
2	2898977	37	16740	025	041502	18.26	77746
2	3868937	26	19820	093	743400	4.03	125859
1	1729803	35	10740	001	940600	67.61	90551
1	719544	06	12540	029	003204	51.54	129830
1	1908680	49	41620	035	113107	11.13	92339
2	3275519	39	17460	055	310700	2.75	103063