3 Structure of Data

In order to get familiar with the extent of the data and check, whether we have imported everything well, we examine the dimensions (extent) of the data set as well as the structure:

dim(data)
## [1] 1000000      64
str(data)
## tibble [1,000,000 × 64] (S3: tbl_df/tbl/data.frame)
##  $ Enterprise Flag                           : num [1:1000000] 1 2 1 1 2 2 2 2 1 1 ...
##  $ Record Number                             : num [1:1000000] 635011 3140543 485568 1577326 3333736 ...
##  $ US Postal State Code                      : chr [1:1000000] "36" "36" "19" "53" ...
##  $ Metropolitan Statistical Area (MSA) Code  : num [1:1000000] 40380 15380 11180 99999 14260 ...
##  $ County - 2010 Census                      : chr [1:1000000] "055" "029" "169" "045" ...
##  $ Census Tract - 2010 Census                : chr [1:1000000] "014203" "014208" "010300" "961200" ...
##  $ 2010 Census Tract - Percent Minority      : num [1:1000000] 21.66 4.12 6.16 11.27 18.22 ...
##  $ 2010 Census Tract - Median Income         : num [1:1000000] 64456 114696 82969 77500 63436 ...
##  $ Local Area Median Income                  : num [1:1000000] 71548 71482 84479 62046 67137 ...
##  $ Tract Income Ratio                        : num [1:1000000] 0.9 1.6 0.98 1.24 0.94 0.65 1.32 1.49 0.7 0.98 ...
##  $ Borrower’s (or Borrowers’) Annual Income  : num [1:1000000] 136000 94000 62000 141000 41000 42000 207000 61000 118000 91000 ...
##  $ Area Median Family Income (2019)          : num [1:1000000] 75800 76900 89400 64500 71700 67800 77800 80800 72900 87400 ...
##  $ Borrower Income Ratio                     : num [1:1000000] 1.79 1.22 0.69 2.18 0.57 0.61 2.66 0.75 1.61 1.04 ...
##  $ Acquisition Unpaid Principal Balance (UPB): num [1:1000000] 125000 255000 175000 455000 85000 115000 185000 175000 155000 205000 ...
##  $ Purpose of Loan                           : num [1:1000000] 7 1 1 1 1 1 1 1 1 2 ...
##  $ Federal Guarantee                         : num [1:1000000] 1 1 1 1 1 1 1 1 1 1 ...
##  $ Number of Borrowers                       : num [1:1000000] 2 2 1 2 2 1 2 1 2 2 ...
##  $ First-Time Home Buyer                     : num [1:1000000] 2 1 2 2 2 1 2 2 1 2 ...
##  $ Borrower Race1                            : num [1:1000000] 5 5 6 5 5 6 5 2 5 6 ...
##  $ Borrower Race2                            : num [1:1000000] 9 9 9 9 9 9 9 9 9 9 ...
##  $ Borrower Race3                            : num [1:1000000] 9 9 9 9 9 9 9 9 9 9 ...
##  $ Borrower Race4                            : num [1:1000000] 9 9 9 9 9 9 9 9 9 9 ...
##  $ Borrower Race5                            : num [1:1000000] 9 9 9 9 9 9 9 9 9 9 ...
##  $ Borrower Ethnicity                        : num [1:1000000] 1 2 3 2 2 2 2 2 1 3 ...
##  $ Co-Borrower Race1                         : num [1:1000000] 5 5 9 5 5 9 5 9 5 6 ...
##  $ Co-Borrower Race2                         : num [1:1000000] 9 9 9 9 9 9 9 9 9 9 ...
##  $ Co-Borrower Race3                         : num [1:1000000] 9 9 9 9 9 9 9 9 9 9 ...
##  $ Co-Borrower Race4                         : num [1:1000000] 9 9 9 9 9 9 9 9 9 9 ...
##  $ Co-Borrower Race5                         : num [1:1000000] 9 9 9 9 9 9 9 9 9 9 ...
##  $ Co-Borrower Ethnicity                     : num [1:1000000] 1 2 9 2 2 9 2 9 2 3 ...
##  $ Borrower Gender                           : num [1:1000000] 1 1 3 2 1 3 1 2 2 1 ...
##  $ Co-Borrower Gender                        : num [1:1000000] 2 2 4 1 2 9 2 9 1 2 ...
##  $ Age of Borrower                           : num [1:1000000] 4 2 2 3 5 1 4 3 4 3 ...
##  $ Age of Co-Borrower                        : num [1:1000000] 4 2 9 4 5 9 4 9 5 3 ...
##  $ Occupancy Code                            : num [1:1000000] 1 1 1 1 1 1 1 1 1 1 ...
##  $ Rate Spread                               : num [1:1000000] 0 0 0 2.7 0 0 0 0 0 0 ...
##  $ HOEPA Status                              : num [1:1000000] 2 2 2 2 2 2 2 2 2 2 ...
##  $ Property Type                             : num [1:1000000] 1 1 1 1 1 1 1 1 1 1 ...
##  $ Lien Status                               : num [1:1000000] 1 1 1 1 1 1 1 1 1 1 ...
##  $ Borrower Age 62 or older                  : num [1:1000000] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Co-Borrower Age 62 or older               : num [1:1000000] 0 0 9 0 0 9 0 9 0 0 ...
##  $ Loan-to-Value Ratio (LTV)                 : num [1:1000000] 80 95 79 90 27.1 ...
##  $ Date of Mortgage Note                     : num [1:1000000] 1 1 1 1 1 1 1 1 1 1 ...
##  $ Term of Mortgage at Origination           : num [1:1000000] 120 360 360 360 360 360 360 360 360 360 ...
##  $ Number of Units                           : num [1:1000000] 1 1 1 1 1 1 1 1 1 1 ...
##  $ Interest Rate at Origination              : num [1:1000000] 4.62 4.25 4.12 5.75 3.75 4.37 4.5 4.37 4.75 3.99 ...
##  $ Note Amount                               : num [1:1000000] 125000 255000 175000 455000 85000 115000 185000 175000 155000 205000 ...
##  $ Preapproval                               : num [1:1000000] 9 9 9 2 9 9 9 9 1 2 ...
##  $ Application Channel                       : num [1:1000000] 1 3 9 1 2 1 1 1 1 1 ...
##  $ Automated Underwriting System (AUS)       : num [1:1000000] 1 1 2 1 2 2 1 2 1 1 ...
##  $ Credit Score Model - Borrower             : num [1:1000000] 3 99 1 2 1 3 99 2 2 2 ...
##  $ Credit Score Model - Co-Borrower          : num [1:1000000] 3 99 10 1 1 10 99 10 1 1 ...
##  $ Debt-to-Income (DTI) Ratio                : num [1:1000000] 46 20 20 40 20 38 20 49 30 30 ...
##  $ Discount Points                           : num [1:1000000] 0 1375 0 0 999999 ...
##  $ Introductory Rate Period                  : num [1:1000000] 999 999 999 999 999 999 999 999 999 999 ...
##  $ Manufactured Home – Land Property Interest: num [1:1000000] 9 5 5 5 5 5 5 5 5 9 ...
##  $ Property Value                            : num [1:1000000] 155000 275000 235000 505000 295000 135000 235000 195000 205000 235000 ...
##  $ Rural Census Tract                        : num [1:1000000] 0 0 0 1 0 1 0 0 0 0 ...
##  $ Lower Mississippi Delta County            : num [1:1000000] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Middle Appalachia County                  : num [1:1000000] 0 0 0 0 0 0 0 1 0 0 ...
##  $ Persistent Poverty County                 : num [1:1000000] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Area of Concentrated Poverty              : num [1:1000000] 0 0 0 0 0 1 0 0 0 0 ...
##  $ High Opportunity Area                     : num [1:1000000] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Qualified Opportunity Zone (QOZ)          : num [1:1000000] 0 0 0 0 0 0 0 0 0 0 ...

It becomes very obvious that all of the columns in the data set are purely of type numeric. That is because all the columns are encoded. We take a look at them thoroughly in later steps. For now, we can see there are 64 columns in total and 1 million rows. In the pdf-file with the description of each column we noticed that almost all our columns may have NA values, but they are recorded as different numbers or characters (e.g. 9, 9999, “00”). So, we now need to change these values to NA, because we will need it in further analysis.

We assume that not applicable means the same as not available. With the help of replace_with_na command from naniar package we can fill our columns with NA values really quick.

data <- data %>%
  replace_with_na(replace = list("US Postal State Code" = "00",
                                 "Metropolitan Statistical Area (MSA) Code" = 00000,
                                 "County - 2010 Census" = "000",
                                 "Census Tract - 2010 Census" = "000000",
                                 "2010 Census Tract - Percent Minority" = 9999.0,
                                 "2010 Census Tract - Median Income" = 999999,                                 
                                 "Local Area Median Income" = 999999,
                                 "Tract Income Ratio" = 9999.000,
                                 "Borrower’s (or Borrowers’) Annual Income" = 999999999,
                                 "Area Median Family Income (2019)" = 999999,
                                 "Borrower Income Ratio" = 9999.000,
                                  "Acquisition Unpaid Principal Balance (UPB)" = 999999999,
                                 "Purpose of Loan" = 9,
                                 "Number of Borrowers" = 99,
                                 "First-Time Home Buyer" = 9,
                                 "Borrower Race1" = c(6,7,9),
                                 "Borrower Race2" = c(6,7,9),
                                 "Borrower Race3" = c(6,7,9),
                                 "Borrower Race4" = c(6,7,9),
                                 "Borrower Race5" = c(6,7,9),
                                 "Borrower Ethnicity" = c(3,4,9),
                                 "Co-Borrower Race1" = c(6,7,9),
                                 "Co-Borrower Race2" = c(6,7,9),
                                 "Co-Borrower Race3" = c(6,7,9),
                                 "Co-Borrower Race4" = c(6,7,9),
                                 "Co-Borrower Race5" = c(6,7,9),
                                 "Co-Borrower Ethnicity" = c(3,4,9),
                                 "Borrower Gender" = c(3,4,9),
                                 "Co-Borrower Gender" = c(3,4,9),
                                 "Age of Borrower" = 9,
                                 "Age of Co-Borrower" = 9,
                                 "Occupancy Code" = 9,
                                 "Rate Spread" = 0,
                                 "HOEPA Status" = 9,
                                 "Property Type" = 9,
                                 "Lien Status" = 4,
                                 "Borrower Age 62 or older" = 9,
                                 "Co-Borrower Age 62 or older" = 9,
                                 "Loan-to-Value Ratio (LTV)" = 999.00,
                                 "Date of Mortgage Note" = 9,
                                 "Term of Mortgage at Origination" = 999,
                                 "Interest Rate at Origination" = 99.000,
                                 "Note Amount" = 999999999,
                                 "Preapproval" = 9,
                                 "Application Channel" = 9,
                                 "Automated Underwriting System (AUS)" = c(6,9),
                                 "Credit Score Model - Borrower" = c(9,99),
                                 "Credit Score Model - Co-Borrower" = c(9,99),
                                 "Debt-to-Income (DTI) Ratio" = 99,
                                 "Discount Points" = 999999,
                                 "Introductory Rate Period" = 999,
                                 "Manufactured Home – Land Property Interest" = c(5,9),
                                 "Property Value" = 999999999
                                 ))

We can further check whether the data is imported in the correct manner by checking some of the rows. So, let’s check the first, last and some random entries (25000 to 25005) of our data.

3.1 Row Check

3.1.2 Tail

knitr::kable(
  tail(data[, 1:8], 6), booktabs = TRUE,
  caption = "A table of the last 6 rows of the data set"
)
Table 3.2: A table of the last 6 rows of the data set
Enterprise Flag Record Number US Postal State Code Metropolitan Statistical Area (MSA) Code County - 2010 Census Census Tract - 2010 Census 2010 Census Tract - Percent Minority 2010 Census Tract - Median Income
1 1720431 19 36540 155 031601 4.16 103664
1 1431808 24 47900 021 752204 34.60 148553
1 310895 25 39300 005 610204 15.34 124142
2 3055430 25 14460 017 322300 14.03 94188
1 1944696 41 38900 051 008700 27.64 63850
1 793677 51 47900 153 901305 13.10 120909

3.1.3 Random Entries

knitr::kable(
  data[25000:25005, 1:8], booktabs = TRUE,
  caption = "A table of the random entries of the data set"
)
Table 3.3: A table of the random entries of the data set
Enterprise Flag Record Number US Postal State Code Metropolitan Statistical Area (MSA) Code County - 2010 Census Census Tract - 2010 Census 2010 Census Tract - Percent Minority 2010 Census Tract - Median Income
2 2898977 37 16740 025 041502 18.26 77746
2 3868937 26 19820 093 743400 4.03 125859
1 1729803 35 10740 001 940600 67.61 90551
1 719544 06 12540 029 003204 51.54 129830
1 1908680 49 41620 035 113107 11.13 92339
2 3275519 39 17460 055 310700 2.75 103063