3 Structure of Data
In order to get familiar with the extent of the data and check, whether we have imported everything well, we examine the dimensions (extent) of the data set as well as the structure:
dim(data)
## [1] 1000000 64
str(data)
## tibble [1,000,000 × 64] (S3: tbl_df/tbl/data.frame)
## $ Enterprise Flag : num [1:1000000] 1 2 1 1 2 2 2 2 1 1 ...
## $ Record Number : num [1:1000000] 635011 3140543 485568 1577326 3333736 ...
## $ US Postal State Code : chr [1:1000000] "36" "36" "19" "53" ...
## $ Metropolitan Statistical Area (MSA) Code : num [1:1000000] 40380 15380 11180 99999 14260 ...
## $ County - 2010 Census : chr [1:1000000] "055" "029" "169" "045" ...
## $ Census Tract - 2010 Census : chr [1:1000000] "014203" "014208" "010300" "961200" ...
## $ 2010 Census Tract - Percent Minority : num [1:1000000] 21.66 4.12 6.16 11.27 18.22 ...
## $ 2010 Census Tract - Median Income : num [1:1000000] 64456 114696 82969 77500 63436 ...
## $ Local Area Median Income : num [1:1000000] 71548 71482 84479 62046 67137 ...
## $ Tract Income Ratio : num [1:1000000] 0.9 1.6 0.98 1.24 0.94 0.65 1.32 1.49 0.7 0.98 ...
## $ Borrower’s (or Borrowers’) Annual Income : num [1:1000000] 136000 94000 62000 141000 41000 42000 207000 61000 118000 91000 ...
## $ Area Median Family Income (2019) : num [1:1000000] 75800 76900 89400 64500 71700 67800 77800 80800 72900 87400 ...
## $ Borrower Income Ratio : num [1:1000000] 1.79 1.22 0.69 2.18 0.57 0.61 2.66 0.75 1.61 1.04 ...
## $ Acquisition Unpaid Principal Balance (UPB): num [1:1000000] 125000 255000 175000 455000 85000 115000 185000 175000 155000 205000 ...
## $ Purpose of Loan : num [1:1000000] 7 1 1 1 1 1 1 1 1 2 ...
## $ Federal Guarantee : num [1:1000000] 1 1 1 1 1 1 1 1 1 1 ...
## $ Number of Borrowers : num [1:1000000] 2 2 1 2 2 1 2 1 2 2 ...
## $ First-Time Home Buyer : num [1:1000000] 2 1 2 2 2 1 2 2 1 2 ...
## $ Borrower Race1 : num [1:1000000] 5 5 6 5 5 6 5 2 5 6 ...
## $ Borrower Race2 : num [1:1000000] 9 9 9 9 9 9 9 9 9 9 ...
## $ Borrower Race3 : num [1:1000000] 9 9 9 9 9 9 9 9 9 9 ...
## $ Borrower Race4 : num [1:1000000] 9 9 9 9 9 9 9 9 9 9 ...
## $ Borrower Race5 : num [1:1000000] 9 9 9 9 9 9 9 9 9 9 ...
## $ Borrower Ethnicity : num [1:1000000] 1 2 3 2 2 2 2 2 1 3 ...
## $ Co-Borrower Race1 : num [1:1000000] 5 5 9 5 5 9 5 9 5 6 ...
## $ Co-Borrower Race2 : num [1:1000000] 9 9 9 9 9 9 9 9 9 9 ...
## $ Co-Borrower Race3 : num [1:1000000] 9 9 9 9 9 9 9 9 9 9 ...
## $ Co-Borrower Race4 : num [1:1000000] 9 9 9 9 9 9 9 9 9 9 ...
## $ Co-Borrower Race5 : num [1:1000000] 9 9 9 9 9 9 9 9 9 9 ...
## $ Co-Borrower Ethnicity : num [1:1000000] 1 2 9 2 2 9 2 9 2 3 ...
## $ Borrower Gender : num [1:1000000] 1 1 3 2 1 3 1 2 2 1 ...
## $ Co-Borrower Gender : num [1:1000000] 2 2 4 1 2 9 2 9 1 2 ...
## $ Age of Borrower : num [1:1000000] 4 2 2 3 5 1 4 3 4 3 ...
## $ Age of Co-Borrower : num [1:1000000] 4 2 9 4 5 9 4 9 5 3 ...
## $ Occupancy Code : num [1:1000000] 1 1 1 1 1 1 1 1 1 1 ...
## $ Rate Spread : num [1:1000000] 0 0 0 2.7 0 0 0 0 0 0 ...
## $ HOEPA Status : num [1:1000000] 2 2 2 2 2 2 2 2 2 2 ...
## $ Property Type : num [1:1000000] 1 1 1 1 1 1 1 1 1 1 ...
## $ Lien Status : num [1:1000000] 1 1 1 1 1 1 1 1 1 1 ...
## $ Borrower Age 62 or older : num [1:1000000] 0 0 0 0 0 0 0 0 0 0 ...
## $ Co-Borrower Age 62 or older : num [1:1000000] 0 0 9 0 0 9 0 9 0 0 ...
## $ Loan-to-Value Ratio (LTV) : num [1:1000000] 80 95 79 90 27.1 ...
## $ Date of Mortgage Note : num [1:1000000] 1 1 1 1 1 1 1 1 1 1 ...
## $ Term of Mortgage at Origination : num [1:1000000] 120 360 360 360 360 360 360 360 360 360 ...
## $ Number of Units : num [1:1000000] 1 1 1 1 1 1 1 1 1 1 ...
## $ Interest Rate at Origination : num [1:1000000] 4.62 4.25 4.12 5.75 3.75 4.37 4.5 4.37 4.75 3.99 ...
## $ Note Amount : num [1:1000000] 125000 255000 175000 455000 85000 115000 185000 175000 155000 205000 ...
## $ Preapproval : num [1:1000000] 9 9 9 2 9 9 9 9 1 2 ...
## $ Application Channel : num [1:1000000] 1 3 9 1 2 1 1 1 1 1 ...
## $ Automated Underwriting System (AUS) : num [1:1000000] 1 1 2 1 2 2 1 2 1 1 ...
## $ Credit Score Model - Borrower : num [1:1000000] 3 99 1 2 1 3 99 2 2 2 ...
## $ Credit Score Model - Co-Borrower : num [1:1000000] 3 99 10 1 1 10 99 10 1 1 ...
## $ Debt-to-Income (DTI) Ratio : num [1:1000000] 46 20 20 40 20 38 20 49 30 30 ...
## $ Discount Points : num [1:1000000] 0 1375 0 0 999999 ...
## $ Introductory Rate Period : num [1:1000000] 999 999 999 999 999 999 999 999 999 999 ...
## $ Manufactured Home – Land Property Interest: num [1:1000000] 9 5 5 5 5 5 5 5 5 9 ...
## $ Property Value : num [1:1000000] 155000 275000 235000 505000 295000 135000 235000 195000 205000 235000 ...
## $ Rural Census Tract : num [1:1000000] 0 0 0 1 0 1 0 0 0 0 ...
## $ Lower Mississippi Delta County : num [1:1000000] 0 0 0 0 0 0 0 0 0 0 ...
## $ Middle Appalachia County : num [1:1000000] 0 0 0 0 0 0 0 1 0 0 ...
## $ Persistent Poverty County : num [1:1000000] 0 0 0 0 0 0 0 0 0 0 ...
## $ Area of Concentrated Poverty : num [1:1000000] 0 0 0 0 0 1 0 0 0 0 ...
## $ High Opportunity Area : num [1:1000000] 0 0 0 0 0 0 0 0 0 0 ...
## $ Qualified Opportunity Zone (QOZ) : num [1:1000000] 0 0 0 0 0 0 0 0 0 0 ...
It becomes very obvious that all of the columns in the data set are purely of type numeric. That is because all the columns are encoded. We take a look at them thoroughly in later steps. For now, we can see there are 64 columns in total and 1 million rows. In the pdf-file with the description of each column we noticed that almost all our columns may have NA values, but they are recorded as different numbers or characters (e.g. 9, 9999, “00”). So, we now need to change these values to NA, because we will need it in further analysis.
We assume that not applicable means the same as not available. With the help of replace_with_na
command from naniar
package we can fill our columns with NA values really quick.
<- data %>%
data replace_with_na(replace = list("US Postal State Code" = "00",
"Metropolitan Statistical Area (MSA) Code" = 00000,
"County - 2010 Census" = "000",
"Census Tract - 2010 Census" = "000000",
"2010 Census Tract - Percent Minority" = 9999.0,
"2010 Census Tract - Median Income" = 999999,
"Local Area Median Income" = 999999,
"Tract Income Ratio" = 9999.000,
"Borrower’s (or Borrowers’) Annual Income" = 999999999,
"Area Median Family Income (2019)" = 999999,
"Borrower Income Ratio" = 9999.000,
"Acquisition Unpaid Principal Balance (UPB)" = 999999999,
"Purpose of Loan" = 9,
"Number of Borrowers" = 99,
"First-Time Home Buyer" = 9,
"Borrower Race1" = c(6,7,9),
"Borrower Race2" = c(6,7,9),
"Borrower Race3" = c(6,7,9),
"Borrower Race4" = c(6,7,9),
"Borrower Race5" = c(6,7,9),
"Borrower Ethnicity" = c(3,4,9),
"Co-Borrower Race1" = c(6,7,9),
"Co-Borrower Race2" = c(6,7,9),
"Co-Borrower Race3" = c(6,7,9),
"Co-Borrower Race4" = c(6,7,9),
"Co-Borrower Race5" = c(6,7,9),
"Co-Borrower Ethnicity" = c(3,4,9),
"Borrower Gender" = c(3,4,9),
"Co-Borrower Gender" = c(3,4,9),
"Age of Borrower" = 9,
"Age of Co-Borrower" = 9,
"Occupancy Code" = 9,
"Rate Spread" = 0,
"HOEPA Status" = 9,
"Property Type" = 9,
"Lien Status" = 4,
"Borrower Age 62 or older" = 9,
"Co-Borrower Age 62 or older" = 9,
"Loan-to-Value Ratio (LTV)" = 999.00,
"Date of Mortgage Note" = 9,
"Term of Mortgage at Origination" = 999,
"Interest Rate at Origination" = 99.000,
"Note Amount" = 999999999,
"Preapproval" = 9,
"Application Channel" = 9,
"Automated Underwriting System (AUS)" = c(6,9),
"Credit Score Model - Borrower" = c(9,99),
"Credit Score Model - Co-Borrower" = c(9,99),
"Debt-to-Income (DTI) Ratio" = 99,
"Discount Points" = 999999,
"Introductory Rate Period" = 999,
"Manufactured Home – Land Property Interest" = c(5,9),
"Property Value" = 999999999
))
We can further check whether the data is imported in the correct manner by checking some of the rows. So, let’s check the first, last and some random entries (25000 to 25005) of our data.
3.1 Row Check
3.1.1 Head
::kable(
knitrhead(data[, 1:8], 6), booktabs = TRUE,
caption = "A table of the first 6 rows of the data set"
)
Enterprise Flag | Record Number | US Postal State Code | Metropolitan Statistical Area (MSA) Code | County - 2010 Census | Census Tract - 2010 Census | 2010 Census Tract - Percent Minority | 2010 Census Tract - Median Income |
---|---|---|---|---|---|---|---|
1 | 635011 | 36 | 40380 | 055 | 014203 | 21.66 | 64456 |
2 | 3140543 | 36 | 15380 | 029 | 014208 | 4.12 | 114696 |
1 | 485568 | 19 | 11180 | 169 | 010300 | 6.16 | 82969 |
1 | 1577326 | 53 | 99999 | 045 | 961200 | 11.27 | 77500 |
2 | 3333736 | 16 | 14260 | 027 | 020901 | 18.22 | 63436 |
2 | 2768127 | 45 | 17900 | 055 | 970602 | 31.51 | 43448 |
3.1.2 Tail
::kable(
knitrtail(data[, 1:8], 6), booktabs = TRUE,
caption = "A table of the last 6 rows of the data set"
)
Enterprise Flag | Record Number | US Postal State Code | Metropolitan Statistical Area (MSA) Code | County - 2010 Census | Census Tract - 2010 Census | 2010 Census Tract - Percent Minority | 2010 Census Tract - Median Income |
---|---|---|---|---|---|---|---|
1 | 1720431 | 19 | 36540 | 155 | 031601 | 4.16 | 103664 |
1 | 1431808 | 24 | 47900 | 021 | 752204 | 34.60 | 148553 |
1 | 310895 | 25 | 39300 | 005 | 610204 | 15.34 | 124142 |
2 | 3055430 | 25 | 14460 | 017 | 322300 | 14.03 | 94188 |
1 | 1944696 | 41 | 38900 | 051 | 008700 | 27.64 | 63850 |
1 | 793677 | 51 | 47900 | 153 | 901305 | 13.10 | 120909 |
3.1.3 Random Entries
::kable(
knitr25000:25005, 1:8], booktabs = TRUE,
data[caption = "A table of the random entries of the data set"
)
Enterprise Flag | Record Number | US Postal State Code | Metropolitan Statistical Area (MSA) Code | County - 2010 Census | Census Tract - 2010 Census | 2010 Census Tract - Percent Minority | 2010 Census Tract - Median Income |
---|---|---|---|---|---|---|---|
2 | 2898977 | 37 | 16740 | 025 | 041502 | 18.26 | 77746 |
2 | 3868937 | 26 | 19820 | 093 | 743400 | 4.03 | 125859 |
1 | 1729803 | 35 | 10740 | 001 | 940600 | 67.61 | 90551 |
1 | 719544 | 06 | 12540 | 029 | 003204 | 51.54 | 129830 |
1 | 1908680 | 49 | 41620 | 035 | 113107 | 11.13 | 92339 |
2 | 3275519 | 39 | 17460 | 055 | 310700 | 2.75 | 103063 |