Chapter 3 Homelessness
3.1 Introduction
The 2020 point-in-time count of people experiencing homelessness for Seattle/King County was 11,751. This represents a 5% increase over the 2019 count and reflects similar trend across many counties in the western U.S.. A step towards addressing homelessness is improving our understanding of the relationship between local housing market factors and homelessness.
The U.S. Department of Housing and Urban Development (HUD) produced a report in 2019 Market Predictors of Homelessness that describes a model-based approach to understanding of the relationship between local housing market factors and homelessness. Our project is motivated by the goals of the HUD study:
“To continue progressing toward the goals of ending and preventing homelessness, we must further our knowledge of the basic community-level determinants of homelessness. The primary objectives of this study are to (1) identify market factors that have established effects on homelessness, (2) construct and evaluate empirical models of community-level homelessness..”
We will investigate whether there are alternative modeling approaches that outperform the models described in the HUD report.
3.2 Data Collection
The data for this project are described in HUD’s report Market Predictors of Homelessness in the section titled DATA.
I will refer you to this section of the HUD report for a detailed description of the sources of the data and how they were processed.
3.2.1 Load necessary packages
#skimr provides a nice summary of a data set
library(skimr)
#leaps will be used for model selection
library(leaps)
#tidyverse contains packages we will use for processing and plotting data
library(tidyverse)
#readxl lets us read Excel files
library(readxl)
#GGally has a nice pairs plotting function
library(GGally)
#corrplot has nice plots for correlation matrices
library(corrplot)
#gridExtra
library(gridExtra)
#glmnet is used to fit glm's. It will be used for lasso and ridge regression models.
library(glmnet)
#tidymodels has a nice workflow for many models. We will use it for XGBoost
library(tidymodels)
#xgboost lets us fit XGBoost models
library(xgboost)
#vip is used to visualize the importance of predicts in XGBoost models
library(vip)
#Set the plotting theme
theme_set(theme_bw())
3.2.2 Examine the data dictionary
The data dictionary HUD TO3 - 05b Analysis File - Data Dictionary.xlsx
contains descriptions of all variables in the data set.
\(\rightarrow\) Load the data dictionary (call it dictionary
) and view its contents using the function View
.
Show Coding Hint
Use the function read_xlsx
Show Answer
<- read_xlsx("HUD TO3 - 05b Analysis File - Data Dictionary.xlsx") dictionary
View(dictionary)
3.2.2.1 What are the sources of data?
\(\rightarrow\) Use the dictionary to find the unique sources of data that are not derived from other variables. We will assume that derived variables are either have Derived
equal to No
or Source or Root Variable
starts with the string See
.
Show Coding Hint
You can use the str_detect
function to determine whether a string includes particular string as a component.
Show Answer
%>%
dictionary filter(Derived == "No" & str_detect(`Source or Root Variable`, "See") == FALSE) %>%
select(`Source or Root Variable`) %>%
unique()
## # A tibble: 26 x 1
## `Source or Root Variable`
## <chr>
## 1 HUD
## 2 HUD PIT
## 3 HUD System Performance Measures
## 4 HUD CoC Funding Reports
## 5 HUD HIC
## 6 Census Intercensal Population Estimates
## 7 Census ACS 5-Year Estimates
## 8 Federal Housing Finance Agency
## 9 BLS Local Area Unemployment Statistics (LAUS)
## 10 Census (County Area in Sq. Miles)
## # … with 16 more rows
What are the sources of most of the data?
\(\rightarrow\) Make a bar graph of the counts of different data sources described in Source or Root Variable
. Your graph should have the following features:
- Order the bars in descending order based on the count.
- Only include the 10 most common data sources.
- Orient the plot so that it is easy to read the labels.
Show Coding Hint
Use the fct_reorder
to reorder the bars.
Use head
to limit the number of bars shown.
Show Answer
%>%
dictionary count(`Source or Root Variable`) %>%
mutate(`Source or Root Variable` = fct_reorder(`Source or Root Variable`, n)) %>%
head(10) %>%
ggplot(aes(x = `Source or Root Variable`, y = n)) +
labs(x = "Source or Root Variable", y = "Count") +
geom_col() +
coord_flip()
The most common source of data is the Census.
\(\rightarrow\) What are the different Associated Domains of the variables?
Show Answer
%>%
dictionary count(`Associated Domain`) %>%
mutate(`Associated Domain` = fct_reorder(`Associated Domain`,n)) %>%
ggplot(aes(x = `Associated Domain`, y = n/sum(n))) +
geom_col() +
scale_y_continuous(labels = scales::percent) +
labs(x = "Associated Domain", y = "Percentage") +
coord_flip()
We see that there are 10 types of variables, plus an identifier. We also see that there are many demographic, housing, safety net, and economic variables. Demographic and housing variables make up more than 50% of the variables.
A big part of this project will be figuring out what variables to include in the analysis.
3.2.3 Data ethics
3.2.3.1 The problem
Is it possible to answer our question with data? Is the problem well-posed?
Can we cause harm to individuals with this project?
3.2.3.2 The data
Are individuals identified in the data set?
Are there issues of privacy or consent that we should address?
Are there sources of bias in the data?
We will discuss these issues in class.
3.3 Data Preparation
3.3.1 Load the data
The HUD data set is contained in the file 05b_analysis_file_update.csv
.
\(\rightarrow\) Load the data set contained in the file 05b_analysis_file_update.csv
and name the data frame df
.
Show Coding Hint
Use read_csv
Show Answer
<- read_csv("05b_analysis_file_update.csv") df
##
## ── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────
## cols(
## .default = col_double(),
## cocnumber = col_character(),
## state_abr = col_character(),
## dem_health_ins_acs5yr_2012 = col_logical()
## )
## ℹ Use `spec()` for the full column specifications.
Note that we used read_csv
, rather than read.csv
because read_csv
loads the data as a tibble.
3.3.2 Explore the contents of the data set
\(\rightarrow\) Look at the first few rows of the data frame.
Show Coding Hint
You can use the functions head
or glimpse
to see the head of the data frame or the function skim
to get a nice summary.
Show Answer
head(df)
## # A tibble: 6 x 332
## year cocnumber pit_tot_shelt_pit_… pit_tot_unshelt_pit… pit_tot_hless_pit_… pit_ind_shelt_pit… pit_ind_unshelt_pi…
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2010 AK-500 1113 118 1231 633 107
## 2 2011 AK-500 1082 141 1223 677 117
## 3 2012 AK-500 1097 50 1147 756 35
## 4 2013 AK-500 1070 52 1122 792 52
## 5 2014 AK-500 970 53 1023 688 48
## 6 2015 AK-500 1029 179 1208 679 158
## # … with 325 more variables: pit_ind_hless_pit_hud <dbl>, pit_perfam_shelt_pit_hud <dbl>,
## # pit_perfam_unshelt_pit_hud <dbl>, pit_perfam_hless_pit_hud <dbl>, pit_ind_chronic_hless_pit_hud <dbl>,
## # pit_perfam_chronic_hless_pit_hud <dbl>, pit_vet_hless_pit_hud <dbl>, econ_urb_urbanicity <dbl>, coctag <dbl>,
## # panelvar <dbl>, hou_pol_totalind_hud <dbl>, hou_pol_totalday_hud <dbl>, hou_pol_totalexit_hud <dbl>,
## # hou_pol_numret6mos_hud <dbl>, hou_pol_numret12mos_hud <dbl>, hou_pol_fedfundcoc <dbl>,
## # hou_pol_fund_project <dbl>, hou_pol_bed_es_hic_hud <dbl>, hou_pol_bed_oph_hic_hud <dbl>,
## # hou_pol_bed_psh_hic_hud <dbl>, hou_pol_bed_rrh_hic_hud <dbl>, hou_pol_bed_sh_hic_hud <dbl>,
## # hou_pol_bed_th_hic_hud <dbl>, hou_pol_perm_bed_hic_hud <dbl>, hou_pol_temp_bed_hic_hud <dbl>,
## # dem_pop_pop_census <dbl>, dem_pop_male_census <dbl>, dem_pop_female_census <dbl>, dem_pop_child_census <dbl>,
## # dem_pop_adult_census <dbl>, dem_pop_senior_census <dbl>, dem_soc_white_census <dbl>, dem_soc_black_census <dbl>,
## # dem_soc_native_census <dbl>, dem_soc_asian_census <dbl>, dem_soc_pacific_census <dbl>,
## # dem_soc_racetwo_census <dbl>, dem_soc_hispanic_census <dbl>, dem_age_boom_census <dbl>,
## # total_rent_inventory_acs5yr <dbl>, dem_soc_ed_bach_acs5yr <dbl>, dem_soc_ed_hsgrad_acs5yr <dbl>,
## # dem_soc_ed_lesshs_acs5yr <dbl>, dem_soc_ed_somecoll_acs5yr <dbl>, dem_soc_vet_acs5yr <dbl>,
## # dem_soc_singparent_acs5yr <dbl>, dem_health_ins_acs5yr <dbl>, econ_labor_topskilled_acs5yr <dbl>,
## # econ_labor_midskilled_acs5yr <dbl>, econ_labor_unskilled_acs5yr <dbl>, econ_labor_incineq_acs5yr <dbl>,
## # econ_labor_medinc_acs5yr <dbl>, econ_labor_pov_acs5yr <dbl>, dem_soc_singadult_acs5yr <dbl>,
## # econ_sn_cashasst_acs5yr <dbl>, hou_mkt_homeval_acs5yr <dbl>, hou_mkt_homeage_acs5yr <dbl>,
## # hou_mkt_homeage1940_acs5yr <dbl>, hou_mkt_medrent_acs5yr <dbl>, hou_mkt_medrentgross_acs5yr <dbl>,
## # hou_mkt_ovrcrowd_acs5yr <dbl>, hou_mkt_rentshare_acs5yr <dbl>, hou_mkt_rentvacancy_acs5yr <dbl>,
## # hou_mkt_utility_acs5yr <dbl>, hou_mkt_burden_own_acs5yr <dbl>, hou_mkt_burden_sev_own_acs5yr <dbl>,
## # hou_mkt_burden_rent_acs5yr <dbl>, hou_mkt_burden_sev_rent_acs5yr <dbl>, fhfa_hpi_flag <dbl>,
## # fhfa_hpi_2009 <dbl>, econ_labor_force_pop_BLS <dbl>, econ_labor_emp_pop_BLS <dbl>,
## # econ_labor_unemp_pop_BLS <dbl>, econ_labor_unemp_rate_BLS <dbl>, hou_mkt_area_census <dbl>,
## # dem_pop_density_census <dbl>, hou_mkt_density_census <dbl>, dem_pop_mig_census <dbl>,
## # hou_pol_hudunit_psh_hud <dbl>, hou_pol_occhudunit_psh_hud <dbl>, hou_mkt_pplunit_psh_hud <dbl>,
## # hou_pol_eli_psh_hud <dbl>, hou_mkt_pctoverhouse_psh_hud <dbl>, hou_mkt_units_census <dbl>,
## # hou_mkt_pmt_totbldg_census <dbl>, hou_mkt_pmt_totunit_census <dbl>, hou_mkt_pmt_totvalue_census <dbl>,
## # econ_labor_pov_pop_census <dbl>, econ_labor_medinc_census <dbl>, econ_sn_ssdi_SSA <dbl>,
## # econ_sn_ssdi_part_rate_SSA <dbl>, econ_sn_ssi_part_SSA <dbl>, econ_sn_ssi_pay_SSA <dbl>,
## # econ_sn_ssi_part_rate_SSA <dbl>, hou_mkt_evict_count <dbl>, hou_mkt_evict_file_count <dbl>,
## # hou_mkt_renter_count_evlab <dbl>, hou_mkt_evict_flag <dbl>, hou_mkt_evict_file_rate <dbl>,
## # hou_mkt_evict_rate <dbl>, …
We see that there are many columns in the data frame (332!), so it will not be easy to make sense of the data set without the data dictionary.
Refer to the data dictionary and the HUD report to understand what variables are present.
3.3.2.1 Explore the columns
What are the variables?
What variable(s) do we want to predict?
What variables seem useful as predictors?
What predictor variables are redundant?
It will take a significant amount of work to understand the contents of the data set. We will discuss this in class.
3.3.2.2 Select a subset of columns
Below is a minimal number of variables to keep in the analysis. You should include other variables that might be useful as predictors. Provide an explanation for why you kept the variables.
\(\rightarrow\) Add variables to the list variable_names
that we will keep for the analysis.
#Search through data dictionary to find other variables to include
<- c("year", "cocnumber",
variable_names
"pit_tot_hless_pit_hud", "pit_tot_shelt_pit_hud", "pit_tot_unshelt_pit_hud","dem_pop_pop_census",
"fhfa_hpi_2009", "ln_hou_mkt_medrent_xt", "hou_mkt_utility_xt", "hou_mkt_burden_own_acs5yr_2017", "hou_mkt_burden_sev_rent_acs_2017", "hou_mkt_rentshare_acs5yr_2017", "hou_mkt_rentvacancy_xt", "hou_mkt_density_dummy", "hou_mkt_evict_count", "hou_mkt_ovrcrowd_acs5yr_2017", "major_city", "suburban",
"econ_labor_unemp_rate_BLS", "econ_labor_incineq_acs5yr_2017", "econ_labor_pov_pop_census_share",
"hou_pol_hudunit_psh_hud_share", "hou_pol_occhudunit_psh_hud", "hou_mkt_homeage1940_xt",
"dem_soc_black_census", "dem_soc_hispanic_census", "dem_soc_asian_census", "dem_soc_pacific_census", "dem_pop_child_census", "dem_pop_senior_census", "dem_pop_female_census", "dem_pop_mig_census", "d_dem_pop_mig_census_share", "dem_soc_singadult_xt", "dem_soc_singparent_xt", "dem_soc_vet_xt", "dem_soc_ed_lessbach_xt", "dem_health_cost_dart",
"dem_health_excesdrink_chr",
"env_wea_avgtemp_noaa", "env_wea_avgtemp_summer_noaa", "env_wea_precip_noaa", "env_wea_precip_annual_noaa")
\(\rightarrow\) Select this subset of variables from the full data set. Call the new data frame df_small
.
Coding Hint
Use the select
function and your variable_names
list. You should use all_of(variable_names)
. See https://tidyselect.r-lib.org/reference/faq-external-vector.html for an explanation of why we use all_of().
Show Answer
<- df %>%
df_small select(all_of(variable_names))
#See https://tidyselect.r-lib.org/reference/faq-external-vector.html for an explanation of
#why we use all_of()
\(\rightarrow\) Examine the head of the new, smaller data frame.
Show Answer
head(df_small)
## # A tibble: 6 x 43
## year cocnumber pit_tot_hless_pi… pit_tot_shelt_p… pit_tot_unshelt… dem_pop_pop_cen… fhfa_hpi_2009 ln_hou_mkt_medr…
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2010 AK-500 1231 1113 118 285194 0 NA
## 2 2011 AK-500 1223 1082 141 293370 0.00936 2.27
## 3 2012 AK-500 1147 1097 50 296291 -0.0491 NA
## 4 2013 AK-500 1122 1070 52 298520 -0.255 2.27
## 5 2014 AK-500 1023 970 53 301081 3.17 NA
## 6 2015 AK-500 1208 1029 179 300296 7.33 2.44
## # … with 35 more variables: hou_mkt_utility_xt <dbl>, hou_mkt_burden_own_acs5yr_2017 <dbl>,
## # hou_mkt_burden_sev_rent_acs_2017 <dbl>, hou_mkt_rentshare_acs5yr_2017 <dbl>, hou_mkt_rentvacancy_xt <dbl>,
## # hou_mkt_density_dummy <dbl>, hou_mkt_evict_count <dbl>, hou_mkt_ovrcrowd_acs5yr_2017 <dbl>, major_city <dbl>,
## # suburban <dbl>, econ_labor_unemp_rate_BLS <dbl>, econ_labor_incineq_acs5yr_2017 <dbl>,
## # econ_labor_pov_pop_census_share <dbl>, hou_pol_hudunit_psh_hud_share <dbl>, hou_pol_occhudunit_psh_hud <dbl>,
## # hou_mkt_homeage1940_xt <dbl>, dem_soc_black_census <dbl>, dem_soc_hispanic_census <dbl>,
## # dem_soc_asian_census <dbl>, dem_soc_pacific_census <dbl>, dem_pop_child_census <dbl>,
## # dem_pop_senior_census <dbl>, dem_pop_female_census <dbl>, dem_pop_mig_census <dbl>,
## # d_dem_pop_mig_census_share <dbl>, dem_soc_singadult_xt <dbl>, dem_soc_singparent_xt <dbl>, dem_soc_vet_xt <dbl>,
## # dem_soc_ed_lessbach_xt <dbl>, dem_health_cost_dart <dbl>, dem_health_excesdrink_chr <dbl>,
## # env_wea_avgtemp_noaa <dbl>, env_wea_avgtemp_summer_noaa <dbl>, env_wea_precip_noaa <dbl>,
## # env_wea_precip_annual_noaa <dbl>
\(\rightarrow\) Create a new dictionary for this subset of variables and use View
to examine the contents.
Coding Hint
Use the filter
function.
Show Answer
#Create the smaller data dictionary
<- dictionary %>%
dictionary_small filter(Variable %in% variable_names)
#View it
%>%
dictionary_small View()
How many variables of each Associated Domain are in the smaller data set?
\(\rightarrow\) Make a bar graph of the counts of different Associated Domain
. Your graph should have the following features:
- Order the bars in descending order based on the count.
- Orient the plot so that it is easy to read the labels.
Show Answer
%>%
dictionary_small count(`Associated Domain`) %>%
mutate(`Associated Domain` = fct_reorder(`Associated Domain`,n)) %>%
ggplot(aes(x = `Associated Domain`, y = n)) +
geom_col() +
labs(x = "Associated Domain", y = "Count") +
coord_flip()
As with the full data frame, we have mostly demographic and housing variables.
3.3.3 Further exploration of basic properties
3.3.3.1 Check for a tidy data frame
In a tidy data set, each column is a variable or id and each row is an observation. It will take some work to determine whether the data set is tidy.
It was difficult to assess whether the full data frame was tidy because of the large number of columns with confusing names. We will do the analysis on the smaller data frame.
Show Answer
Each column is a variable or id and each row is an observation, so the data frame is tidy. We are benefiting from some of the pre-processing that HUD did with the data.
\(\rightarrow\) How many observations are in the data set?
Show Answer
nrow(df_small)
## [1] 3008
There are 3008 observations.
3.3.4 Data cleaning
3.3.4.1 Rename variables
I used the data dictionary to create more readable names for the minimal set of variables. You should add in new names for the additional variables you included in the data set.
The data frame with renamed columns is called df_hud
.
#Add your new names to this list
<- df_small %>%
df_hud rename(coc_number = dictionary_small$Variable[2],
total_sheltered = dictionary_small$Variable[3],
total_unsheltered = dictionary_small$Variable[4],
total_homeless = dictionary_small$Variable[5],
total_population = dictionary_small$Variable[6],
total_female_population = dictionary_small$Variable[7],
total_population_0_19 = dictionary_small$Variable[8],
total_population_65_plus = dictionary_small$Variable[9],
total_black = dictionary_small$Variable[10],
total_asian = dictionary_small$Variable[11],
total_pacific_islander = dictionary_small$Variable[12],
total_latino_hispanic = dictionary_small$Variable[13],
house_price_index_2009 = dictionary_small$Variable[14],
rate_unemployment = dictionary_small$Variable[15],
net_migration = dictionary_small$Variable[16],
HUD_unit_occupancy_rate = dictionary_small$Variable[17],
number_eviction = dictionary_small$Variable[18],
percentage_excessive_drinking = dictionary_small$Variable[19],
medicare_reimbursements_per_enrollee = dictionary_small$Variable[20],
average_summer_temperature = dictionary_small$Variable[21],
total_annual_precipitation = dictionary_small$Variable[22],
average_Jan_temperature = dictionary_small$Variable[23],
total_Jan_precipitation = dictionary_small$Variable[24],
gini_coefficient_2016 = dictionary_small$Variable[25],
poverty_rate = dictionary_small$Variable[26],
share_renters_2016 = dictionary_small$Variable[27],
share_overcrowded_units_2016 = dictionary_small$Variable[28],
percentage_owners_cost_burden_2016 = dictionary_small$Variable[29],
percentage_renters_severe_cost_burden_2016 = dictionary_small$Variable[30],
share_HUD_units = dictionary_small$Variable[31],
high_housing_density = dictionary_small$Variable[32],
share_built_before_1940 = dictionary_small$Variable[33],
utility_costs = dictionary_small$Variable[34],
rental_vacancy_rate = dictionary_small$Variable[35],
proportion_one_person_households = dictionary_small$Variable[36],
share_under_18_with_single_parent = dictionary_small$Variable[37],
share_veteran_status = dictionary_small$Variable[38],
log_median_rent = dictionary_small$Variable[39],
migration_4_year_change = dictionary_small$Variable[40],
share_no_bachelors = dictionary_small$Variable[41],
city_or_urban = dictionary_small$Variable[42],
suburban = dictionary_small$Variable[43]
)
\(\rightarrow\) Examine the head of the new data frame with the updated names:
Show Answer
head(df_hud)
## # A tibble: 6 x 43
## year coc_number total_homeless total_sheltered total_unsheltered total_population house_price_ind… log_median_rent
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2010 AK-500 1231 1113 118 285194 0 NA
## 2 2011 AK-500 1223 1082 141 293370 0.00936 2.27
## 3 2012 AK-500 1147 1097 50 296291 -0.0491 NA
## 4 2013 AK-500 1122 1070 52 298520 -0.255 2.27
## 5 2014 AK-500 1023 970 53 301081 3.17 NA
## 6 2015 AK-500 1208 1029 179 300296 7.33 2.44
## # … with 35 more variables: utility_costs <dbl>, percentage_owners_cost_burden_2016 <dbl>,
## # percentage_renters_severe_cost_burden_2016 <dbl>, share_renters_2016 <dbl>, rental_vacancy_rate <dbl>,
## # high_housing_density <dbl>, number_eviction <dbl>, share_overcrowded_units_2016 <dbl>, city_or_urban <dbl>,
## # suburban <dbl>, rate_unemployment <dbl>, gini_coefficient_2016 <dbl>, poverty_rate <dbl>, share_HUD_units <dbl>,
## # HUD_unit_occupancy_rate <dbl>, share_built_before_1940 <dbl>, total_black <dbl>, total_latino_hispanic <dbl>,
## # total_asian <dbl>, total_pacific_islander <dbl>, total_population_0_19 <dbl>, total_population_65_plus <dbl>,
## # total_female_population <dbl>, net_migration <dbl>, migration_4_year_change <dbl>,
## # proportion_one_person_households <dbl>, share_under_18_with_single_parent <dbl>, share_veteran_status <dbl>,
## # share_no_bachelors <dbl>, medicare_reimbursements_per_enrollee <dbl>, percentage_excessive_drinking <dbl>,
## # average_Jan_temperature <dbl>, average_summer_temperature <dbl>, total_Jan_precipitation <dbl>,
## # total_annual_precipitation <dbl>
\(\rightarrow\) Display the names of the columns of the new data frame:
Show Answer
names(df_hud)
## [1] "year" "coc_number"
## [3] "total_homeless" "total_sheltered"
## [5] "total_unsheltered" "total_population"
## [7] "house_price_index_2009" "log_median_rent"
## [9] "utility_costs" "percentage_owners_cost_burden_2016"
## [11] "percentage_renters_severe_cost_burden_2016" "share_renters_2016"
## [13] "rental_vacancy_rate" "high_housing_density"
## [15] "number_eviction" "share_overcrowded_units_2016"
## [17] "city_or_urban" "suburban"
## [19] "rate_unemployment" "gini_coefficient_2016"
## [21] "poverty_rate" "share_HUD_units"
## [23] "HUD_unit_occupancy_rate" "share_built_before_1940"
## [25] "total_black" "total_latino_hispanic"
## [27] "total_asian" "total_pacific_islander"
## [29] "total_population_0_19" "total_population_65_plus"
## [31] "total_female_population" "net_migration"
## [33] "migration_4_year_change" "proportion_one_person_households"
## [35] "share_under_18_with_single_parent" "share_veteran_status"
## [37] "share_no_bachelors" "medicare_reimbursements_per_enrollee"
## [39] "percentage_excessive_drinking" "average_Jan_temperature"
## [41] "average_summer_temperature" "total_Jan_precipitation"
## [43] "total_annual_precipitation"
Renaming the variables is a huge pain, but I find that it helps me to understand what the variables represent. This is useful for projects where you will spend a significant amount of time with the data.
3.3.5 Identify and deal with missing values
\(\rightarrow\) How many missing values are there in each column? Give the number of missing values and the percent of values in each column that are missing.
Show Coding Hing
The skim
function shows the number of missing values and the completion rate for each column. You can also use skim_without_charts
to get the same information without the graphs.
Show Answer
Recall that missing values may be coded with NA
, or they may be empty. We want to convert empty cells to NA
, so that we can use the function is.na
to find all missing values. The function read_csv
that we used to read in the data does this automatically, so we do not need to take further action to deal with empty cells. Note that the default argument in read_csv
that controls this is na = c("", "NA")
, which means that empty cells or NA values are treated as NA
.
skim_without_charts(df_hud)
Name | df_hud |
Number of rows | 3008 |
Number of columns | 43 |
_______________________ | |
Column type frequency: | |
character | 1 |
numeric | 42 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
coc_number | 0 | 1 | 6 | 7 | 0 | 376 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
---|---|---|---|---|---|---|---|---|---|
year | 0 | 1.00 | 2013.50 | 2.29 | 2010.00 | 2011.75 | 2013.50 | 2015.25 | 2017.00 |
total_homeless | 14 | 1.00 | 1560.43 | 4313.39 | 7.00 | 320.25 | 679.00 | 1468.00 | 76501.00 |
total_sheltered | 14 | 1.00 | 1033.85 | 3445.81 | 3.00 | 224.00 | 445.50 | 961.50 | 72565.00 |
total_unsheltered | 14 | 1.00 | 526.58 | 1742.31 | 0.00 | 36.25 | 114.00 | 418.25 | 42828.00 |
total_population | 0 | 1.00 | 837457.01 | 1217309.59 | 29344.00 | 266710.75 | 481798.50 | 896909.25 | 11058958.00 |
house_price_index_2009 | 0 | 1.00 | -2.74 | 10.14 | -26.50 | -8.61 | -3.28 | 0.00 | 63.86 |
log_median_rent | 1504 | 0.50 | 1.97 | 0.31 | 1.24 | 1.73 | 1.94 | 2.16 | 2.87 |
utility_costs | 1504 | 0.50 | 14.66 | 3.10 | 5.60 | 12.60 | 14.80 | 16.90 | 22.60 |
percentage_owners_cost_burden_2016 | 0 | 1.00 | 14.87 | 3.04 | 8.07 | 12.48 | 14.39 | 17.08 | 22.96 |
percentage_renters_severe_cost_burden_2016 | 0 | 1.00 | 25.54 | 3.86 | 16.38 | 22.86 | 25.00 | 27.76 | 37.65 |
share_renters_2016 | 0 | 1.00 | 34.55 | 8.23 | 15.10 | 28.64 | 33.52 | 39.66 | 69.22 |
rental_vacancy_rate | 1504 | 0.50 | 7.15 | 2.98 | 1.92 | 5.20 | 6.60 | 8.40 | 24.63 |
high_housing_density | 0 | 1.00 | 0.25 | 0.43 | 0.00 | 0.00 | 0.00 | 0.25 | 1.00 |
number_eviction | 0 | 1.00 | 2501.98 | 4314.93 | 0.00 | 96.75 | 935.50 | 2693.00 | 37105.00 |
share_overcrowded_units_2016 | 0 | 1.00 | 2.70 | 1.95 | 0.66 | 1.51 | 2.04 | 3.04 | 12.26 |
city_or_urban | 0 | 1.00 | 0.28 | 0.45 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 |
suburban | 0 | 1.00 | 0.43 | 0.49 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 |
rate_unemployment | 0 | 1.00 | 7.48 | 2.70 | 2.51 | 5.51 | 7.18 | 8.94 | 28.86 |
gini_coefficient_2016 | 0 | 1.00 | 45.52 | 2.85 | 37.56 | 43.70 | 45.40 | 46.99 | 54.78 |
poverty_rate | 0 | 1.00 | 14.42 | 4.46 | 3.35 | 11.45 | 14.32 | 17.31 | 35.79 |
share_HUD_units | 0 | 1.00 | 3.65 | 1.92 | 0.09 | 2.32 | 3.30 | 4.69 | 13.96 |
HUD_unit_occupancy_rate | 0 | 1.00 | 92.60 | 3.99 | 69.44 | 90.77 | 93.01 | 95.15 | 100.00 |
share_built_before_1940 | 1504 | 0.50 | 13.46 | 10.78 | 0.22 | 4.88 | 10.11 | 20.13 | 56.88 |
total_black | 0 | 1.00 | 103606.65 | 202325.49 | 152.00 | 10567.00 | 36742.00 | 109931.25 | 1956449.00 |
total_latino_hispanic | 0 | 1.00 | 142104.15 | 407635.29 | 224.00 | 14898.50 | 35744.50 | 92569.50 | 4924272.00 |
total_asian | 0 | 1.00 | 42375.38 | 117157.23 | 80.00 | 4291.75 | 11239.50 | 30542.50 | 1474987.00 |
total_pacific_islander | 0 | 1.00 | 1388.47 | 5558.70 | 1.00 | 117.00 | 259.00 | 642.25 | 89016.00 |
total_population_0_19 | 0 | 1.00 | 219436.13 | 329651.86 | 6194.00 | 67781.25 | 124462.00 | 234535.00 | 3204024.00 |
total_population_65_plus | 0 | 1.00 | 117131.31 | 161665.55 | 4781.00 | 38146.75 | 66976.00 | 124377.25 | 1525916.00 |
total_female_population | 0 | 1.00 | 425237.98 | 618118.92 | 14816.00 | 135979.25 | 244430.50 | 461586.00 | 5542088.00 |
net_migration | 0 | 1.00 | 2458.73 | 8426.78 | -41321.00 | -719.00 | 492.50 | 3207.75 | 101369.00 |
migration_4_year_change | 2632 | 0.12 | 0.07 | 0.68 | -3.45 | -0.26 | 0.02 | 0.36 | 2.60 |
proportion_one_person_households | 1504 | 0.50 | 27.73 | 4.15 | 12.48 | 25.40 | 27.82 | 30.10 | 46.71 |
share_under_18_with_single_parent | 1504 | 0.50 | 25.58 | 6.94 | 9.44 | 20.94 | 24.61 | 29.21 | 55.63 |
share_veteran_status | 1504 | 0.50 | 10.89 | 3.05 | 2.80 | 8.96 | 10.78 | 12.66 | 23.88 |
share_no_bachelors | 1504 | 0.50 | 69.69 | 10.30 | 23.69 | 64.34 | 71.09 | 77.36 | 88.64 |
medicare_reimbursements_per_enrollee | 0 | 1.00 | 9.36 | 1.25 | 6.03 | 8.44 | 9.22 | 10.08 | 17.95 |
percentage_excessive_drinking | 2632 | 0.12 | 0.19 | 0.03 | 0.11 | 0.17 | 0.19 | 0.21 | 0.29 |
average_Jan_temperature | 0 | 1.00 | 35.36 | 12.85 | -3.24 | 26.32 | 33.71 | 44.11 | 74.70 |
average_summer_temperature | 2 | 1.00 | 74.05 | 6.17 | 55.80 | 69.54 | 73.90 | 78.74 | 93.25 |
total_Jan_precipitation | 0 | 1.00 | 2.94 | 2.51 | 0.00 | 1.33 | 2.53 | 3.59 | 25.32 |
total_annual_precipitation | 0 | 1.00 | 41.03 | 15.36 | 1.25 | 33.52 | 42.03 | 49.96 | 105.19 |
Most variables are not missing any values, but there are many variables that are missing half or more of the values.
We are interested in predicting the number of people experiencing homelessness. So, we will remove the rows where those numbers are missing.
\(\rightarrow\) Remove the rows where total_homeless
is NA
.
Show Coding Hint
Use the filter
function.
Show Answer
<- df_hud %>%
df_hud filter(is.na(total_homeless) == FALSE)
skim_without_charts(df_hud)
Name | df_hud |
Number of rows | 2994 |
Number of columns | 43 |
_______________________ | |
Column type frequency: | |
character | 1 |
numeric | 42 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
coc_number | 0 | 1 | 6 | 7 | 0 | 376 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
---|---|---|---|---|---|---|---|---|---|
year | 0 | 1.00 | 2013.50 | 2.29 | 2010.00 | 2012.00 | 2013.00 | 2015.00 | 2017.00 |
total_homeless | 0 | 1.00 | 1560.43 | 4313.39 | 7.00 | 320.25 | 679.00 | 1468.00 | 76501.00 |
total_sheltered | 0 | 1.00 | 1033.85 | 3445.81 | 3.00 | 224.00 | 445.50 | 961.50 | 72565.00 |
total_unsheltered | 0 | 1.00 | 526.58 | 1742.31 | 0.00 | 36.25 | 114.00 | 418.25 | 42828.00 |
total_population | 0 | 1.00 | 840371.32 | 1219259.79 | 29344.00 | 267388.75 | 484041.50 | 900588.25 | 11058958.00 |
house_price_index_2009 | 0 | 1.00 | -2.72 | 10.15 | -26.50 | -8.60 | -3.25 | 0.00 | 63.86 |
log_median_rent | 1495 | 0.50 | 1.97 | 0.31 | 1.24 | 1.73 | 1.94 | 2.16 | 2.87 |
utility_costs | 1495 | 0.50 | 14.65 | 3.11 | 5.60 | 12.60 | 14.80 | 16.90 | 22.60 |
percentage_owners_cost_burden_2016 | 0 | 1.00 | 14.87 | 3.04 | 8.07 | 12.48 | 14.39 | 17.07 | 22.96 |
percentage_renters_severe_cost_burden_2016 | 0 | 1.00 | 25.56 | 3.84 | 16.38 | 22.86 | 25.00 | 27.85 | 37.65 |
share_renters_2016 | 0 | 1.00 | 34.55 | 8.25 | 15.10 | 28.55 | 33.53 | 39.73 | 69.22 |
rental_vacancy_rate | 1495 | 0.50 | 7.14 | 2.98 | 1.92 | 5.20 | 6.59 | 8.40 | 24.63 |
high_housing_density | 0 | 1.00 | 0.25 | 0.43 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 |
number_eviction | 0 | 1.00 | 2513.26 | 4321.83 | 0.00 | 107.50 | 941.50 | 2722.00 | 37105.00 |
share_overcrowded_units_2016 | 0 | 1.00 | 2.71 | 1.95 | 0.66 | 1.51 | 2.05 | 3.06 | 12.26 |
city_or_urban | 0 | 1.00 | 0.28 | 0.45 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 |
suburban | 0 | 1.00 | 0.43 | 0.49 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 |
rate_unemployment | 0 | 1.00 | 7.48 | 2.71 | 2.51 | 5.50 | 7.18 | 8.94 | 28.86 |
gini_coefficient_2016 | 0 | 1.00 | 45.53 | 2.85 | 37.56 | 43.70 | 45.42 | 47.01 | 54.78 |
poverty_rate | 0 | 1.00 | 14.42 | 4.47 | 3.35 | 11.43 | 14.32 | 17.31 | 35.79 |
share_HUD_units | 0 | 1.00 | 3.66 | 1.91 | 0.09 | 2.32 | 3.29 | 4.70 | 13.96 |
HUD_unit_occupancy_rate | 0 | 1.00 | 92.60 | 3.99 | 69.44 | 90.77 | 93.01 | 95.15 | 100.00 |
share_built_before_1940 | 1495 | 0.50 | 13.44 | 10.76 | 0.22 | 4.88 | 10.13 | 20.12 | 56.88 |
total_black | 0 | 1.00 | 104005.74 | 202698.30 | 168.00 | 10750.00 | 36833.00 | 110397.25 | 1956449.00 |
total_latino_hispanic | 0 | 1.00 | 142706.42 | 408490.87 | 224.00 | 15101.25 | 36013.00 | 93348.75 | 4924272.00 |
total_asian | 0 | 1.00 | 42564.47 | 117398.02 | 80.00 | 4369.00 | 11316.50 | 30792.50 | 1474987.00 |
total_pacific_islander | 0 | 1.00 | 1394.53 | 5570.97 | 1.00 | 117.00 | 259.50 | 646.50 | 89016.00 |
total_population_0_19 | 0 | 1.00 | 220205.33 | 330188.04 | 6194.00 | 67984.75 | 124752.50 | 235074.50 | 3204024.00 |
total_population_65_plus | 0 | 1.00 | 117515.06 | 161921.65 | 5132.00 | 38351.75 | 67198.50 | 124875.25 | 1525916.00 |
total_female_population | 0 | 1.00 | 426719.90 | 619107.26 | 14816.00 | 136479.75 | 246606.00 | 464249.50 | 5542088.00 |
net_migration | 0 | 1.00 | 2470.68 | 8444.32 | -41321.00 | -720.50 | 501.50 | 3211.50 | 101369.00 |
migration_4_year_change | 2620 | 0.12 | 0.07 | 0.68 | -3.45 | -0.26 | 0.02 | 0.36 | 2.60 |
proportion_one_person_households | 1495 | 0.50 | 27.72 | 4.15 | 12.48 | 25.40 | 27.81 | 30.10 | 46.71 |
share_under_18_with_single_parent | 1495 | 0.50 | 25.58 | 6.95 | 9.44 | 20.93 | 24.61 | 29.21 | 55.63 |
share_veteran_status | 1495 | 0.50 | 10.89 | 3.06 | 2.80 | 8.96 | 10.78 | 12.68 | 23.88 |
share_no_bachelors | 1495 | 0.50 | 69.66 | 10.30 | 23.69 | 64.25 | 71.08 | 77.35 | 88.64 |
medicare_reimbursements_per_enrollee | 0 | 1.00 | 9.37 | 1.25 | 6.03 | 8.45 | 9.22 | 10.08 | 17.95 |
percentage_excessive_drinking | 2620 | 0.12 | 0.19 | 0.03 | 0.11 | 0.17 | 0.19 | 0.21 | 0.29 |
average_Jan_temperature | 0 | 1.00 | 35.37 | 12.87 | -3.24 | 26.32 | 33.71 | 44.17 | 74.70 |
average_summer_temperature | 2 | 1.00 | 74.06 | 6.18 | 55.80 | 69.54 | 73.92 | 78.74 | 93.25 |
total_Jan_precipitation | 0 | 1.00 | 2.94 | 2.51 | 0.00 | 1.33 | 2.54 | 3.59 | 25.32 |
total_annual_precipitation | 0 | 1.00 | 41.06 | 15.33 | 1.25 | 33.56 | 42.03 | 49.99 | 105.19 |
There are some variables that are missing many of their values.
\(\rightarrow\) Produce scatter plots of the variables that are missing many values vs. time to see if data are missing from particular years.
Show Answer
This is one example:
%>%
df_hud ggplot(aes(x = year, y = migration_4_year_change)) +
geom_point()
## Warning: Removed 2620 rows containing missing values (geom_point).
percentage_excessive_drinking
and migration_4_year_change
only have data from 2017. Following the analysis in the HUD report, we will focus on data from 2017.
\(\rightarrow\) Produce a data frame with data only from 2017. Call it df_2017
.
Show Answer
<- df_hud %>%
df_2017 filter(year == 2017)
\(\rightarrow\) Check for missing values in the 2017 data.
Show Answer
skim_without_charts(df_2017)
Name | df_2017 |
Number of rows | 374 |
Number of columns | 43 |
_______________________ | |
Column type frequency: | |
character | 1 |
numeric | 42 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
coc_number | 0 | 1 | 6 | 7 | 0 | 374 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
---|---|---|---|---|---|---|---|---|---|
year | 0 | 1 | 2017.00 | 0.00 | 2017.00 | 2017.00 | 2017.00 | 2017.00 | 2017.00 |
total_homeless | 0 | 1 | 1466.07 | 5114.12 | 10.00 | 282.75 | 567.50 | 1203.00 | 76501.00 |
total_sheltered | 0 | 1 | 961.68 | 3921.31 | 5.00 | 189.00 | 373.00 | 850.75 | 72565.00 |
total_unsheltered | 0 | 1 | 504.39 | 2334.30 | 0.00 | 37.00 | 109.50 | 384.00 | 42828.00 |
total_population | 0 | 1 | 862794.07 | 1257003.13 | 29344.00 | 271652.75 | 498140.50 | 924319.75 | 11058958.00 |
house_price_index_2009 | 0 | 1 | 6.90 | 14.68 | -17.53 | -2.61 | 3.12 | 13.51 | 63.86 |
log_median_rent | 0 | 1 | 2.02 | 0.31 | 1.36 | 1.80 | 1.98 | 2.20 | 2.87 |
utility_costs | 0 | 1 | 14.84 | 2.97 | 5.90 | 12.95 | 14.97 | 17.16 | 22.60 |
percentage_owners_cost_burden_2016 | 0 | 1 | 14.88 | 3.04 | 8.07 | 12.49 | 14.40 | 17.09 | 22.96 |
percentage_renters_severe_cost_burden_2016 | 0 | 1 | 25.55 | 3.87 | 16.38 | 22.89 | 25.00 | 27.82 | 37.65 |
share_renters_2016 | 0 | 1 | 34.57 | 8.26 | 15.10 | 28.58 | 33.53 | 39.71 | 69.22 |
rental_vacancy_rate | 0 | 1 | 6.47 | 2.73 | 1.92 | 4.78 | 6.24 | 7.61 | 21.05 |
high_housing_density | 0 | 1 | 0.25 | 0.44 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 |
number_eviction | 0 | 1 | 2402.04 | 4297.43 | 0.00 | 75.75 | 854.50 | 2441.00 | 36357.00 |
share_overcrowded_units_2016 | 0 | 1 | 2.71 | 1.96 | 0.66 | 1.51 | 2.04 | 3.05 | 12.26 |
city_or_urban | 0 | 1 | 0.28 | 0.45 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 |
suburban | 0 | 1 | 0.43 | 0.50 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 |
rate_unemployment | 0 | 1 | 5.00 | 1.57 | 2.51 | 4.14 | 4.90 | 5.57 | 23.64 |
gini_coefficient_2016 | 0 | 1 | 45.53 | 2.86 | 37.56 | 43.71 | 45.40 | 47.00 | 54.78 |
poverty_rate | 0 | 1 | 13.41 | 4.14 | 3.50 | 10.50 | 13.23 | 16.08 | 25.56 |
share_HUD_units | 0 | 1 | 3.53 | 1.84 | 0.09 | 2.24 | 3.22 | 4.57 | 11.86 |
HUD_unit_occupancy_rate | 0 | 1 | 92.42 | 4.08 | 73.00 | 90.13 | 93.00 | 95.00 | 100.00 |
share_built_before_1940 | 0 | 1 | 13.04 | 10.52 | 0.26 | 4.58 | 9.91 | 19.21 | 56.01 |
total_black | 0 | 1 | 107580.71 | 207495.53 | 244.00 | 11811.25 | 37467.50 | 113661.50 | 1909362.00 |
total_latino_hispanic | 0 | 1 | 154259.52 | 431612.58 | 344.00 | 17374.50 | 39273.00 | 100622.25 | 4924272.00 |
total_asian | 0 | 1 | 47716.18 | 128225.20 | 106.00 | 5367.00 | 12655.50 | 34762.75 | 1474987.00 |
total_pacific_islander | 0 | 1 | 1512.28 | 5793.14 | 1.00 | 130.00 | 290.50 | 697.25 | 88323.00 |
total_population_0_19 | 0 | 1 | 219133.87 | 329833.51 | 6194.00 | 67203.00 | 125673.00 | 236132.50 | 3204024.00 |
total_population_65_plus | 0 | 1 | 131381.43 | 180215.56 | 6218.00 | 42612.75 | 75493.50 | 138750.00 | 1525916.00 |
total_female_population | 0 | 1 | 437977.63 | 638095.05 | 14823.00 | 137737.75 | 252411.00 | 473421.25 | 5542088.00 |
net_migration | 0 | 1 | 3030.79 | 10140.42 | -41321.00 | -990.25 | 442.00 | 3645.50 | 91270.00 |
migration_4_year_change | 0 | 1 | 0.07 | 0.68 | -3.45 | -0.26 | 0.02 | 0.36 | 2.60 |
proportion_one_person_households | 0 | 1 | 28.00 | 4.17 | 12.48 | 25.54 | 28.18 | 30.34 | 44.69 |
share_under_18_with_single_parent | 0 | 1 | 25.76 | 6.89 | 9.77 | 21.02 | 24.81 | 29.44 | 54.47 |
share_veteran_status | 0 | 1 | 9.94 | 2.86 | 2.80 | 8.16 | 10.07 | 11.28 | 22.27 |
share_no_bachelors | 0 | 1 | 68.89 | 10.45 | 23.69 | 63.02 | 70.41 | 76.62 | 88.37 |
medicare_reimbursements_per_enrollee | 0 | 1 | 9.91 | 1.21 | 6.79 | 9.02 | 9.70 | 10.61 | 14.31 |
percentage_excessive_drinking | 0 | 1 | 0.19 | 0.03 | 0.11 | 0.17 | 0.19 | 0.21 | 0.29 |
average_Jan_temperature | 0 | 1 | 39.10 | 12.07 | 10.98 | 30.30 | 37.48 | 46.48 | 73.50 |
average_summer_temperature | 0 | 1 | 75.16 | 5.87 | 59.55 | 71.27 | 75.23 | 79.67 | 93.25 |
total_Jan_precipitation | 0 | 1 | 4.39 | 3.68 | 0.21 | 2.47 | 3.39 | 4.70 | 25.32 |
total_annual_precipitation | 0 | 1 | 39.43 | 13.69 | 3.06 | 33.84 | 38.58 | 45.04 | 96.34 |
There are no missing values! We will do our analysis on the 2017 data set.
3.4 Exploratory data analysis
We have two main goals when doing exploratory data analysis. The first is that we want to understand the data set more completely. The second goal is to explore relationships between the variables to help guide the modeling process to answer our specific question.
3.4.1 Graphical summaries
3.4.1.1 Look at the distributions of the homeless counts
\(\rightarrow\) Make a histogram of the total number of homeless in 2017.
Show Answer
%>%
df_2017 ggplot(aes(x = total_homeless)) +
geom_histogram(boundary = 0) +
labs(x = "Total homeless", y = "Count")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
It appears that there are some outliers, but it can be hard to see because of the limits of the y-axis. We can limit the y-axis values to get a better view of the small bars.
%>%
df_2017 ggplot(aes(x = total_homeless)) +
geom_histogram(boundary = 0) +
labs(x = "Total homeless", y = "Count") +
coord_cartesian(ylim = c(0, 15))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
There are some large outliers because there is a wide range of populations among the areas represented in the data set. We can see this by making a histogram of total_population
.
%>%
df_2017 ggplot(aes(x = total_population)) +
geom_histogram(boundary = 0) +
labs(x = "Total population", y = "Count")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
\(\rightarrow\) Make a histogram of the number of sheltered homeless in 2017.
Show Answer
%>%
df_2017 ggplot(aes(x = total_sheltered)) +
geom_histogram(boundary = 0) +
labs(x = "Sheltered homeless", y = "Count")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
\(\rightarrow\) Make a histogram of the number of unsheltered homeless in 2017.
Show Answer
%>%
df_2017 ggplot(aes(x = total_unsheltered)) +
geom_histogram(boundary = 0) +
labs(x = "Unsheltered homeless", y = "Count") +
coord_cartesian(ylim = c(0, 15))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Converting the total counts to rates relative to population size should remove extreme outliers.
\(\rightarrow\) Use the mutate
function to create a new variable rate_homeless
that is the total number of homeless per 10,000 people in the population and make a histogram of rate_homeless
.
Show Answer
%>%
df_2017 mutate(rate_homeless = total_homeless/(total_population/10000)) %>%
ggplot(aes(x = rate_homeless)) +
geom_histogram(boundary = 0) +
labs(x = "Homeless rate per 10,000", y = "Count")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
\(\rightarrow\) Compare boxplots of rate_homeless
and total_homeless
to visually assess whether outliers are less extreme in rate_homeless
than in total_homeless
.
Show Answer
#Boxplot of rate_homeless
<- df_2017 %>%
p1 mutate(rate_homeless = total_homeless/(total_population/10000)) %>%
ggplot() +
geom_boxplot(aes(y = rate_homeless))
#Boxplot of total_homeless
<- df_2017 %>%
p2 ggplot() +
geom_boxplot(aes(y = total_homeless))
#Display as two panels in one graph
grid.arrange(p1, p2, nrow = 1)
There are still points that are outliers in the rates, but they are not as extreme as in the totals.
3.4.1.2 Data processing
We will add rates to the data frame. Following the HUD report, we will produce rates per 10,000 people.
\(\rightarrow\) Use the mutate
function to create new variables rate_homeless
, rate_sheltered
, and rate_unsheltered
in the data frame df_2017
that are the counts per 10,000 people in the population.
Show Answer
<- df_2017 %>%
df_2017 mutate(rate_sheltered = total_sheltered/(total_population/10000),
rate_unsheltered = total_unsheltered/(total_population/10000),
rate_homeless = total_homeless/(total_population/10000))
We should note that the demographic variables (race, gender, age) are given as total counts. We will also convert these totals to percentages.
\(\rightarrow\) Use the mutate
function to create new demographics variables in the data frame df_2017
that are percentages of the total population.
Show Answer
<- df_2017 %>%
df_2017 mutate(percent_black = total_black/total_population,
percent_latino_hispanic = total_latino_hispanic/total_population,
percent_asian = total_asian/total_population,
percent_pacific_islander = total_pacific_islander/total_population,
percent_population_0_19 = total_population_0_19/total_population,
percent_population_65_plus = total_population_65_plus/total_population,
percent_female_population = total_female_population/total_population)
3.4.1.3 Basic summary
\(\rightarrow\) How many people were experiencing homelessness in 2017? How many were sheltered and how many were unsheltered?
Show Answer
%>%
df_2017 select(total_homeless, total_unsheltered, total_sheltered) %>%
colSums()
## total_homeless total_unsheltered total_sheltered
## 548312 188643 359669
\(\rightarrow\) What are the minimum, maximum, mean, and median number of total, sheltered, and unsheltered homeless in 2017?
Show Answer
%>%
df_2017 select(total_homeless, total_unsheltered, total_sheltered) %>%
pivot_longer(cols = c(total_homeless, total_unsheltered, total_sheltered), names_to = "variable", values_to = "count") %>%
group_by(variable) %>%
summarize(min = min(count), max = max(count), mean = round(mean(count),1), median = median(count))
## # A tibble: 3 x 5
## variable min max mean median
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 total_homeless 10 76501 1466. 568.
## 2 total_sheltered 5 72565 962. 373
## 3 total_unsheltered 0 42828 504. 110.
3.4.2 Correlations between numerical variables
\(\rightarrow\) Plot the correlation coefficients between all pairs of numerical variables using the corrplot
function.
Show Answer
Try this first with no optional arguments in corrplot
and notice the problems.
%>%
df_2017select_if(is.numeric) %>%
cor(use = "pairwise.complete.obs") %>%
corrplot()
## Warning in cor(., use = "pairwise.complete.obs"): the standard deviation is zero
There are two problems. First, year is only equal to 2017, so it doesn’t make sense to compute a correlation coefficient with this variable. So, year
should be removed. Also, the labels are huge, so we can’t read the plot.
Try this:
%>%
df_2017 select_if(is.numeric) %>%
select(-"year") %>%
cor(use = "pairwise.complete.obs") %>%
corrplot(tl.cex = 0.5, type = "lower")
There are some variables that are highly correlated with the rate of homeless. Some are obvious, such as the rate of sheltered homeless. Other correlations show the existence of environmental variables, such as the share of overcrowded units, that are correlated with the rate of homeless.
Note the high correlation among subsets of the input variables. We will want to remove redundant variables or make a transformation of the input variables to deal with the correlation before constructing a model.
Next, we will find the variables with the highest correlation with rate_homeless
. First, create the correlation matrix. Then examine the row with rate_homeless
to find its correlation with the other variables.
<- df_2017 %>%
M select_if(is.numeric) %>%
cor(use = "pairwise.complete.obs")
## Warning in cor(., use = "pairwise.complete.obs"): the standard deviation is zero
round(M["rate_homeless",],2)
## year total_homeless
## NA 0.39
## total_sheltered total_unsheltered
## 0.33 0.29
## total_population house_price_index_2009
## 0.05 0.34
## log_median_rent utility_costs
## 0.31 -0.31
## percentage_owners_cost_burden_2016 percentage_renters_severe_cost_burden_2016
## 0.32 0.30
## share_renters_2016 rental_vacancy_rate
## 0.48 -0.08
## high_housing_density number_eviction
## 0.08 0.03
## share_overcrowded_units_2016 city_or_urban
## 0.42 0.19
## suburban rate_unemployment
## -0.12 0.13
## gini_coefficient_2016 poverty_rate
## 0.31 0.13
## share_HUD_units HUD_unit_occupancy_rate
## 0.27 -0.08
## share_built_before_1940 total_black
## 0.09 0.07
## total_latino_hispanic total_asian
## 0.14 0.28
## total_pacific_islander total_population_0_19
## 0.23 0.02
## total_population_65_plus total_female_population
## 0.03 0.05
## net_migration migration_4_year_change
## -0.04 0.03
## proportion_one_person_households share_under_18_with_single_parent
## 0.21 0.11
## share_veteran_status share_no_bachelors
## -0.13 -0.07
## medicare_reimbursements_per_enrollee percentage_excessive_drinking
## 0.11 0.19
## average_Jan_temperature average_summer_temperature
## 0.15 -0.14
## total_Jan_precipitation total_annual_precipitation
## 0.37 -0.03
## rate_sheltered rate_unsheltered
## 0.72 0.83
## rate_homeless percent_black
## 1.00 -0.04
## percent_latino_hispanic percent_asian
## 0.28 0.25
## percent_pacific_islander percent_population_0_19
## 0.22 -0.20
## percent_population_65_plus percent_female_population
## 0.00 -0.08
Find the variables where the absolute value of the correlation with rate_homeless
is greater than 0.3.
"rate_homeless",abs(M["rate_homeless",]) > 0.3 & is.na(M["rate_homeless",]) == FALSE] %>%
M[abs() %>%
round(2) %>%
sort(decreasing = T)
## rate_homeless rate_unsheltered rate_sheltered
## 1.00 0.83 0.72
## share_renters_2016 share_overcrowded_units_2016 total_homeless
## 0.48 0.42 0.39
## total_Jan_precipitation house_price_index_2009 total_sheltered
## 0.37 0.34 0.33
## percentage_owners_cost_burden_2016 log_median_rent utility_costs
## 0.32 0.31 0.31
## gini_coefficient_2016
## 0.31
\(\rightarrow\) Make a pairs plot with a subset of the variables with the highest magnitude correlations. Select a small enough group so that you can see the panels. Do not include variables that you will not use as predictors (e.g. rate_unsheltered
, any demographic totals)
Show Answer
%>%
df_2017 select("rate_homeless", "share_renters_2016", "share_overcrowded_units_2016", "total_Jan_precipitation", "house_price_index_2009", "percentage_owners_cost_burden_2016", "log_median_rent", "utility_costs", "gini_coefficient_2016") %>%
filter("coc_number" != "CA-613") %>%
ggpairs(progress = FALSE, lower = list(continuous = "cor"), upper = list(continuous = "points"))
\(\rightarrow\) Create additional plots to further understand the data set. Describe your findings.
3.5 Model
We will multiple approaches to construct models that predict rate_homeless
:
- Use statistical significance to create a multiple linear regression model.
- Best subset selection for a multiple linear regression model.
- Lasso
- Ridge regression
- XGBoost
To compare the different approaches, we will use a training and testing split of the data set.
3.5.1 Set up the data set for training and testing
3.5.1.1 Remove some variables
There are several variables that we do not want to include as predictors. We want to remove demographic totals, the year, the CoC number, and the other homeless rates that we are not predicting. You may have additional variables to remove to create the data set that contains only the response variable and the predictors that you want to use.
= c("total_homeless", "total_sheltered", "total_unsheltered", "total_black", "total_latino_hispanic", "total_asian", "total_pacific_islander", "total_population_0_19", "total_population_65_plus", "total_female_population", "year", "coc_number", "total_population", "rate_unsheltered", "rate_sheltered")
variable_remove
<- df_2017 %>%
df_model select(-all_of(variable_remove))
names(df_model)
## [1] "house_price_index_2009" "log_median_rent"
## [3] "utility_costs" "percentage_owners_cost_burden_2016"
## [5] "percentage_renters_severe_cost_burden_2016" "share_renters_2016"
## [7] "rental_vacancy_rate" "high_housing_density"
## [9] "number_eviction" "share_overcrowded_units_2016"
## [11] "city_or_urban" "suburban"
## [13] "rate_unemployment" "gini_coefficient_2016"
## [15] "poverty_rate" "share_HUD_units"
## [17] "HUD_unit_occupancy_rate" "share_built_before_1940"
## [19] "net_migration" "migration_4_year_change"
## [21] "proportion_one_person_households" "share_under_18_with_single_parent"
## [23] "share_veteran_status" "share_no_bachelors"
## [25] "medicare_reimbursements_per_enrollee" "percentage_excessive_drinking"
## [27] "average_Jan_temperature" "average_summer_temperature"
## [29] "total_Jan_precipitation" "total_annual_precipitation"
## [31] "rate_homeless" "percent_black"
## [33] "percent_latino_hispanic" "percent_asian"
## [35] "percent_pacific_islander" "percent_population_0_19"
## [37] "percent_population_65_plus" "percent_female_population"
\(\rightarrow\) How many input variables remain in the data set that you can use as predictors?
Show Answer
ncol(df_model) - 1
## [1] 37
There are 37 input variables.
3.5.1.2 Get train and test splits
We will split the data into training and testing sets, with %80 of the data kept for training.
#Do the split. Keep 80% for training. Use stratified sampling based on rate_homeless
<- initial_split(df_model, prop = 0.8, strata = rate_homeless)
split
#Extract the training and testing splits
<- training(split)
df_train <- testing(split) df_test
\(\rightarrow\) Check the sizes of the training and testing splits
Show Answer
dim(df_train)
## [1] 302 38
dim(df_test)
## [1] 72 38
3.5.2 Full regression model
3.5.2.1 Fit the model on the training data
\(\rightarrow\) Use the training data to fit a multiple linear regression model to predict rate_homeless
using all possible predictor variables.
Show Coding Hint
Use the lm
function:
<- lm() fit
Show Answer
<- lm(rate_homeless ~ ., df_train) fit
\(\rightarrow\) Examine the summary of the fit. Which predictors have statistically significant coefficients? Do the signs of the coefficients make sense?
Show Answer
summary(fit)
##
## Call:
## lm(formula = rate_homeless ~ ., data = df_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33.981 -5.956 -0.068 4.266 80.821
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.914e+01 7.899e+01 -0.749 0.454667
## house_price_index_2009 7.668e-02 7.332e-02 1.046 0.296546
## log_median_rent 2.706e+01 8.964e+00 3.019 0.002784 **
## utility_costs -2.091e-01 3.730e-01 -0.561 0.575550
## percentage_owners_cost_burden_2016 -5.521e-01 5.500e-01 -1.004 0.316367
## percentage_renters_severe_cost_burden_2016 3.088e-01 2.990e-01 1.033 0.302582
## share_renters_2016 1.536e-01 2.349e-01 0.654 0.513694
## rental_vacancy_rate 1.548e-01 3.553e-01 0.436 0.663460
## high_housing_density -2.618e+00 2.370e+00 -1.105 0.270261
## number_eviction 2.676e-04 1.899e-04 1.410 0.159830
## share_overcrowded_units_2016 2.621e+00 1.066e+00 2.460 0.014531 *
## city_or_urban -2.508e+00 2.639e+00 -0.950 0.342781
## suburban -5.140e+00 2.343e+00 -2.194 0.029137 *
## rate_unemployment 1.539e+00 6.298e-01 2.443 0.015205 *
## gini_coefficient_2016 2.660e-01 5.255e-01 0.506 0.613177
## poverty_rate -8.865e-01 5.705e-01 -1.554 0.121397
## share_HUD_units 2.658e+00 7.783e-01 3.415 0.000738 ***
## HUD_unit_occupancy_rate -7.112e-02 1.795e-01 -0.396 0.692342
## share_built_before_1940 -2.208e-02 1.339e-01 -0.165 0.869182
## net_migration -1.550e-04 1.036e-04 -1.496 0.135787
## migration_4_year_change 4.815e+00 1.322e+00 3.643 0.000324 ***
## proportion_one_person_households 1.300e+00 4.442e-01 2.926 0.003730 **
## share_under_18_with_single_parent 2.347e-01 3.454e-01 0.679 0.497490
## share_veteran_status -5.586e-02 4.467e-01 -0.125 0.900584
## share_no_bachelors 1.488e-01 2.365e-01 0.629 0.529745
## medicare_reimbursements_per_enrollee -6.791e-01 9.436e-01 -0.720 0.472362
## percentage_excessive_drinking 4.008e+01 3.784e+01 1.059 0.290423
## average_Jan_temperature 1.934e-01 1.874e-01 1.032 0.302961
## average_summer_temperature -4.087e-03 3.153e-01 -0.013 0.989670
## total_Jan_precipitation 8.756e-01 2.614e-01 3.349 0.000928 ***
## total_annual_precipitation 4.696e-02 7.222e-02 0.650 0.516097
## percent_black -3.397e+01 1.578e+01 -2.153 0.032261 *
## percent_latino_hispanic -2.100e+01 1.608e+01 -1.306 0.192722
## percent_asian -5.199e+01 2.506e+01 -2.075 0.038964 *
## percent_pacific_islander 1.929e+02 1.375e+02 1.403 0.161796
## percent_population_0_19 2.941e+01 6.861e+01 0.429 0.668517
## percent_population_65_plus -4.816e+00 5.136e+01 -0.094 0.925365
## percent_female_population -1.211e+02 1.175e+02 -1.031 0.303538
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.18 on 264 degrees of freedom
## Multiple R-squared: 0.5918, Adjusted R-squared: 0.5345
## F-statistic: 10.34 on 37 and 264 DF, p-value: < 2.2e-16
Isolate the predictors that have statistically significant coefficients.
#Name the summary
<- summary(fit)
s
#Use logical indexing on the matrix of coefficients to find coefficients where the p-value < 0.05
$coefficients[s$coefficients[,4] < 0.05,1] %>%
sround(2) %>% #Round because we don't need to look at a million decimal places
data.frame() #Create a data frame because it produces a nicer display
## .
## log_median_rent 27.06
## share_overcrowded_units_2016 2.62
## suburban -5.14
## rate_unemployment 1.54
## share_HUD_units 2.66
## migration_4_year_change 4.81
## proportion_one_person_households 1.30
## total_Jan_precipitation 0.88
## percent_black -33.97
## percent_asian -51.99
We should also do a qualitative assessment of the fit to the training data.
\(\rightarrow\) Plot the residuals to look for systematic patterns of residuals that might suggest a new model.
Show Answer
plot(fit,1)
The residual plot has a systematic pattern. How might you use this to alter the model? This is something to explore in the Additional Step.
3.5.2.2 Assess the model on the testing data
\(\rightarrow\) Use the model to predict the homeless rate in the testing data.Coding Hint
Use the function predict
to make the predictions.
Show Answer
<- predict(fit, df_test) lm_pred
\(\rightarrow\) Make a scatter plot to compare the actual value and the predicted value of rate_homeless
.
Show Answer
plot(df_test$rate_homeless, lm_pred, xlab = "Measured homeless rate", ylab = "Predicted homeless rate", pch = 20)
\(\rightarrow\) Compute the RMSE
Show Answer
<- sqrt(mean((df_test$rate_homeless- lm_pred)^2))
lm_rmse
lm_rmse
## [1] 9.137479
Or, use pipes if you prefer:
<- (df_test$rate_homeless- lm_pred)^2 %>%
lm_rmse mean() %>%
sqrt()
3.5.3 Subset selection
The full model contains too many predictors, so we will use subset selection on the training data to find a smaller number of predictors.
\(\rightarrow\) Get the number of variables in the data set that will be used as predictors. This will be used to set the subset size.
Show Answer
<- ncol(df_train) - 1 num_var
\(\rightarrow\) Do best subset selection on the training data. Set the maximum number of variables to equal the number of predictor variables in the data set.
Show Answer
<- regsubsets(rate_homeless ~ . , data = df_train, nvmax = num_var) regfit_full
\(\rightarrow\) Get the summary. We will use this after performing cross validation to determine the best model.
Show Answer
<- summary(regfit_full) reg_summary
3.5.3.1 Cross validation
\(\rightarrow\) Use 10-fold cross validation on the training data to determine the best subset of predictors in the model.
Show Answer
#Number of observations
<- nrow(df_train)
n
#number of folds
<- 10
k
#folds
set.seed(1)
#Each observation is randomly assigned to a fold 1, 2, .. k
<- sample(1:k,n,replace = TRUE)
folds
#Initialize error for testing data
<- matrix(NA,k,num_var,dimnames = list(NULL,paste(1:num_var)))
cv_errors
#For each testing fold
for (j in 1:k){
#Best subsets on training folds
<- regsubsets(rate_homeless ~ ., data = df_train[folds !=j,], nvmax = num_var)
reg_fit_best
#test matrix with test fold
<- model.matrix(rate_homeless ~ .,data = df_train[folds == j,])
test_mat
#Compare errors across models with different numbers of variables
for (i in 1:num_var){
#get coefficients of best model with i variables
<- coef(reg_fit_best,id = i)
coefi #make predictions using test data
<- test_mat[,names(coefi)] %*% coefi
pred #compute test MSE
<- mean((df_train$rate_homeless[folds == j] - pred)^2)
cv_errors[j,i]
}
}
#Average error over folds
<- apply(cv_errors,2,mean) cv_errors_mean
\(\rightarrow\) Plot the assessment measures vs. the number of predictors
Show Answer
par(mfrow = c(2,2))
= which.min(reg_summary$cp)
ind_cp plot(reg_summary$cp,type = "b",xlab = "Number of variables",ylab = "Cp", main = toString(ind_cp))
points(ind_cp, reg_summary$cp[ind_cp],col = "red",pch = 20)
= which.min(reg_summary$bic)
ind_bic plot(reg_summary$bic,type = "b",xlab = "Number of variables",ylab = "BIC", main = toString(ind_bic))
points(ind_bic, reg_summary$bic[ind_bic],col = "red",pch = 20)
= which.max(reg_summary$adjr2)
ind_adjr2 plot(reg_summary$adjr2,type = "b",xlab = "Number of variables",ylab = "Adjusted R2", main = toString(ind_adjr2))
points(ind_adjr2, reg_summary$adjr2[ind_adjr2],col = "red",pch = 20)
<- which.min(cv_errors_mean)
ind_cv plot(cv_errors_mean,type = "b",xlab = "Number of variables",ylab = "Cross validation", main = toString(ind_cv))
points(ind_cv, cv_errors_mean[ind_cv],col = "red",pch = 20)
\(\rightarrow\) What are the coefficients of the best model according to different measures?
Show Answer
Try each of: ind_cp, ind_bic, ind_adjr2, ind_cv
coef(regfit_full,ind_cp) %>%
round(2) %>%
data.frame()
## .
## (Intercept) -105.88
## log_median_rent 22.92
## high_housing_density -3.29
## share_overcrowded_units_2016 3.49
## suburban -4.32
## rate_unemployment 1.19
## share_HUD_units 2.54
## migration_4_year_change 3.52
## proportion_one_person_households 1.34
## percentage_excessive_drinking 69.85
## average_Jan_temperature 0.19
## total_Jan_precipitation 0.96
## percent_black -30.42
## percent_latino_hispanic -23.65
## percent_asian -53.64
## percent_pacific_islander 214.50
coef(regfit_full,ind_bic) %>%
round(2) %>%
data.frame()
## .
## (Intercept) -89.27
## log_median_rent 13.31
## high_housing_density -6.89
## share_overcrowded_units_2016 3.37
## share_HUD_units 1.73
## migration_4_year_change 4.12
## proportion_one_person_households 1.56
## percentage_excessive_drinking 84.14
## total_Jan_precipitation 1.13
## percent_pacific_islander 216.87
3.5.3.2 Assess the performance of the model on the testing data
Use the model to predict the homeless rate in the testing data. We will use the best BIC model as an example, but you should compare the performance of all models.
\(\rightarrow\) Select the variables that are indicated by the best subset selection criterion that you choose (Cp, BIC, Adjusted R-squared, or CV) and produce a new, smaller data frame.
Show Answer
<- df_train %>%
df_bic select(all_of(names(coef(regfit_full,ind_bic))[2:length(coef(regfit_full,ind_bic))]), "rate_homeless")
head(df_bic)
## # A tibble: 6 x 10
## log_median_rent high_housing_dens… share_overcrowded_un… share_HUD_units migration_4_year_c… proportion_one_person…
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2.44 0 4.12 3.45 -0.849 25.0
## 2 1.82 0 1.54 3.74 0.419 29.8
## 3 1.44 0 1.87 3.65 0.191 27.3
## 4 1.81 0 2.07 5.30 -0.0421 27.6
## 5 1.81 0 1.94 4.18 -0.237 30.8
## 6 1.78 0 3.35 1.46 0.434 24.7
## # … with 4 more variables: percentage_excessive_drinking <dbl>, total_Jan_precipitation <dbl>,
## # percent_pacific_islander <dbl>, rate_homeless <dbl>
\(\rightarrow\) Fit the model using this subset. The lm
object will help us to make predictions easily.
Show Answer
<- lm(rate_homeless ~., data = df_bic)
fit_bic
summary(fit_bic)
##
## Call:
## lm(formula = rate_homeless ~ ., data = df_bic)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34.177 -5.609 -0.538 4.155 84.739
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -89.2710 8.7324 -10.223 < 2e-16 ***
## log_median_rent 13.3127 2.9526 4.509 9.45e-06 ***
## high_housing_density -6.8941 1.9062 -3.617 0.000352 ***
## share_overcrowded_units_2016 3.3692 0.4260 7.908 5.43e-14 ***
## share_HUD_units 1.7343 0.4661 3.721 0.000238 ***
## migration_4_year_change 4.1238 1.0682 3.861 0.000139 ***
## proportion_one_person_households 1.5551 0.2212 7.031 1.46e-11 ***
## percentage_excessive_drinking 84.1390 27.3821 3.073 0.002321 **
## total_Jan_precipitation 1.1297 0.2026 5.575 5.63e-08 ***
## percent_pacific_islander 216.8700 84.6951 2.561 0.010952 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.33 on 292 degrees of freedom
## Multiple R-squared: 0.5363, Adjusted R-squared: 0.522
## F-statistic: 37.52 on 9 and 292 DF, p-value: < 2.2e-16
Are all of the coefficients statistically significant?
\(\rightarrow\) Generate predictions of the ACT score in the testing data.
Show Answer
<- predict(fit_bic, df_test) bic_pred
\(\rightarrow\) Make a scatter plot to compare the actual value and the predicted value of rate_homeless
.
Show Answer
plot(df_test$rate_homeless, bic_pred, xlab = "Measured homeless rate", ylab = "Predicted homeless rate", pch = 20)
\(\rightarrow\) Compute the RMSE. How does the RMSE compare to the error for the full model?
Show Answer
<- sqrt(mean((df_test$rate_homeless- bic_pred)^2))
bic_rmse
bic_rmse
## [1] 8.359051
3.5.4 Lasso
The lasso is another approach to producing a linear regression model with a subset of the available predictors. The lasso works by finding coefficients \(\boldsymbol{\beta}\) that minimize the cost function:
\[C(\boldsymbol{\beta}) = \sum_{i=1}^n(y_i - \beta_{0} - \sum_{j=1}^p\beta_{j}x_{ij})^2 + \lambda \sum_{j=1}^p|\beta_{j}| = \text{RSS} + \lambda \sum_{j=1}^p|\beta_{j}|\]
where \(\lambda \geq 0\) is a tuning parameter (or hyperparameter).
\(\rightarrow\) Prepare the data by creating a model matrix x_train
from the training data. Create the model matrix using model.matrix
so that it includes the training data for all predictors, but does not include a column for the intercept. Also create the training response data y_train
.
Show Answer
<- model.matrix(rate_homeless ~ ., df_train)[,-1] #Remove the intercept
x_train <- df_train$rate_homeless y_train
\(\rightarrow\) Use cross-validation to find the best hyperparameter \(\lambda\)
Show Answer
#alpha = 1 says that we will use the lasso
<- cv.glmnet(x_train, y_train, alpha = 1) cv_lasso
\(\rightarrow\) Show error as a function of the hyperparameter \(\lambda\) and its best value.
Show Answer
plot(cv_lasso)
<- cv_lasso$lambda.min
best_lam best_lam
## [1] 0.3594396
\(\rightarrow\) Fit the lasso with the best \(\lambda\) using the function glmnet
.
Show Answer
<- glmnet(x_train, y_train, alpha = 1, lambda = best_lam) lasso_mod
\(\rightarrow\) Examine the coefficients. Which variables have non-zero coefficients?
Show Answer
round(coef(lasso_mod),3)
## 38 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) -33.482
## house_price_index_2009 0.066
## log_median_rent 12.173
## utility_costs -0.118
## percentage_owners_cost_burden_2016 .
## percentage_renters_severe_cost_burden_2016 0.050
## share_renters_2016 0.029
## rental_vacancy_rate .
## high_housing_density -2.930
## number_eviction 0.000
## share_overcrowded_units_2016 2.530
## city_or_urban .
## suburban -2.023
## rate_unemployment 0.482
## gini_coefficient_2016 0.013
## poverty_rate .
## share_HUD_units 2.015
## HUD_unit_occupancy_rate .
## share_built_before_1940 .
## net_migration .
## migration_4_year_change 2.874
## proportion_one_person_households 1.002
## share_under_18_with_single_parent .
## share_veteran_status 0.104
## share_no_bachelors .
## medicare_reimbursements_per_enrollee .
## percentage_excessive_drinking 56.857
## average_Jan_temperature 0.034
## average_summer_temperature .
## total_Jan_precipitation 0.974
## total_annual_precipitation 0.018
## percent_black -9.328
## percent_latino_hispanic .
## percent_asian -17.782
## percent_pacific_islander 213.469
## percent_population_0_19 -49.639
## percent_population_65_plus .
## percent_female_population -46.932
3.5.4.1 Look at prediction error
\(\rightarrow\) Use the model to predict the homeless rate in the testing data.
Show Answer
First create a model matrix with the testing data
<- model.matrix(rate_homeless ~ .,df_test)[,-1] #Remove the intercept x_test
Use the predict
function to make the prediction of the testing ACT data.
<- predict(lasso_mod, s = best_lam, newx = x_test) lasso_pred
\(\rightarrow\) Make a scatter plot to compare the actual value and the predicted value of rate_homeless
.
Show Answer
plot(df_test$rate_homeless, lasso_pred, xlab = "Measured homeless rate", ylab = "Predicted homeless rate", pch = 20)
\(\rightarrow\) Compute the RMSE. How does it compare to the other models?
Show Answer
<- sqrt(mean((df_test$rate_homeless - lasso_pred)^2))
lasso_rmse lasso_rmse
## [1] 8.029321
3.5.5 Ridge regression
Ridge regression is another approach to model building
Ridge regression works by finding coefficients \(\boldsymbol{\beta}\) that minimize the cost function:
\[C(\boldsymbol{\beta}) = \sum_{i=1}^n(y_i - \beta_{0} - \sum_{j=1}^p\beta_{j}x_{ij})^2 + \lambda \sum_{j=1}^p\beta_{j}^2 = \text{RSS} + \lambda \sum_{j=1}^p\beta_{j}^2\]
In contrast to the lasso, ridge regression will not reduce the number of non-zero coefficients in the model. Ridge regression will shrink the coefficients, which helps to prevent overfitting of the training data.
The fitting procedure for the ridge regression model mirrors the lasso approach, only changing the parameter \(\alpha\) to 0 in the cv.glmnet
and glmnet
functions.
\(\rightarrow\) Fit and assess a ridge regression model. How does it compare to the other models?
Show Answer
First, do cross validation to find the best value of the \(\lambda\)
<- cv.glmnet(x_train,y_train,alpha = 0)
cv_ridge
<- cv_ridge$lambda.min best_lam
Fit the model with the best value of the \(\lambda\)
<- glmnet(x_train, y_train, alpha = 0, lambda = best_lam) ridge_mod
Predict the ACT scores in the test data
<- predict(ridge_mod, s = best_lam, newx = x_test) ridge_pred
Compute the RMSE
<- sqrt(mean((df_test$rate_homeless - ridge_pred)^2))
ridge_rmse ridge_rmse
## [1] 8.221767
3.5.6 XGBoost
XGBoost is short for eXtreme Gradient Boosting.
We are going to use the tidymodels
package to fit the XGBoost model.
3.5.6.1 Set up the model
The model will be a boosted tree model, so we start by specifying the features of a boost_tree
model. Theboost_tree
creates a specification of a model, but does not fit the model.
<- boost_tree(
xgb_spec mode = "regression", #We are solving a regression problem
trees = 1000,
tree_depth = tune(), # tune() says that we will specify this parameter later
min_n = tune(),
loss_reduction = tune(),
sample_size = tune(),
mtry = tune(),
learn_rate = tune(),
%>%
) set_engine("xgboost", objective = "reg:squarederror") ## We will use xgboost to fit the model and try to minimize the squared error
xgb_spec
## Boosted Tree Model Specification (regression)
##
## Main Arguments:
## mtry = tune()
## trees = 1000
## min_n = tune()
## tree_depth = tune()
## learn_rate = tune()
## loss_reduction = tune()
## sample_size = tune()
##
## Engine-Specific Arguments:
## objective = reg:squarederror
##
## Computational engine: xgboost
Create a workflow that specifies the model formula and the model type. We are still setting up the model; this does not fit the model.
<- workflow() %>%
xgb_wf add_formula(rate_homeless ~ .) %>%
add_model(xgb_spec)
xgb_wf
## ══ Workflow ═════════════════════════════════════════════════════════════════════════════════════════════════════════
## Preprocessor: Formula
## Model: boost_tree()
##
## ── Preprocessor ─────────────────────────────────────────────────────────────────────────────────────────────────────
## rate_homeless ~ .
##
## ── Model ────────────────────────────────────────────────────────────────────────────────────────────────────────────
## Boosted Tree Model Specification (regression)
##
## Main Arguments:
## mtry = tune()
## trees = 1000
## min_n = tune()
## tree_depth = tune()
## learn_rate = tune()
## loss_reduction = tune()
## sample_size = tune()
##
## Engine-Specific Arguments:
## objective = reg:squarederror
##
## Computational engine: xgboost
3.5.6.2 Fit the model
We need to fit all of the parameters that we specified as tune()
. We will specify the parameter grid using the functions grid_latin_hypercube
:
<- grid_latin_hypercube(
xgb_grid tree_depth(),
min_n(),
loss_reduction(),
sample_size = sample_prop(),
finalize(mtry(), df_train),
learn_rate(),
size = 30 #Create 30 sets of the 6 parameters
)
Create folds for cross-validation.
#folds <- vfold_cv(df_train, strata = rate_homeless)
<- vfold_cv(df_train) folds
Do the parameter fitting. This will take some time.
<- tune_grid(
xgb_res #The workflow
xgb_wf, resamples = folds, #The training data split into folds
grid = xgb_grid, #The grid of parameters to fit
control = control_grid(save_pred = TRUE)
)
## ! Fold01: internal: A correlation computation is required, but `estimate` is constant and has 0 standard deviation, res...
## ! Fold02: internal: A correlation computation is required, but `estimate` is constant and has 0 standard deviation, res...
## ! Fold03: internal: A correlation computation is required, but `estimate` is constant and has 0 standard deviation, res...
## ! Fold04: internal: A correlation computation is required, but `estimate` is constant and has 0 standard deviation, res...
## ! Fold05: internal: A correlation computation is required, but `estimate` is constant and has 0 standard deviation, res...
## ! Fold06: internal: A correlation computation is required, but `estimate` is constant and has 0 standard deviation, res...
## ! Fold07: internal: A correlation computation is required, but `estimate` is constant and has 0 standard deviation, res...
## ! Fold08: internal: A correlation computation is required, but `estimate` is constant and has 0 standard deviation, res...
## ! Fold09: internal: A correlation computation is required, but `estimate` is constant and has 0 standard deviation, res...
## ! Fold10: internal: A correlation computation is required, but `estimate` is constant and has 0 standard deviation, res...
xgb_res
## Warning: This tuning result has notes. Example notes on model fitting include:
## internal: A correlation computation is required, but `estimate` is constant and has 0 standard deviation, resulting in a divide by 0 error. `NA` will be returned.
## internal: A correlation computation is required, but `estimate` is constant and has 0 standard deviation, resulting in a divide by 0 error. `NA` will be returned.
## internal: A correlation computation is required, but `estimate` is constant and has 0 standard deviation, resulting in a divide by 0 error. `NA` will be returned.
## # Tuning results
## # 10-fold cross-validation
## # A tibble: 10 x 5
## splits id .metrics .notes .predictions
## <list> <chr> <list> <list> <list>
## 1 <split [271/31]> Fold01 <tibble [60 × 10]> <tibble [1 × 1]> <tibble [930 × 10]>
## 2 <split [271/31]> Fold02 <tibble [60 × 10]> <tibble [1 × 1]> <tibble [930 × 10]>
## 3 <split [272/30]> Fold03 <tibble [60 × 10]> <tibble [1 × 1]> <tibble [900 × 10]>
## 4 <split [272/30]> Fold04 <tibble [60 × 10]> <tibble [1 × 1]> <tibble [900 × 10]>
## 5 <split [272/30]> Fold05 <tibble [60 × 10]> <tibble [1 × 1]> <tibble [900 × 10]>
## 6 <split [272/30]> Fold06 <tibble [60 × 10]> <tibble [1 × 1]> <tibble [900 × 10]>
## 7 <split [272/30]> Fold07 <tibble [60 × 10]> <tibble [1 × 1]> <tibble [900 × 10]>
## 8 <split [272/30]> Fold08 <tibble [60 × 10]> <tibble [1 × 1]> <tibble [900 × 10]>
## 9 <split [272/30]> Fold09 <tibble [60 × 10]> <tibble [1 × 1]> <tibble [900 × 10]>
## 10 <split [272/30]> Fold10 <tibble [60 × 10]> <tibble [1 × 1]> <tibble [900 × 10]>
Set up the final workflow with the best model parameters.
#Get the best model, according to RMSE
<- select_best(xgb_res, "rmse")
best_rmse
#Update the workflow with the best parameters
<- finalize_workflow(
final_xgb
xgb_wf,
best_rmse
)
final_xgb
## ══ Workflow ═════════════════════════════════════════════════════════════════════════════════════════════════════════
## Preprocessor: Formula
## Model: boost_tree()
##
## ── Preprocessor ─────────────────────────────────────────────────────────────────────────────────────────────────────
## rate_homeless ~ .
##
## ── Model ────────────────────────────────────────────────────────────────────────────────────────────────────────────
## Boosted Tree Model Specification (regression)
##
## Main Arguments:
## mtry = 17
## trees = 1000
## min_n = 29
## tree_depth = 8
## learn_rate = 0.0156478729445193
## loss_reduction = 8.97021164557671e-05
## sample_size = 0.552171032566478
##
## Engine-Specific Arguments:
## objective = reg:squarederror
##
## Computational engine: xgboost
3.5.6.3 Prediction
Fit the final model to the training data.
<- final_xgb %>%
xgb_fit fit(df_train)
Generate predictions and bind them together with the true values from the test data.
<- xgb_fit %>%
predictions_xgb predict(new_data = df_test) %>%
bind_cols(df_test %>% select(rate_homeless))
Show the RMSE. Compare the result to the test RMSE for the other models.
%>%
predictions_xgb metrics(rate_homeless, .pred)
## # A tibble: 3 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 8.86
## 2 rsq standard 0.467
## 3 mae standard 6.79
Plot the ACT scores and the prediction
plot(df_test$rate_homeless, predictions_xgb$.pred, pch = 20, xlab = "Homeless rate", ylab = "Predicted homeless rate")
3.5.6.4 Relative importance of predictors
Look at which predictors are most important in the model
%>%
xgb_fit pull_workflow_fit() %>%
vip(geom = "col")
3.5.7 Compare models
You used several methods to construct a model
- Use statistical significance to create a multiple linear regression model.
- Best subset selection for a multiple linear regression model.
- Lasso
- Ridge regression
- XGBoost
Compare the performance of the models.
3.6 Additional step
In addition to completing the above analyses, you should perform any revisions to the models that you think might improve their performance.
3.7 Conclusion
After completing your analyses, you will make your conclusions and communicate your results. Consult Canvas for further directions.