6 Preprocessing
Preprocessing your data is far and away one of the most critical steps to any data mining pipeline. In an academic learning environment you’re often given datasets that are free of errors and have all the variables you need scaled and ready to go. In reality data are rarely this way. Most data scientists will tell you that 80% of their work is the various preprocessing steps.
What is preprocessing exactly? I’d say it can be lumped into the following general tasks:
identifying and correcting erroneous values
removing special characters and converting datatypes
filtering data to contain only the values of interest
transforming data to ensure your model runs correctly
encoding categorical variables
creating useful features
I want to stress that there isn’t a strict formula for this process… it requires you to explore your data, fix things, explore again, fix things, and so on. Every dataset is different and thus will require different combinations of techniques. Thus, the goal of this lesson is to expose you to some of these techniques and the ideas behind them, not necessarily provide a rigid structure for you to do this.
6.1 About our AirBnB data
This lesson is going to work with real data from AirBnB. They actually provide detailed listing information on their website. The raw data contained over 100 columns, but we’re only going to work with 19. The goal will be to prepare our features so that we can predict two targets within the data - rental price and rental rating. With that being said, let’s jump in!
6.2 Digging into preprocessing
6.2.1 Importing our data
Let’s load up our packages
library(tidyverse)
Bring in our data
<- read_csv("https://docs.google.com/spreadsheets/d/1q2RkNoLxyminnnUwBxqhII8bZM1OfeDXsNCGbHZW_kc/gviz/tq?tqx=out:csv") air
## Parsed with column specification:
## cols(
## .default = col_character(),
## minimum_nights = col_double(),
## maximum_nights = col_double(),
## square_feet = col_double(),
## bedrooms = col_double(),
## bathrooms = col_double(),
## accommodates = col_double(),
## host_identity_verified = col_logical(),
## host_is_superhost = col_logical(),
## review_scores_rating = col_double(),
## number_of_reviews = col_double()
## )
## See spec(...) for full column specifications.
6.3 Exploring the data set as a whole
A quick glimpse()
of our data shows a bunch of columns. Right off the bat I see the following:
I see our two features that we want to use later:
price
andnumber_of_reviews
.Notice how
price
,cleaning_fee
, andsecurity_deposit
are character strings because they have a $ sign in the data… we’ll have to remove those and convert to numeric before we can use the actual values within! Same with the % inhost_response_time
I see a bunch of character columns (e.g.
bed_type
) that we need to look at individually and see what values are in there. We will also need to remove spaces from withinNA
values that we need to explore
glimpse(air)
## Rows: 30,218
## Columns: 21
## $ price <chr> "$50.00", "$117.00", "$80.00", "$150.00", "$...
## $ cleaning_fee <chr> "$15.00", "$35.00", "$0.00", "$95.00", "$50....
## $ security_deposit <chr> "$0.00", "$0.00", "$0.00", NA, "$200.00", "$...
## $ cancellation_policy <chr> "strict_14_with_grace_period", "moderate", "...
## $ minimum_nights <dbl> 2, 2, 2, 4, 2, 4, 2, 89, 32, 1, 32, 2, 3, 3,...
## $ maximum_nights <dbl> 89, 60, 60, 60, 365, 120, 7, 1000, 395, 730,...
## $ square_feet <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, 1000, NA...
## $ bedrooms <dbl> 1, 3, 1, 1, 1, 2, 1, 2, 0, 2, 2, 3, 1, 1, 2,...
## $ bathrooms <dbl> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0,...
## $ accommodates <dbl> 1, 7, 2, 4, 2, 4, 2, 4, 3, 7, 6, 6, 2, 2, 6,...
## $ bed_type <chr> "Real Bed", "Real Bed", "Futon", "Real Bed",...
## $ room_type <chr> "Private room", "Entire home/apt", "Entire h...
## $ property_type <chr> "Condominium", "Apartment", "Apartment", "Ap...
## $ city <chr> "Chicago", "Chicago", "Chicago", "Chicago", ...
## $ neighbourhood_cleansed <chr> "Hyde Park", "South Lawndale", "West Town", ...
## $ host_identity_verified <lgl> TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, T...
## $ host_is_superhost <lgl> TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, FALS...
## $ host_response_time <chr> "within an hour", "within an hour", "within ...
## $ host_response_rate <chr> "100%", "100%", "100%", "89%", "93%", "89%",...
## $ review_scores_rating <dbl> 100, 96, 92, 92, 85, 89, 100, 93, 85, 86, 86...
## $ number_of_reviews <dbl> 149, 368, 338, 35, 38, 9, 4, 9, 37, 178, 44,...
Let’s look at a summary()
to understand our numeric data better. Doing this the following jumps out at me:
We are missing over 4000
review_scores_ratings
and much less frombedrooms
andbathrooms
. We can probably impute these missing values.We are missing over 29000
square_feet
measurements, so this is probably useless and should be dropped.minimum_nights
andmaximum_nights
have extreme max values. Those need to be changed to something more reasonable.
summary(air)
## price cleaning_fee security_deposit cancellation_policy
## Length:30218 Length:30218 Length:30218 Length:30218
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## minimum_nights maximum_nights square_feet bedrooms
## Min. : 1 Min. : 1 Min. : 0.0 Min. : 0.000
## 1st Qu.: 1 1st Qu.: 29 1st Qu.: 513.5 1st Qu.: 1.000
## Median : 2 Median : 365 Median : 600.0 Median : 1.000
## Mean : 3317 Mean : 3900 Mean : 773.9 Mean : 1.425
## 3rd Qu.: 3 3rd Qu.: 1125 3rd Qu.:1000.0 3rd Qu.: 2.000
## Max. :100000000 Max. :100000000 Max. :5500.0 Max. :24.000
## NA's :29379 NA's :7
## bathrooms accommodates bed_type room_type
## Min. : 0.00 Min. : 1.000 Length:30218 Length:30218
## 1st Qu.: 1.00 1st Qu.: 2.000 Class :character Class :character
## Median : 1.00 Median : 3.000 Mode :character Mode :character
## Mean : 1.32 Mean : 3.667
## 3rd Qu.: 1.50 3rd Qu.: 4.000
## Max. :21.00 Max. :36.000
## NA's :25
## property_type city neighbourhood_cleansed
## Length:30218 Length:30218 Length:30218
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## host_identity_verified host_is_superhost host_response_time host_response_rate
## Mode :logical Mode :logical Length:30218 Length:30218
## FALSE:15206 FALSE:17466 Class :character Class :character
## TRUE :15009 TRUE :12749 Mode :character Mode :character
## NA's :3 NA's :3
##
##
##
## review_scores_rating number_of_reviews
## Min. : 20.00 Min. : 0.00
## 1st Qu.: 95.00 1st Qu.: 2.00
## Median : 98.00 Median : 17.00
## Mean : 95.78 Mean : 46.39
## 3rd Qu.:100.00 3rd Qu.: 60.00
## Max. :100.00 Max. :963.00
## NA's :4672
6.4 Exploring and processing columns within the dataset
6.4.1 Filtering out specific character levels
Let’s start by looking at columns that are character strings. Although we could just use unique()
, we’re going to do another little trick to look at how many values are present for each unique character string in a column. R’s factor datatype will tell us this if we call a summary() on it. After that, we’ll filter our data to contain only enough values to be useful.
Here I call a column, then covert it to a factor using factor()
, then get a summary all in one go.
summary(factor(air$city))
## <U+897F><U+96C5><U+56FE> Ballard Seattle
## 1 1
## Ballard, Seattle Bernal Heights, San Francisco
## 1 1
## Capitol Hill, Seattle chicago
## 1 1
## Chicago CHICAGO
## 8148 1
## Chicago Heights Chicago,
## 2 1
## Daly City Detroit Lakes
## 44 1
## Dr. Chicago Elmwood Park
## 1 1
## Evanston Evergreen Park
## 1 1
## Happy Valley Lake Forest Park
## 2 3
## Lake Oswego Lincoln Park
## 1 1
## Milwaukie Noe Valley - San Francisco
## 2 1
## Norridge Oak Park
## 1 4
## Pilsen, Chicago Portland
## 1 5543
## Redmond Rogers Park
## 1 1
## San Francisco San Francisco, Hayes Valley
## 7515 1
## Seattle Seattle, Washington, US
## 8911 1
## Shoreline South San Francisco
## 3 2
## Tukwila West Seattle
## 1 1
## NA's
## 15
What we see is that although these appear to be data from four major cities, there are still misspellings and neighborhoods being listed as the city. Without deep knowledge about these places, it would be difficult to correct each one. Thus, we’re going to filter our data frame so that it has just the for most represented cities: Chicago, San Francisco, Seattle, and Portland. We learned how to do this last week.
<- c('Seattle', 'San Francisco', 'Chicago', 'Portland') # make vector of strings we want to keep
cities
<- air %>% # we'll overwrite air
air filter(city %in% cities) # use filter to select just cities where the strings in the column are found in our vector
summary(factor(air$city)) # and check our values again!
## Chicago Portland San Francisco Seattle
## 8148 5543 7515 8911
Looks good!
6.4.2 Automatic filtering of top values
Let’s look at property_type
the same way:
summary(factor(air$property_type))
## Aparthotel Apartment Barn
## 71 11952 1
## Bed and breakfast Boat Boutique hotel
## 141 35 410
## Bungalow Bus Cabin
## 390 4 28
## Camper/RV Casa particular (Cuba) Castle
## 44 1 3
## Cave Chalet Condominium
## 1 1 2763
## Cottage Dome house Earth house
## 91 1 4
## Farm stay Guest suite Guesthouse
## 6 2144 701
## Hostel Hotel House
## 129 159 8530
## Houseboat Hut In-law
## 17 4 2
## Loft Nature lodge Other
## 480 1 76
## Resort Serviced apartment Tent
## 23 544 9
## Timeshare Tiny house Townhouse
## 1 73 1245
## Treehouse Villa Yurt
## 3 25 4
Clearly AirBnB allows for a range of property types to be listed. Although I’m curious what the Cave and Earth houses are like, but there are too few in the dataset to effectively model. Let’s grab the top five and filter like we did above. Instead of manually making the vector of what we want, let’s just automatically select the top five values. To do this we’ll first do a groupby and then select the top five values.
Group_by’s are very SQL like if you have ever worked with a flavor of SQL. You first tell it your data, then what feature you want to group by. You then tell it what you want it to do with each group. Watch:
<- air %>% # Data
top_props group_by(property_type) %>% # Group by unique levels in property_type
summarize(total_obs = n()) # In each level count the number of observations using n() and then make a column called total_obs
## `summarise()` ungrouping output (override with `.groups` argument)
# let's see top_props
## # A tibble: 39 x 2
## property_type total_obs
## <chr> <int>
## 1 Aparthotel 71
## 2 Apartment 11952
## 3 Barn 1
## 4 Bed and breakfast 141
## 5 Boat 35
## 6 Boutique hotel 410
## 7 Bungalow 390
## 8 Bus 4
## 9 Cabin 28
## 10 Camper/RV 44
## # ... with 29 more rows
Great, so this just did a simple count of how many times each property type appears. The perk of this over using the summary(factor())
method is that we now have a dataframe called top_props
that we can work with. Let’s string on the top_n() function to select just the top 5.
<- air %>%
top_props group_by(property_type) %>%
summarize(total_obs = n()) %>%
top_n(5)
# Now we just have the top 5! top_props
## # A tibble: 5 x 2
## property_type total_obs
## <chr> <int>
## 1 Apartment 11952
## 2 Condominium 2763
## 3 Guest suite 2144
## 4 House 8530
## 5 Townhouse 1245
Great! Let’s use the property_type column in our filter to cut down our airBnB data more.
<- air %>%
air filter(property_type %in% top_props$property_type)
summary(factor(air$property_type)) # check
## Apartment Condominium Guest suite House Townhouse
## 11952 2763 2144 8530 1245
Much better! So why bother with this? Well, this methods is automated for one, so if the name of one of the top levels changes you don’t have to recode it. Or, if one property type becomes more or less popular you don’t have to manually update your vector. The other advantage is that you might have way too many unique values to really consider by eyeballing it.
6.4.3 Removing columns
But neighbourhood_cleansed
contains way too many levels to be useful for now. Just looking at the top 20:
head(summary(factor(air$neighbourhood_cleansed)), 20)
## West Town Lake View Mission
## 1003 724 695
## Near North Side Broadway Western Addition
## 604 557 536
## Logan Square South of Market Belltown
## 520 516 470
## Lincoln Park Castro/Upper Market Bernal Heights
## 399 393 390
## Near West Side Haight Ashbury First Hill
## 366 330 316
## Downtown/Civic Center Noe Valley Pike-Market
## 306 306 300
## Loop Wallingford
## 296 296
We can use our select()
function to remove values to. All you do is include a minus (-
) sign before the column you want removed.
<- air %>%
air select(-neighbourhood_cleansed)
glimpse(air) #check that it's removed
## Rows: 26,634
## Columns: 20
## $ price <chr> "$50.00", "$117.00", "$80.00", "$150.00", "$...
## $ cleaning_fee <chr> "$15.00", "$35.00", "$0.00", "$95.00", "$50....
## $ security_deposit <chr> "$0.00", "$0.00", "$0.00", NA, "$200.00", "$...
## $ cancellation_policy <chr> "strict_14_with_grace_period", "moderate", "...
## $ minimum_nights <dbl> 2, 2, 2, 4, 2, 4, 89, 32, 1, 32, 2, 3, 3, 32...
## $ maximum_nights <dbl> 89, 60, 60, 60, 365, 120, 1000, 395, 730, 30...
## $ square_feet <dbl> NA, NA, NA, NA, NA, NA, NA, NA, 1000, NA, NA...
## $ bedrooms <dbl> 1, 3, 1, 1, 1, 2, 2, 0, 2, 2, 3, 1, 1, 2, 1,...
## $ bathrooms <dbl> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0,...
## $ accommodates <dbl> 1, 7, 2, 4, 2, 4, 4, 3, 7, 6, 6, 2, 2, 6, 4,...
## $ bed_type <chr> "Real Bed", "Real Bed", "Futon", "Real Bed",...
## $ room_type <chr> "Private room", "Entire home/apt", "Entire h...
## $ property_type <chr> "Condominium", "Apartment", "Apartment", "Ap...
## $ city <chr> "Chicago", "Chicago", "Chicago", "Chicago", ...
## $ host_identity_verified <lgl> TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, T...
## $ host_is_superhost <lgl> TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, FALS...
## $ host_response_time <chr> "within an hour", "within an hour", "within ...
## $ host_response_rate <chr> "100%", "100%", "100%", "89%", "93%", "89%",...
## $ review_scores_rating <dbl> 100, 96, 92, 92, 85, 89, 93, 85, 86, 86, 89,...
## $ number_of_reviews <dbl> 149, 368, 338, 35, 38, 9, 9, 37, 178, 44, 47...
6.4.4 Removing special characters
In our initial exploration we saw columns that include either a $ or % symbol. These are forcing the datatypes to be character strings, but we need them as numeric if we want to use that as a continuous feature or target. First we need to remove the symbol, then we need to covert to numeric. We can use the package within tidyverse
called stringr
. There’s a great cheatsheet here
stringr
has lots of functions to add, remove, and extract bits from character strings. We’ll use the function str_remove()
to remove the $. In the syntax you need to specific the column you want to manipulate, then a pipe (%>%
) then the stringr
function. Inside str_remove
you need to specify the character you want to remove, which requires something called regular expression (aka REGEX). REGEX is a bit of a pain so don’t be afraid to do some googling to figure out how to make it work for you. For now, I’ll tell you that you need to include square brackets around the symbol you want to remove for REGEX to recognize it.
Here’s an example. Note how it removes the $ from anywhere in the string
<- c('$12', '13', '14$')
xyz %>% str_remove('[$]') xyz
## [1] "12" "13" "14"
Let’s do it on our price
column first. I’m going to be sure to do the following two things
1. Assign back to the column price
so that it overwrites the character string
2. Add an extra pipe (%>%
) then the function as.numeric()
to covert from a string to a numeric datatype.
$price <- air$price %>% # be sure to assign back to air$price
airstr_remove('[$]') %>% # remove $
as.numeric() # convert to numeric
## Warning in air$price %>% str_remove("[$]") %>% as.numeric(): NAs introduced by
## coercion
# note, you'll get an error message if some values were unable to convert
head(air$price, 10) # So check out the first 10 values to make sure it looks fine
## [1] 50 117 80 150 35 215 99 99 145 99
Your task: I’ll let you go and fix the other three columns: cleaning_fee
, security_deposit
, and host_response_rate
6.4.5 Replacing spaces and other symbols
Some of our columns contain several unique character strings, each of which is its own level. The only issue is that in many of these there are spaces between words. Spaces between words make it difficult to specific them as features, which you’ll see we’re going to do later. So, let’s remove spaces and -’s and replace them with underscores.
Let’s look at room_type
first
unique(air$room_type)
## [1] "Private room" "Entire home/apt" "Shared room"
So, we have spaces we want to remove as well as a /
. Let’s also just convert to lower case. We can use the stringr
function str_replace()
to replace our spaces or other symbols with what we want. You need to specify two arguments: the REGEX string you want to replace, and then what you want to replace it with.
For example:
<- c(' pancake', 'french toast', 'crepe ') # vector with spaces
xyz %>% str_replace('\\s', '_') #REGEX for one space is \\s xyz
## [1] "_pancake" "french_toast" "crepe_"
let’s do it on room_type
now. We’re going to string two str_replace
calls together as we need to get rid of the space and /
.
$room_type <- air$room_type %>%
airstr_replace('\\s', '_') %>%
str_replace('/', '_') %>%
tolower() # convert to lower case while we're at it!
unique(air$room_type) #Check!
## [1] "private_room" "entire_home_apt" "shared_room"
Looks great!
Your task: Can you go do the same for city
, property_type
and bed_type
? Be sure to check out the REGEX cheat sheet!
6.4.6 Rest of the character data
These columns all look good in that there are not strange levels that don’t make sense:
summary(factor(air$cancellation_policy))
## flexible moderate
## 5762 9751
## strict strict_14_with_grace_period
## 783 10092
## super_strict_30 super_strict_60
## 192 54
summary(factor(air$bed_type))
## airbed couch futon pull_out_sofa real_bed
## 66 25 152 91 26300
summary(factor(air$room_type))
## entire_home_apt private_room shared_room
## 18521 7537 576
6.5 Dealing with errors in numeric data
There are several types of errors you’ll find in numeric data
Improperly entered data - This can happen if someone enters in the wrong value or forgets a decimal.
Misleading data - This can happen if people are quickly doing something on a website and just enter in a random number. How many of you have entered in say 1900 as your age if that’s the first thing that pops up on an age verification list?
Badly defined defaults on a website - Some pages may automatically assign a value if the user doesn’t input one. This is really bad from an analysis point of view as it’ll alter your math.
Outright missing data - Actual
NA
values. These are useful as R can operate over them, remove them from analysis, or whatever else you ask.
There are several options for dealing with these issues
Replace with what you think the value should be
Replace with some average value - This is called imputing which we’ll do later
Replace with
NA
valuesRemove the entire observation or column (i.e. the whole row or column).
6.5.1 Replacing erroneous value with known value
If we look at a summary of our minimum_nights
we see that the max is 100000000! Obviously that’s not correct, but it’s so off it’s dramatically shifting our mean.
summary(air$minimum_nights)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 1 2 3762 3 100000000
We can do a quick filter to get an idea of how many observations have this strange value. Luckily it’s only the one.
%>% filter(minimum_nights == 100000000) air
## # A tibble: 1 x 20
## price cleaning_fee security_deposit cancellation_po~ minimum_nights
## <dbl> <dbl> <dbl> <chr> <dbl>
## 1 68 29 NA strict_14_with_~ 100000000
## # ... with 15 more variables: maximum_nights <dbl>, square_feet <dbl>,
## # bedrooms <dbl>, bathrooms <dbl>, accommodates <dbl>, bed_type <chr>,
## # room_type <chr>, property_type <chr>, city <chr>,
## # host_identity_verified <lgl>, host_is_superhost <lgl>,
## # host_response_time <chr>, host_response_rate <dbl>,
## # review_scores_rating <dbl>, number_of_reviews <dbl>
Let’s replace this value with a simple 1. We can use an ifelse
statement to replace all values that meet a specific criteria. There are three parts to any ifelse
: 1, the condition you want to evaluate if it’s TRUE
or false
; 2, what to do if TRUE
; 3, what to do if FALSE
Below I’m saying ’if minimum_nights
equals 100000000, then replace that value with 1, if not just leave it as whatever value is in air$minimum_nights
. Note that I’m assigning back to the same column.
$minimum_nights <- ifelse(air$minimum_nights == 100000000, 1, air$minimum_nights)
air
summary(air$minimum_nights) #check!
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 2.000 7.665 3.000 1125.000
hist(air$minimum_nights) #look at a histogram too
Huh, so some people are putting still unreasonably high values for minimum nights. There’s no way someone would actually expect the minimum rental to be 1100+ days! Let’s modify that ifelse
to make the max minimum_nights
30. It’s important to keep in mind where to mark the boundary is a subjective decision based on what you know of the problem and common sense! If you have absolutely no information it might be better to make them an NA
value.
This looks better!
$minimum_nights <- ifelse(air$minimum_nights >= 30, 1, air$minimum_nights)
air
summary(air$minimum_nights) #check!
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 2.000 2.075 2.000 29.000
hist(air$minimum_nights) #look at a histogram too
Your task: Can you go and fix the column maximum_nights
I won’t tell you what value to choose is extreme and what to replace with, but it should be something that makes sense in the context of the data.
6.5.2 Replace missing value with assumed value
Our columns cleaning_fee
and security_deposit
both have a bunch of NA
values. I’m going to assume the reason why they’re missing is because the agent doesn’t have a cleaning fee or security deposit, and thus the value should be 0 but they didn’t fill it in when listing.
We could use an ifelse to replace our NA
values, but the perk to NA
values is that R can rapidly find them and do things with them - in this case replace them!
Here’s a base R way to use the function is.na()
. So we can use the same syntax that we used to filter rows in our R Refresher lesson. But, instead where we were filtering by asking something like air$security_deposit >= 10
, which would return T or F if the condition is met, we are going to use is.na()
to ask T or F if the value in that row is NA
. One little thing - we’re assigning zero to the every value where that condition is TRUE, so the slicing actually falls right of the arrow here.
Here’s a quick example of is.na()
in action
<- c(1,2,3, NA, 4, 5, NA)
xyz is.na(xyz)
## [1] FALSE FALSE FALSE TRUE FALSE FALSE TRUE
Now to replace
$security_deposit[is.na(air$security_deposit)] <- 0
air# "in air$security_deposit, if there is an NA in security_deposit, replace with 0"
summary(air$security_deposit) # looks good!
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 0.0 0.0 142.4 250.0 999.0
Your task Go and fix the cleaning_fee
column. Let’s assuming that NA
values should be 0.
6.5.3 Imputing missing values
Imputing is a process where you make some statistical estimate about the nature of the missing value. We’ll do the simplest form of imputation here by just replacing NA
values with the mean/median of that column.
Let’s look at a summary of bedrooms
.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 1.000 1.000 1.473 2.000 24.000 7
We see that there are only 6 NA
values and the median is 1 and the mean is just a bit above. Let’s replace our NA values with the median. In this case the condition we’re checking in our ifelse
is if the value in air$bedrooms
is NA
using is.na()
A quick note - when asking for the median()
or mean()
of a column you need to tell R what you want to do with the NA
values. If there are NA
values in there R will return NA
back as you can’t calculate with them. If you specify na.rm=TRUE
, R knows to ignore them in the calculate (na.rm
just mean ‘remove na?’). Let’s see an example
<- c(1,2,3, NA, 4, 5, NA)
xyz mean(xyz)
## [1] NA
mean(xyz, na.rm = TRUE)
## [1] 3
OK, do it with bedrooms now.
$bedrooms <- ifelse(is.na(air$bedrooms), median(air$bedrooms, na.rm=TRUE), air$bedrooms)
airsummary(air$bedrooms) # check - looks good!
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.000 1.000 1.473 2.000 24.000
Again, this is the simplest version of imputing. There are more complex methods where it evaluates the data using a linear regression model and imputes the predicted value for the missing one. Say for example if you knew someone’s height but their weight was NA
in a dataframe. You could use the weight ~ height regression model from the previous lesson to provide a smarter estimate!
Your task: Go and fix the bathrooms
column. What should you replace the NA
values with?
6.5.4 Dropping columns
There’s no firm rule on when to just drop a whole column. Still, in this case we can can see that 97.0113% of our square_feet
column are NA
values.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 550.0 600.0 777.1 1000.0 5500.0 25838
## [1] 97.01134
Let’s just drop it as imputing would be pointless as we don’t have nearly enough data to make an informative estimate with.
<- air %>%
air select(-square_feet)
6.6 Rescaling and changing distribution
Sometimes in preprocessing you need to change the scaling or alter the distribution of your features. This is because some model types that we’ll be working with are very sensitive to differences in feature scale. For example, a feature that ranges from 0-70,000 will take priority over one that ranges from 0-70 even if the variance is the same and they’re equally important.
Other times the distribution of the feature is just off and using it as continuous doesn’t make sense.
6.6.1 Converting to binary
Let’s take a look at a histogram of host_response_rate
hist(air$host_response_rate)
These data are clearly not normally distributed, and it might not make sense to really use them as a continuous feature. Instead, it might be worth just making the data binary and considering them as ‘good responders’ and ‘bad responders’. We could define a good responder as someone who responds say 90% of the time, and everything else as a bad responder.
We can create a whole new column and use an ifelse
to define the new values
# Create a new column host_responder
# When host_response rate is greater than 90 then fill with 'good_responder'
# Otherwise they're a 'bad_responder'
$host_responder <- ifelse(air$host_response_rate >= 90, 'good_responder', 'bad_responder')
airglimpse(air)# check
## Rows: 26,634
## Columns: 20
## $ price <dbl> 50, 117, 80, 150, 35, 215, 99, 99, 145, 99, ...
## $ cleaning_fee <dbl> 15, 35, 0, 95, 50, 100, 75, 80, 55, 150, 100...
## $ security_deposit <dbl> 0, 0, 0, 0, 200, 0, 0, 200, 0, 300, 300, 0, ...
## $ cancellation_policy <chr> "strict_14_with_grace_period", "moderate", "...
## $ minimum_nights <dbl> 2, 2, 2, 4, 2, 4, 1, 1, 1, 1, 2, 3, 3, 1, 1,...
## $ maximum_nights <dbl> 89, 60, 60, 60, 365, 120, 365, 365, 365, 300...
## $ bedrooms <dbl> 1, 3, 1, 1, 1, 2, 2, 0, 2, 2, 3, 1, 1, 2, 1,...
## $ bathrooms <dbl> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0,...
## $ accommodates <dbl> 1, 7, 2, 4, 2, 4, 4, 3, 7, 6, 6, 2, 2, 6, 4,...
## $ bed_type <chr> "real_bed", "real_bed", "futon", "real_bed",...
## $ room_type <chr> "private_room", "entire_home_apt", "entire_h...
## $ property_type <chr> "Condominium", "Apartment", "Apartment", "Ap...
## $ city <chr> "Chicago", "Chicago", "Chicago", "Chicago", ...
## $ host_identity_verified <lgl> TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, T...
## $ host_is_superhost <lgl> TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, FALS...
## $ host_response_time <chr> "within an hour", "within an hour", "within ...
## $ host_response_rate <dbl> 100, 100, 100, 89, 93, 89, 100, 100, 100, 10...
## $ review_scores_rating <dbl> 100, 96, 92, 92, 85, 89, 93, 85, 86, 86, 89,...
## $ number_of_reviews <dbl> 149, 368, 338, 35, 38, 9, 9, 37, 178, 44, 47...
## $ host_responder <chr> "good_responder", "good_responder", "good_re...
6.6.2 Rescaling features
As mentioned above, you sometimes must rescale features so a model fits correctly. Scaling features also helps with interpreting your model output. One common way is to do this is to rescale all your features so they have a mean of 0 and a standard deviation of one using the following formula: \[ x_{scaled} = \frac{x - \overline{x}}{sd_{x}}\]
6.6.2.1 Demonstrating how to scale
Let’s do a quick example of how to scale. Let’s make a normal distribution of prices using the following code and then make a histogram. We can see that the mean falls around 100 as we specified, and that 95% of the data will fall within +/- 20 as we said the SD should be 20.
<- rnorm(mean = 100, sd = 20, n = 100)
price hist(price)
But we can scale it to have a mean of zero and standard deviation to be one using the above formula and a use of the sd()
function. Note how our data are now centered around 0 and 95% of our data fall into +/- one unit.
<- (price - mean(price))/sd(price)
price_scaled hist(price_scaled)
We can also use a convient scale()
function to do this for us.
<- scale(price)
price_scaled_2 hist(price_scaled_2)
6.6.2.2 Why scale?
The perk to this scaling is it makes it so that your regression features all on the same scale, and thus your \(\beta\) coefficients tell you how much your target changes for one \(SD\) change in the feature. Let’s see an example.
First, let’s understand how cleaning_fee
and accommodates
influence price
. You might think that those who have a high cleaning fee charge less (they’re padding their price with cleaning fees). Similarly, you’d assume that larger places can charge more.
Looking at our summary we see really different betas for our two features, but both are highly significant. Is one more important than the other? It’s sorta hard to say. They’re on different scales (cleaning_fee
is related to dollars, where accommodates
considers number of people that’ll fit).
summary(lm(price ~ cleaning_fee + accommodates, data = air))
##
## Call:
## lm(formula = price ~ cleaning_fee + accommodates, data = air)
##
## Residuals:
## Min 1Q Median 3Q Max
## -475.12 -52.42 -20.34 21.81 888.75
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 32.68330 1.51005 21.64 <2e-16 ***
## cleaning_fee 0.74480 0.01303 57.17 <2e-16 ***
## accommodates 18.16350 0.35514 51.15 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 118.5 on 23569 degrees of freedom
## (3062 observations deleted due to missingness)
## Multiple R-squared: 0.304, Adjusted R-squared: 0.3039
## F-statistic: 5147 on 2 and 23569 DF, p-value: < 2.2e-16
Let’s wrap the scale()
function around each feature and look again.
summary(lm(price ~ scale(cleaning_fee) + scale(accommodates), data = air))
##
## Call:
## lm(formula = price ~ scale(cleaning_fee) + scale(accommodates),
## data = air)
##
## Residuals:
## Min 1Q Median 3Q Max
## -475.12 -52.42 -20.34 21.81 888.75
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 159.1936 0.7728 205.99 <2e-16 ***
## scale(cleaning_fee) 49.9089 0.8730 57.17 <2e-16 ***
## scale(accommodates) 43.6741 0.8539 51.15 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 118.5 on 23569 degrees of freedom
## (3062 observations deleted due to missingness)
## Multiple R-squared: 0.304, Adjusted R-squared: 0.3039
## F-statistic: 5147 on 2 and 23569 DF, p-value: < 2.2e-16
So this is in a lot of ways easier to understand as now one unit is one \(SD\). In other words, our beta coefficients say how much the target will change for one \(SD\) in each feature! So we see that each actually has a similar effect when you consider an equal change (1 \(SD\)) in the feature!
It’s worth noting that our \(RSE\), \(R^2\), and \(p\)-value didn’t change… in regression models scaling doesn’t influence the model fit itself, just how you interpret coefficients!
6.7 One hot encoding
What the heck is one hot encoding? Well, it sounds complicated but it’s really just a way to represent categorical data using a series of binary features. We’ve already fit binary features in regression models (did it rain - yes or no; is the person a smoker or not). Those just have two levels, and thus can be represented with a simple 0 and 1.
What happens if we have more than two levels and each level is it’s on category? We have a bunch of features like that here in our AirBnB data. For example city
has four clear categories:
## [1] "Chicago" "Portland" "Seattle" "San_Francisco"
But models can’t directly fit features with multiple categories like this. Instead you need to convert it to a series of binary variables (also known as dummy variables) where each one ‘asks’ if the observation is part of that category. For example, city
can be represented as four features in your data frame:
- city_is_chicago: yes or no
- city_is_Seattle: yes or no
- city_is_San_Francisco: yes or no
- city_is_Portland: yes or no
You could write code to do this, but of course someone has already made a handy function for us! Go install the package caret
and then load it.
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
We can use the wonderful dummyVars()
function to take a single feature with multiple categorical levels and expand it into several dummy features. There are a couple steps to this.
Let’s make dummies for city
. First, you make a dummies object using the dummyVars()
feature. You need to specify the feature you want to convert using the syntax ~ feature_to_convert
. You can use a + if you want to convert multiple at once. You also need to tell it your data using data =
and then specify fullRank = TRUE
<- dummyVars( ~ city, data = air, fullRank = TRUE) my_dummies
Next you need to use this object to predict the values of the newly created columns. You can use the predict()
function. Inside you specify what object you’re using to predict (in this case my_dummies
) and then the data you want to predict from (newdata = air
).
<- predict(my_dummies, newdata = air) my_dummies_pred
my_dummies_pred
is a data frame if dummy variables…
## cityPortland citySan_Francisco citySeattle
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
This isn’t much use as our target is in our original air
data frame. Luckily we can just join these new dummy variables onto air
using the function cbind()
. All cbind means is ‘column bind’ or ‘bind these columns together’.
<- cbind(air, my_dummies_pred) # bind our two data frames
air glimpse(air) # check
## Rows: 26,634
## Columns: 23
## $ price <dbl> 50, 117, 80, 150, 35, 215, 99, 99, 145, 99, ...
## $ cleaning_fee <dbl> 15, 35, 0, 95, 50, 100, 75, 80, 55, 150, 100...
## $ security_deposit <dbl> 0, 0, 0, 0, 200, 0, 0, 200, 0, 300, 300, 0, ...
## $ cancellation_policy <chr> "strict_14_with_grace_period", "moderate", "...
## $ minimum_nights <dbl> 2, 2, 2, 4, 2, 4, 1, 1, 1, 1, 2, 3, 3, 1, 1,...
## $ maximum_nights <dbl> 89, 60, 60, 60, 365, 120, 365, 365, 365, 300...
## $ bedrooms <dbl> 1, 3, 1, 1, 1, 2, 2, 0, 2, 2, 3, 1, 1, 2, 1,...
## $ bathrooms <dbl> 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0,...
## $ accommodates <dbl> 1, 7, 2, 4, 2, 4, 4, 3, 7, 6, 6, 2, 2, 6, 4,...
## $ bed_type <chr> "real_bed", "real_bed", "futon", "real_bed",...
## $ room_type <chr> "private_room", "entire_home_apt", "entire_h...
## $ property_type <chr> "Condominium", "Apartment", "Apartment", "Ap...
## $ city <chr> "Chicago", "Chicago", "Chicago", "Chicago", ...
## $ host_identity_verified <lgl> TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, TRUE, T...
## $ host_is_superhost <lgl> TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, FALS...
## $ host_response_time <chr> "within an hour", "within an hour", "within ...
## $ host_response_rate <dbl> 100, 100, 100, 89, 93, 89, 100, 100, 100, 10...
## $ review_scores_rating <dbl> 100, 96, 92, 92, 85, 89, 93, 85, 86, 86, 89,...
## $ number_of_reviews <dbl> 149, 368, 338, 35, 38, 9, 9, 37, 178, 44, 47...
## $ host_responder <chr> "good_responder", "good_responder", "good_re...
## $ cityPortland <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ citySan_Francisco <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ citySeattle <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
But wait, how come we only have three dummy variables when there were four unique levels to city
? Well, the same reason a simple binary (i.e. smoker vs. not) doesn’t have two columns (i.e. yes_smoker & no_smoker). If you include a column for every level and fit all those features you have no reference to compare against. Thus you drop one and that is your baseline. In the case of us encoding city
we can see that Chicago was dropped, and thus it is our reference to compare the other features against. Let’s fit a model to make this more clear.
<- lm(price ~ cityPortland + citySan_Francisco + citySeattle, data = air)
city_mod summary(city_mod)
##
## Call:
## lm(formula = price ~ cityPortland + citySan_Francisco + citySeattle,
## data = air)
##
## Residuals:
## Min 1Q Median 3Q Max
## -195.30 -77.48 -38.30 29.38 878.38
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 137.484 1.630 84.350 < 2e-16 ***
## cityPortland -16.865 2.628 -6.418 1.4e-10 ***
## citySan_Francisco 57.820 2.375 24.341 < 2e-16 ***
## citySeattle 25.347 2.261 11.211 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 139.6 on 26379 degrees of freedom
## (251 observations deleted due to missingness)
## Multiple R-squared: 0.03511, Adjusted R-squared: 0.035
## F-statistic: 319.9 on 3 and 26379 DF, p-value: < 2.2e-16
So, all of our \(\beta\) estimates are relative to Chicago. Portland is cheaper, while San Francisco and Seattle are higher. Clearly based on the \(R^2\) there is still a lot of price variation to explain, but city does matter!
A quick note - you’ll see that if you just fit a simpler not encode model of lm(price ~ city, data = air)
that you get the same output. Why is this? Well, R is doing the one hot encoding for you behind the scenes! That’s great and all, but many models won’t do it for you, so it’s good to incorporate this into your workflow.
Your task: Go and one hot encode bed_type
, run a model with the bed types, and then make some inference on how the type of bed in the listing influences price.