3.1 Data

3.1.1 Import

You can download the dataset by right-clicking on this link, selecting “Save link as…” (or something similar), and saving the .csv file in a directory on your hard drive. As mentioned in the introduction, it’s a good idea to save your work in a directory that is automatically backed up by file-sharing software. Let’s import the data:

library(tidyverse)
setwd("c:/Dropbox/work/teaching/R/") # Set your working directory 

airbnb <- read_csv("tomslee_airbnb_belgium_1454_2017-07-14.csv") %>% 
  mutate(room_id = factor(room_id), host_id = factor(host_id)) %>% 
  select(-country, -survey_id) %>% # drop country & survey_id, see introduction for why we do this
  rename(country = city, city = borough) # rename city & borough, see introduction for why we do this

Don’t forget to save your script in the working directory.

3.1.2 Manipulate

If you open the airbnb data frame in a Viewer tab, you’ll see that bathrooms and minstay are empty columns and that location and last_modified are not very informative. Let’s remove these variables:

airbnb <- airbnb %>% 
  select (-bathrooms, -minstay, -location, -last_modified)

Now, have a look at the overall_satisfaction variable:

# use head() to print only the first few values of a vector, to avoid getting a really long list
# tail() prints only the last few values of a vector
head(airbnb$overall_satisfaction) 
## [1] 4.5 0.0 4.0 4.5 5.0 5.0

The second rating is zero. This probably means that the rating is missing rather than that the rating is actually zero. Let’s replace the zero values in overall_satisfaction with NA:

airbnb <- airbnb %>% 
  mutate(overall_satisfaction = replace(overall_satisfaction, overall_satisfaction == 0, NA)) 
# create a "new" variable overall_satisfaction that is equal to overall_satisfaction with NA values where overall_satisfaction is equal to zero. 

# Say we wanted to replace NA with 0, then the command becomes: replace(overall_satisfaction, is.na(overall_satisfaction), 0)
# overall_satisfaction == NA won't work

head(airbnb$overall_satisfaction)
## [1] 4.5  NA 4.0 4.5 5.0 5.0

3.1.3 Merging datasets

Later on, we’ll test whether price is related to certain characteristics of the room types. Potentially interesting characteristics are: room_type, city, reviews, overall_satisfaction, etc. To make it even more interesting, we can augment the data, for example with publicly available data on the cities. I’ve collected the population sizes of the most populous Belgian cities from this website. Download these data here and import them into R:

population <- read_excel("population.xlsx","data")
population
## # A tibble: 183 x 2
##    place     population
##    <chr>          <dbl>
##  1 Brussels     1019022
##  2 Antwerpen     459805
##  3 Gent          231493
##  4 Charleroi     200132
##  5 Liège         182597
##  6 Brugge        116709
##  7 Namur         106284
##  8 Leuven         92892
##  9 Mons           91277
## 10 Aalst          77534
## # ... with 173 more rows

Now, we want to link these data to our airbnb data frame. This is very easy in R (but a huge pain in, for example, Excel):

airbnb.merged <- left_join(airbnb, population, by = c("city" = "place"))
# the first argument is the dataset that we want to augment
# the second argument is where we find the data to augment the first dataset with
# the third argument is the variables that we use to link one dataset with the other (city is a variable in airbnb, place is a variable in population)

Check out the most relevant columns of the airbnb.merged data frame:

airbnb.merged %>% select(room_id, city, price, population)
## # A tibble: 17,651 x 4
##    room_id  city     price population
##    <fct>    <chr>    <dbl>      <dbl>
##  1 5141135  Gent        59     231493
##  2 13128333 Brussel     53         NA
##  3 8298885  Brussel     46         NA
##  4 13822088 Oostende    56         NA
##  5 18324301 Brussel     47         NA
##  6 12664969 Brussel     60         NA
##  7 15452889 Gent        41     231493
##  8 3911778  Brussel     36         NA
##  9 14929414 Verviers    18      52824
## 10 8497852  Brussel     38         NA
## # ... with 17,641 more rows

We see that there is a column population in our airbnb.merged dataset. You can also see this in the Environment pane: airbnb.merged has one variable more than airbnb (but the same number of observations).

Data for Brussels are missing, however. This is because Brussels is spelled in Dutch in the airbnb dataset but in English in the population dataset. Let’s replace Brussels with Brussel in the population dataset (and also change the spelling of two other cities) and link the data again:

population <- population %>% 
  mutate(place = replace(place, place == "Brussels", "Brussel"),
         place = replace(place, place == "Ostend", "Oostende"),
         place = replace(place, place == "Mouscron", "Moeskroen"))

airbnb.merged <- left_join(airbnb, population, by = c("city" = "place"))

airbnb.merged %>% select(room_id, city, price, population)
## # A tibble: 17,651 x 4
##    room_id  city     price population
##    <fct>    <chr>    <dbl>      <dbl>
##  1 5141135  Gent        59     231493
##  2 13128333 Brussel     53    1019022
##  3 8298885  Brussel     46    1019022
##  4 13822088 Oostende    56      69011
##  5 18324301 Brussel     47    1019022
##  6 12664969 Brussel     60    1019022
##  7 15452889 Gent        41     231493
##  8 3911778  Brussel     36    1019022
##  9 14929414 Verviers    18      52824
## 10 8497852  Brussel     38    1019022
## # ... with 17,641 more rows

3.1.4 Recap: importing & manipulating

Here’s what we’ve done so far, in one orderly sequence of piped operations (download the data here and here):

library(tidyverse)
setwd("c:/Dropbox/work/teaching/R") # Set your working directory 

airbnb <- read_csv("tomslee_airbnb_belgium_1454_2017-07-14.csv") %>% 
  mutate(room_id = factor(room_id), host_id = factor(host_id),
         overall_satisfaction = replace(overall_satisfaction, overall_satisfaction == 0, NA)) %>% 
  select(-country, -survey_id,- bathrooms, -minstay, -location, -last_modified) %>% 
  rename(country = city, city = borough) 

population <- read_excel("population.xlsx","data") %>% 
  mutate(place = replace(place, place == "Brussels", "Brussel"),
         place = replace(place, place == "Ostend", "Oostende"),
         place = replace(place, place == "Mouscron", "Moeskroen"))

airbnb <- left_join(airbnb, population, by = c("city" = "place"))