3.1 Data
3.1.1 Import
You can download the dataset by right-clicking on this link, selecting “Save link as…” (or something similar), and saving the .csv file in a directory on your hard drive. As mentioned in the introduction, it’s a good idea to save your work in a directory that is automatically backed up by file-sharing software. Let’s import the data:
library(tidyverse)
setwd("c:/Dropbox/work/teaching/R/") # Set your working directory
airbnb <- read_csv("tomslee_airbnb_belgium_1454_2017-07-14.csv") %>%
mutate(room_id = factor(room_id), host_id = factor(host_id)) %>%
select(-country, -survey_id) %>% # drop country & survey_id, see introduction for why we do this
rename(country = city, city = borough) # rename city & borough, see introduction for why we do thisDon’t forget to save your script in the working directory.
3.1.2 Manipulate
If you open the airbnb data frame in a Viewer tab, you’ll see that bathrooms and minstay are empty columns and that location and last_modified are not very informative. Let’s remove these variables:
airbnb <- airbnb %>%
select (-bathrooms, -minstay, -location, -last_modified)Now, have a look at the overall_satisfaction variable:
# use head() to print only the first few values of a vector, to avoid getting a really long list
# tail() prints only the last few values of a vector
head(airbnb$overall_satisfaction) ## [1] 4.5 0.0 4.0 4.5 5.0 5.0
The second rating is zero. This probably means that the rating is missing rather than that the rating is actually zero. Let’s replace the zero values in overall_satisfaction with NA:
airbnb <- airbnb %>%
mutate(overall_satisfaction = replace(overall_satisfaction, overall_satisfaction == 0, NA))
# create a "new" variable overall_satisfaction that is equal to overall_satisfaction with NA values where overall_satisfaction is equal to zero.
# Say we wanted to replace NA with 0, then the command becomes: replace(overall_satisfaction, is.na(overall_satisfaction), 0)
# overall_satisfaction == NA won't work
head(airbnb$overall_satisfaction)## [1] 4.5 NA 4.0 4.5 5.0 5.0
3.1.3 Merging datasets
Later on, we’ll test whether price is related to certain characteristics of the room types. Potentially interesting characteristics are: room_type, city, reviews, overall_satisfaction, etc. To make it even more interesting, we can augment the data, for example with publicly available data on the cities. I’ve collected the population sizes of the most populous Belgian cities from this website. Download these data here and import them into R:
population <- read_excel("population.xlsx","data")population## # A tibble: 183 x 2
## place population
## <chr> <dbl>
## 1 Brussels 1019022
## 2 Antwerpen 459805
## 3 Gent 231493
## 4 Charleroi 200132
## 5 Liège 182597
## 6 Brugge 116709
## 7 Namur 106284
## 8 Leuven 92892
## 9 Mons 91277
## 10 Aalst 77534
## # ... with 173 more rows
Now, we want to link these data to our airbnb data frame. This is very easy in R (but a huge pain in, for example, Excel):
airbnb.merged <- left_join(airbnb, population, by = c("city" = "place"))
# the first argument is the dataset that we want to augment
# the second argument is where we find the data to augment the first dataset with
# the third argument is the variables that we use to link one dataset with the other (city is a variable in airbnb, place is a variable in population)Check out the most relevant columns of the airbnb.merged data frame:
airbnb.merged %>% select(room_id, city, price, population)## # A tibble: 17,651 x 4
## room_id city price population
## <fct> <chr> <dbl> <dbl>
## 1 5141135 Gent 59 231493
## 2 13128333 Brussel 53 NA
## 3 8298885 Brussel 46 NA
## 4 13822088 Oostende 56 NA
## 5 18324301 Brussel 47 NA
## 6 12664969 Brussel 60 NA
## 7 15452889 Gent 41 231493
## 8 3911778 Brussel 36 NA
## 9 14929414 Verviers 18 52824
## 10 8497852 Brussel 38 NA
## # ... with 17,641 more rows
We see that there is a column population in our airbnb.merged dataset. You can also see this in the Environment pane: airbnb.merged has one variable more than airbnb (but the same number of observations).
Data for Brussels are missing, however. This is because Brussels is spelled in Dutch in the airbnb dataset but in English in the population dataset. Let’s replace Brussels with Brussel in the population dataset (and also change the spelling of two other cities) and link the data again:
population <- population %>%
mutate(place = replace(place, place == "Brussels", "Brussel"),
place = replace(place, place == "Ostend", "Oostende"),
place = replace(place, place == "Mouscron", "Moeskroen"))
airbnb.merged <- left_join(airbnb, population, by = c("city" = "place"))
airbnb.merged %>% select(room_id, city, price, population)## # A tibble: 17,651 x 4
## room_id city price population
## <fct> <chr> <dbl> <dbl>
## 1 5141135 Gent 59 231493
## 2 13128333 Brussel 53 1019022
## 3 8298885 Brussel 46 1019022
## 4 13822088 Oostende 56 69011
## 5 18324301 Brussel 47 1019022
## 6 12664969 Brussel 60 1019022
## 7 15452889 Gent 41 231493
## 8 3911778 Brussel 36 1019022
## 9 14929414 Verviers 18 52824
## 10 8497852 Brussel 38 1019022
## # ... with 17,641 more rows
3.1.4 Recap: importing & manipulating
Here’s what we’ve done so far, in one orderly sequence of piped operations (download the data here and here):
library(tidyverse)
setwd("c:/Dropbox/work/teaching/R") # Set your working directory
airbnb <- read_csv("tomslee_airbnb_belgium_1454_2017-07-14.csv") %>%
mutate(room_id = factor(room_id), host_id = factor(host_id),
overall_satisfaction = replace(overall_satisfaction, overall_satisfaction == 0, NA)) %>%
select(-country, -survey_id,- bathrooms, -minstay, -location, -last_modified) %>%
rename(country = city, city = borough)
population <- read_excel("population.xlsx","data") %>%
mutate(place = replace(place, place == "Brussels", "Brussel"),
place = replace(place, place == "Ostend", "Oostende"),
place = replace(place, place == "Mouscron", "Moeskroen"))
airbnb <- left_join(airbnb, population, by = c("city" = "place"))