3.1 Data
3.1.1 Import
You can download the dataset by right-clicking on this link, selecting “Save link as…” (or something similar), and saving the .csv
file in a directory on your hard drive. As mentioned in the introduction, it’s a good idea to save your work in a directory that is automatically backed up by file-sharing software. Let’s import the data:
library(tidyverse)
setwd("c:/Dropbox/work/teaching/R/") # Set your working directory
airbnb <- read_csv("tomslee_airbnb_belgium_1454_2017-07-14.csv") %>%
mutate(room_id = factor(room_id), host_id = factor(host_id)) %>%
select(-country, -survey_id) %>% # drop country & survey_id, see introduction for why we do this
rename(country = city, city = borough) # rename city & borough, see introduction for why we do this
Don’t forget to save your script in the working directory.
3.1.2 Manipulate
If you open the airbnb
data frame in a Viewer tab, you’ll see that bathrooms
and minstay
are empty columns and that location
and last_modified
are not very informative. Let’s remove these variables:
Now, have a look at the overall_satisfaction
variable:
# use head() to print only the first few values of a vector, to avoid getting a really long list
# tail() prints only the last few values of a vector
head(airbnb$overall_satisfaction)
## [1] 4.5 0.0 4.0 4.5 5.0 5.0
The second rating is zero. This probably means that the rating is missing rather than that the rating is actually zero. Let’s replace the zero values in overall_satisfaction
with NA
:
airbnb <- airbnb %>%
mutate(overall_satisfaction = replace(overall_satisfaction, overall_satisfaction == 0, NA))
# create a "new" variable overall_satisfaction that is equal to overall_satisfaction with NA values where overall_satisfaction is equal to zero.
# Say we wanted to replace NA with 0, then the command becomes: replace(overall_satisfaction, is.na(overall_satisfaction), 0)
# overall_satisfaction == NA won't work
head(airbnb$overall_satisfaction)
## [1] 4.5 NA 4.0 4.5 5.0 5.0
3.1.3 Merging datasets
Later on, we’ll test whether price is related to certain characteristics of the room types. Potentially interesting characteristics are: room_type
, city
, reviews
, overall_satisfaction
, etc. To make it even more interesting, we can augment the data, for example with publicly available data on the cities. I’ve collected the population sizes of the most populous Belgian cities from this website. Download these data here and import them into R:
## # A tibble: 183 x 2
## place population
## <chr> <dbl>
## 1 Brussels 1019022
## 2 Antwerpen 459805
## 3 Gent 231493
## 4 Charleroi 200132
## 5 Liège 182597
## 6 Brugge 116709
## 7 Namur 106284
## 8 Leuven 92892
## 9 Mons 91277
## 10 Aalst 77534
## # ... with 173 more rows
Now, we want to link these data to our airbnb
data frame. This is very easy in R (but a huge pain in, for example, Excel):
airbnb.merged <- left_join(airbnb, population, by = c("city" = "place"))
# the first argument is the dataset that we want to augment
# the second argument is where we find the data to augment the first dataset with
# the third argument is the variables that we use to link one dataset with the other (city is a variable in airbnb, place is a variable in population)
Check out the most relevant columns of the airbnb.merged
data frame:
## # A tibble: 17,651 x 4
## room_id city price population
## <fct> <chr> <dbl> <dbl>
## 1 5141135 Gent 59 231493
## 2 13128333 Brussel 53 NA
## 3 8298885 Brussel 46 NA
## 4 13822088 Oostende 56 NA
## 5 18324301 Brussel 47 NA
## 6 12664969 Brussel 60 NA
## 7 15452889 Gent 41 231493
## 8 3911778 Brussel 36 NA
## 9 14929414 Verviers 18 52824
## 10 8497852 Brussel 38 NA
## # ... with 17,641 more rows
We see that there is a column population
in our airbnb.merged
dataset. You can also see this in the Environment pane: airbnb.merged
has one variable more than airbnb
(but the same number of observations).
Data for Brussels are missing, however. This is because Brussels is spelled in Dutch in the airbnb
dataset but in English in the population
dataset. Let’s replace Brussels with Brussel in the population
dataset (and also change the spelling of two other cities) and link the data again:
population <- population %>%
mutate(place = replace(place, place == "Brussels", "Brussel"),
place = replace(place, place == "Ostend", "Oostende"),
place = replace(place, place == "Mouscron", "Moeskroen"))
airbnb.merged <- left_join(airbnb, population, by = c("city" = "place"))
airbnb.merged %>% select(room_id, city, price, population)
## # A tibble: 17,651 x 4
## room_id city price population
## <fct> <chr> <dbl> <dbl>
## 1 5141135 Gent 59 231493
## 2 13128333 Brussel 53 1019022
## 3 8298885 Brussel 46 1019022
## 4 13822088 Oostende 56 69011
## 5 18324301 Brussel 47 1019022
## 6 12664969 Brussel 60 1019022
## 7 15452889 Gent 41 231493
## 8 3911778 Brussel 36 1019022
## 9 14929414 Verviers 18 52824
## 10 8497852 Brussel 38 1019022
## # ... with 17,641 more rows
3.1.4 Recap: importing & manipulating
Here’s what we’ve done so far, in one orderly sequence of piped operations (download the data here and here):
library(tidyverse)
setwd("c:/Dropbox/work/teaching/R") # Set your working directory
airbnb <- read_csv("tomslee_airbnb_belgium_1454_2017-07-14.csv") %>%
mutate(room_id = factor(room_id), host_id = factor(host_id),
overall_satisfaction = replace(overall_satisfaction, overall_satisfaction == 0, NA)) %>%
select(-country, -survey_id,- bathrooms, -minstay, -location, -last_modified) %>%
rename(country = city, city = borough)
population <- read_excel("population.xlsx","data") %>%
mutate(place = replace(place, place == "Brussels", "Brussel"),
place = replace(place, place == "Ostend", "Oostende"),
place = replace(place, place == "Mouscron", "Moeskroen"))
airbnb <- left_join(airbnb, population, by = c("city" = "place"))