2.1 Importing data
In this chapter, we will explore a publicly available dataset of Airbnb data. We found these data here. (These are real data “scraped” from airbnb.com in July 2017. This means that the website owner created a script to automatically collect these data from the airbnb.com website. This is one of the many things that you can also do in R. But first let’s learn the basics.) You can download the dataset by right-clicking on this link, selecting “Save link as…” (or something similar), and saving the
.csv file in a directory on your hard drive. As mentioned in the introduction, it’s a good idea to save your work in a directory that is automatically backed up by file-sharing software. Later on, we will save our script in the same directory.
2.1.1 Importing CSV files
To import data into R, click on Import dataset and then on From text (readr). A new window will pop up. Click on Browse and find your data file. Make sure that First row as names is selected (this tells R to treat the first row of your data as the titles of the columns) and then click on Import. After clicking on import, RStudio opens a Viewer tab. This shows you your data in a spreadsheet.
Some computers save
.csv files with semicolons (
;) instead of commas (
,) as the separators or ‘delimiters’. This usually happens when English is not the first or only language on your computer. If your files are separated by semicolons, click on Import Dataset and find your data file, but now choose Semicolon in the dropdown menu Delimiter.
Note: if you did not save the dataset by right-clicking the link and selecting “Save link as…”, but instead left-clicked the link, then your browser may have ended up opening the dataset. You could then save the dataset by pressing Ctrl+S. Note, however, that your browser may end up saving the dataset as a
.txt file. It’s important to then change the extension of your file in the arguments to the
read_csv command below.
2.1.2 Setting your working directory
After you have imported your data with Import dataset, check the console window. You’ll see the command for opening the Viewer (
View()), and one line above that, you’ll see the command that reads the data. Copy the command that reads the data from the console to your script. In my case it looks like this:
tomslee_airbnb_belgium_1454_2017_07_14 <- read_csv("c:/Dropbox/work/teaching/R/data/tomslee_airbnb_belgium_1454_2017-07-14.csv") # Change .csv to .txt if necessary.
This line reads as follows (from right to left): the
read_csv function should read the
tomslee_airbnb_belgium_1454_2017-07-14.csv file from the
c:/Dropbox/work/teaching/R/data/ directory (you will see a different directory here). Then, R should assign (
<-) these data to an object named
Before we explain each of these concepts, let’s simplify this line of code:
setwd("c:/Dropbox/work/teaching/R/data/") # Set the working directory to where R needs to look for the .csv file airbnb <- read_csv("tomslee_airbnb_belgium_1454_2017-07-14.csv") # read_csv now does not require a directory anymore and only needs a filename. # We assign the data to an object with a simpler name: airbnb instead of tomslee_airbnb_belgium_1454_2017_07_14
setwd command tells R where your working directory is. Your working directory is a folder on your computer where R will look for data, where plots will be saved, etc. Set your working directory to the folder where you have stored your data. Now, the
read_csv file does not require a directory anymore.
You only need to set your working directory once, at the top of your script. You can check whether it’s correctly set by running
getwd(). Note that on a Windows computer, file paths have backslashes separating the folders (
"C:\folder\data"). However, the filepath you enter into R should use forward slashes (
Save this script in the working directory (in my case:
c:/Dropbox/work/teaching/R/data/). In the future you can just run these lines of code to import your data instead of clicking on Import dataset (running lines of code is much faster than pointing and clicking — one of the advantages of using R).
Don’t forget to load the
tidyverse package at the top of your script (even before setting the working directory) with
2.1.3 Assigning data to objects
<- arrow in the middle of the line that imported the
airbnb <- read_csv("tomslee_airbnb_belgium_1454_2017-07-14.csv")
<- is the assignment operator. In this case we assign the dataset (i.e., the data we read from the
.csv file) to an object named
airbnb. An object is a data structure. All the objects you create will show up in the Environment pane (the top right window). RStudio provides a shortcut for writing
<-: Alt + - (in Windows). It’s a good idea to learn this shortcut by heart.
When you import data into R, it will become an object called a data frame. A data frame is like a table or an Excel sheet. It has two dimensions: rows and columns. Usually, the rows represent your observations, the columns represent the different variables. When your data consists of only one dimension (e.g., a sequence of numbers or words), it gets stored in a second type of object called a vector. Later on, we’ll learn how to create vectors.
2.1.4 Importing Excel files
R works best with
.csv (comma separated values) files. However, data is often stored as an Excel file (you can download the Airbnb dataset as an Excel file here). R can handle that as well, but you’ll need to load a package called
readxl first (this package is part of the
tidyverse package, but it does not load with
library(tidyverse) because it is not a core
library(readxl) # load the package airbnb.excel <- read_excel(path = "tomslee_airbnb_belgium_1454_2017-07-14.xlsx", sheet = "Sheet1") # make sure the Excel file is saved in your working directory # you can also leave out path = & sheet = # then the command becomes: read_excel("tomslee_airbnb_belgium_1454_2017-07-14.xlsx", "Sheet1")
read_excel is a function from the
readxl package. It takes two arguments: the first one is the filename and the second argument is the name of the Excel sheet that you want to read.
2.1.5 Inspecting the Airbnb dataset
Our dataset contains information on rooms in Belgium listed on airbnb.com. We know for each room (identified by
room_id): who the host is (
host_id), what type of room it is (
room_type), where it is located (
neighborhood, and even the exact
longitude), how many reviews it has received (
reviews), how satisfied people were (
overall_satisfaction), price (
price), and room characteristics (
A really important step is to check that your data were imported correctly. It’s good practice to always inspect your data — do you see any missing values, do the numbers and names make sense? If you start right away with the analysis, you run the risk of having to re-do your analysis because the data weren’t read correctly, or worse, analysing wrong data without noticing.
airbnb # Print the content of the object airbnb.
## # A tibble: 17,651 x 20 ## room_id survey_id host_id room_type country city borough neighborhood ## <int> <int> <int> <chr> <chr> <chr> <chr> <chr> ## 1 5.14e6 1454 2.07e7 Shared r~ <NA> Belg~ Gent Gent ## 2 1.31e7 1454 4.61e7 Shared r~ <NA> Belg~ Brussel Schaarbeek ## 3 8.30e6 1454 3.09e7 Shared r~ <NA> Belg~ Brussel Elsene ## 4 1.38e7 1454 8.14e7 Shared r~ <NA> Belg~ Oosten~ Middelkerke ## 5 1.83e7 1454 1.43e7 Shared r~ <NA> Belg~ Brussel Anderlecht ## 6 1.27e7 1454 6.88e7 Shared r~ <NA> Belg~ Brussel Koekelberg ## 7 1.55e7 1454 9.91e7 Shared r~ <NA> Belg~ Gent Gent ## 8 3.91e6 1454 3.69e6 Shared r~ <NA> Belg~ Brussel Elsene ## 9 1.49e7 1454 3.06e7 Shared r~ <NA> Belg~ Vervie~ Baelen ## 10 8.50e6 1454 4.05e7 Shared r~ <NA> Belg~ Brussel Etterbeek ## # ... with 17,641 more rows, and 12 more variables: reviews <int>, ## # overall_satisfaction <dbl>, accommodates <int>, bedrooms <dbl>, ## # bathrooms <chr>, price <dbl>, minstay <chr>, name <chr>, ## # last_modified <dttm>, latitude <dbl>, longitude <dbl>, location <chr>
R tells us we are dealing with a
tibble (this is just another word for data frame) with 17651 rows or observations and 20 columns or variables. For each column, the type of the variable is given:
dttm (date-time). Integer and double variables store numbers (integer for round numbers, double for numbers with decimals), character variables store letters, date-time variables store dates and/or times.
R only prints the data of the first ten rows and the maximum number of columns that fit on the screen. If, however, you want to inspect the whole dataset, double-click on the
airbnb object in the Environment pane (the top right window) to open open a Viewer tab or run
View(airbnb). Mind the capital V in the
View command. R is always case-sensitive!
You can also use the
# Print 25 rows (set to Inf to print all rows) & set width to 100 to see more columns. # Notice that columns that don't fit on the first screen with 25 rows # are printed below the initial 25 rows. print(airbnb, n = 25, width = 100)
## # A tibble: 17,651 x 20 ## room_id survey_id host_id room_type country city borough neighborhood ## <int> <int> <int> <chr> <chr> <chr> <chr> <chr> ## 1 5.14e6 1454 2.07e7 Shared r~ <NA> Belg~ Gent Gent ## 2 1.31e7 1454 4.61e7 Shared r~ <NA> Belg~ Brussel Schaarbeek ## 3 8.30e6 1454 3.09e7 Shared r~ <NA> Belg~ Brussel Elsene ## 4 1.38e7 1454 8.14e7 Shared r~ <NA> Belg~ Oosten~ Middelkerke ## 5 1.83e7 1454 1.43e7 Shared r~ <NA> Belg~ Brussel Anderlecht ## 6 1.27e7 1454 6.88e7 Shared r~ <NA> Belg~ Brussel Koekelberg ## 7 1.55e7 1454 9.91e7 Shared r~ <NA> Belg~ Gent Gent ## 8 3.91e6 1454 3.69e6 Shared r~ <NA> Belg~ Brussel Elsene ## 9 1.49e7 1454 3.06e7 Shared r~ <NA> Belg~ Vervie~ Baelen ## 10 8.50e6 1454 4.05e7 Shared r~ <NA> Belg~ Brussel Etterbeek ## 11 1.94e7 1454 1.87e7 Shared r~ <NA> Belg~ Tournai Brunehaut ## 12 1.99e7 1454 1.29e8 Shared r~ <NA> Belg~ Brussel Etterbeek ## 13 6.77e6 1454 3.50e7 Shared r~ <NA> Belg~ Gent Gent ## 14 1.39e7 1454 8.18e7 Shared r~ <NA> Belg~ Arlon Arlon ## 15 1.16e7 1454 5.00e7 Shared r~ <NA> Belg~ Kortri~ Waregem ## 16 3.65e6 1454 1.84e7 Shared r~ <NA> Belg~ Antwer~ Boom ## 17 1.20e7 1454 6.37e7 Shared r~ <NA> Belg~ Vervie~ Büllingen ## 18 1.20e7 1454 6.37e7 Shared r~ <NA> Belg~ Vervie~ Büllingen ## 19 4.28e5 1454 1.33e6 Shared r~ <NA> Belg~ Gent Gent ## 20 1.42e7 1454 8.61e7 Shared r~ <NA> Belg~ Brussel Sint-Jans-M~ ## 21 1.93e7 1454 1.07e8 Shared r~ <NA> Belg~ Leuven Rotselaar ## 22 1.21e7 1454 6.21e7 Shared r~ <NA> Belg~ Brugge Jabbeke ## 23 4.42e6 1454 2.29e7 Shared r~ <NA> Belg~ Ath Ath ## 24 1.56e7 1454 2.05e7 Shared r~ <NA> Belg~ Leuven Leuven ## 25 1.33e6 1454 3.51e6 Shared r~ <NA> Belg~ Tonger~ Voeren ## reviews overall_satisfa~ accommodates bedrooms bathrooms price minstay ## <int> <dbl> <int> <dbl> <chr> <dbl> <chr> ## 1 9 4.5 2 1 <NA> 59 <NA> ## 2 2 0 2 1 <NA> 53 <NA> ## 3 12 4 2 1 <NA> 46 <NA> ## 4 19 4.5 4 1 <NA> 56 <NA> ## 5 5 5 2 1 <NA> 47 <NA> ## 6 28 5 4 1 <NA> 60 <NA> ## 7 2 0 2 1 <NA> 41 <NA> ## 8 13 4 2 1 <NA> 36 <NA> ## 9 2 0 8 1 <NA> 18 <NA> ## 10 57 4.5 3 1 <NA> 38 <NA> ## 11 1 0 4 1 <NA> 14 <NA> ## 12 0 0 2 1 <NA> 37 <NA> ## 13 143 5 2 1 <NA> 28 <NA> ## 14 0 0 1 1 <NA> 177 <NA> ## 15 1 0 4 1 <NA> 147 <NA> ## 16 3 4.5 2 1 <NA> 177 <NA> ## 17 0 0 2 1 <NA> 129 <NA> ## 18 0 0 2 1 <NA> 140 <NA> ## 19 9 5 2 1 <NA> 141 <NA> ## 20 0 0 5 1 <NA> 136 <NA> ## 21 1 0 2 1 <NA> 132 <NA> ## 22 0 0 1 1 <NA> 117 <NA> ## 23 0 0 6 1 <NA> 106 <NA> ## 24 3 5 1 1 <NA> 116 <NA> ## 25 13 4.5 2 1 <NA> 106 <NA> ## # ... with 1.763e+04 more rows, and 5 more variables: name <chr>, ## # last_modified <dttm>, latitude <dbl>, longitude <dbl>, location <chr>