2.1 Importing data
In this chapter, we will explore a publicly available dataset of Airbnb data. We found these data here. (These are real data “scraped” from airbnb.com in July 2017. This means that the website owner created a script to automatically collect these data from the airbnb.com website. This is one of the many things that you can also do in R. But first let’s learn the basics.) You can download the dataset by right-clicking on this link, selecting “Save link as…” (or something similar), and saving the .csv
file in a directory on your hard drive. As mentioned in the introduction, it’s a good idea to save your work in a directory that is automatically backed up by file-sharing software. Later on, we will save our script in the same directory.
2.1.1 Importing CSV files
To import data into R, click on Import dataset and then on From text (readr). A new window will pop up. Click on Browse and find your data file. Make sure that First row as names is selected (this tells R to treat the first row of your data as the titles of the columns) and then click on Import. After clicking on import, RStudio opens a Viewer tab. This shows you your data in a spreadsheet.
Some computers save .csv
files with semicolons (;
) instead of commas (,
) as the separators or ‘delimiters.’ This usually happens when English is not the first or only language on your computer. If your files are separated by semicolons, click on Import Dataset and find your data file, but now choose Semicolon in the dropdown menu Delimiter.
Note: if you did not save the dataset by right-clicking the link and selecting “Save link as…,” but instead left-clicked the link, then your browser may have ended up opening the dataset. You could then save the dataset by pressing Ctrl+S. Note, however, that your browser may end up saving the dataset as a .txt
file. It’s important to then change the extension of your file in the arguments to the read_csv
command below.
2.1.2 Setting your working directory
After you have imported your data with Import dataset, check the console window. You’ll see the command for opening the Viewer (View()
), and one line above that, you’ll see the command that reads the data. Copy the command that reads the data from the console to your script. In my case it looks like this:
<- read_csv("c:/Dropbox/work/teaching/R/data/tomslee_airbnb_belgium_1454_2017-07-14.csv")
tomslee_airbnb_belgium_1454_2017_07_14 # Change .csv to .txt if necessary.
This line reads as follows (from right to left): the read_csv
function should read the tomslee_airbnb_belgium_1454_2017-07-14.csv
file from the c:/Dropbox/work/teaching/R/data/
directory (you will see a different directory here). Then, R should assign (<-
) these data to an object named tomslee_airbnb_belgium_1454_2017_07_14
.
Before we explain each of these concepts, let’s simplify this line of code:
setwd("c:/Dropbox/work/teaching/R/data/") # Set the working directory to where R needs to look for the .csv file
<- read_csv("tomslee_airbnb_belgium_1454_2017-07-14.csv")
airbnb # read_csv now does not require a directory anymore and only needs a filename.
# We assign the data to an object with a simpler name: airbnb instead of tomslee_airbnb_belgium_1454_2017_07_14
The setwd
command tells R where your working directory is. Your working directory is a folder on your computer where R will look for data, where plots will be saved, etc. Set your working directory to the folder where you have stored your data. Now, the read_csv
file does not require a directory anymore.
You only need to set your working directory once, at the top of your script. You can check whether it’s correctly set by running getwd()
. Note that on a Windows computer, file paths have backslashes separating the folders ("C:\folder\data"
). However, the filepath you enter into R should use forward slashes ("C:/folder/data"
).
Save this script in the working directory (in my case: c:/Dropbox/work/teaching/R/data/
). In the future you can just run these lines of code to import your data instead of clicking on Import dataset (running lines of code is much faster than pointing and clicking — one of the advantages of using R).
Don’t forget to load the tidyverse
package at the top of your script (even before setting the working directory) with library(tidyverse)
.
2.1.3 Assigning data to objects
Note the <-
arrow in the middle of the line that imported the .csv
file:
<- read_csv("tomslee_airbnb_belgium_1454_2017-07-14.csv") airbnb
<-
is the assignment operator. In this case we assign the dataset (i.e., the data we read from the .csv
file) to an object named airbnb
. An object is a data structure. All the objects you create will show up in the Environment pane (the top right window). RStudio provides a shortcut for writing <-
: Alt + - (in Windows). It’s a good idea to learn this shortcut by heart.
When you import data into R, it will become an object called a data frame. A data frame is like a table or an Excel sheet. It has two dimensions: rows and columns. Usually, the rows represent your observations, the columns represent the different variables. When your data consists of only one dimension (e.g., a sequence of numbers or words), it gets stored in a second type of object called a vector. Later on, we’ll learn how to create vectors.
2.1.4 Importing Excel files
R works best with .csv
(comma separated values) files. However, data is often stored as an Excel file (you can download the Airbnb dataset as an Excel file here). R can handle that as well, but you’ll need to load a package called readxl
first (this package is part of the tidyverse
package, but it does not load with library(tidyverse)
because it is not a core tidyverse
package):
library(readxl) # load the package
<- read_excel(path = "tomslee_airbnb_belgium_1454_2017-07-14.xlsx", sheet = "Sheet1")
airbnb.excel # make sure the Excel file is saved in your working directory
# you can also leave out path = & sheet =
# then the command becomes: read_excel("tomslee_airbnb_belgium_1454_2017-07-14.xlsx", "Sheet1")
read_excel
is a function from the readxl
package. It takes two arguments: the first one is the filename and the second argument is the name of the Excel sheet that you want to read.
2.1.5 Inspecting the Airbnb dataset
Our dataset contains information on rooms in Belgium listed on airbnb.com. We know for each room (identified by room_id
): who the host is (host_id
), what type of room it is (room_type
), where it is located (country
, city
, neighborhood
, and even the exact latitude
and longitude
), how many reviews it has received (reviews
), how satisfied people were (overall_satisfaction
), price (price
), and room characteristics (accommodates
, bedrooms
, bathrooms
, minstay
).
A really important step is to check that your data were imported correctly. It’s good practice to always inspect your data — do you see any missing values, do the numbers and names make sense? If you start right away with the analysis, you run the risk of having to re-do your analysis because the data weren’t read correctly, or worse, analysing wrong data without noticing.
# Print the content of the object airbnb. airbnb
## # A tibble: 17,651 x 20
## room_id survey_id host_id room_type country city borough neighborhood
## <dbl> <dbl> <dbl> <chr> <lgl> <chr> <chr> <chr>
## 1 5141135 1454 20676997 Shared room NA Belgium Gent Gent
## 2 13128333 1454 46098805 Shared room NA Belgium Brussel Schaarbeek
## 3 8298885 1454 30924336 Shared room NA Belgium Brussel Elsene
## 4 13822088 1454 81440431 Shared room NA Belgium Oostende Middelkerke
## 5 18324301 1454 14294054 Shared room NA Belgium Brussel Anderlecht
## 6 12664969 1454 68810179 Shared room NA Belgium Brussel Koekelberg
## 7 15452889 1454 99127166 Shared room NA Belgium Gent Gent
## 8 3911778 1454 3690027 Shared room NA Belgium Brussel Elsene
## 9 14929414 1454 30624501 Shared room NA Belgium Verviers Baelen
## 10 8497852 1454 40513093 Shared room NA Belgium Brussel Etterbeek
## # ... with 17,641 more rows, and 12 more variables: reviews <dbl>,
## # overall_satisfaction <dbl>, accommodates <dbl>, bedrooms <dbl>,
## # bathrooms <lgl>, price <dbl>, minstay <lgl>, name <chr>,
## # last_modified <chr>, latitude <dbl>, longitude <dbl>, location <chr>
R tells us we are dealing with a tibble
(this is just another word for data frame) with 17651 rows or observations and 20 columns or variables. For each column, the type of the variable is given: int
(integer), chr
(character), dbl
(double), dttm
(date-time). Integer and double variables store numbers (integer for round numbers, double for numbers with decimals), character variables store letters, date-time variables store dates and/or times.
R only prints the data of the first ten rows and the maximum number of columns that fit on the screen. If, however, you want to inspect the whole dataset, double-click on the airbnb
object in the Environment pane (the top right window) to open open a Viewer tab or run View(airbnb)
. Mind the capital V in the View
command. R is always case-sensitive!
You can also use the print
command to ask for more (or less) rows and columns in the console window:
# Print 25 rows (set to Inf to print all rows) & set width to 100 to see more columns.
# Notice that columns that don't fit on the first screen with 25 rows
# are printed below the initial 25 rows.
print(airbnb, n = 25, width = 100)
## # A tibble: 17,651 x 20
## room_id survey_id host_id room_type country city borough neighborhood
## <dbl> <dbl> <dbl> <chr> <lgl> <chr> <chr> <chr>
## 1 5141135 1454 20676997 Shared room NA Belgium Gent Gent
## 2 13128333 1454 46098805 Shared room NA Belgium Brussel Schaarbeek
## 3 8298885 1454 30924336 Shared room NA Belgium Brussel Elsene
## 4 13822088 1454 81440431 Shared room NA Belgium Oostende Middelkerke
## 5 18324301 1454 14294054 Shared room NA Belgium Brussel Anderlecht
## 6 12664969 1454 68810179 Shared room NA Belgium Brussel Koekelberg
## 7 15452889 1454 99127166 Shared room NA Belgium Gent Gent
## 8 3911778 1454 3690027 Shared room NA Belgium Brussel Elsene
## 9 14929414 1454 30624501 Shared room NA Belgium Verviers Baelen
## 10 8497852 1454 40513093 Shared room NA Belgium Brussel Etterbeek
## 11 19372053 1454 18715177 Shared room NA Belgium Tournai Brunehaut
## 12 19855549 1454 128685860 Shared room NA Belgium Brussel Etterbeek
## 13 6772358 1454 34993608 Shared room NA Belgium Gent Gent
## 14 13852832 1454 81817614 Shared room NA Belgium Arlon Arlon
## 15 11581251 1454 49968974 Shared room NA Belgium Kortrijk Waregem
## 16 3645177 1454 18411614 Shared room NA Belgium Antwerpen Boom
## 17 12032748 1454 63713892 Shared room NA Belgium Verviers Büllingen
## 18 12034268 1454 63713892 Shared room NA Belgium Verviers Büllingen
## 19 427739 1454 1328011 Shared room NA Belgium Gent Gent
## 20 14194882 1454 86071645 Shared room NA Belgium Brussel Sint-Jans-M~
## 21 19298546 1454 107330233 Shared room NA Belgium Leuven Rotselaar
## 22 12133666 1454 62134289 Shared room NA Belgium Brugge Jabbeke
## 23 4419833 1454 22943117 Shared room NA Belgium Ath Ath
## 24 15573750 1454 20497610 Shared room NA Belgium Leuven Leuven
## 25 1334575 1454 3508041 Shared room NA Belgium Tongeren Voeren
## reviews overall_satisfac~ accommodates bedrooms bathrooms price minstay name
## <dbl> <dbl> <dbl> <dbl> <lgl> <dbl> <lgl> <chr>
## 1 9 4.5 2 1 NA 59 NA "Spa~
## 2 2 0 2 1 NA 53 NA "app~
## 3 12 4 2 1 NA 46 NA "YOU~
## 4 19 4.5 4 1 NA 56 NA "stu~
## 5 5 5 2 1 NA 47 NA "NIC~
## 6 28 5 4 1 NA 60 NA "Che~
## 7 2 0 2 1 NA 41 NA "a d~
## 8 13 4 2 1 NA 36 NA "Cal~
## 9 2 0 8 1 NA 18 NA "Ecu~
## 10 57 4.5 3 1 NA 38 NA "Bel~
## 11 1 0 4 1 NA 14 NA "Cha~
## 12 0 0 2 1 NA 37 NA "Cos~
## 13 143 5 2 1 NA 28 NA "Cou~
## 14 0 0 1 1 NA 177 NA "Log~
## 15 1 0 4 1 NA 147 NA "Ter~
## 16 3 4.5 2 1 NA 177 NA "PRI~
## 17 0 0 2 1 NA 129 NA "Spa~
## 18 0 0 2 1 NA 140 NA "Spa~
## 19 9 5 2 1 NA 141 NA "pen~
## 20 0 0 5 1 NA 136 NA "App~
## 21 1 0 2 1 NA 132 NA "Hou~
## 22 0 0 1 1 NA 117 NA "Oud~
## 23 0 0 6 1 NA 106 NA "Cha~
## 24 3 5 1 1 NA 116 NA "The~
## 25 13 4.5 2 1 NA 106 NA "Voe~
## # ... with 17,626 more rows, and 4 more variables: last_modified <chr>, latitude <dbl>,
## # longitude <dbl>, location <chr>
2.1.6 Importing data from Qualtrics
This section explains how to import data from Qualtrics, which is a popular tool for running online surveys. You can skip this section if you’re working through the Airbnb example.
When you run a survey with Qualtrics, you can download its data by going to ‘Data & Analysis,’ pressing ‘Export & Import’ and then ‘Export Data’ (note that a free Qualtrics account will not allow you to download your data, you have to use an institutional account).
Then, you can download your data in different formats.
The format that works best with R is CSV.
You should ‘Download all fields’ and then decide between using numeric values or choice text.
When using numeric values, responses on, e.g., a seven-point Likert scale from 1 = ‘Strongly disagree’ to 7 = ‘Strongly agree’ will be stored as numbers in your .csv
file; when using choice text, these responses will be stored as text.
Then you need to download your .csv
file.
When you open it, you’ll see a file in which the first three rows are headers.
This complicates a straightforward import into R.
We can resolve this issue as follows:
library(tidyverse)
library(magrittr) # install this first with install.packages("magrittr")
# imagine that your .csv file is named data.csv
<- read_csv("data.csv") %>% names() # read the dataset and get the variable names from the first row
variable_names
<- read_csv("data.csv", skip = 3, col_names = FALSE) %>% # then tell R to read the dataset but skip three rows and not assign variable names
data set_names(variable_names) %>% # we set the variable names that we captured before
select(-c("IPAddress", "Progress", "RecordedDate", "ResponseId", "RecipientLastName", "RecipientFirstName", "RecipientEmail", "ExternalReference", "DistributionChannel", "UserLanguage")) %>% # remove some variables that you will hardly ever use
rename(duration = `Duration (in seconds)`) # rename one variable so that there are no spaces in its name
# Inspect your dataset. This looks much cleaner than the raw dataset that you get from Qualtrics. data
If you often use Qualtrics, it makes sense to turn the above into a function that you can often use. Check out the function on my Github page.
Another tip: When you collect data with Qualtrics (or any other survey tool), make sure that you will end up with a data file that has meaningful variable names. Working with variables such as “cognitive_load” and “willingness_to_pay,” instead of “Q1” and “Q2,” is much easier and leads to fewer mistakes. Creating meaningful variable names also makes it much more convenient for others to work with your data. In Qualtrics, you can name your questions and these will become your variable names.