2.1 Importing data

In this chapter, we will explore a publicly available dataset of Airbnb data. We found these data here. (These are real data “scraped” from airbnb.com in July 2017. This means that the website owner created a script to automatically collect these data from the airbnb.com website. This is one of the many things that you can also do in R. But first let’s learn the basics.) You can download the dataset by right-clicking on this link, selecting “Save link as…” (or something similar), and saving the .csv file in a directory on your hard drive. As mentioned in the introduction, it’s a good idea to save your work in a directory that is automatically backed up by file-sharing software. Later on, we will save our script in the same directory.

2.1.1 Importing CSV files

To import data into R, click on Import dataset and then on From text (readr). A new window will pop up. Click on Browse and find your data file. Make sure that First row as names is selected (this tells R to treat the first row of your data as the titles of the columns) and then click on Import. After clicking on import, RStudio opens a Viewer tab. This shows you your data in a spreadsheet.

Some computers save .csv files with semicolons (;) instead of commas (,) as the separators or ‘delimiters.’ This usually happens when English is not the first or only language on your computer. If your files are separated by semicolons, click on Import Dataset and find your data file, but now choose Semicolon in the dropdown menu Delimiter.

Note: if you did not save the dataset by right-clicking the link and selecting “Save link as…,” but instead left-clicked the link, then your browser may have ended up opening the dataset. You could then save the dataset by pressing Ctrl+S. Note, however, that your browser may end up saving the dataset as a .txt file. It’s important to then change the extension of your file in the arguments to the read_csv command below.

2.1.2 Setting your working directory

After you have imported your data with Import dataset, check the console window. You’ll see the command for opening the Viewer (View()), and one line above that, you’ll see the command that reads the data. Copy the command that reads the data from the console to your script. In my case it looks like this:

tomslee_airbnb_belgium_1454_2017_07_14 <- read_csv("c:/Dropbox/work/teaching/R/data/tomslee_airbnb_belgium_1454_2017-07-14.csv") 
# Change .csv to .txt if necessary.

This line reads as follows (from right to left): the read_csv function should read the tomslee_airbnb_belgium_1454_2017-07-14.csv file from the c:/Dropbox/work/teaching/R/data/ directory (you will see a different directory here). Then, R should assign (<-) these data to an object named tomslee_airbnb_belgium_1454_2017_07_14.

Before we explain each of these concepts, let’s simplify this line of code:

setwd("c:/Dropbox/work/teaching/R/data/") # Set the working directory to where R needs to look for the .csv file
airbnb <- read_csv("tomslee_airbnb_belgium_1454_2017-07-14.csv") 
# read_csv now does not require a directory anymore and only needs a filename. 
# We assign the data to an object with a simpler name: airbnb instead of tomslee_airbnb_belgium_1454_2017_07_14

The setwd command tells R where your working directory is. Your working directory is a folder on your computer where R will look for data, where plots will be saved, etc. Set your working directory to the folder where you have stored your data. Now, the read_csv file does not require a directory anymore.

You only need to set your working directory once, at the top of your script. You can check whether it’s correctly set by running getwd(). Note that on a Windows computer, file paths have backslashes separating the folders ("C:\folder\data"). However, the filepath you enter into R should use forward slashes ("C:/folder/data").

Save this script in the working directory (in my case: c:/Dropbox/work/teaching/R/data/). In the future you can just run these lines of code to import your data instead of clicking on Import dataset (running lines of code is much faster than pointing and clicking — one of the advantages of using R).

Don’t forget to load the tidyverse package at the top of your script (even before setting the working directory) with library(tidyverse).

2.1.3 Assigning data to objects

Note the <- arrow in the middle of the line that imported the .csv file:

airbnb <- read_csv("tomslee_airbnb_belgium_1454_2017-07-14.csv")

<- is the assignment operator. In this case we assign the dataset (i.e., the data we read from the .csv file) to an object named airbnb. An object is a data structure. All the objects you create will show up in the Environment pane (the top right window). RStudio provides a shortcut for writing <-: Alt + - (in Windows). It’s a good idea to learn this shortcut by heart.

When you import data into R, it will become an object called a data frame. A data frame is like a table or an Excel sheet. It has two dimensions: rows and columns. Usually, the rows represent your observations, the columns represent the different variables. When your data consists of only one dimension (e.g., a sequence of numbers or words), it gets stored in a second type of object called a vector. Later on, we’ll learn how to create vectors.

2.1.4 Importing Excel files

R works best with .csv (comma separated values) files. However, data is often stored as an Excel file (you can download the Airbnb dataset as an Excel file here). R can handle that as well, but you’ll need to load a package called readxl first (this package is part of the tidyverse package, but it does not load with library(tidyverse) because it is not a core tidyverse package):

library(readxl) # load the package

airbnb.excel <- read_excel(path = "tomslee_airbnb_belgium_1454_2017-07-14.xlsx", sheet = "Sheet1")
# make sure the Excel file is saved in your working directory

# you can also leave out path = & sheet =
# then the command becomes: read_excel("tomslee_airbnb_belgium_1454_2017-07-14.xlsx", "Sheet1")

read_excel is a function from the readxl package. It takes two arguments: the first one is the filename and the second argument is the name of the Excel sheet that you want to read.

2.1.5 Inspecting the Airbnb dataset

Our dataset contains information on rooms in Belgium listed on airbnb.com. We know for each room (identified by room_id): who the host is (host_id), what type of room it is (room_type), where it is located (country, city, neighborhood, and even the exact latitude and longitude), how many reviews it has received (reviews), how satisfied people were (overall_satisfaction), price (price), and room characteristics (accommodates, bedrooms, bathrooms, minstay).

A really important step is to check that your data were imported correctly. It’s good practice to always inspect your data — do you see any missing values, do the numbers and names make sense? If you start right away with the analysis, you run the risk of having to re-do your analysis because the data weren’t read correctly, or worse, analysing wrong data without noticing.

airbnb # Print the content of the object airbnb.
## # A tibble: 17,651 x 20
##     room_id survey_id  host_id room_type   country city    borough  neighborhood
##       <dbl>     <dbl>    <dbl> <chr>       <lgl>   <chr>   <chr>    <chr>       
##  1  5141135      1454 20676997 Shared room NA      Belgium Gent     Gent        
##  2 13128333      1454 46098805 Shared room NA      Belgium Brussel  Schaarbeek  
##  3  8298885      1454 30924336 Shared room NA      Belgium Brussel  Elsene      
##  4 13822088      1454 81440431 Shared room NA      Belgium Oostende Middelkerke 
##  5 18324301      1454 14294054 Shared room NA      Belgium Brussel  Anderlecht  
##  6 12664969      1454 68810179 Shared room NA      Belgium Brussel  Koekelberg  
##  7 15452889      1454 99127166 Shared room NA      Belgium Gent     Gent        
##  8  3911778      1454  3690027 Shared room NA      Belgium Brussel  Elsene      
##  9 14929414      1454 30624501 Shared room NA      Belgium Verviers Baelen      
## 10  8497852      1454 40513093 Shared room NA      Belgium Brussel  Etterbeek   
## # ... with 17,641 more rows, and 12 more variables: reviews <dbl>,
## #   overall_satisfaction <dbl>, accommodates <dbl>, bedrooms <dbl>,
## #   bathrooms <lgl>, price <dbl>, minstay <lgl>, name <chr>,
## #   last_modified <chr>, latitude <dbl>, longitude <dbl>, location <chr>

R tells us we are dealing with a tibble (this is just another word for data frame) with 17651 rows or observations and 20 columns or variables. For each column, the type of the variable is given: int (integer), chr (character), dbl (double), dttm (date-time). Integer and double variables store numbers (integer for round numbers, double for numbers with decimals), character variables store letters, date-time variables store dates and/or times.

R only prints the data of the first ten rows and the maximum number of columns that fit on the screen. If, however, you want to inspect the whole dataset, double-click on the airbnb object in the Environment pane (the top right window) to open open a Viewer tab or run View(airbnb). Mind the capital V in the View command. R is always case-sensitive!

You can also use the print command to ask for more (or less) rows and columns in the console window:

# Print 25 rows (set to Inf to print all rows) & set width to 100 to see more columns. 
# Notice that columns that don't fit on the first screen with 25 rows 
# are printed below the initial 25 rows.

print(airbnb, n = 25, width = 100)
## # A tibble: 17,651 x 20
##     room_id survey_id   host_id room_type   country city    borough   neighborhood
##       <dbl>     <dbl>     <dbl> <chr>       <lgl>   <chr>   <chr>     <chr>       
##  1  5141135      1454  20676997 Shared room NA      Belgium Gent      Gent        
##  2 13128333      1454  46098805 Shared room NA      Belgium Brussel   Schaarbeek  
##  3  8298885      1454  30924336 Shared room NA      Belgium Brussel   Elsene      
##  4 13822088      1454  81440431 Shared room NA      Belgium Oostende  Middelkerke 
##  5 18324301      1454  14294054 Shared room NA      Belgium Brussel   Anderlecht  
##  6 12664969      1454  68810179 Shared room NA      Belgium Brussel   Koekelberg  
##  7 15452889      1454  99127166 Shared room NA      Belgium Gent      Gent        
##  8  3911778      1454   3690027 Shared room NA      Belgium Brussel   Elsene      
##  9 14929414      1454  30624501 Shared room NA      Belgium Verviers  Baelen      
## 10  8497852      1454  40513093 Shared room NA      Belgium Brussel   Etterbeek   
## 11 19372053      1454  18715177 Shared room NA      Belgium Tournai   Brunehaut   
## 12 19855549      1454 128685860 Shared room NA      Belgium Brussel   Etterbeek   
## 13  6772358      1454  34993608 Shared room NA      Belgium Gent      Gent        
## 14 13852832      1454  81817614 Shared room NA      Belgium Arlon     Arlon       
## 15 11581251      1454  49968974 Shared room NA      Belgium Kortrijk  Waregem     
## 16  3645177      1454  18411614 Shared room NA      Belgium Antwerpen Boom        
## 17 12032748      1454  63713892 Shared room NA      Belgium Verviers  Büllingen   
## 18 12034268      1454  63713892 Shared room NA      Belgium Verviers  Büllingen   
## 19   427739      1454   1328011 Shared room NA      Belgium Gent      Gent        
## 20 14194882      1454  86071645 Shared room NA      Belgium Brussel   Sint-Jans-M~
## 21 19298546      1454 107330233 Shared room NA      Belgium Leuven    Rotselaar   
## 22 12133666      1454  62134289 Shared room NA      Belgium Brugge    Jabbeke     
## 23  4419833      1454  22943117 Shared room NA      Belgium Ath       Ath         
## 24 15573750      1454  20497610 Shared room NA      Belgium Leuven    Leuven      
## 25  1334575      1454   3508041 Shared room NA      Belgium Tongeren  Voeren      
##    reviews overall_satisfac~ accommodates bedrooms bathrooms price minstay name 
##      <dbl>             <dbl>        <dbl>    <dbl> <lgl>     <dbl> <lgl>   <chr>
##  1       9               4.5            2        1 NA           59 NA      "Spa~
##  2       2               0              2        1 NA           53 NA      "app~
##  3      12               4              2        1 NA           46 NA      "YOU~
##  4      19               4.5            4        1 NA           56 NA      "stu~
##  5       5               5              2        1 NA           47 NA      "NIC~
##  6      28               5              4        1 NA           60 NA      "Che~
##  7       2               0              2        1 NA           41 NA      "a d~
##  8      13               4              2        1 NA           36 NA      "Cal~
##  9       2               0              8        1 NA           18 NA      "Ecu~
## 10      57               4.5            3        1 NA           38 NA      "Bel~
## 11       1               0              4        1 NA           14 NA      "Cha~
## 12       0               0              2        1 NA           37 NA      "Cos~
## 13     143               5              2        1 NA           28 NA      "Cou~
## 14       0               0              1        1 NA          177 NA      "Log~
## 15       1               0              4        1 NA          147 NA      "Ter~
## 16       3               4.5            2        1 NA          177 NA      "PRI~
## 17       0               0              2        1 NA          129 NA      "Spa~
## 18       0               0              2        1 NA          140 NA      "Spa~
## 19       9               5              2        1 NA          141 NA      "pen~
## 20       0               0              5        1 NA          136 NA      "App~
## 21       1               0              2        1 NA          132 NA      "Hou~
## 22       0               0              1        1 NA          117 NA      "Oud~
## 23       0               0              6        1 NA          106 NA      "Cha~
## 24       3               5              1        1 NA          116 NA      "The~
## 25      13               4.5            2        1 NA          106 NA      "Voe~
## # ... with 17,626 more rows, and 4 more variables: last_modified <chr>, latitude <dbl>,
## #   longitude <dbl>, location <chr>

2.1.6 Importing data from Qualtrics

This section explains how to import data from Qualtrics, which is a popular tool for running online surveys. You can skip this section if you’re working through the Airbnb example.

When you run a survey with Qualtrics, you can download its data by going to ‘Data & Analysis,’ pressing ‘Export & Import’ and then ‘Export Data’ (note that a free Qualtrics account will not allow you to download your data, you have to use an institutional account). Then, you can download your data in different formats. The format that works best with R is CSV. You should ‘Download all fields’ and then decide between using numeric values or choice text. When using numeric values, responses on, e.g., a seven-point Likert scale from 1 = ‘Strongly disagree’ to 7 = ‘Strongly agree’ will be stored as numbers in your .csv file; when using choice text, these responses will be stored as text. Then you need to download your .csv file. When you open it, you’ll see a file in which the first three rows are headers. This complicates a straightforward import into R. We can resolve this issue as follows:

library(tidyverse)
library(magrittr) # install this first with install.packages("magrittr")

# imagine that your .csv file is named data.csv

variable_names <- read_csv("data.csv") %>% names() # read the dataset and get the variable names from the first row

data <- read_csv("data.csv", skip = 3, col_names = FALSE) %>% # then tell R to read the dataset but skip three rows and not assign variable names
  set_names(variable_names) %>% # we set the variable names that we captured before
  select(-c("IPAddress", "Progress", "RecordedDate", "ResponseId", "RecipientLastName", "RecipientFirstName", "RecipientEmail", "ExternalReference", "DistributionChannel", "UserLanguage")) %>% # remove some variables that you will hardly ever use
  rename(duration = `Duration (in seconds)`) # rename one variable so that there are no spaces in its name

data # Inspect your dataset. This looks much cleaner than the raw dataset that you get from Qualtrics.

If you often use Qualtrics, it makes sense to turn the above into a function that you can often use. Check out the function on my Github page.

Another tip: When you collect data with Qualtrics (or any other survey tool), make sure that you will end up with a data file that has meaningful variable names. Working with variables such as “cognitive_load” and “willingness_to_pay,” instead of “Q1” and “Q2,” is much easier and leads to fewer mistakes. Creating meaningful variable names also makes it much more convenient for others to work with your data. In Qualtrics, you can name your questions and these will become your variable names.