2.2 Manipulating data frames

2.2.1 Transforming variables

2.2.1.1 Factorizing

Let’s inspect our dataset again:

airbnb

## # A tibble: 17,651 x 20
##     room_id survey_id  host_id room_type   country city    borough  neighborhood
##       <dbl>     <dbl>    <dbl> <chr>       <lgl>   <chr>   <chr>    <chr>       
##  1  5141135      1454 20676997 Shared room NA      Belgium Gent     Gent        
##  2 13128333      1454 46098805 Shared room NA      Belgium Brussel  Schaarbeek  
##  3  8298885      1454 30924336 Shared room NA      Belgium Brussel  Elsene      
##  4 13822088      1454 81440431 Shared room NA      Belgium Oostende Middelkerke 
##  5 18324301      1454 14294054 Shared room NA      Belgium Brussel  Anderlecht  
##  6 12664969      1454 68810179 Shared room NA      Belgium Brussel  Koekelberg  
##  7 15452889      1454 99127166 Shared room NA      Belgium Gent     Gent        
##  8  3911778      1454  3690027 Shared room NA      Belgium Brussel  Elsene      
##  9 14929414      1454 30624501 Shared room NA      Belgium Verviers Baelen      
## 10  8497852      1454 40513093 Shared room NA      Belgium Brussel  Etterbeek   
## # ... with 17,641 more rows, and 12 more variables: reviews <dbl>,
## #   overall_satisfaction <dbl>, accommodates <dbl>, bedrooms <dbl>,
## #   bathrooms <lgl>, price <dbl>, minstay <lgl>, name <chr>,
## #   last_modified <chr>, latitude <dbl>, longitude <dbl>, location <chr>

We see that room_id and host_id are ‘identifiers’ or labels that identify the observations. They are names (in this case just numbers) for the specific rooms and hosts. However, we see that R treats them as integers, i.e., as numbers. This means we could add the room_id‘s of two different rooms and get a new number. This would not make a lot of sense though, because the room_id’s are just labels. Let’s make sure R treats the identifiers as labels instead of numbers by ’factorizing’ them. Notice the $ operator. This very important operator allows us to select specific variables from a data frame, in this case room_id and host_id.

airbnb$room_id <- factor(airbnb$room_id)
airbnb$host_id <- factor(airbnb$host_id)

A factor variable is similar to a character variable in that it stores letters. Factors are most useful for variables that can only take on a number of pre-determined categories. They should, for example, be used for categorical dependent variables — e.g., whether a sale was made or not: sale vs. non-sale. You can think of factors as variables that store labels. The actual labels themselves are not that important (we don’t really care whether a sale is called sale or success or something else), we only use them to make a distinction between different categories. It’s very important to factorize integer variables that represent categorical independent or dependent variables, because if we don’t factorize these variables, they will be treated as continuous instead of categorical variables in analyses. For example, a variable can represent a sale as 1 and a non-sale as 0. In that case, it’s important to tell R that this variable should be treated as a categorical instead of a continuous variable.

Character variables are different from factor variables in that they are not just labels for categories. An example of a character variable would be a variable that stores survey respondents’ answers to an open question. Here, the actual content is important (we do care whether someone describes their stay at an Airbnb as very good or excellent or something else).

In the airbnb dataset, the room_id’s are not strictly determined beforehand, but they definitely are labels and should not be treated as numbers, so we tell R to convert them to factors. Let’s have a look at the airbnb dataset again to check whether the type of these variables has changed after factorizing:

airbnb

## # A tibble: 17,651 x 20
##    room_id  survey_id host_id  room_type   country city    borough  neighborhood
##    <fct>        <dbl> <fct>    <chr>       <lgl>   <chr>   <chr>    <chr>       
##  1 5141135       1454 20676997 Shared room NA      Belgium Gent     Gent        
##  2 13128333      1454 46098805 Shared room NA      Belgium Brussel  Schaarbeek  
##  3 8298885       1454 30924336 Shared room NA      Belgium Brussel  Elsene      
##  4 13822088      1454 81440431 Shared room NA      Belgium Oostende Middelkerke 
##  5 18324301      1454 14294054 Shared room NA      Belgium Brussel  Anderlecht  
##  6 12664969      1454 68810179 Shared room NA      Belgium Brussel  Koekelberg  
##  7 15452889      1454 99127166 Shared room NA      Belgium Gent     Gent        
##  8 3911778       1454 3690027  Shared room NA      Belgium Brussel  Elsene      
##  9 14929414      1454 30624501 Shared room NA      Belgium Verviers Baelen      
## 10 8497852       1454 40513093 Shared room NA      Belgium Brussel  Etterbeek   
## # ... with 17,641 more rows, and 12 more variables: reviews <dbl>,
## #   overall_satisfaction <dbl>, accommodates <dbl>, bedrooms <dbl>,
## #   bathrooms <lgl>, price <dbl>, minstay <lgl>, name <chr>,
## #   last_modified <chr>, latitude <dbl>, longitude <dbl>, location <chr>

We see that the type of room_id and host_id is now fct (factor).

2.2.1.2 Numerical transformations

Let’s have a look at the ratings of the accommodations:

# I use the head function to make sure R only shows the first few ratings. 
# Otherwise we get a really long list of ratings.
head(airbnb$overall_satisfaction)

## [1] 4.5 0.0 4.0 4.5 5.0 5.0

We see that ratings are on a scale from 0 to 5. If we prefer to have the ratings on a scale from 0 to 100, we could simply multiply the ratings by 20:

airbnb$overall_satisfaction_100 <- airbnb$overall_satisfaction * 20 
# Note that we are creating a new variable overall_satisfaction_100.
# The original variable overall_satisfaction remains unchanged.


# You can also inspect the whole dataset with the Viewer 
# and see that there is a new column all the way on the right.
head(airbnb$overall_satisfaction_100)

## [1]  90   0  80  90 100 100

2.2.1.3 Transforming variables with the mutate function

We can also transform variables with the mutate function:

airbnb <- mutate(airbnb, 
                 room_id = factor(room_id), host_id = factor(host_id),
                 overall_satisfaction_100 = overall_satisfaction * 20)

This tells R to take the airbnb dataset, overwrite the variable room_id with the factorization of room_id, overwrite the variable host_id with the factorization of host_id, and create a new variable overall_satisfaction_100 that should be overall_satisfaction times 20. The dataset with these mutations (transformations) should then be assigned to the airbnb object. Note that we don’t need to use the $ operator here, because the mutate function knows from its first argument (airbnb) where to look for certain variables, and therefore we don’t need to specify it with airbnb$ later on. One advantage of using the mutate function is that it nicely keeps all our desired transformations inside one command. Another big advantage of using mutate will be discussed in the section on the pipe operator.

2.2.2 Including or excluding and renaming variables (columns)

If we look at the data, we can also see that country is NA, which means not available or missing. city is always Belgium (which is wrong because Belgium is a country, not a city) and borough contains the information on the city. Let’s correct these mistakes by dropping the country variable from our dataset and by renaming city and borough. We’ll also drop survey_id because this variable is constant across observations and we won’t use it in the rest of the analysis:

airbnb <- select(airbnb, -country, -survey_id) 
# Tell R to drop country & survey_id from the airbnb data frame by including a minus sign before these variables. 
# Re-assign this new data frame to the airbnb object.
airbnb # You'll now see that country & survey_id are gone.

airbnb <- rename(airbnb, country = city, city = borough) 
# Tell R to rename some variables from the airbnb data frame and re-assign this new data frame to the airbnb object. 
# Note: the syntax is a bit counterintuitive: new variable name (country) = old variable name (city)!
airbnb # country = Belgium now and city refers to cities

2.2.3 Including or excluding observations (rows)

2.2.3.1 Creating a vector with c()

Later on, we’ll make a graph of Airbnb prices in Belgium’s ten largest cities (in terms of population): Brussels, Antwerpen, Gent, Charleroi, Liege, Brugge, Namur, Leuven, Mons, Aalst.

For this, we need to create a data object that only has data for the ten largest cities. To do this, we first need a vector with the names of the ten largest cities, so that in the next section, we can tell R to include only the data from these cities:

topten <- c("Brussel","Antwerpen","Gent","Charleroi","Liege","Brugge","Namur","Leuven","Mons","Aalst") # Create a vector with the top ten largest cities.
topten # Show the vector.

##  [1] "Brussel"   "Antwerpen" "Gent"      "Charleroi" "Liege"     "Brugge"   
##  [7] "Namur"     "Leuven"    "Mons"      "Aalst"

Remember, a vector is a one-dimensional data structure (unlike a data frame which has two dimensions, i.e., columns and rows). We use the c() operator to create a vector that we call topten. c() is an abbreviation of concatenate, which means putting things together. The topten vector is a vector of strings (words). There should be quotation marks around strings. A vector of numbers, however, does not require quotation marks:

number_vector <- c(0,2,4,6)
number_vector

## [1] 0 2 4 6

Any vector that you will create will appear as an object in the Environment pane (top right window).

2.2.3.2 Including or excluding observations with the filter function

To retain only the data from the ten largest cities, we need the %in% operator from package Hmisc:

install.packages("Hmisc")
library(Hmisc)

We can now use the filter function to instruct R to retain the data from only the top ten largest cities:

airbnb.topten <- filter(airbnb, city %in% topten) 
# Filter the airbnb data frame so that we retain only those cities in the topten vector. 
# Store the filtered dataset in an object named airbnb.topten.

# So we're creating a new dataset airbnb.topten which is a subset of the airbnb dataset.
# Check the Environment pane to see that the airbnb.topten dataset has less observations than the airbnb dataset, 
# because it only has the data for the top ten largest cities.