2.2 Manipulating data frames

2.2.1 Transforming variables

2.2.1.1 Factorizing

Let’s inspect our dataset again:

## # A tibble: 17,651 x 20
##    room_id survey_id host_id room_type country city  borough neighborhood
##      <dbl>     <dbl>   <dbl> <chr>     <lgl>   <chr> <chr>   <chr>       
##  1  5.14e6      1454  2.07e7 Shared r~ NA      Belg~ Gent    Gent        
##  2  1.31e7      1454  4.61e7 Shared r~ NA      Belg~ Brussel Schaarbeek  
##  3  8.30e6      1454  3.09e7 Shared r~ NA      Belg~ Brussel Elsene      
##  4  1.38e7      1454  8.14e7 Shared r~ NA      Belg~ Oosten~ Middelkerke 
##  5  1.83e7      1454  1.43e7 Shared r~ NA      Belg~ Brussel Anderlecht  
##  6  1.27e7      1454  6.88e7 Shared r~ NA      Belg~ Brussel Koekelberg  
##  7  1.55e7      1454  9.91e7 Shared r~ NA      Belg~ Gent    Gent        
##  8  3.91e6      1454  3.69e6 Shared r~ NA      Belg~ Brussel Elsene      
##  9  1.49e7      1454  3.06e7 Shared r~ NA      Belg~ Vervie~ Baelen      
## 10  8.50e6      1454  4.05e7 Shared r~ NA      Belg~ Brussel Etterbeek   
## # ... with 17,641 more rows, and 12 more variables: reviews <dbl>,
## #   overall_satisfaction <dbl>, accommodates <dbl>, bedrooms <dbl>,
## #   bathrooms <lgl>, price <dbl>, minstay <lgl>, name <chr>,
## #   last_modified <chr>, latitude <dbl>, longitude <dbl>, location <chr>

We see that room_id and host_id are ‘identifiers’ or labels that identify the observations. They are names (in this case just numbers) for the specific rooms and hosts. However, we see that R treats them as integers, i.e., as numbers. This means we could add the room_id‘s of two different rooms and get a new number. This would not make a lot of sense though, because the room_id’s are just labels. Let’s make sure R treats the identifiers as labels instead of numbers by ’factorizing’ them. Notice the $ operator. This very important operator allows us to select specific variables from a data frame, in this case room_id and host_id.

A factor variable is similar to a character variable in that it stores letters. Factors are most useful for variables that can only take on a number of pre-determined categories. They should, for example, be used for categorical dependent variables — e.g., whether a sale was made or not: sale vs. non-sale. You can think of factors as variables that store labels. The actual labels themselves are not that important (we don’t really care whether a sale is called sale or success or something else), we only use them to make a distinction between different categories. It’s very important to factorize integer variables that represent categorical independent or dependent variables, because if we don’t factorize these variables, they will be treated as continuous instead of categorical variables in analyses. For example, a variable can represent a sale as 1 and a non-sale as 0. In that case, it’s important to tell R that this variable should be treated as a categorical instead of a continuous variable.

Character variables are different from factor variables in that they are not just labels for categories. An example of a character variable would be a variable that stores survey respondents’ answers to an open question. Here, the actual content is important (we do care whether someone describes their stay at an Airbnb as very good or excellent or something else).

In the airbnb dataset, the room_id’s are not strictly determined beforehand, but they definitely are labels and should not be treated as numbers, so we tell R to convert them to factors. Let’s have a look at the airbnb dataset again to check whether the type of these variables has changed after factorizing:

## # A tibble: 17,651 x 22
##    room_id survey_id host_id room_type country city  borough neighborhood
##      <dbl>     <dbl>   <dbl> <chr>     <lgl>   <chr> <chr>   <chr>       
##  1  5.14e6      1454  2.07e7 Shared r~ NA      Belg~ Gent    Gent        
##  2  1.31e7      1454  4.61e7 Shared r~ NA      Belg~ Brussel Schaarbeek  
##  3  8.30e6      1454  3.09e7 Shared r~ NA      Belg~ Brussel Elsene      
##  4  1.38e7      1454  8.14e7 Shared r~ NA      Belg~ Oosten~ Middelkerke 
##  5  1.83e7      1454  1.43e7 Shared r~ NA      Belg~ Brussel Anderlecht  
##  6  1.27e7      1454  6.88e7 Shared r~ NA      Belg~ Brussel Koekelberg  
##  7  1.55e7      1454  9.91e7 Shared r~ NA      Belg~ Gent    Gent        
##  8  3.91e6      1454  3.69e6 Shared r~ NA      Belg~ Brussel Elsene      
##  9  1.49e7      1454  3.06e7 Shared r~ NA      Belg~ Vervie~ Baelen      
## 10  8.50e6      1454  4.05e7 Shared r~ NA      Belg~ Brussel Etterbeek   
## # ... with 17,641 more rows, and 14 more variables: reviews <dbl>,
## #   overall_satisfaction <dbl>, accommodates <dbl>, bedrooms <dbl>,
## #   bathrooms <lgl>, price <dbl>, minstay <lgl>, name <chr>,
## #   last_modified <chr>, latitude <dbl>, longitude <dbl>, location <chr>,
## #   room_id_F <fct>, host_id_F <fct>

We see that the type of room_id and host_id is now fct (factor).

2.2.1.3 Transforming variables with the mutate function

We can also transform variables with the mutate function:

This tells R to take the airbnb dataset, create a new variable room_id_F that should be the factorization of room_id, a new variable host_id_F that should be the factorization of host_id, and a new variable overall_satisfaction_100 that should be overall_satisfaction times 20. The dataset with these mutations (transformations) should then be assigned to the airbnb object. Note that we don’t need to use the $ operator here, because the mutate function knows from its first argument (airbnb) where to look for certain variables, and therefore we don’t need to specify it with airbnb$ later on. One advantage of using the mutate function is that it nicely keeps all our desired transformations inside one command. Another big advantage of using mutate will be discussed in the section on the pipe operator.

2.2.2 Including or excluding and renaming variables (columns)

If we look at the data, we can also see that country is NA, which means not available or missing. city is always Belgium (which is wrong because Belgium is a country, not a city) and borough contains the information on the city. Let’s correct these mistakes by dropping the country variable from our dataset and by renaming city and borough. We’ll also drop survey_id because this variable is constant across observations and we won’t use it in the rest of the analysis:

2.2.3 Including or excluding observations (rows)

2.2.3.1 Creating a vector with c()

Later on, we’ll make a graph of Airbnb prices in Belgium’s ten largest cities (in terms of population): Brussels, Antwerpen, Gent, Charleroi, Liege, Brugge, Namur, Leuven, Mons, Aalst.

For this, we need to create a data object that only has data for the ten largest cities. To do this, we first need a vector with the names of the ten largest cities, so that in the next section, we can tell R to include only the data from these cities:

##  [1] "Brussel"   "Antwerpen" "Gent"      "Charleroi" "Liege"     "Brugge"   
##  [7] "Namur"     "Leuven"    "Mons"      "Aalst"

Remember, a vector is a one-dimensional data structure (unlike a data frame which has two dimensions, i.e., columns and rows). We use the c() operator to create a vector that we call topten. c() is an abbreviation of concatenate, which means putting things together. The topten vector is a vector of strings (words). There should be quotation marks around strings. A vector of numbers, however, does not require quotation marks:

## [1] 0 2 4 6

Any vector that you will create will appear as an object in the Environment pane (top right window).