2.2 Manipulating data frames
2.2.1 Transforming variables
2.2.1.1 Factorizing
Let’s inspect our dataset again:
## # A tibble: 17,651 x 20
## room_id survey_id host_id room_type country city borough neighborhood
## <dbl> <dbl> <dbl> <chr> <lgl> <chr> <chr> <chr>
## 1 5.14e6 1454 2.07e7 Shared r~ NA Belg~ Gent Gent
## 2 1.31e7 1454 4.61e7 Shared r~ NA Belg~ Brussel Schaarbeek
## 3 8.30e6 1454 3.09e7 Shared r~ NA Belg~ Brussel Elsene
## 4 1.38e7 1454 8.14e7 Shared r~ NA Belg~ Oosten~ Middelkerke
## 5 1.83e7 1454 1.43e7 Shared r~ NA Belg~ Brussel Anderlecht
## 6 1.27e7 1454 6.88e7 Shared r~ NA Belg~ Brussel Koekelberg
## 7 1.55e7 1454 9.91e7 Shared r~ NA Belg~ Gent Gent
## 8 3.91e6 1454 3.69e6 Shared r~ NA Belg~ Brussel Elsene
## 9 1.49e7 1454 3.06e7 Shared r~ NA Belg~ Vervie~ Baelen
## 10 8.50e6 1454 4.05e7 Shared r~ NA Belg~ Brussel Etterbeek
## # ... with 17,641 more rows, and 12 more variables: reviews <dbl>,
## # overall_satisfaction <dbl>, accommodates <dbl>, bedrooms <dbl>,
## # bathrooms <lgl>, price <dbl>, minstay <lgl>, name <chr>,
## # last_modified <chr>, latitude <dbl>, longitude <dbl>, location <chr>
We see that room_id
and host_id
are ‘identifiers’ or labels that identify the observations. They are names (in this case just numbers) for the specific rooms and hosts. However, we see that R treats them as integers, i.e., as numbers. This means we could add the room_id
‘s of two different rooms and get a new number. This would not make a lot of sense though, because the room_id
’s are just labels. Let’s make sure R treats the identifiers as labels instead of numbers by ’factorizing’ them. Notice the $
operator. This very important operator allows us to select specific variables from a data frame, in this case room_id
and host_id
.
A factor variable is similar to a character variable in that it stores letters. Factors are most useful for variables that can only take on a number of pre-determined categories. They should, for example, be used for categorical dependent variables — e.g., whether a sale was made or not: sale
vs. non-sale
. You can think of factors as variables that store labels. The actual labels themselves are not that important (we don’t really care whether a sale is called sale
or success
or something else), we only use them to make a distinction between different categories. It’s very important to factorize integer variables that represent categorical independent or dependent variables, because if we don’t factorize these variables, they will be treated as continuous instead of categorical variables in analyses. For example, a variable can represent a sale as 1 and a non-sale as 0. In that case, it’s important to tell R that this variable should be treated as a categorical instead of a continuous variable.
Character variables are different from factor variables in that they are not just labels for categories. An example of a character variable would be a variable that stores survey respondents’ answers to an open question. Here, the actual content is important (we do care whether someone describes their stay at an Airbnb as very good
or excellent
or something else).
In the airbnb
dataset, the room_id
’s are not strictly determined beforehand, but they definitely are labels and should not be treated as numbers, so we tell R to convert them to factors. Let’s have a look at the airbnb
dataset again to check whether the type of these variables has changed after factorizing:
## # A tibble: 17,651 x 22
## room_id survey_id host_id room_type country city borough neighborhood
## <dbl> <dbl> <dbl> <chr> <lgl> <chr> <chr> <chr>
## 1 5.14e6 1454 2.07e7 Shared r~ NA Belg~ Gent Gent
## 2 1.31e7 1454 4.61e7 Shared r~ NA Belg~ Brussel Schaarbeek
## 3 8.30e6 1454 3.09e7 Shared r~ NA Belg~ Brussel Elsene
## 4 1.38e7 1454 8.14e7 Shared r~ NA Belg~ Oosten~ Middelkerke
## 5 1.83e7 1454 1.43e7 Shared r~ NA Belg~ Brussel Anderlecht
## 6 1.27e7 1454 6.88e7 Shared r~ NA Belg~ Brussel Koekelberg
## 7 1.55e7 1454 9.91e7 Shared r~ NA Belg~ Gent Gent
## 8 3.91e6 1454 3.69e6 Shared r~ NA Belg~ Brussel Elsene
## 9 1.49e7 1454 3.06e7 Shared r~ NA Belg~ Vervie~ Baelen
## 10 8.50e6 1454 4.05e7 Shared r~ NA Belg~ Brussel Etterbeek
## # ... with 17,641 more rows, and 14 more variables: reviews <dbl>,
## # overall_satisfaction <dbl>, accommodates <dbl>, bedrooms <dbl>,
## # bathrooms <lgl>, price <dbl>, minstay <lgl>, name <chr>,
## # last_modified <chr>, latitude <dbl>, longitude <dbl>, location <chr>,
## # room_id_F <fct>, host_id_F <fct>
We see that the type of room_id
and host_id
is now fct
(factor).
2.2.1.2 Numerical transformations
Let’s have a look at the ratings of the accommodations:
# I use the head function to make sure R only shows the first few ratings.
# Otherwise we get a really long list of ratings.
head(airbnb$overall_satisfaction)
## [1] 4.5 0.0 4.0 4.5 5.0 5.0
We see that ratings are on a scale from 0 to 5. If we prefer to have the ratings on a scale from 0 to 100, we could simply multiply the ratings by 20:
airbnb$overall_satisfaction_100 <- airbnb$overall_satisfaction * 20
# Note that we are creating a new variable overall_satisfaction_100.
# The original variable overall_satisfaction remains unchanged.
# You can also inspect the whole dataset with the Viewer
# and see that there is a new column all the way on the right.
head(airbnb$overall_satisfaction_100)
## [1] 90 0 80 90 100 100
2.2.1.3 Transforming variables with the mutate function
We can also transform variables with the mutate
function:
airbnb <- mutate(airbnb,
room_id_F = factor(room_id), host_id_F = factor(host_id),
overall_satisfaction_100 = overall_satisfaction * 20)
This tells R to take the airbnb
dataset, create a new variable room_id_F
that should be the factorization of room_id
, a new variable host_id_F
that should be the factorization of host_id
, and a new variable overall_satisfaction_100
that should be overall_satisfaction
times 20. The dataset with these mutations (transformations) should then be assigned to the airbnb
object. Note that we don’t need to use the $
operator here, because the mutate
function knows from its first argument (airbnb
) where to look for certain variables, and therefore we don’t need to specify it with airbnb$
later on. One advantage of using the mutate
function is that it nicely keeps all our desired transformations inside one command. Another big advantage of using mutate
will be discussed in the section on the pipe operator.
2.2.2 Including or excluding and renaming variables (columns)
If we look at the data, we can also see that country
is NA
, which means not available or missing. city
is always Belgium (which is wrong because Belgium is a country, not a city) and borough
contains the information on the city. Let’s correct these mistakes by dropping the country
variable from our dataset and by renaming city
and borough
. We’ll also drop survey_id
because this variable is constant across observations and we won’t use it in the rest of the analysis:
airbnb <- select(airbnb, -country, -survey_id)
# Tell R to drop country & survey_id from the airbnb data frame by including a minus sign before these variables.
# Re-assign this new data frame to the airbnb object.
airbnb # You'll now see that country & survey_id are gone.
airbnb <- rename(airbnb, country = city, city = borough)
# Tell R to rename some variables from the airbnb data frame and re-assign this new data frame to the airbnb object.
# Note: the syntax is a bit counterintuitive: new variable name (country) = old variable name (city)!
airbnb # country = Belgium now and city refers to cities
2.2.3 Including or excluding observations (rows)
2.2.3.1 Creating a vector with c()
Later on, we’ll make a graph of Airbnb prices in Belgium’s ten largest cities (in terms of population): Brussels, Antwerpen, Gent, Charleroi, Liege, Brugge, Namur, Leuven, Mons, Aalst.
For this, we need to create a data object that only has data for the ten largest cities. To do this, we first need a vector with the names of the ten largest cities, so that in the next section, we can tell R to include only the data from these cities:
topten <- c("Brussel","Antwerpen","Gent","Charleroi","Liege","Brugge","Namur","Leuven","Mons","Aalst") # Create a vector with the top ten largest cities.
topten # Show the vector.
## [1] "Brussel" "Antwerpen" "Gent" "Charleroi" "Liege" "Brugge"
## [7] "Namur" "Leuven" "Mons" "Aalst"
Remember, a vector is a one-dimensional data structure (unlike a data frame which has two dimensions, i.e., columns and rows). We use the c()
operator to create a vector that we call topten
. c()
is an abbreviation of concatenate, which means putting things together. The topten
vector is a vector of strings (words). There should be quotation marks around strings. A vector of numbers, however, does not require quotation marks:
## [1] 0 2 4 6
Any vector that you will create will appear as an object in the Environment pane (top right window).
2.2.3.2 Including or excluding observations with the filter function
To retain only the data from the ten largest cities, we need the %in%
operator from package Hmisc:
We can now use the filter
function to instruct R to retain the data from only the top ten largest cities:
airbnb.topten <- filter(airbnb, city %in% topten)
# Filter the airbnb data frame so that we retain only those cities in the topten vector.
# Store the filtered dataset in an object named airbnb.topten.
# So we're creating a new dataset airbnb.topten which is a subset of the airbnb dataset.
# Check the Environment pane to see that the airbnb.topten dataset has less observations than the airbnb dataset,
# because it only has the data for the top ten largest cities.