2.2 Manipulating data frames
2.2.1 Transforming variables
2.2.1.1 Factorizing
Let’s inspect our dataset again:
airbnb
## # A tibble: 17,651 x 20
## room_id survey_id host_id room_type country city borough neighborhood
## <dbl> <dbl> <dbl> <chr> <lgl> <chr> <chr> <chr>
## 1 5141135 1454 20676997 Shared room NA Belgium Gent Gent
## 2 13128333 1454 46098805 Shared room NA Belgium Brussel Schaarbeek
## 3 8298885 1454 30924336 Shared room NA Belgium Brussel Elsene
## 4 13822088 1454 81440431 Shared room NA Belgium Oostende Middelkerke
## 5 18324301 1454 14294054 Shared room NA Belgium Brussel Anderlecht
## 6 12664969 1454 68810179 Shared room NA Belgium Brussel Koekelberg
## 7 15452889 1454 99127166 Shared room NA Belgium Gent Gent
## 8 3911778 1454 3690027 Shared room NA Belgium Brussel Elsene
## 9 14929414 1454 30624501 Shared room NA Belgium Verviers Baelen
## 10 8497852 1454 40513093 Shared room NA Belgium Brussel Etterbeek
## # ... with 17,641 more rows, and 12 more variables: reviews <dbl>,
## # overall_satisfaction <dbl>, accommodates <dbl>, bedrooms <dbl>,
## # bathrooms <lgl>, price <dbl>, minstay <lgl>, name <chr>,
## # last_modified <chr>, latitude <dbl>, longitude <dbl>, location <chr>
We see that room_id
and host_id
are ‘identifiers’ or labels that identify the observations. They are names (in this case just numbers) for the specific rooms and hosts. However, we see that R treats them as integers, i.e., as numbers. This means we could add the room_id
‘s of two different rooms and get a new number. This would not make a lot of sense though, because the room_id
’s are just labels. Let’s make sure R treats the identifiers as labels instead of numbers by ’factorizing’ them. Notice the $
operator. This very important operator allows us to select specific variables from a data frame, in this case room_id
and host_id
.
$room_id <- factor(airbnb$room_id)
airbnb$host_id <- factor(airbnb$host_id) airbnb
A factor variable is similar to a character variable in that it stores letters. Factors are most useful for variables that can only take on a number of pre-determined categories. They should, for example, be used for categorical dependent variables — e.g., whether a sale was made or not: sale
vs. non-sale
. You can think of factors as variables that store labels. The actual labels themselves are not that important (we don’t really care whether a sale is called sale
or success
or something else), we only use them to make a distinction between different categories. It’s very important to factorize integer variables that represent categorical independent or dependent variables, because if we don’t factorize these variables, they will be treated as continuous instead of categorical variables in analyses. For example, a variable can represent a sale as 1 and a non-sale as 0. In that case, it’s important to tell R that this variable should be treated as a categorical instead of a continuous variable.
Character variables are different from factor variables in that they are not just labels for categories. An example of a character variable would be a variable that stores survey respondents’ answers to an open question. Here, the actual content is important (we do care whether someone describes their stay at an Airbnb as very good
or excellent
or something else).
In the airbnb
dataset, the room_id
’s are not strictly determined beforehand, but they definitely are labels and should not be treated as numbers, so we tell R to convert them to factors. Let’s have a look at the airbnb
dataset again to check whether the type of these variables has changed after factorizing:
airbnb
## # A tibble: 17,651 x 20
## room_id survey_id host_id room_type country city borough neighborhood
## <fct> <dbl> <fct> <chr> <lgl> <chr> <chr> <chr>
## 1 5141135 1454 20676997 Shared room NA Belgium Gent Gent
## 2 13128333 1454 46098805 Shared room NA Belgium Brussel Schaarbeek
## 3 8298885 1454 30924336 Shared room NA Belgium Brussel Elsene
## 4 13822088 1454 81440431 Shared room NA Belgium Oostende Middelkerke
## 5 18324301 1454 14294054 Shared room NA Belgium Brussel Anderlecht
## 6 12664969 1454 68810179 Shared room NA Belgium Brussel Koekelberg
## 7 15452889 1454 99127166 Shared room NA Belgium Gent Gent
## 8 3911778 1454 3690027 Shared room NA Belgium Brussel Elsene
## 9 14929414 1454 30624501 Shared room NA Belgium Verviers Baelen
## 10 8497852 1454 40513093 Shared room NA Belgium Brussel Etterbeek
## # ... with 17,641 more rows, and 12 more variables: reviews <dbl>,
## # overall_satisfaction <dbl>, accommodates <dbl>, bedrooms <dbl>,
## # bathrooms <lgl>, price <dbl>, minstay <lgl>, name <chr>,
## # last_modified <chr>, latitude <dbl>, longitude <dbl>, location <chr>
We see that the type of room_id
and host_id
is now fct
(factor).
2.2.1.2 Numerical transformations
Let’s have a look at the ratings of the accommodations:
# I use the head function to make sure R only shows the first few ratings.
# Otherwise we get a really long list of ratings.
head(airbnb$overall_satisfaction)
## [1] 4.5 0.0 4.0 4.5 5.0 5.0
We see that ratings are on a scale from 0 to 5. If we prefer to have the ratings on a scale from 0 to 100, we could simply multiply the ratings by 20:
$overall_satisfaction_100 <- airbnb$overall_satisfaction * 20
airbnb# Note that we are creating a new variable overall_satisfaction_100.
# The original variable overall_satisfaction remains unchanged.
# You can also inspect the whole dataset with the Viewer
# and see that there is a new column all the way on the right.
head(airbnb$overall_satisfaction_100)
## [1] 90 0 80 90 100 100
2.2.1.3 Transforming variables with the mutate function
We can also transform variables with the mutate
function:
<- mutate(airbnb,
airbnb room_id = factor(room_id), host_id = factor(host_id),
overall_satisfaction_100 = overall_satisfaction * 20)
This tells R to take the airbnb
dataset, overwrite the variable room_id
with the factorization of room_id
, overwrite the variable host_id
with the factorization of host_id
, and create a new variable overall_satisfaction_100
that should be overall_satisfaction
times 20. The dataset with these mutations (transformations) should then be assigned to the airbnb
object. Note that we don’t need to use the $
operator here, because the mutate
function knows from its first argument (airbnb
) where to look for certain variables, and therefore we don’t need to specify it with airbnb$
later on. One advantage of using the mutate
function is that it nicely keeps all our desired transformations inside one command. Another big advantage of using mutate
will be discussed in the section on the pipe operator.
2.2.2 Including or excluding and renaming variables (columns)
If we look at the data, we can also see that country
is NA
, which means not available or missing. city
is always Belgium (which is wrong because Belgium is a country, not a city) and borough
contains the information on the city. Let’s correct these mistakes by dropping the country
variable from our dataset and by renaming city
and borough
. We’ll also drop survey_id
because this variable is constant across observations and we won’t use it in the rest of the analysis:
<- select(airbnb, -country, -survey_id)
airbnb # Tell R to drop country & survey_id from the airbnb data frame by including a minus sign before these variables.
# Re-assign this new data frame to the airbnb object.
# You'll now see that country & survey_id are gone.
airbnb
<- rename(airbnb, country = city, city = borough)
airbnb # Tell R to rename some variables from the airbnb data frame and re-assign this new data frame to the airbnb object.
# Note: the syntax is a bit counterintuitive: new variable name (country) = old variable name (city)!
# country = Belgium now and city refers to cities airbnb
2.2.3 Including or excluding observations (rows)
2.2.3.1 Creating a vector with c()
Later on, we’ll make a graph of Airbnb prices in Belgium’s ten largest cities (in terms of population): Brussels, Antwerpen, Gent, Charleroi, Liege, Brugge, Namur, Leuven, Mons, Aalst.
For this, we need to create a data object that only has data for the ten largest cities. To do this, we first need a vector with the names of the ten largest cities, so that in the next section, we can tell R to include only the data from these cities:
<- c("Brussel","Antwerpen","Gent","Charleroi","Liege","Brugge","Namur","Leuven","Mons","Aalst") # Create a vector with the top ten largest cities.
topten # Show the vector. topten
## [1] "Brussel" "Antwerpen" "Gent" "Charleroi" "Liege" "Brugge"
## [7] "Namur" "Leuven" "Mons" "Aalst"
Remember, a vector is a one-dimensional data structure (unlike a data frame which has two dimensions, i.e., columns and rows). We use the c()
operator to create a vector that we call topten
. c()
is an abbreviation of concatenate, which means putting things together. The topten
vector is a vector of strings (words). There should be quotation marks around strings. A vector of numbers, however, does not require quotation marks:
<- c(0,2,4,6)
number_vector number_vector
## [1] 0 2 4 6
Any vector that you will create will appear as an object in the Environment pane (top right window).
2.2.3.2 Including or excluding observations with the filter function
To retain only the data from the ten largest cities, we need the %in%
operator from package Hmisc:
install.packages("Hmisc")
library(Hmisc)
We can now use the filter
function to instruct R to retain the data from only the top ten largest cities:
<- filter(airbnb, city %in% topten)
airbnb.topten # Filter the airbnb data frame so that we retain only those cities in the topten vector.
# Store the filtered dataset in an object named airbnb.topten.
# So we're creating a new dataset airbnb.topten which is a subset of the airbnb dataset.
# Check the Environment pane to see that the airbnb.topten dataset has less observations than the airbnb dataset,
# because it only has the data for the top ten largest cities.