2.4 Grouping & summarizing
Let’s work on the full dataset again. So far, your script should look like this:
library(tidyverse)
setwd("c:/Dropbox/work/teaching/R/data/") # Set your working directory
airbnb <- read_csv("tomslee_airbnb_belgium_1454_2017-07-14.csv") %>%
mutate(room_id = factor(room_id), host_id = factor(host_id)) %>% # We don't create a new variable room_id_F, but instead overwrite room_id with its factorization. Same for host_id.
select(-country, -survey_id) %>% # drop country & survey_id
rename(country = city, city = borough) # rename city & borough
# We leave out the transformation of overall satisfaction
# and we leave out the filter command to make sure we do not retain only the data of the ten most populated cities
2.4.1 Frequency tables
Each observation in our dataset is a room, so we know that our data contains information on 17651 rooms. Say we want to know how many rooms there are per city:
airbnb %>%
group_by(city) %>% # Use the group_by function to group the airbnb data frame (provided by the pipe on the previous line) by city
summarise(nr_per_city = n()) # Summarize this grouped object (provided by the pipe on the previous line): ask R to create a new variable nr_per_city that has the number of observations in each group (city)
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 43 x 2
## city nr_per_city
## <chr> <int>
## 1 Aalst 74
## 2 Antwerpen 1610
## 3 Arlon 46
## 4 Ath 47
## 5 Bastogne 145
## 6 Brugge 1094
## 7 Brussel 6715
## 8 Charleroi 118
## 9 Dendermonde 45
## 10 Diksmuide 27
## # ... with 33 more rows
We tell R to take the airbnb
object, to group it by city
, and to summarise
it. The summary we want is the number of observations per group. In this case the cities form the groups. The groups will always be the first column in our output. We obtain the number of observations per group with the n()
function. These numbers are stored in a new column that we name nr_per_city
.
As you can see, these frequencies are sorted alphabetically by city. We can sort them on the number of rooms per city instead:
airbnb %>%
group_by(city) %>%
summarise(nr_per_city = n()) %>%
arrange(nr_per_city) # Use the arrange function to sort on a column of choice
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 43 x 2
## city nr_per_city
## <chr> <int>
## 1 Tielt 24
## 2 Diksmuide 27
## 3 Moeskroen 28
## 4 Roeselare 41
## 5 Eeklo 43
## 6 Dendermonde 45
## 7 Arlon 46
## 8 Ath 47
## 9 Waremme 51
## 10 Sint-Niklaas 52
## # ... with 33 more rows
It shows the city with the fewest rooms on top. To display the city with the most rooms on top, sort in descending order:
airbnb %>%
group_by(city) %>%
summarise(nr_per_city = n()) %>%
arrange(desc(nr_per_city)) # Sort in descending order
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 43 x 2
## city nr_per_city
## <chr> <int>
## 1 Brussel 6715
## 2 Antwerpen 1610
## 3 Gent 1206
## 4 Brugge 1094
## 5 Liege 667
## 6 Verviers 631
## 7 Oostende 527
## 8 Nivelles 505
## 9 Halle-Vilvoorde 471
## 10 Leuven 434
## # ... with 33 more rows
You’ll see that the capital Brussels has the most rooms on offer, followed by Antwerpen and Gent. Notice that this is a lot like working with PivotTable in Excel. You could have done all this in Excel, but that has several disadvantages, especially when working with large datasets like ours: you have no record of what you clicked on, how you sorted the data, and what you may have copied or deleted. In Excel, it’s easier to make accidental mistakes without noticing than in R. In R, you have your script, so you can go back and check all the steps in your analysis.
Note: you could have also done this without the pipe operator:
airbnb.grouped <- group_by(airbnb, city)
airbnb.grouped.summary <- summarize(airbnb.grouped, nr_per_city = n())
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 43 x 2
## city nr_per_city
## <chr> <int>
## 1 Brussel 6715
## 2 Antwerpen 1610
## 3 Gent 1206
## 4 Brugge 1094
## 5 Liege 667
## 6 Verviers 631
## 7 Oostende 527
## 8 Nivelles 505
## 9 Halle-Vilvoorde 471
## 10 Leuven 434
## # ... with 33 more rows
But hopefully you’ll agree that the code that uses the pipe operator is easier to read. Also, without the pipe operator you’ll end up creating many unnecessary objects such as airbnb.grouped
and airbnb.grouped.summary
.
2.4.2 Descriptive statistics
Say that, in addition to the frequencies per city, we also want the average price per city. We want this sorted in descending order by average price. Also, we now want to store the frequencies and averages in an object (in the previous section we did not store the frequency table in an object):
airbnb.summary <- airbnb %>% # Store this summary into an object called airbnb.summary.
group_by(city) %>%
summarise(nr_per_city = n(), average_price = mean(price)) %>% # Here we tell R to create another variable called average_price that gives us the mean of price per group (city)
arrange(desc(average_price)) # Now sort on average_price and show the highest priced cities on top
## `summarise()` ungrouping output (override with `.groups` argument)
# Check the Environment pane to see that there's now a new object called airbnb.summary.
# Instead of just running airbnb.summary,
# I've wrapped it in a print command and set n to Inf to see all the rows.
print(airbnb.summary, n = Inf)
## # A tibble: 43 x 3
## city nr_per_city average_price
## <chr> <int> <dbl>
## 1 Bastogne 145 181.
## 2 Philippeville 85 162.
## 3 Verviers 631 159.
## 4 Ieper 143 151.
## 5 Waremme 51 150.
## 6 Dinant 286 144.
## 7 Oudenaarde 110 142.
## 8 Neufchâteau 160 141.
## 9 Ath 47 134.
## 10 Tielt 24 129.
## 11 Tongeren 173 127.
## 12 Brugge 1094 126.
## 13 Huy 99 125.
## 14 Marche-en-Famenne 266 124.
## 15 Veurne 350 119.
## 16 Eeklo 43 115.
## 17 Diksmuide 27 114.
## 18 Moeskroen 28 113.
## 19 Mechelen 190 112.
## 20 Namur 286 111.
## 21 Thuin 81 107.
## 22 Kortrijk 107 103.
## 23 Oostende 527 102.
## 24 Hasselt 151 99.6
## 25 Maaseik 93 98.1
## 26 Antwerpen 1610 95.7
## 27 Aalst 74 94.9
## 28 Nivelles 505 94.1
## 29 Gent 1206 90.5
## 30 Sint-Niklaas 52 86.7
## 31 Virton 56 86.5
## 32 Tournai 97 86.4
## 33 Halle-Vilvoorde 471 85.4
## 34 Dendermonde 45 81.4
## 35 Mons 129 79.3
## 36 Liege 667 79.1
## 37 Turnhout 130 78.1
## 38 Soignies 58 77.7
## 39 Charleroi 118 76.9
## 40 Arlon 46 76.0
## 41 Leuven 434 75.7
## 42 Brussel 6715 75.1
## 43 Roeselare 41 74.9
Perhaps surprisingly, the top three most expensive cities are Bastogne, Philippeville, and Verviers. Perhaps the average price for these cities is high because of outliers. Let’s calculate some more descriptive statistics to see whether our hunch is correct:
airbnb %>%
group_by(city) %>%
summarise(nr_per_city = n(),
average_price = mean(price),
median_price = median(price), # calculate the median price per group (city)
max_price = max(price)) %>% # calculate the maximum price per group (city)
arrange(desc(median_price),
desc(max_price)) # sort, in descending order, on median price and then on maximum price
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 43 x 5
## city nr_per_city average_price median_price max_price
## <chr> <int> <dbl> <dbl> <dbl>
## 1 Tielt 24 129. 112 318
## 2 Ieper 143 151. 111 695
## 3 Verviers 631 159. 105 1769
## 4 Brugge 1094 126. 105 1414
## 5 Bastogne 145 181. 100 1650
## 6 Veurne 350 119. 100 943
## 7 Marche-en-Famenne 266 124. 100 472
## 8 Dinant 286 144. 95 1284
## 9 Tongeren 173 127. 95 990
## 10 Neufchâteau 160 141. 95 872
## # ... with 33 more rows
We see that two of the three cities with the highest average price (Verviers and Bastogne) are also in the top five median price cities, so their high average price is not only due to a few extremely high priced rooms (even though the highest priced rooms in these cities are pretty expensive).