2.4 Grouping & summarizing

Let’s work on the full dataset again. So far, your script should look like this:

library(tidyverse)
setwd("c:/Dropbox/work/teaching/R/data/") # Set your working directory 

airbnb <- read_csv("tomslee_airbnb_belgium_1454_2017-07-14.csv") %>% 
  mutate(room_id = factor(room_id), host_id = factor(host_id)) %>% # overwrite room_id with its factorization. Same for host_id. 
  select(-country, -survey_id) %>% # drop country & survey_id
  rename(country = city, city = borough) # rename city & borough

# We leave out the transformation of overall satisfaction
# and we leave out the filter command to make sure we do not retain only the data of the ten most populated cities

2.4.1 Frequency tables

Each observation in our dataset is a room, so we know that our data contains information on 17651 rooms. Say we want to know how many rooms there are per city:

airbnb %>% 
  group_by(city) %>% # Use the group_by function to group the airbnb data frame (provided by the pipe on the previous line) by city
  summarise(nr_per_city = n()) # Summarize this grouped object (provided by the pipe on the previous line): ask R to create a new variable nr_per_city that has the number of observations in each group (city)

## # A tibble: 43 x 2
##    city        nr_per_city
##    <chr>             <int>
##  1 Aalst                74
##  2 Antwerpen          1610
##  3 Arlon                46
##  4 Ath                  47
##  5 Bastogne            145
##  6 Brugge             1094
##  7 Brussel            6715
##  8 Charleroi           118
##  9 Dendermonde          45
## 10 Diksmuide            27
## # ... with 33 more rows

We tell R to take the airbnb object, to group it by city, and to summarise it. The summary we want is the number of observations per group. In this case the cities form the groups. The groups will always be the first column in our output. We obtain the number of observations per group with the n() function. These numbers are stored in a new column that we name nr_per_city.

As you can see, these frequencies are sorted alphabetically by city. We can sort them on the number of rooms per city instead:

airbnb %>% 
  group_by(city) %>%
  summarise(nr_per_city = n()) %>%
  arrange(nr_per_city) # Use the arrange function to sort on a column of choice

## # A tibble: 43 x 2
##    city         nr_per_city
##    <chr>              <int>
##  1 Tielt                 24
##  2 Diksmuide             27
##  3 Moeskroen             28
##  4 Roeselare             41
##  5 Eeklo                 43
##  6 Dendermonde           45
##  7 Arlon                 46
##  8 Ath                   47
##  9 Waremme               51
## 10 Sint-Niklaas          52
## # ... with 33 more rows

It shows the city with the fewest rooms on top. To display the city with the most rooms on top, sort in descending order:

airbnb %>% 
  group_by(city) %>%
  summarise(nr_per_city = n()) %>%
  arrange(desc(nr_per_city)) # Sort in descending order

## # A tibble: 43 x 2
##    city            nr_per_city
##    <chr>                 <int>
##  1 Brussel                6715
##  2 Antwerpen              1610
##  3 Gent                   1206
##  4 Brugge                 1094
##  5 Liege                   667
##  6 Verviers                631
##  7 Oostende                527
##  8 Nivelles                505
##  9 Halle-Vilvoorde         471
## 10 Leuven                  434
## # ... with 33 more rows

You’ll see that the capital Brussels has the most rooms on offer, followed by Antwerpen and Gent. Notice that this is a lot like working with PivotTable in Excel. You could have done all this in Excel, but that has several disadvantages, especially when working with large datasets like ours: you have no record of what you clicked on, how you sorted the data, and what you may have copied or deleted. In Excel, it’s easier to make accidental mistakes without noticing than in R. In R, you have your script, so you can go back and check all the steps in your analysis.

Note: you could have also done this without the pipe operator:

airbnb.grouped <- group_by(airbnb, city)
airbnb.grouped.summary <- summarize(airbnb.grouped, nr_per_city = n())
arrange(airbnb.grouped.summary, desc(nr_per_city))

## # A tibble: 43 x 2
##    city            nr_per_city
##    <chr>                 <int>
##  1 Brussel                6715
##  2 Antwerpen              1610
##  3 Gent                   1206
##  4 Brugge                 1094
##  5 Liege                   667
##  6 Verviers                631
##  7 Oostende                527
##  8 Nivelles                505
##  9 Halle-Vilvoorde         471
## 10 Leuven                  434
## # ... with 33 more rows

But hopefully you’ll agree that the code that uses the pipe operator is easier to read. Also, without the pipe operator you’ll end up creating many unnecessary objects such as airbnb.grouped and airbnb.grouped.summary.

2.4.2 Descriptive statistics

Say that, in addition to the frequencies per city, we also want the average price per city. We want this sorted in descending order by average price. Also, we now want to store the frequencies and averages in an object (in the previous section we did not store the frequency table in an object):

airbnb.summary <- airbnb %>% # Store this summary into an object called airbnb.summary.
  group_by(city) %>%
  summarise(nr_per_city = n(), average_price = mean(price)) %>% # Here we tell R to create another variable called average_price that gives us the mean of price per group (city)
  arrange(desc(average_price)) # Now sort on average_price and show the highest priced cities on top

# Check the Environment pane to see that there's now a new object called airbnb.summary.

# Instead of just running airbnb.summary, 
# I've wrapped it in a print command and set n to Inf to see all the rows.
print(airbnb.summary, n = Inf)

## # A tibble: 43 x 3
##    city              nr_per_city average_price
##    <chr>                   <int>         <dbl>
##  1 Bastogne                  145         181. 
##  2 Philippeville              85         162. 
##  3 Verviers                  631         159. 
##  4 Ieper                     143         151. 
##  5 Waremme                    51         150. 
##  6 Dinant                    286         144. 
##  7 Oudenaarde                110         142. 
##  8 Neufchâteau               160         141. 
##  9 Ath                        47         134. 
## 10 Tielt                      24         129. 
## 11 Tongeren                  173         127. 
## 12 Brugge                   1094         126. 
## 13 Huy                        99         125. 
## 14 Marche-en-Famenne         266         124. 
## 15 Veurne                    350         119. 
## 16 Eeklo                      43         115. 
## 17 Diksmuide                  27         114. 
## 18 Moeskroen                  28         113. 
## 19 Mechelen                  190         112. 
## 20 Namur                     286         111. 
## 21 Thuin                      81         107. 
## 22 Kortrijk                  107         103. 
## 23 Oostende                  527         102. 
## 24 Hasselt                   151          99.6
## 25 Maaseik                    93          98.1
## 26 Antwerpen                1610          95.7
## 27 Aalst                      74          94.9
## 28 Nivelles                  505          94.1
## 29 Gent                     1206          90.5
## 30 Sint-Niklaas               52          86.7
## 31 Virton                     56          86.5
## 32 Tournai                    97          86.4
## 33 Halle-Vilvoorde           471          85.4
## 34 Dendermonde                45          81.4
## 35 Mons                      129          79.3
## 36 Liege                     667          79.1
## 37 Turnhout                  130          78.1
## 38 Soignies                   58          77.7
## 39 Charleroi                 118          76.9
## 40 Arlon                      46          76.0
## 41 Leuven                    434          75.7
## 42 Brussel                  6715          75.1
## 43 Roeselare                  41          74.9

Perhaps surprisingly, the top three most expensive cities are Bastogne, Philippeville, and Verviers. Perhaps the average price for these cities is high because of outliers. Let’s calculate some more descriptive statistics to see whether our hunch is correct:

airbnb %>%
  group_by(city) %>%
  summarise(nr_per_city = n(), 
            average_price = mean(price),
            median_price = median(price), # calculate the median price per group (city)
            max_price = max(price)) %>% # calculate the maximum price per group (city)
  arrange(desc(median_price),
          desc(max_price)) # sort, in descending order, on median price and then on maximum price

## # A tibble: 43 x 5
##    city              nr_per_city average_price median_price max_price
##    <chr>                   <int>         <dbl>        <dbl>     <dbl>
##  1 Tielt                      24          129.          112       318
##  2 Ieper                     143          151.          111       695
##  3 Verviers                  631          159.          105      1769
##  4 Brugge                   1094          126.          105      1414
##  5 Bastogne                  145          181.          100      1650
##  6 Veurne                    350          119.          100       943
##  7 Marche-en-Famenne         266          124.          100       472
##  8 Dinant                    286          144.           95      1284
##  9 Tongeren                  173          127.           95       990
## 10 Neufchâteau               160          141.           95       872
## # ... with 33 more rows

We see that two of the three cities with the highest average price (Verviers and Bastogne) are also in the top five median price cities, so their high average price is not only due to a few extremely high priced rooms (even though the highest priced rooms in these cities are pretty expensive).