# Chapter 4 Merge and Aggregate the data

## 4.1 Merge the data

Remember the column `maxspeed` from the `OSM_sf` table ? We are going to combine the `DT.Uber` table with the `OSM_sf` to compare the Uber `speed_kph_mean` with the OSM `maxspeed`:

first we select only from the `OSM_sf` the fields of interest :

``DT.max.speed <- OSM_sf %>% dplyr::select(osm_id,fclass,name,maxspeed)``

then we set this new table as a `data.table`, convert the column `osm_id` as numeric (because the `osm_way_id` from `DT.Uber` is already a numeric value, and finally define the index column :

``````setDT(DT.max.speed)
DT.max.speed[,osm_id:=as.numeric(osm_id)]
setkey(DT.max.speed,osm_id)``````

And now, we merge this `DT.max.speed` table with the `DT.Uber` :

``DT.max.speed <- merge(DT.max.speed,DT.Uber, by.x='osm_id',by.y='osm_way_id')``

with this new table we add a new column `is.over` that gives 1 if the mean_speed_kph is greater than the maxspeed allowed and 0 otherwise :

``DT.max.speed[,is.over := as.numeric(is.over <- speed_kph_mean>maxspeed)]``

Out of sheer curiosity, how often is the speed limit not respected ?

``DT.max.speed[,(ratio=sum(is.over)/.N)*100, ]``
``## [1] 24.93969``

OK.

Now, we check how many ways have a speed limit defined :

``DT.max.speed[,(ratio=sum(maxspeed>0)/.N)*100, ]``
``## [1] 99.91157``

How does it looks like per maxspeed value :

``````DT.max.speed[, .(count = .N), by = .(maxspeed)][, percentage := count / sum(count) *
100][order(maxspeed)]``````
``````##     maxspeed   count   percentage
##  1:        0    2112  0.088425696
##  2:       10    4350  0.182126789
##  3:       20     236  0.009880902
##  4:       30  349198 14.620301234
##  5:       40    1430  0.059871565
##  6:       50 1846718 77.318808966
##  7:       60   36972  1.547952099
##  8:       70     188  0.007871227
##  9:       80  146229  6.122349009
## 10:      100     972  0.040695917
## 11:      120      41  0.001716597``````

Fine, there are few ways without speed limit. (less than 0.09%).

## 4.2 Aggregate the data

Hang on, we are close the end, we still want to aggregate the data per way and hour of the day. For this, we are going to use the group by functionality via the `by` keyword, and create new variables : (take note that we filter the table on the ways that have an OSM maximum speed defined)

``````DT.max.speed.per.way.hour <-
DT.max.speed[maxspeed > 0][, .(
total_rows = .N,
ratio.over = sum(is.over) / (.N) * 100,
speed_kph_mean = speed_kph_mean <-
mean(speed_kph_mean),
maxspeed = maxspeed <-
mean(maxspeed),
speed_minus_max = speed_kph_mean - maxspeed,
name = first(name)
), by = .(osm_id, hour)]``````

How does this new aggregated table looks like ? (The empty square brackets are a neat `data.table` trick to print it)

``DT.max.speed.per.way.hour[]``
``````##            osm_id hour total_rows ratio.over speed_kph_mean maxspeed
##      1:   4045243    0         60   11.66667       47.80583       50
##      2:   4045243    1         60   30.00000       49.01758       50
##      3:   4045243    2         46   30.43478       49.14030       50
##      4:   4045243    3         58   48.27586       49.89152       50
##      5:   4045243    4         40   42.50000       49.50612       50
##     ---
## 126416: 699295574    2          3    0.00000       35.91000       50
## 126417: 699295574    3          3    0.00000       37.82367       50
## 126418: 699295574   16          1    0.00000       27.88300       50
## 126419: 699295574    4          1    0.00000       38.22100       50
## 126420: 699531029    8          1    0.00000       21.04500       50
##         speed_minus_max              name
##      1:      -2.1941667 Frankfurter Allee
##      2:      -0.9824167 Frankfurter Allee
##      3:      -0.8596957 Frankfurter Allee
##      4:      -0.1084828 Frankfurter Allee
##      5:      -0.4938750 Frankfurter Allee
##     ---
## 126416:     -14.0900000  An der Wuhlheide
## 126417:     -12.1763333  An der Wuhlheide
## 126418:     -22.1170000  An der Wuhlheide
## 126419:     -11.7790000  An der Wuhlheide
## 126420:     -28.9550000 Pasewalker Straße``````

For each OSM way, we have all hours of the day, the percentage of times the median speed was over the maximum speed allowed, and the delta between both.