Chapter 4 Merge and Aggregate the data

4.1 Merge the data

Remember the column maxspeed from the OSM_sf table ? We are going to combine the DT.Uber table with the OSM_sf to compare the Uber speed_kph_mean with the OSM maxspeed:

first we select only from the OSM_sf the fields of interest :

DT.max.speed <- OSM_sf %>% dplyr::select(osm_id,fclass,name,maxspeed)

then we set this new table as a data.table, convert the column osm_id as numeric (because the osm_way_id from DT.Uber is already a numeric value, and finally define the index column :

setDT(DT.max.speed)
DT.max.speed[,osm_id:=as.numeric(osm_id)]
setkey(DT.max.speed,osm_id)

And now, we merge this DT.max.speed table with the DT.Uber :

DT.max.speed <- merge(DT.max.speed,DT.Uber, by.x='osm_id',by.y='osm_way_id')

with this new table we add a new column is.over that gives 1 if the mean_speed_kph is greater than the maxspeed allowed and 0 otherwise :

DT.max.speed[,is.over := as.numeric(is.over <- speed_kph_mean>maxspeed)]

Out of sheer curiosity, how often is the speed limit not respected ?

DT.max.speed[,(ratio=sum(is.over)/.N)*100, ]

## [1] 24.93969

OK.

Now, we check how many ways have a speed limit defined :

DT.max.speed[,(ratio=sum(maxspeed>0)/.N)*100, ]

## [1] 99.91157

How does it looks like per maxspeed value :

DT.max.speed[, .(count = .N), by = .(maxspeed)][, percentage := count / sum(count) *
                                                  100][order(maxspeed)]

##     maxspeed   count   percentage
##  1:        0    2112  0.088425696
##  2:       10    4350  0.182126789
##  3:       20     236  0.009880902
##  4:       30  349198 14.620301234
##  5:       40    1430  0.059871565
##  6:       50 1846718 77.318808966
##  7:       60   36972  1.547952099
##  8:       70     188  0.007871227
##  9:       80  146229  6.122349009
## 10:      100     972  0.040695917
## 11:      120      41  0.001716597

Fine, there are few ways without speed limit. (less than 0.09%).

4.2 Aggregate the data

Hang on, we are close the end, we still want to aggregate the data per way and hour of the day. For this, we are going to use the group by functionality via the by keyword, and create new variables : (take note that we filter the table on the ways that have an OSM maximum speed defined)

DT.max.speed.per.way.hour <-
  DT.max.speed[maxspeed > 0][, .(
    total_rows = .N,
    ratio.over = sum(is.over) / (.N) * 100,
    speed_kph_mean = speed_kph_mean <-
      mean(speed_kph_mean),
    maxspeed = maxspeed <-
      mean(maxspeed),
    speed_minus_max = speed_kph_mean - maxspeed,
    name = first(name)
  ), by = .(osm_id, hour)]

How does this new aggregated table looks like ? (The empty square brackets are a neat data.table trick to print it)

DT.max.speed.per.way.hour[]

##            osm_id hour total_rows ratio.over speed_kph_mean maxspeed
##      1:   4045243    0         60   11.66667       47.80583       50
##      2:   4045243    1         60   30.00000       49.01758       50
##      3:   4045243    2         46   30.43478       49.14030       50
##      4:   4045243    3         58   48.27586       49.89152       50
##      5:   4045243    4         40   42.50000       49.50612       50
##     ---                                                             
## 126416: 699295574    2          3    0.00000       35.91000       50
## 126417: 699295574    3          3    0.00000       37.82367       50
## 126418: 699295574   16          1    0.00000       27.88300       50
## 126419: 699295574    4          1    0.00000       38.22100       50
## 126420: 699531029    8          1    0.00000       21.04500       50
##         speed_minus_max              name
##      1:      -2.1941667 Frankfurter Allee
##      2:      -0.9824167 Frankfurter Allee
##      3:      -0.8596957 Frankfurter Allee
##      4:      -0.1084828 Frankfurter Allee
##      5:      -0.4938750 Frankfurter Allee
##     ---                                  
## 126416:     -14.0900000  An der Wuhlheide
## 126417:     -12.1763333  An der Wuhlheide
## 126418:     -22.1170000  An der Wuhlheide
## 126419:     -11.7790000  An der Wuhlheide
## 126420:     -28.9550000 Pasewalker Straße

For each OSM way, we have all hours of the day, the percentage of times the median speed was over the maximum speed allowed, and the delta between both.