What is saving the world?

Many ways of saying the same thing:

  • 'Policy-led research'
  • 'Impact'
  • 'Socially beneficial research'
  • Don't be evil (Google)

My definition: building an evidence-base for sustainable systems.

  • In the context of climate change that means:
  • Building an evidence-base to transition away from fossil fuels
  • But could also be interpretted in terms of other (quantifiable) social/economic/environmental indicators

Why climate change? I

Why climate change? II + Geography

Why climate change? III + Geocomputation!

What is Geographic Data Science?

  • You tell me!
  • How does it differ from good old 'GIS'?
  • What does the science in the title mean?
  • Why the focus on data rather than information

Code example:

d = frame_data(
  ~Attribute, ~GIS, ~GDS,
  "Home disciplines", "Geography", "Geography, Computing, Statistics",
  "Software focus", "Graphic User Interface", "Code",
  "Reproduciblility", "Minimal", "Maximal"

Comparing GDS with GIS

Attribute GIS GDS
Home disciplines Geography Geography, Computing, Statistics
Software focus Graphic User Interface Code
Reproduciblility Minimal Maximal

Geographic data science CAN 'save the world'

But only if it's open and scientific


  • Evidence inevitably gets skewed by political aims
  • If the people doing the research are influenced by dominant political forces, findings will be biases for political gain (solved by independent well-funded public research).
  • People doing policy relevant research watch out (regarding politicians):

"Their very spirit undergoes a pervasive transformation,” and they finally end up as “experts at exchanging smiles, handshakes, and favors." (Reclus 2013, original: 1898)

Importance of open data and methods

  • If the data underlying policy is hidden, it can be represented to push certain aims (solved by open data)
  • If the data is 'open' but the tools are closed, results open to political influence
  • Which brings us onto our next topic…

Example question: Where will cycling uptake happen?

How to transition to active cities? From this…

To this?

With available resources

Context 'evidence overload'?

  • Challenge: operationalise data
  • Challenge: make locally specific

Data for walking and cycling investment

  • Travel behaviour data
  • Route network data
  • Existing infrastructure (road widths, traffic, future possibilities)
  • Road safety data
  • Air pollution data
  • Crowdsourced data

The international dimension

~200 km cycle network in Seville, Spain. Source: WHO report at [ATFutures/who](https://github.com/ATFutures/who)

  • Not a UK-specific issue, but benefits of country-specific tools

The Propensity to Cycle Tool (PCT)

Context: from concept to implementation

  • 3 years in the making
  • Origins go back further
  • "An algorithm to decide where to build next"!
  • Internationalisation of methods (World Health Organisation funded project)

The PCT in context (source: Lovelace et al. 2017)

Tool Scale Coverage Public access Format of output Levels of analysis Software licence
Propensity to Cycle Tool National England Yes Online map A, OD, R, RN Open source
Prioritization Index City Montreal No GIS-based P, A, R Proprietary
PAT Local Parts of Dublin No GIS-based A, OD, R Proprietary
Usage intensity index City Belo Horizonte No GIS-based A, OD, R, I Proprietary
Cycling Potential Tool City London No Static A, I Unknown
Santa Monica model City Santa Monica No Static P, OD, A Unknown

Policy feedback

"The PCT is a brilliant example of using Big Data to better plan infrastructure investment. It will allow us to have more confidence that new schemes are built in places and along travel corridors where there is high latent demand."

  • Shane Snow: Head of Seamless Travel Team, Sustainable and Acessible Travel Division

"The PCT shows the country’s great potential to get on their bikes, highlights the areas of highest possible growth and will be a useful innovation for local authorities to get the greatest bang for their buck from cycling investments and realise cycling potential."

  • Andrew Jones, Parliamentary Under Secretary of State for Transport


Included in Cycling and Walking Infrastructure Strategy (CWIS)

Scenario shift in desire lines

Source: Lovelace et al. (2017)

  • Origin-destination data shows 'desire lines'
  • How will these shift with cycling uptake

Scenario shift in network load

What can the PCT do? - see www.pct.bike

The front page of the open source, open access Propensity to Cycle Tool (PCT).

The Cycling Infrastructure Prioritisation Toolkit (CyIPT)

Overview of the project

  • 12 month project funded by DfT's Innovation Challenge Fund (ICF)
  • Aim: tackle the challenge that cycling uptake is often limited by infrastructural barriers which could be remediated cost-effectively, yet investment is often spent on less cost-effective interventions, based on assessment of only a few options.

  • Project team:
    • Robin Lovelace (University of Leeds)
    • Malcolm Morgan (University of Leeds)
    • John Parkin (University of West of England)
    • Martin Lucas-Smith (Cyclestreets.net)
    • Adrian Lord (Phil Jones Associates)

Modelling cycling uptake

  • We can use 'backcasting' to estimate long-term potential under ideal questions (PCT)
  • But transport authorities need forecasts of future uptake
  • From specific interventions in order to do this
  • There is much existing work on this
  • But none that is 'operationalisable'
  • How to operationalise available data?

Data on infrastructure-uptake at a regional level

  • Clear link between infrastructure and uptake

New datasets:

  • DfT's Transport Direct data
  • 2001 OD data (manipulated and joined with 2011 data)

Operationalising the data

Wider context: Open source tools

  • Online interfaces reduce barriers
  • But there are benefits of running analysis locally
  • Various software options, including:
  • QGIS mapping software
  • sDNA QGIS plugin
  • R (see upcoming course 26th - 27th April)
  • Key feature of CyIPT and PCT:
  • Open source and provides open data downloads

stplanr lives here: https://github.com/ropensci/stplanr

A few words on data carpentry

Why data carpentry?

  • If you 'hack' or 'munge' data, it won't scale
  • So ultimately it's about being able to handle Big Data
  • We'll cover the basics of data frames and tibbles
  • And the basics of dplyr, an excellent package for data carpentry
    • dplyr is also compatible with the sf package

The data frame

The humble data frame is at the heart of most analysis projects:

d = data.frame(x = 1:3, y = c("A", "B", "C"))
##   x y
## 1 1 A
## 2 2 B
## 3 3 C

In reality this is a list, making function work on each column:

##        x       y    
##  Min.   :1.0   A:1  
##  1st Qu.:1.5   B:1  
##  Median :2.0   C:1  
##  Mean   :2.0        
##  3rd Qu.:2.5        
##  Max.   :3.0


In base R, there are many ways to subset:

d[1,] # the first line
##   x y
## 1 1 A
d[,1] # the first column
## [1] 1 2 3
d$x # the first column
## [1] 1 2 3
d[1] # the first column, as a data frame
##   x
## 1 1
## 2 2
## 3 3

The tibble

Recently the data frame has been extended:

dt = tibble(x = 1:3, y = c("A", "B", "C"))
## # A tibble: 3 x 2
##       x y    
##   <int> <chr>
## 1     1 A    
## 2     2 B    
## 3     3 C

Advantages of the tibble

It comes down to efficiency and usability

  • When printed, the tibble diff reports class
  • Character vectors are not coerced into factors
  • When printing a tibble diff to screen, only the first ten rows are displayed


Like tibbles, has advantages over historic ways of doing things

  • Type stability (data frame in, data frame out)
  • Consistent functions - functions not [ do everything
  • Piping make complex operations easy
ghg_ems %>%
  filter(!grepl("World|Europe", Country)) %>% 
  group_by(Country) %>% 
  summarise(Mean = mean(Transportation),
            Growth = diff(range(Transportation))) %>%
  top_n(3, Growth) %>%
# dplyr must be loaded with

Why pipes?

wb_ineq %>% 
  filter(grepl("g", Country)) %>%
  group_by(Year) %>%
  summarise(gini = mean(gini, na.rm  = TRUE)) %>%
  arrange(desc(gini)) %>%
  top_n(n = 5)


        filter(wb_ineq, grepl("g", Country)),
      gini = mean(gini, na.rm  = TRUE)),
  n = 5)

Subsetting with dplyr

Only 1 way to do it, making life simpler:

select(dt, x) # select columns
## # A tibble: 3 x 1
##       x
##   <int>
## 1     1
## 2     2
## 3     3
slice(dt, 2) # 'slice' rows
## # A tibble: 1 x 2
##       x y    
##   <int> <chr>
## 1     2 B

Geocomputation for transport

Worked example: Desire lines in Bristol

  • We'll download and visualise some transport data
u_pct = "https://github.com/npct/pct-data/raw/master/avon/l.Rds"
  download.file(u_pct, "l.Rds")
l = readRDS("l.Rds")

Analysing where people walk

sel_walk = l$foot > 9
l_walk = l[sel_walk,]
plot(l_walk, add = T, col = "red")

library(dplyr) # for next slide...

Doing it with sf (!)

  • See (Lovelace, Nowosad, and Muenchow 2018)
l_walk1 = l %>% filter(All > 10) # fails
## Linking to GEOS 3.5.1, GDAL 2.2.2, proj.4 4.9.2
l_sf = st_as_sf(l)

Subsetting with sf

much easier

l_walk2 = l_sf %>% 
  filter(foot > 9)
plot(l_walk2, add = T)

Subsetting with sf

A more advanced example

l_sf$distsf = as.numeric(st_length(l_sf))
l_drive_short2 = l_sf %>% 
  filter(distsf < 1000) %>% 
  filter(car_driver > foot)

Result: where people drive short distances?

## tmap mode set to interactive viewing

Discussion: ensuring research is used for the greater good

Points of discussion

It is clear that geographical research can have large policy impacts.

  • That researchers can act to maximise the social benefit of the research
  • That involves getting the evidence out to as many people as possible
  • And using open source, accessible tools - the 'science' in GDS?

But many questions remain:

  • Where to draw the line between impartial research and campaigning?
  • To what extent should researchers open-sourcing their work defend against commercial exploitation?

Further ideas + links

Modelling cycling uptake

  • Hilliness and distance are (relatively) unchanging over time
  • Model based on polynomial logit model of both:

\[ logit(pcycle) = \alpha + \beta_1 d + \beta_2 d^{0.5} + \beta_3 d^2 + \gamma h + \delta_1 d h + \delta_2 d^{0.5} h \]

logit_pcycle = -3.9 + (-0.59 * distance) + (1.8 * sqrt(distance) ) + (0.008 * distance^2)


