Chapter 7 Getting data from APIs

7.1 Overview

The is session is all about getting data! In this practical session you will develop a crime analysis using data from the UK Police website / API, and then from the Nomis census API to link crime data to socio-economic variables.

You should note a few things for the crime data:

  • the crime data from https://data.police.uk/data/ - it comes in monthly chunks;
  • functions are used to extract the location and crime type from the data from the RCurl and rjsonlite packages;
  • you will extract data on a particular kind of crime;

You should note a few other things for the population Census API:

  • Nomis is the official population census site: https://www.nomisweb.co.uk. You should have a look if you are not familiar.
  • Census data are available for different years, and over different geographies
  • importantly census data are summarised in different ways
  • here, the nomisr package is used to access the Nomis web API
  • variables are extracted to support the creation of a Townsend measure of deprivation

The final section brings the crime and deprivation index together, and in so doing the practical suggests how data from different sources can be combined, to develop geocomputational models of socio-economic processes for example.

7.2 Packages and Data

You will need to load the following packages for this practical. Some may need installing but you are experienced at this now.

library(httr)
library(jsonlite)
library(sf)
library(tmap)
library(tidyverse) 
library(nomisr)
select = dplyr::select

You will need the following data in your working directory from the VLE: leeds_lsoa.gpkg.

7.3 The Police API

7.3.1 Getting and mapping crime data

The first thing is to download some data from the police API. The code below downloads data for September 2023 for an area around a location 53.7997 North and -1.5492 West (do you know where this is?).

# specify the url - the web address
url = paste0("http://data.police.uk/api/crimes-street/all-crime",
                      "?lat=53.7997", 
                      "&lng=-1.5492",
                      "&date=2023-09")
# use the GET function to "get" the url response object
# (see the help for GET)
x = GET(url)
# finally extract and assign to a data table
crimes <- as_tibble(
      fromJSON(httr::content(x, as = "text", encoding = "utf8"),
        flatten = T
      )
    )

Now before we investigate what has been downloaded, have a look at the web address in url:

# the url - the web page 
url
# the function below opens it in a browser`
BROWSE(url)

To understand what is going on here have a look at how the call to the Police API is formed for street crime at url(https://data.police.uk/docs/method/crime-street/). In the url object above, notice the use of the ‘?’ and the ‘&’ to construct the query with lat, lng and date.

We can examine the crimes object:

crimes
## # A tibble: 1,714 × 13
##    category    location_type context persistent_id     id location_subtype month
##    <chr>       <chr>         <chr>   <chr>          <int> <chr>            <chr>
##  1 anti-socia… Force         ""      ""            1.13e8 ""               2023…
##  2 anti-socia… Force         ""      ""            1.13e8 ""               2023…
##  3 anti-socia… Force         ""      ""            1.13e8 ""               2023…
##  4 anti-socia… Force         ""      ""            1.13e8 ""               2023…
##  5 anti-socia… Force         ""      ""            1.13e8 ""               2023…
##  6 anti-socia… Force         ""      ""            1.13e8 ""               2023…
##  7 anti-socia… Force         ""      ""            1.13e8 ""               2023…
##  8 anti-socia… Force         ""      ""            1.13e8 ""               2023…
##  9 anti-socia… Force         ""      ""            1.13e8 ""               2023…
## 10 anti-socia… Force         ""      ""            1.13e8 ""               2023…
## # ℹ 1,704 more rows
## # ℹ 6 more variables: location.latitude <chr>, location.longitude <chr>,
## #   location.street.id <int>, location.street.name <chr>,
## #   outcome_status.category <chr>, outcome_status.date <chr>

The data is in tibble format with a number of attributes:

names(crimes)
table(crimes$category)
table(crimes$outcome_status.category)

The 2 location attributes are of particular interest here. The functions below extract the coordinates and the attributes and renders them into a flat data table format:

# 1. Get location
getLonLat <- function(x) {
  df = data.frame(lon = as.numeric(x$location.longitude),
                  lat = as.numeric(x$location.latitude))
  # return the dataframe
  return(df)
}
# 2. Get attributes
getAttr <- function(x) {
  df = data.frame(
    category = x$category,
    street_name = x$location.street.name,
    location_type = x$location_type,
    month = substr(x$month, 6,9),
    year = substr(x$month, 1,4))
  # return the data.frame
  return(df)
}

The workings of these functions can be investigated for example by assigning crimes to x and running lines of the code:

x = crimes
head(as.numeric(x$location.longitude))
head(x$category)
head(x$location.street.name)

Then they can be applied to the crimes object:

crimes.loc <- getLonLat(crimes)
crimes.attr <- getAttr(crimes)

Finally a function is defined that uses the location information and the attributes to create a spatial object in sf format:

# join together and make a spatial (sf) object
makeSpatial = function(crimes.loc, crimes.attr){
  # create a data frame
  df = data.frame(longitude = crimes.loc[,1], 
                latitude = crimes.loc[,2],
                crimes.attr)
  # convert to sf
  df_sf = st_as_sf(df, coords = c("longitude", "latitude"), 
                 crs = 4326, agr = "constant")
  # return the sf
  return(df_sf)
}
# and apply
crimes_sf = makeSpatial(crimes.loc, crimes.attr)

It is possible to see the counts of different crime types using the table function:

table(crimes_sf$category)
## 
## anti-social-behaviour         bicycle-theft              burglary 
##                    77                    57                   111 
## criminal-damage-arson                 drugs           other-crime 
##                    81                    52                    24 
##           other-theft possession-of-weapons          public-order 
##                   150                    13                   190 
##               robbery           shoplifting theft-from-the-person 
##                    47                   308                    93 
##         vehicle-crime         violent-crime 
##                    86                   425

And using the code below you can extract a crime type and plot them on a map - note the use of the alpha parameter so that crime densities are shown1. This is need:

asb.pts <- crimes_sf[crimes_sf$category=="anti-social-behaviour",]
tmap_mode("view")
## tmap mode set to interactive viewing
tm_shape(asb.pts)+
  tm_dots(alpha = 0.5, shape = 1, size = 0.1)+
  tm_basemap('OpenStreetMap')
tmap_mode("plot")
## tmap mode set to plotting

7.3.2 Getting and mapping more crime data

In the above example crime data has just been obtained for a single month. It is very easy to get data for a longer period, a year for example.

This is done by putting the operations above into a loop. The code below extends the single month to a year (1 to 12 months in the loop) simply by passing a different date variable to the getForm function, and appends the answer to a list.This is done for crimes that involve possession-of-weapons.

Note the while loop in the middle to test for a lack of a server-side error. This would blow the loop out. The while loop tests for this and if present repeated queries the server until the data are returned.

# create empty vectors for the results
# these will convert to a data.table in the first iteration 
# of the loop and then are subsequently added to 
crimes.loc.tab = vector()
crimes.attr.tab = vector()
for (i in 1:12) {
  # create the date
  date.i <- paste0("2022-",i)
  # pass to the API
  url.i <- paste0("http://data.police.uk/api/crimes-street/all-crime",
                     "?lat=53.7997", 
                      "&lng=-1.5492",
                      "&date=", date.i)
  x.i = GET(url.i)
  while (x.i$status_code == 500) {
    x.i = GET(url.i)
  } 
  crimes.i <- as_tibble(
      fromJSON(httr::content(x.i, as = "text", encoding = "utf8"),
        flatten = T
      )
    )
  # add the result to the results
  crimes.loc.tab  <- rbind(crimes.loc.tab, getLonLat(crimes.i))
  crimes.attr.tab <- rbind(crimes.attr.tab, getAttr(crimes.i))
  # print out a little indicator of progress
  cat("downloaded month", i, "\n")
}
## downloaded month 1 
## downloaded month 2 
## downloaded month 3 
## downloaded month 4 
## downloaded month 5 
## downloaded month 6 
## downloaded month 7 
## downloaded month 8 
## downloaded month 9 
## downloaded month 10 
## downloaded month 11 
## downloaded month 12

Then you can have a look at the data, convert to an sf spatial object and map the results:

head(crimes.loc.tab)
##         lon      lat
## 1 -1.543925 53.79574
## 2 -1.550334 53.79696
## 3 -1.541691 53.79601
## 4 -1.565558 53.80603
## 5 -1.542859 53.79891
## 6 -1.544913 53.79843
crimes_sf_2021 = makeSpatial(crimes.loc.tab, crimes.attr.tab)
bike_nickers.pts <- crimes_sf_2021[crimes_sf_2021$category=="bicycle-theft",]
tmap_mode("view")
## tmap mode set to interactive viewing
tm_shape(bike_nickers.pts)+
  tm_dots(alpha = 0.2, shape = 1, size = 0.1)+
  tm_basemap('OpenStreetMap')
tmap_mode("plot")
## tmap mode set to plotting

7.3.3 Getting and mapping lots of crime data

The above extension got data from the API for a longer time period and grouped the results to show patterns for the year. However this was for an area approximately a mile around a single location. It is possible to further extend this spatially by defining a bounding box or a polygon to get the data. The code below reads in some LSOA data for Leeds and then extracts some contiguous LSOAs as an area to investigate:

# read in a Leeds LSOA 
leeds = st_read("leeds_lsoa.gpkg", quiet = T)
# transform to lat / lon - WGS84
leeds = st_transform(leeds, 4326)
# set up list of LSOAs
codes =  c("E01011351", "E01011352", "E01011353", "E01011356", "E01011359")
# extract from leeds
poly.temp <- leeds %>% filter(code %in% codes)
# have a look 
tmap_mode("view")
## tmap mode set to interactive viewing
tm_shape(poly.temp)+tm_borders() +tm_text("code") 
tmap_mode("plot")
## tmap mode set to plotting

We can now use these polygons as an area to extract crimes for by slightly changing the arguments we pass to the API call. First the coordinates of the bounding box for this area are extracted

bb = st_bbox(poly.temp)
bb
##      xmin      ymin      xmax      ymax 
## -1.542430 53.816416 -1.515997 53.835344

These are used to create a sequence of coordinates to be passed to the API, of the box - notice how the first coordinate pair are repeated at the end to close the box:

X = round(c(bb[1], bb[1], bb[3], bb[3], bb[1]), 3)
Y = round(c(bb[2], bb[4], bb[4], bb[2], bb[2]), 3)
poly_paste <- paste(paste(Y, X, sep = ","), collapse = ":")
poly_paste
## [1] "53.816,-1.542:53.835,-1.542:53.835,-1.516:53.816,-1.516:53.816,-1.542"

Finally these can be passed to the API using the poly= parameter:

url = paste0("https://data.police.uk/api/crimes-street/all-crime?poly=",
             poly_paste,
             "&date=2023-09")
x = GET(url)
crimes <- as_tibble(
      fromJSON(httr::content(x, as = "text", encoding = "utf8"),
        flatten = T
      )
    )
crimes
## # A tibble: 207 × 13
##    category    location_type context persistent_id     id location_subtype month
##    <chr>       <chr>         <chr>   <chr>          <int> <chr>            <chr>
##  1 anti-socia… Force         ""      ""            1.13e8 ""               2023…
##  2 anti-socia… Force         ""      ""            1.13e8 ""               2023…
##  3 anti-socia… Force         ""      ""            1.13e8 ""               2023…
##  4 anti-socia… Force         ""      ""            1.13e8 ""               2023…
##  5 anti-socia… Force         ""      ""            1.13e8 ""               2023…
##  6 anti-socia… Force         ""      ""            1.13e8 ""               2023…
##  7 anti-socia… Force         ""      ""            1.13e8 ""               2023…
##  8 anti-socia… Force         ""      ""            1.13e8 ""               2023…
##  9 anti-socia… Force         ""      ""            1.13e8 ""               2023…
## 10 anti-socia… Force         ""      ""            1.13e8 ""               2023…
## # ℹ 197 more rows
## # ℹ 6 more variables: location.latitude <chr>, location.longitude <chr>,
## #   location.street.id <int>, location.street.name <chr>,
## #   outcome_status.category <chr>, outcome_status.date <chr>

Again you should have a look at the url and even use the BROWSE function to explore it as before.

Now there are are limits to what can be passed to the API in terms of the complexity of the definition of a polygon2. So the code below creates a 5km grid for the area of Leeds and passes each grid in turn to the API, using yet another for loop.

# transform back to OSGB 
leeds = st_transform(leeds, 27700) 
# create a grid: 1. the geometry
gr_geom = st_make_grid(leeds, cellsize = 5000)
# 2. a data frame of IDs
gr = data.frame(ID = 1:length(gr_geom))
# 3. apply the geometry to the data.frame to make an sf object
st_geometry(gr) = gr_geom

You could examine this:

tmap_mode("view")
## tmap mode set to interactive viewing
tm_shape(gr)+tm_borders(col = "black")+tm_basemap('OpenStreetMap')
tmap_mode("plot")
## tmap mode set to plotting

The function below extracts the coordinates from each grid cell and formats them so that they can be passed to the API with a poly call.

get_poly_coords = function(x){
  # transform to lat lon
  x = st_transform(x, 4326)
  # extract coordinates
  coords = data.frame(st_coordinates(x)[, c("X", "Y")])
  poly_paste <- paste(paste(coords$Y, coords$X, sep = ","), collapse = ":")
  return(poly_paste)
}

To test this and to show what the above function is doing, examine this for a single grid cell:

coords = get_poly_coords(gr[1,])
coords
## [1] "53.699629746901,-1.80124066773008:53.6994803352637,-1.72550965108028:53.7444201684409,-1.72521713914431:53.7445698244954,-1.80102893467678:53.699629746901,-1.80124066773008"
# URL for leeds area
url=paste0("https://data.police.uk/api/crimes-street/all-crime?poly=", 
           coords,
           "&date=2023-09")
x = GET(url)
crimes <- as_tibble(
      fromJSON(httr::content(x, as = "text", encoding = "utf8"),
        flatten = T
      )
    )
# extract data 
crimes.loc <- getLonLat(crimes)
crimes.attr <- getAttr(crimes)
crimes_sf = makeSpatial(crimes.loc, crimes.attr)
# and map
tmap_mode("view")
tm_shape(gr[1,])+ tm_borders() +
  tm_shape(crimes_sf) + tm_dots(alpha = 0.2, shape = 1, size = 0.1)+
  tm_basemap('OpenStreetMap')
tmap_mode("plot")

Finally this can be put into a loop for all the grid cells. Again note the while loop in the middle to avoid server-side errors.

# define some results tables as before 
crimes.loc.tab = vector()
crimes.attr.tab = vector()
for(i in 1:nrow(gr)){
  coords = get_poly_coords(gr[i,])
  url.i=paste0("https://data.police.uk/api/crimes-street/all-crime?",
               "poly=", coords,
               "&date=2023-09")
  x.i = GET(url.i)
  while (x.i$status_code == 500) {
    x.i = GET(url.i)
  }  
  crimes.i <- as_tibble(
    fromJSON(httr::content(x.i, as = "text", encoding = "utf8"), 
             flatten = T)
    )
  crimes.loc.tab <- rbind(crimes.loc.tab, getLonLat(crimes.i))
  crimes.attr.tab <- rbind(crimes.attr.tab, getAttr(crimes.i))
  # print out a little indicator of progress
  cat("grid cell", i, "done \n")
}
## grid cell 1 done 
## grid cell 2 done 
## grid cell 3 done 
## grid cell 4 done 
## grid cell 5 done 
## grid cell 6 done 
## grid cell 7 done 
## grid cell 8 done 
## grid cell 9 done 
## grid cell 10 done 
## grid cell 11 done 
## grid cell 12 done 
## grid cell 13 done 
## grid cell 14 done 
## grid cell 15 done 
## grid cell 16 done 
## grid cell 17 done 
## grid cell 18 done 
## grid cell 19 done 
## grid cell 20 done 
## grid cell 21 done 
## grid cell 22 done 
## grid cell 23 done 
## grid cell 24 done 
## grid cell 25 done 
## grid cell 26 done 
## grid cell 27 done 
## grid cell 28 done 
## grid cell 29 done 
## grid cell 30 done 
## grid cell 31 done 
## grid cell 32 done 
## grid cell 33 done 
## grid cell 34 done 
## grid cell 35 done 
## grid cell 36 done 
## grid cell 37 done 
## grid cell 38 done 
## grid cell 39 done 
## grid cell 40 done 
## grid cell 41 done 
## grid cell 42 done

And as ever the results can be converted to an sf spatial object and mapped:

crimes_sf = makeSpatial(crimes.loc.tab, crimes.attr.tab)
tmap_mode("view")
## tmap mode set to interactive viewing
tm_shape(gr)+ tm_borders() +
  tm_shape(crimes_sf) + tm_dots(alpha = 0.2, shape = 1, size = 0.005)+
  tm_basemap('OpenStreetMap')
tmap_mode("plot")
## tmap mode set to plotting

7.3.4 Getting and mapping even more crime data

Of course, getting crime data over a longer period of time and for a large area can be combined. The code below contains a loop within a loop - a nested loop, to get data for different grid cells over a 12 month period. Notice how i is used to index the first loop and j is used to index the second, nested one. Here j extracts the date directly from dates_list. Also notice how the date is constructed to get data across different years. At the time of writing the Police API has data available up to September 2022. The loop gets 12 months of data prior to and including this, for each grid cell.

This will of course take longer so when this loop is running (i.e. you see it has done 2 grid cells), I would take a break or do this after the practical! ! It may take 10-15 minutes or longer if your internet is slow.

# define some dates 
dates_list = c(paste0("2022-", 1:12))
# define some results tables
crimes.loc.tab = vector()
crimes.attr.tab = vector()
# start the 1st loop: "for each grid cell do..."
for(i in 1:nrow(gr)){
  coords = get_poly_coords(gr[i,])
  # start the second loop: "for each date do..."
  for (j in dates_list) {
    # extract the date
    date.j <- j
    # pass to the API
    url.i=paste0("https://data.police.uk/api/crimes-street/all-crime?",
                 "poly=", coords,
                 "&date=",j)
    status = "ok"
    x.i = GET(url.i)
    while (x.i$status_code == 500) {
      x.i = GET(url.i)
    }  
    crimes.i <- as_tibble(
      fromJSON(httr::content(x.i, as = "text", encoding = "utf8"), 
               flatten = T)
      )
    crimes.loc.tab <- rbind(crimes.loc.tab, getLonLat(crimes.i))
    crimes.attr.tab <- rbind(crimes.attr.tab, getAttr(crimes.i))
    # end the second loop
  }
  # print out a little indicator of progress
  cat("grid cell", i, "done \n")
}
## grid cell 1 done 
## grid cell 2 done 
## grid cell 3 done 
## grid cell 4 done 
## grid cell 5 done 
## grid cell 6 done 
## grid cell 7 done 
## grid cell 8 done 
## grid cell 9 done 
## grid cell 10 done 
## grid cell 11 done 
## grid cell 12 done 
## grid cell 13 done 
## grid cell 14 done 
## grid cell 15 done 
## grid cell 16 done 
## grid cell 17 done 
## grid cell 18 done 
## grid cell 19 done 
## grid cell 20 done 
## grid cell 21 done 
## grid cell 22 done 
## grid cell 23 done 
## grid cell 24 done 
## grid cell 25 done 
## grid cell 26 done 
## grid cell 27 done 
## grid cell 28 done 
## grid cell 29 done 
## grid cell 30 done 
## grid cell 31 done 
## grid cell 32 done 
## grid cell 33 done 
## grid cell 34 done 
## grid cell 35 done 
## grid cell 36 done 
## grid cell 37 done 
## grid cell 38 done 
## grid cell 39 done 
## grid cell 40 done 
## grid cell 41 done 
## grid cell 42 done

Then you can process the data as before and have a look. Again this will take some time because there are so many data points. In fact it is probably best to view this in plot mode in tmap.

crimes_sf = makeSpatial(crimes.loc.tab, crimes.attr.tab)
# have a look
crimes_sf
tm_shape(gr)+ tm_borders() +
  tm_shape(crimes_sf) + tm_dots(alpha = 0.2, shape = 1, size = 0.005)

7.3.5 And afterwards?

So what could you do with this data? Some suggestions include:

  • separate out the different crimes (e.g. table(crimes.attr.tab$category) to view and see the construction of bike_nickers.pts above);
  • link it to other data such as census areas like OAs, LSOAs etc (see below!);
  • examine the counts of crimes in each area relative to population using a point in polygon operation (see Week 1 for GIS operations);
  • create a model of socio-economic factors (see Week 5 main practical) for inference (understanding) or prediction;
  • identify areas where particular crimes are more closely associated with local factors (e.g. using a GWR);
  • etc etc.

7.4 The Nomis API

In this section you will first extract an census dataset on age, and then use a similar approach to extract the variables for a Townsend index of deprivation.

The basic approach to the extracting data from the Nomis API using nomisr is to first determine which data tables you want to work from.

The code below returns a list of all the available data tables:

x <- nomis_data_info()

If you examine it then the attribute probably of most interest is name.value:

head(x)
names(x)
sort(x$name.value)

You can see that there are lots of types of census tables. It is beyond the scope of this practical to delve into these in detail but they are summarised in the Appendix below. Perhaps most commonly used are Key statistics (starting with KS) and Quick statistics (starting with QS). Others of potential interest are Workplace zones (WP) and Short-term residents (ST).

Here we will work with Topic Summaries (TS) of the 2021 population census. The code below extracts information for these:

# get information about Topic Summaries
x = nomis_search(name = "TS*")

And again the name.value can be inspected to see what tables are available:

x$name.value

The code below looks for all of the Topic Summary tables containing the word `Age:

x |> 
    filter(str_detect(name.value, 'Age')) |>
    select(id, name.value)
## # A tibble: 6 × 2
##   id        name.value                                                        
##   <chr>     <chr>                                                             
## 1 NM_2020_1 TS007A - Age by five-year age bands                               
## 2 NM_2027_1 TS007 - Age by single year                                        
## 3 NM_2038_1 TS018 - Age of arrival in the UK                                  
## 4 NM_2092_1 TS037ASP - General health - Age-standardised proportions          
## 5 NM_2093_1 TS038ASP - Disability - Age-standardised proportions              
## 6 NM_2094_1 TS039ASP - Provision of unpaid care - Age-standardised proportions

The table with the id of NM_2020_1 looks interesting. Information about the datasets can be extracted through their id attribute and the features or concepts that relate to it:

nomis_get_metadata(id = "NM_2020_1")
## # A tibble: 5 × 3
##   codelist               conceptref   isfrequencydimension
##   <chr>                  <chr>        <chr>               
## 1 CL_2020_1_GEOGRAPHY    GEOGRAPHY    false               
## 2 CL_2020_1_C2021_AGE_19 C2021_AGE_19 false               
## 3 CL_2020_1_MEASURES     MEASURES     false               
## 4 CL_2020_1_FREQ         FREQ         true                
## 5 CL_2020_1_TIME         TIME         false

This can be expanded to specify particular elements (concepts):

nomis_get_metadata(id = "NM_2020_1", 
                   concept = "GEOGRAPHY", 
                   type = "type") |> 
  print(n = "all")
## # A tibble: 10 × 3
##    id      label.en                                               description.en
##    <chr>   <chr>                                                  <chr>         
##  1 TYPE150 2021 output areas                                      2021 output a…
##  2 TYPE151 2021 super output areas - lower layer                  2021 super ou…
##  3 TYPE152 2021 super output areas - middle layer                 2021 super ou…
##  4 TYPE154 2022 local authorities: districts                      2022 local au…
##  5 TYPE155 2022 local authorities: counties                       2022 local au…
##  6 TYPE423 local authorities: county / unitary (as of April 2023) local authori…
##  7 TYPE424 local authorities: district / unitary (as of April 20… local authori…
##  8 TYPE459 local enterprise partnerships (as of April 2021)       local enterpr…
##  9 TYPE480 regions                                                regions       
## 10 TYPE499 countries                                              countries

This indicates the available formats of the data. Here LSOAs are of interest. The code below extracts the metadata for each of the ~35,000 LSOAs, so this takes some time to run

y = nomis_get_metadata(id = "NM_2020_1", concept = "GEOGRAPHY", type = "TYPE151")
y
## # A tibble: 35,672 × 3
##    id        label.en        description.en 
##    <chr>     <chr>           <chr>          
##  1 633351271 Hartlepool 001A Hartlepool 001A
##  2 633351284 Hartlepool 001B Hartlepool 001B
##  3 633351285 Hartlepool 001C Hartlepool 001C
##  4 633351286 Hartlepool 001D Hartlepool 001D
##  5 633371500 Hartlepool 001F Hartlepool 001F
##  6 633371502 Hartlepool 001G Hartlepool 001G
##  7 633351269 Hartlepool 002A Hartlepool 002A
##  8 633351270 Hartlepool 002B Hartlepool 002B
##  9 633351303 Hartlepool 002C Hartlepool 002C
## 10 633351304 Hartlepool 002D Hartlepool 002D
## # ℹ 35,662 more rows

The LSOA labels can be used to extract the id values of the LSOAs in the Leeds local authority district and assign these to an index:

# create an index to filter for leeds area
index <- 
    y |> 
    filter(str_detect(label.en, 'Leeds')) |> 
    select(id) |> 
    unlist() |> 
    as.vector() 

You can inspect this (e.g. by entering head(index) and it can be used to extract the data:

# get the data
d <- nomis_get_data(id = "NM_2020_1", 
                    time = c("latest"), 
                    geography = index)
# have a look
d |> data.frame() |> head() 

Finally the data in long format is filtered for Values in this case (later we will use percentages) and then reformatted to a wide format:

data <- 
    d |> 
    select(GEOGRAPHY_CODE, C2021_AGE_19_NAME, MEASURES_NAME, OBS_VALUE) |>
    filter(MEASURES_NAME == "Value") |> 
    pivot_wider(id_cols = GEOGRAPHY_CODE, 
        values_from = OBS_VALUE, 
        names_from = C2021_AGE_19_NAME) 
data
## # A tibble: 488 × 20
##    GEOGRAPHY_CODE Total `Aged 4 years and under` `Aged 5 to 9 years`
##    <chr>          <dbl>                    <dbl>               <dbl>
##  1 E01011698       1463                       63                  86
##  2 E01011699       1289                       63                  69
##  3 E01011701       1364                       68                  74
##  4 E01011702       1647                      100                  97
##  5 E01011703       1226                       25                  51
##  6 E01011697       2285                       86                  84
##  7 E01011700       1482                       57                  67
##  8 E01011704       2228                      113                  94
##  9 E01011705       1371                       69                  68
## 10 E01011581       1661                       96                  75
## # ℹ 478 more rows
## # ℹ 16 more variables: `Aged 10 to 14 years` <dbl>,
## #   `Aged 15 to 19 years` <dbl>, `Aged 20 to 24 years` <dbl>,
## #   `Aged 25 to 29 years` <dbl>, `Aged 30 to 34 years` <dbl>,
## #   `Aged 35 to 39 years` <dbl>, `Aged 40 to 44 years` <dbl>,
## #   `Aged 45 to 49 years` <dbl>, `Aged 50 to 54 years` <dbl>,
## #   `Aged 55 to 59 years` <dbl>, `Aged 60 to 64 years` <dbl>, …

NB: remember you can unpick what each line of the pipe does by running it from d |> in chunks up to the list pipe symbol in the line.

7.4.1 Creating a Townsend Index

The Townsend deprivation index is an index of material deprivation. It is simple to calculate and is derived from 4 census variables:

  • % Households without a car
  • % Overcrowded households
  • % Households not owner­occupied
  • % Unemployed

The data are extracted to create the index which is then linked to the crime data. They require some manipulation before they are combined:

  1. a value of 1 is added to the unemployment and overcrowding percentages (+1);
  2. the unemployment and overcrowding percentages (with 1 added) are log transformed using the log function;
  3. each of the 4 variables are then standardized to z-scores (the equivalent to subtracting the mean value and dividing by the standard deviation).

These four standardized scores are then summed to create the Townsend deprivation index. Positive values indicate areas with high deprivation and areas with negative values indicate areas of relative affluence. Index scores near to 0 indicate areas with overall typical values.

Some descriptions of the index are at:

Some inspection of the Topic Summaries is necessary to determine which tables to use:

# car ownership
x |> 
    filter(str_detect(name.value, 'Car')) |>
    select(id, name.value)
# overcrowding
x |> 
    filter(str_detect(name.value, 'room')) |>
    select(id, name.value)
# home ownership
x |> 
    filter(str_detect(name.value, 'Tenure')) |>
    select(id, name.value)
# unemployment
x |> 
    filter(str_detect(name.value, 'Economic')) |>
    select(id, name.value)

The code used to extract the age data table above can be wrapped in a function to do all the work. This provides an alternative to recreating all the code above 4 times!

# define the function
extractCensus = function(id = "NM_2072_1", perc = TRUE, area_name = "Leeds", data_type = "TYPE151") {
    y = nomis_get_metadata(id = id, concept = "GEOGRAPHY", type = data_type)
    # create an index to filter for leeds area
    index <- 
        y |> 
        filter(str_detect(label.en, area_name)) |> 
        select(id) |> 
        unlist() |> 
        as.vector() 
    # get the data
    d <- nomis_get_data(id = id, 
                        time = c("latest"), 
                        geography = index)
    # % or values?
    if(perc) val = "Percent" 
    if(!perc) val = "Value"                   
    # variable names
    n = names(d)
    data <- 
        d |> 
        #select(GEOGRAPHY_CODE, C2021_AGE_19_NAME, MEASURES_NAME, OBS_VALUE) |>
        select(GEOGRAPHY_CODE, n[14], MEASURES_NAME, OBS_VALUE) |>
        filter(MEASURES_NAME == val) |> 
        pivot_wider(id_cols = GEOGRAPHY_CODE, 
            values_from = OBS_VALUE, 
            names_from = n[14]) 
    return(data)
}

This is then run for each of the id codes we identified in the x object:

carvan = extractCensus(id = "NM_2063_1", perc = TRUE, area_name = "Leeds", data_type = "TYPE151") 
occup = extractCensus(id = "NM_2071_1", perc = TRUE, area_name = "Leeds", data_type = "TYPE151") 
tenure = extractCensus(id = "NM_2072_1", perc = TRUE, area_name = "Leeds", data_type = "TYPE151") 
econact = extractCensus(id = "NM_2083_1", perc = TRUE, area_name = "Leeds", data_type = "TYPE151") 
## Retrieving additional pages 1 of 1

It is useful to inspect the results to determine the variable that are of interest:

names(carvan)
names(occup)
names(tenure)
names(econact)

Finally, these can be combined to a single data layer and linked to the leeds spatial object as in the code below. Note that we could have broken this up but the basic approach is a series of table joins, with each data table manipulated to create and select the variable before each join:

df_final <-  
    carvan |> rename(nocar = "No cars or vans in household") |> 
        select(GEOGRAPHY_CODE, nocar) |>
    left_join(
        occup |> mutate(overcrowd = `Occupancy rating of rooms: -1` + 
            `Occupancy rating of rooms: -2 or less`) |>
        select(GEOGRAPHY_CODE, overcrowd)
        ) |>
    left_join(
        tenure |> mutate(owned = Owned) |> 
        select(GEOGRAPHY_CODE, owned)
        ) |>
    left_join(
        econact |> 
          rename(unemp = "Economically active (excluding full-time students): Unemployed") |> 
        select(GEOGRAPHY_CODE, unemp)
    ) |>
    rename(code = "GEOGRAPHY_CODE")
## Joining with `by = join_by(GEOGRAPHY_CODE)`
## Joining with `by = join_by(GEOGRAPHY_CODE)`
## Joining with `by = join_by(GEOGRAPHY_CODE)`
# this includes 9 LSOAs outside of the Leeds LAD  
# these get dropped by the join operation
leeds_ti <- 
    leeds |> 
    left_join(df_final) 
## Joining with `by = join_by(code)`

The data need to be further manipulated, as described above. This could have been done in the pipe above but is done here!:

leeds_ti <- 
  leeds_ti |>
  # add 1 to unemployment and overcrowding
  # transform by logging
  mutate(unemp = log(unemp+1), overcrowd = log(overcrowd+1)) |>
  # standardized to z-scores using the scale function
  mutate_at(c("nocar", "owned", "overcrowd", "unemp"),scale) |>
  # sum the 4 four standardized scores 
  mutate(ti = nocar + owned + overcrowd+unemp)

You could map this using ggplot:

ggplot(data = leeds_ti) +geom_sf(aes(fill = ti)) +
  scale_fill_binned(type = "viridis", name = "Deprivation")
The Townsend Index in the Leeds area.

Figure 7.1: The Townsend Index in the Leeds area.

7.4.2 Linking to crime data

A point in polygon operation can be undertaken to link the crime and and the deprivation index. But some care and consideration is needed:

  • are the counts the object of the analysis, for example for use in a Poisson regression?
  • or would crime rates be better? If so what is the denominator? As a proportion of all crimes? Of selected crimes? Or logged crimes per person or per 1000 people? In which case a count needs to be added to the data. The object d still contains this and population can be extracted from it:
d |> data.frame() |> head() 
leeds_ti <- 
  leeds_ti |>
  left_join(
    d |> 
      filter(C2021_AGE_19_NAME == "Total") |>
      filter(MEASURES_NAME == "Value") |>
      rename(code = GEOGRAPHY_CODE, pop = OBS_VALUE) |>
      select(code, pop) 
  )
## Joining with `by = join_by(code)`

Recall from Chapter 2 that a point in polygon operation can be used to determine the counts of crimes that occur inside each LSOA:

# counts
leeds_ti$crime_counts <- 
  leeds_ti |> 
  st_contains(st_transform(crimes_sf, 27700), sparse = F) |>
  rowSums() 
#hist(leeds_ti$crime_counts) 
# rates
leeds_ti <- 
  leeds_ti |>
  mutate(crime_rate = (crime_counts)/pop) 
#hist(leeds_ti$crime_rate) 

Finally some kind of model can be constructed:

summary(lm(crime_rate~ti, data = leeds_ti))
## 
## Call:
## lm(formula = crime_rate ~ ti, data = leeds_ti)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.014824 -0.005231 -0.002051  0.001250  0.208049 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.0117268  0.0006415  18.280  < 2e-16 ***
## ti          0.0026982  0.0003466   7.785 4.24e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01417 on 486 degrees of freedom
## Multiple R-squared:  0.1109, Adjusted R-squared:  0.109 
## F-statistic:  60.6 on 1 and 486 DF,  p-value: 4.24e-14

But probably further work is needed!

7.5 Optional: Joining tables and spatial data

In this self-directed session you will develop your dplyr skills by using it to join data together in different ways using the different tables joins and data wrangling that are possible in dplyr.

You will need the following packages for this:

library(sf)         # for spatial data - you have this as well
library(tidyverse)  # for dplyr and ggplot graphics - ditto
library(tmap)       # for mapping
library(readxl)     # for reading Excel files
library(car)        # for some regression model manipulations
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some

The first part introduces table joins in dplyr using standard flat data tables. The second part extends these approaches to spatial data.

7.5.1 Joining data tables in dplyr

Data tables and spatial data tables can be joined through an attribute that they have in common. This section of the practical illustrates how that is done.

You will need to load some data. This from Comber and Brunsdon (2021)

# load the data
load("ch2.RData") 

This loads 2 datasets for the Liverpool area:

  • lsoa: an sf format multipolygon object of Lower Super Output Areas (LSOAs)
  • lsoa_data: a data.frame of socio-economic data, details of which are in the Week 4 main practical

You could examine the objects that have been loaded in the usual way:

lsoa_data
lsoa
plot(st_geometry(lsoa), col = "thistle")

Both have a field called code and this can be used to link the 2 data sets using one of the dplyr join functions:

lsoa_join = inner_join(lsoa, lsoa_data)

There are a few things to note here:

  • By default dplyr joins look for similarly named fields and try to join by them, and in this case both inputs have a field called code;
  • Note also that join function returned an object in the same formats the first argument, in this case an sf` spatial layer. You could try reversing the order in the snippet above to examine this.

It is also possible to specify join fields in situations where the names are not the same. To illustrate this the code below changes the name of code in lsoa_data to ID, specifies the join and then renames the field back to code.

names(lsoa_data)[1] = "ID"
lsoa_join = inner_join(lsoa, lsoa_data, by = c("code" = "ID"))
names(lsoa_data)[1] = "code"

It is important to note that links between 2 tables can be defined in different ways. The different join types available in dplyr are summarised in the table below.

## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows
Table 7.1: The different types of join in the dplyr package.
Join Description Type
inner_join() returns all of the records from x with matching values in y, and all fields from both x and y. If there are multiple matches between x and y, all combination of the matches are returned. mutating
left_join() returns all of the records from x, and all of the fields from x and y. Any records in x with no match in y will be given NA values in the new fields. If there are multiple matches between x and y, all combination of the matches are returned. mutating
right_join() returns all of the records in y, and all of the fields in x and y. Records in y with no match in x are given NA values in the new fields. If there are multiple matches between x and y, all combination of the matches are returned. mutating
full_join() returns all records and fields from both x and y. NA is allocated to any non-matching values. mutating
semi_join() returns all records from x where there are matching values in y, keeping just the fields from x. It never duplicates records as does an inner join. filtering
anti_join() returns all the records from x where there are non-matching values in y, keeping just fields from x. filtering

There are 6 join types included in the dplyr package - four of what are called mutating joins and two filtering joins. Mutating joins combine variables from both the x and y inputs and filtering joins only keep records from the x input. They have a similar syntax:

result <- JOIN_TYPE(x, y)

It is instructive to see how they work with lsoa_data and lsoa, using some mismatched data. The code below samples 250 records from lsoa_data and the inner join means that only records that match are returned, and these are passed to dim to get the object dimensions:

set.seed(1984)  # for reproducibility 
lsoa_mis = sample_n(lsoa_data, 250)
# nest the join inside dim
dim(inner_join(lsoa_mis, lsoa))
## Joining with `by = join_by(code)`
## [1] 250  12

The different join types can return different results. The code snippets below illustrate this using dim() :

dim(inner_join(lsoa, lsoa_mis))
dim(left_join(lsoa, lsoa_mis))
dim(right_join(lsoa, lsoa_mis))
dim(semi_join(lsoa, lsoa_mis))
dim(anti_join(lsoa, lsoa_mis))

You should compare the results when x and y are changed, especially for left and right joins. For example, you could replace dim with head. The links between 2 tables can be defined in different ways and cardinality enforced between hem through the choice of join method. In a perfect world all joins would result in a seamless, single match between each table through the common attribute, with no unmatched records and no ambiguous matches. Most cases are not like this and the different joins types essentially enforce rules for how unmatched records are treated. This is how dplyr enforces cardinality, or relations based on the uniqueness of the values in data column being joined. In this context, the mutating joins combine variables both x and y inputs, filtering the join to the records from the input given prominence (e.g. only matching records in x and y with an inner join, all of y for a right join etc).

The help section for dplyr joins has some other worked examples that you can copy and paste into your console and then run them:

?inner_join

After this practical you should work through the vignette for dplyr joins.

vignette("two-table", package = "dplyr")

As with the dplyr single table verbs, various joins are used in all kinds of data and spatial data analysis.

Putting it all together, single table manipulations and joins can be combined (in fact, they usually are combined). The code below mixes some single table verbs and joining operations create a unemployment rate, selects the areas with within the top 25% and passes this qtm:

lsoa %>% 
  inner_join(lsoa_data) %>% 
  mutate(unemp_pc = (econ_active-employed)/emp_pop) %>% 
  filter(unemp_pc > quantile(unemp_pc, 0.75)) %>% 
  qtm("tomato", basemaps = 'OpenStreetMap')
## Joining with `by = join_by(code)`
The top 25 percent unemployed areas in Liverpool.

Figure 7.2: The top 25 percent unemployed areas in Liverpool.

Note that the nested unpiped version of this is horrible - Urgh!

# unpiped!
# Do Not run this! 
qtm(  
  filter(
    mutate(
      inner_join(lsoa, lsoa_data),
    unemp_pc = (econ_active-employed)/emp_pop),
  unemp_pc > quantile(unemp_pc, 0.75)), "tomato", basemaps = 'OpenStreetMap')

Key Points:

  • Data tables can be joined through an attribute that they have in common;
  • This can be done in different directions, preserving the inputs in different ways (compare left_join with right_join and inner_join);
  • Spatial data tables in sf format can be joined in the same way, preserving the spatial properties,

7.5.2 Spatial Joins

All of the preceding has focused on relational joins. It is also possible to join spatial data by doing some kind of overlay operation. The idea here is that the attributes in one spatial data layer are joined to another by the spatial properties they have in common.

The easiest and most straight forward spatial overlay and join is to link points and areas, such that the point data gains the attributes of the area within which it sits.

You should clear your workspace and load the ch7.RData file from the VLE (it should be in your working directory):

# clear workspace
rm(list = ls())
# load the data
load("ch7.RData") 

You have seen this data before, but in brief, this loads 2 spatial datasets in sf format as follows:

  • oa: a multipolygon object of Output Areas (OAs) with some variables;
  • properties: a multipoint object of houses for sale in the Liverpool area with loads of binary variables;

Details of these layers in the Week 4 main practical. The code below does a bit of processing as in the Week 4 self-directed practical to transform the data:

# transform the properties data to OSGB
props = st_transform(properties, 27700)
# subset the properties data for Price and other variables
props = props[, c("Price", "Beds", "Garage", "Terraced")]
# subset for the points in the oa area
props = props[oa, ]

Consider the props layer: it is a point dataset of house sales, with the single attribute Price. We would like to attach socio-economic data to this and this can be done in 2 ways:

  1. By attaching the data from oa to each point in props;
  2. By summarising the props data over each oa and attaching the summarised information to each area in oa.

Each approach has its own assumptions and challenges, as will be illustrated through the construction of a regression model for each. Attaching polygon data to polygons is both a bit more complex and details of how to do this are described in Chapter 5 of Brunsdon and Comber (2018).

First, let’s work with the props data. This can be joined using a spatial intersection, with the result a point data layer.

props %>% st_intersection(oa) -> props_oa
## Warning: attribute variables are assumed to be spatially constant throughout
## all geometries

Have a look at the result:

props_oa

Now for the model, we need to convert the beds variable to a number (it is a factor at the moment).

model1 <- lm(log(Price)~unmplyd+Terraced+Garage+as.numeric(Beds) + u65 + OAC, 
             data = props_oa)
summary(model1)
## 
## Call:
## lm(formula = log(Price) ~ unmplyd + Terraced + Garage + as.numeric(Beds) + 
##     u65 + OAC, data = props_oa)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.3242 -0.2412 -0.0124  0.2273  2.5216 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     3.950989   0.054693  72.239  < 2e-16 ***
## unmplyd                        -0.023716   0.001856 -12.776  < 2e-16 ***
## TerracedTRUE                   -0.166365   0.020445  -8.137 6.69e-16 ***
## GarageTRUE                      0.218003   0.029909   7.289 4.34e-13 ***
## as.numeric(Beds)                0.308782   0.008470  36.457  < 2e-16 ***
## u65                             0.011216   0.001404   7.989 2.18e-15 ***
## OACCosmopolitans                0.478108   0.040927  11.682  < 2e-16 ***
## OACEthnicity Central            0.266006   0.040922   6.500 9.88e-11 ***
## OACHard-Pressed Living         -0.076656   0.031184  -2.458   0.0140 *  
## OACMulticultural Metropolitans  0.064842   0.039138   1.657   0.0977 .  
## OACSuburbanites                 0.385204   0.041723   9.232  < 2e-16 ***
## OACUrbanites                    0.258367   0.036440   7.090 1.80e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3866 on 2199 degrees of freedom
## Multiple R-squared:  0.6396, Adjusted R-squared:  0.6378 
## F-statistic: 354.8 on 11 and 2199 DF,  p-value: < 2.2e-16

This tells us that the number of bedrooms of the property for sale drive the price as do the the OAC classes of Cosmopolitan and Suburbanite.

The second approach requires us to summarise the house data over OAs and to attach the summaries to the oa layer. This can be done by using some of the dplyr options. The first summary is a point in polygon operation to return the counts and of properties in each oa area and then other summaries are made for mean number of beds, and price and the proportion of the properties with different characteristics.

# overlay with OAs
props %>% st_intersection(oa) %>% 
  # group by OAs
  group_by(code) %>% 
  # summarise the housing data over OAs
  summarise(house_count = n(),
            Price = mean(Price),
            Beds = mean(as.numeric(Beds)),
            TerracedTRUE = sum(Terraced==TRUE)/n(),
            GarageTRUE = sum(Garage == TRUE)/n()) %>%
  # get rid of the grouping
  ungroup() %>%
  # link back to the original OA data
  right_join(st_drop_geometry(oa)) %>%
  # re-order the variables - effectively this pushes geometry to the end  
  dplyr::select(-geometry) -> oa_props
## Warning: attribute variables are assumed to be spatially constant throughout
## all geometries
## Joining with `by = join_by(code)`

You should examine the result:

oa_props

And a second model can be fitted:

model2 <- lm(log(Price)~unmplyd+TerracedTRUE+GarageTRUE+as.numeric(Beds) + u65 + OAC, 
             data = oa_props)
summary(model2)
## 
## Call:
## lm(formula = log(Price) ~ unmplyd + TerracedTRUE + GarageTRUE + 
##     as.numeric(Beds) + u65 + OAC, data = oa_props)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.26773 -0.22685 -0.01131  0.20252  1.69659 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     3.832383   0.079974  47.921  < 2e-16 ***
## unmplyd                        -0.018688   0.002364  -7.907 7.44e-15 ***
## TerracedTRUE                   -0.210082   0.028174  -7.457 2.03e-13 ***
## GarageTRUE                      0.253567   0.046395   5.465 5.93e-08 ***
## as.numeric(Beds)                0.339593   0.015190  22.357  < 2e-16 ***
## u65                             0.011072   0.002043   5.420 7.58e-08 ***
## OACCosmopolitans                0.498898   0.057553   8.668  < 2e-16 ***
## OACEthnicity Central            0.261618   0.054317   4.816 1.70e-06 ***
## OACHard-Pressed Living         -0.078014   0.037921  -2.057   0.0399 *  
## OACMulticultural Metropolitans  0.113869   0.050398   2.259   0.0241 *  
## OACSuburbanites                 0.275204   0.053805   5.115 3.81e-07 ***
## OACUrbanites                    0.256134   0.048128   5.322 1.29e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3404 on 932 degrees of freedom
##   (640 observations deleted due to missingness)
## Multiple R-squared:  0.6608, Adjusted R-squared:  0.6568 
## F-statistic:   165 on 11 and 932 DF,  p-value: < 2.2e-16

The coefficients estimates are similar but with different magnitudes.

Table 7.2: A comparison of regression models constructed from Point data with OA attributes (model1) and from OA data with point data summaries (model2).
Model 1 SE 1 Pr(>|z|) 1 Model 2 SE 2 Pr(>|z|) 2
(Intercept) 3.951 0.055 0.000 3.832 0.080 0.000
unmplyd -0.024 0.002 0.000 -0.019 0.002 0.000
TerracedTRUE -0.166 0.020 0.000 -0.210 0.028 0.000
GarageTRUE 0.218 0.030 0.000 0.254 0.046 0.000
as.numeric(Beds) 0.309 0.008 0.000 0.340 0.015 0.000
u65 0.011 0.001 0.000 0.011 0.002 0.000
OACCosmopolitans 0.478 0.041 0.000 0.499 0.058 0.000
OACEthnicity Central 0.266 0.041 0.000 0.262 0.054 0.000
OACHard-Pressed Living -0.077 0.031 0.014 -0.078 0.038 0.040
OACMulticultural Metropolitans 0.065 0.039 0.098 0.114 0.050 0.024
OACSuburbanites 0.385 0.042 0.000 0.275 0.054 0.000
OACUrbanites 0.258 0.036 0.000 0.256 0.048 0.000

The coefficients can be compared using the compareCoefs function in the car package as in the main practical with the results shown in Table 2. The results are similar but with some interesting differences….!

compareCoefs(model1,model2,print=T, pvals = T)

Appendix 1: Further Nomis API information

There are some resources to support working with the Nomis API or with the nomisr package:

The Census data are organised summarised in different ways. Some of the are listed below:

KS - Key statistics QS - Quick statistics DC - Detailed characteristics LC - Local characteristics ST - Short-term residents WD - Workday populations CT - Census tables WP - Workplace populations TT - Theme table (few) TS - Topic summary SG - Social grade (few) SH - 1961 residential (few) RM - Ready made tables OT - Out of term time population (few)

Appendix 2: Getting OpenStreetMap data

Introduction and Packages

OpenStreetMap (OSM) is a user contributed mapping activity. The data now underpins all kinds of mapping applications. It has a standard set of procedures for tagging objects by contributors - see https://wiki.openstreetmap.org/wiki/. Essentially this wiki shows how OSM organises its data. so for example restaurants are tagged as amenity - see https://wiki.openstreetmap.org/wiki/Tag:amenity%3Drestaurant.

We will use the osmdata package which allows you to send queries to OSM through your R session. You will need the following packages loaded:

# load packages into the R session
library(tmap)
library(sf)
library(osmdata)
## Data (c) OpenStreetMap contributors, ODbL 1.0. https://www.openstreetmap.org/copyright
library(dplyr)
library(tidyverse)

You should explore the vignette for this package at some point:

vignette(topic = "osmdata", package = "osmdata")

Extracting OSM point features

The code below extracts the restaurants for the Harrogate area - it will take a few minutes.

## [1] TRUE
q <- opq ("Harrogate, UK") %>%
  add_osm_feature (
    key = "amenity",
    value = "restaurant"
  ) %>% 
  osmdata_sf()

NB if this fails and returns the following error: Error: Overpass query unavailable without internet then run the code below and the re-run the query above:

curl::has_internet()
assign("has_internet_via_proxy", TRUE, environment(curl::has_internet))

If you examine the result of the query, q you will see that it is a list with a number of slots

q
## Object of class 'osmdata' with:
##                  $bbox : 53.9521491,-1.5791039,54.0321491,-1.4991039
##         $overpass_call : The call submitted to the overpass API
##                  $meta : metadata including timestamp and version numbers
##            $osm_points : 'sf' Simple Features Collection with 73 points
##             $osm_lines : NULL
##          $osm_polygons : 'sf' Simple Features Collection with 6 polygons
##        $osm_multilines : NULL
##     $osm_multipolygons : NULL

Here we are interested the points and we can extract the point layer, assigning them to an sf object:

points_sf <- q$osm_points

If you examine the data you will see that there many empty fields

points_sf

Next, we can examine these in more detail to refine our query:

names(points_sf)
# have at the amenity
unique(points_sf$amenity)

We can filter for the restaurant amenity value (note this will reduce the point layer considerably and so you may wish not to do this!).

points_sf <- points_sf %>% filter(amenity == "restaurant" )

We can tidy the data a bit to include just the fields we are interested in

points_sf %>% dplyr::select(name, cuisine, `addr:city`: `addr:street`) -> points_sf

Finally we can map the layer and write the data to a file for use elsewhere:

# tmap plot
tmap_mode("view")
## tmap mode set to interactive viewing
tm_shape(points_sf)+tm_dots(col = "red") +tm_basemap("OpenStreetMap")
tmap_mode("plot")
## tmap mode set to plotting
# write to shapefile
st_write(points_sf, dsn = ".", layer = "OSMshops.shp", 
         driver = "ESRI Shapefile", delete_layer = T)
# write to geopackage
st_write(points_sf, "OSMshops.gpkg", delete_layer = T)

Extracting OSM polygon features

The wiki for buildings is here https://wiki.openstreetmap.org/wiki/Key:building. The code below extracts the building data for Leeds and undertakes the same tasks but using a slightly different syntax. Again this will take some time to run depending on your internet speed!

# OSM building  data
q <- opq ("Harrogate, UK") %>%
  add_osm_feature ("building") %>%
  osmdata_sf()

We can then extract shop polygons using the inverse of the is.na function:

osm_shops <- q$osm_polygons[!is.na(q$osm_polygons$shop), ] 

And then tidy the data frame after examine the names of the object:

names(osm_shops)
osm_shops %>% dplyr::select(name, shop, `addr:city`: `addr:unit`) -> osm_shops

Finally we can examine the shop types and create a map - you will see many “shops” are not present - you really do need to work with OSM wikis!:

table(osm_shops$shop)
## 
##           alcohol               art            bakery          bathroom 
##                 1                 1                 1                 1 
##           bicycle building_supplies               car         car_parts 
##                 5                 1                 1                 1 
##        car_repair           clothes       convenience              deli 
##                 3                 2                 8                 1 
##  department_store      doityourself funeral_directors         furniture 
##                 1                 1                 1                 2 
##     garden_centre          hardware         interiors           outdoor 
##                 1                 2                 1                 1 
##               pet       supermarket               yes 
##                 2                13                 1
tmap_mode("view")
## tmap mode set to interactive viewing
tm_shape(osm_shops)+tm_polygons(col = "red") +tm_basemap("OpenStreetMap")
tmap_mode("plot")
## tmap mode set to plotting

As before you could write this object to a file.

Extracting OSM line features

The code below extracts the road data for Leeds and undertakes the same tasks, but filters for roads specifically. The wiki for roads can be found at https://wiki.openstreetmap.org/wiki/Highways.

# OSM street data
q <- opq ("Harrogate, UK") %>%
  add_osm_feature ("highway") %>%
  osmdata_sf()

# create a line sf object
streets <- q$osm_lines[, "highway"] 

# examine types
unique(streets$highway)
##  [1] "trunk"         "residential"   "primary"       "pedestrian"   
##  [5] "footway"       "secondary"     "tertiary"      "service"      
##  [9] "tertiary_link" "unclassified"  "bridleway"     "cycleway"     
## [13] "steps"         "track"         "path"          "primary_link" 
## [17] "living_street" "construction"
# this can be filtered
index <- streets$highway != "unclassified" & 
  streets$highway != "pedestrian" &
  streets$highway != "footway" &
  streets$highway != "path" &
  streets$highway != "track" & 
  streets$highway != "steps" &
  streets$highway != "cycleway" & 
  streets$highway != "construction" &
  streets$highway != "service" &
  streets$highway != "motorway" &
  streets$highway != "motorway_link" &
  streets$highway !=  "platform" 
streets <- streets[which(index), "highway"]

You can inspect what is returned and consider filtering:

streets
# the highway types
unique(streets$highway)

Again these can be mapped - this will take some time

# plot 
tmap_mode("view")
## tmap mode set to interactive viewing
tm_shape(streets)+tm_lines() +  
  tm_basemap("OpenStreetMap")
tmap_mode("plot")
## tmap mode set to plotting

OSM Wiki

The use of the OSM wiki is critical for constructing your queries. For example, you wish to undertake a public health analysis if Dakar, Bangladesh. The code below extracts health facilities in as point objects, undertakes some filtering and then maps them.

q <- opq ("Dhaka, Bangladesh") %>%
  add_osm_feature ("amenity", "hospital") %>%
  osmdata_sf()
## Warning in fix_columns_list(res[kv_df]): Feature keys clash with id or metadata columns and will be renamed by appending `.n`:
##  osm_id.1
# create sf
hc <- q$osm_points 
# look at the tags
names(hc)
##  [1] "osm_id"                           "name"                            
##  [3] "addr:city"                        "addr:country"                    
##  [5] "addr:district"                    "addr:full"                       
##  [7] "addr:housename"                   "addr:housenumber"                
##  [9] "addr:postcode"                    "addr:state"                      
## [11] "addr:street"                      "addr:suburb"                     
## [13] "alt_name"                         "alt_name:en"                     
## [15] "amenity"                          "authentication:phone_call"       
## [17] "authentication:phone_call:number" "barrier"                         
## [19] "building"                         "check_date"                      
## [21] "contact:phone"                    "currency:Taka"                   
## [23] "description"                      "drive_through"                   
## [25] "email"                            "emergency"                       
## [27] "full_id"                          "health_facility:type"            
## [29] "health_facility_type"             "healthcare"                      
## [31] "healthcare:speciality"            "identifier"                      
## [33] "image"                            "internet_access"                 
## [35] "label"                            "level"                           
## [37] "licensed"                         "max_level"                       
## [39] "min_level"                        "name:bn"                         
## [41] "name:en"                          "note"                            
## [43] "opening_hours"                    "operator"                        
## [45] "operator:type"                    "osm_id.1"                        
## [47] "osm_type"                         "phone"                           
## [49] "ref"                              "source"                          
## [51] "start_date"                       "tourism"                         
## [53] "website"                          "wheelchair"                      
## [55] "wikidata"                         "wikimedia_commons"               
## [57] "wikipedia"                        "geometry"
# inspect the type
unique(hc$health_facility_type)
## [1] NA              "health_centre" "health centre" "hospital"
hc <- hc %>% filter(!is.na(amenity)) %>% 
  dplyr::select(name, amenity, health_facility_type)
# map
tmap_mode("view")
## tmap mode set to interactive viewing
tm_shape(hc)+
  tm_dots(alpha = 0.2)+
  tm_basemap('OpenStreetMap')

Final comment

OSM has resulted in many unexpected developments and the creating and provision of an open database of geographic features is amazing! Previously unmapped areas now mapped and the OSM initiative has forced national mapping agencies to open up their data to the public (including Ordnance Survey). It is routinely used as a backdrop in applications BUT it also illustrates classic issues with VGI when compared to authoritative data: it has unknown issues of completeness.

Isochrones

Introducton

Isochrone maps show regions of equal travel time from a single or multiple locations. They are particularly useful for urban transport and allow us to travel times to and from a feature of interest (railway stations, health facilities, etc). This sections uses the osrm package which uses the OpenStreetMap-Based Routing Service. Effectively what it does is send off queries to this and loads the results into an sf object. The code below loads some packages and creates an isochrone around a location in Berlin. This is straight from the example in the help for the main osrmIsochrone function.

library(osrm)
library(sf)
library(tmap)
iso <- osrmIsochrone(loc = c(13.43,52.47), breaks = seq(0,12,2))
iso
## Simple feature collection with 5 features and 3 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 13.34397 ymin: 52.41675 xmax: 13.50034 ymax: 52.51447
## Geodetic CRS:  WGS 84
##   id isomin isomax                       geometry
## 1  1      0      4 MULTIPOLYGON (((13.43743 52...
## 2  2      4      6 MULTIPOLYGON (((13.42257 52...
## 3  3      6      8 MULTIPOLYGON (((13.43614 52...
## 4  4      8     10 MULTIPOLYGON (((13.42257 52...
## 5  5     10     12 MULTIPOLYGON (((13.42257 52...

For the mapping the isochrones polygone in-validity has to be fixed:

tmap_options(check.and.fix = TRUE)

tmap_mode("view")
## tmap mode set to interactive viewing
p1 = tm_shape(iso) + 
  tm_polygons("isomax", title = "Time", palette = "YlGnBu", alpha = 0.5) +
  tm_basemap("OpenStreetMap")
p1
## Warning: The shape iso is invalid. See sf::st_is_valid
tmap_mode("plot")
## tmap mode set to plotting

Adjusting isochrone paramters

If you look at the help for the osrmIsochrone function, you will see that there are a number of parameters that can be adjusted:

?osrmIsochrone

We can change

  1. the mode e.g. to walking
  2. the spatial detail of the isochrones
  3. the isochrone breaks

The code below does each of these in turn

# mode
iso_w <- osrmIsochrone(loc = c(13.43,52.47), breaks = seq(0,12,2), osrm.profile = "foot")
# detail = takes longer!
iso_d <- osrmIsochrone(loc = c(13.43,52.47), res = 50, breaks = seq(0,12,2), 
                       osrm.profile = "foot")
# breaks and distance
iso_b <- osrmIsochrone(loc = c(13.43,52.47), res = 50, breaks = seq(0,20,2),
                       osrm.profile = "foot")

And we can compare the results

tmap_mode("view")
## tmap mode set to interactive viewing
p2 = tm_shape(iso_w) + 
  tm_polygons("isomax", title = "Mode", palette = "YlGnBu", alpha = 0.5) +
  tm_basemap("OpenStreetMap")
p3 = tm_shape(iso_d) + 
  tm_polygons("isomax", title = "Spatial Detail", palette = "YlGnBu", alpha = 0.5) +
  tm_basemap("OpenStreetMap")
p4 = tm_shape(iso_b) + 
  tm_polygons("isomax", title = "Breaks /nDistance", palette = "YlGnBu", alpha = 0.5) +
  tm_basemap("OpenStreetMap")
tmap_arrange(p1,p2,p3,p4, ncol = 2)
## Warning: The shape iso is invalid. See sf::st_is_valid
## Warning: The shape iso_w is invalid. See sf::st_is_valid
## Warning: The shape iso_d is invalid. See sf::st_is_valid
## Warning: The shape iso_b is invalid. See sf::st_is_valid
tmap_mode("plot")
## tmap mode set to plotting

Multiple locations

So far the isochrones have been for a defined single location these have all been for the default example in Berlin using a vector of Latitude and Longitude.

Supposing I wanted to know the travel times to Hospitals in Leeds and to explore an accessibility policy of being within 30 minutes walking time to a hospital. I could use the isochrone approach above to generate isochrones for each hospital merge them, and then do an overlay with census area data (perhaps cast as population weighted centroids).

The code below creates some hospitals data from OSM and then creates a single isochrone by way of example, by passing the sf object to the isochrone function. You will need the osmdata package loaded as before:

# 1. get the hospital locations
library(osmdata)
q <- opq ("Leeds UK") %>%
  add_osm_feature (
    key = "amenity",
    value = "hospital"
  ) %>% 
  osmdata_sf()
names(q$osm_polygons)
##  [1] "osm_id"           "name"             "addr:city"        "addr:country"    
##  [5] "addr:housename"   "addr:housenumber" "addr:postcode"    "addr:street"     
##  [9] "addr:suburb"      "amenity"          "building"         "description"     
## [13] "emergency"        "fhrs:id"          "healthcare"       "height"          
## [17] "is_in"            "is_in:town"       "landuse"          "loc_name"        
## [21] "note"             "old_name"         "operator"         "operator:type"   
## [25] "phone"            "source"           "source:addr"      "start_date"      
## [29] "website"          "wheelchair"       "wikidata"         "wikipedia"       
## [33] "geometry"
# have at the healthcare labels
unique(q$osm_polygons$healthcare)
## [1] "hospital" NA
q$osm_polygons %>% dplyr::filter(healthcare == "hospital") -> hosp_sf
# convert to points
hosp_sf = st_centroid((hosp_sf))
## Warning: st_centroid assumes attributes are constant over geometries
# 2. get the isochrones
iso_h <- osrmIsochrone(loc = hosp_sf, breaks = c(0,10,20,30), osrm.profile = "foot")
## Only the first row/element of "loc" is used.
# 3. Map the result
tmap_mode("view")
## tmap mode set to interactive viewing
tm_shape(iso_h) + 
  tm_polygons("isomax", title = "Breaks /nDistance", palette = "YlGnBu", alpha = 0.5) +
  tm_basemap("OpenStreetMap")
## Warning: The shape iso_h is invalid. See sf::st_is_valid
tmap_mode("plot")
## tmap mode set to plotting

The warning message Only the first row/element of "loc" is used indicates how we can do multiple operations: in a loop! This will take a bit of time to run, but a counter is provided - there are around 19 hospitals!

for (i in 1:nrow(hosp_sf)) {
  iso.i <- osrmIsochrone(loc = hosp_sf[i,], 
                         breaks = c(0,10,20,30), 
                         osrm.profile = "foot")
  if (i == 1) iso_result = iso.i
  if (i > 1) iso_result = rbind(iso_result, iso.i)
  cat(i, "\t")
}
## 1    2   3   4   5   6   7   8   9   10  11  12  13  14  15  16  17  18  19  

If we map these then we can see some overlaps in the isochrones. It should be possible to unpick these, but at the time of writing I am not sure exactly how!

tmap_mode("view")
## tmap mode set to interactive viewing
tm_shape(iso_result) + 
  tm_polygons("isomax", title = "Breaks /nDistance", palette = "YlGnBu", alpha = 0.5) +
  tm_basemap("OpenStreetMap")
## Warning: The shape iso_result is invalid. See sf::st_is_valid
tmap_mode("plot")
## tmap mode set to plotting

References

Comber, Lex, and Chris Brunsdon. 2021. Geographical Data Science and Spatial Data Analysis: An Introduction in r. Sage.

  1. Crimes are mapped to a specific location on each street, rather than exactly where they were committed on that street.↩︎

  2. see https://data.police.uk/docs/method/stops-street/↩︎