Chapter 4 Analysing onemap population data

https://www.onemap.sg/main/v2/ is the authoritative national map of Singapore, developed by the Singapore Land Authority.

It consist of datasets such as: * Land Query (land ownership and landlot number) * School query (finds nearby schools) * Traffic query (traffic cconditions, incidents/alerts/ live camera action/ERP/parking lots availability) * Property query (resale and subletting transaction information) * Population query (population size per planning area by details such as income, age, gender, household size) * Business query (location of registered businesses)

In this exercise, we draw data from the population query from onemap using its API, and attempt to analyse correlaltions between the variables (eg. income, age etc) and perform some basic chloropleths of the population numbers per planning area.

The onemap API documentation can be found here: https://docs.onemap.sg/#introduction The list of population datasets from Department of Statistics, that we can query from onemap, can be found at: https://docs.onemap.sg/#population-query

4.1 Getting the token

After creating your account at https://docs.onemap.sg, using your account email and password and the token_url, we can get the token tied to your account.

base_url <- "https://developers.onemap.sg"
token_url <- paste0(base_url, "/privateapi/auth/post/getToken")
body <- list(email = "jolene_quek@mymail.sutd.edu.sg",
             password = "Jojoquek1") # get an account at https://docs.onemap.sg
token <- POST(token_url, body = body) %>%
  content() %>%
  pluck("access_token") # pluck comes from tidyverse

4.2 Dataset 1: Economic status data

url <- paste0(base_url, "/privateapi/popapi/getEconomicStatus") #list of available options at https://docs.onemap.sg/#population-query 
query = list(
  token = token,
  planningArea = "MARINE PARADE",
  year = 2015
)
data <- GET(url, query = query, verbose()) %>% content()

4.2.1 Getting the data

url <- paste0(base_url, "/publicapi/popapi/getAllEconomicStatus")
query = list(
  token = token
)
data <- GET(url, query = query, verbose()) %>% content()
View(data)

4.2.2 Unnesting the data

This step is needed as the datais currently in the form of a nested list. We need it in a datable for analysis.

data <- data %>%
  purrr::transpose() %>% #inverts the nest
  as_tibble() %>%
  unnest()

4.2.3 Summing up men and women

The data shows 2 rows per planning area and year: one for female and one for male. We want to sum these two rows up for every planning area and year, so that we get a total number per planning area and year.

data <- data %>%
  select(-id, -gender) %>%
  group_by(planning_area, year) %>%
  summarise_all(sum) %>% #sum all the columns, not just one column
  ungroup()

4.2.4 Percentages of employment status per planning area per year

data <- data %>%
  mutate(total = employed + unemployed + inactive) %>%
  mutate(employed_perc = employed / total,
  unemployed_perc = unemployed / (employed + unemployed), 
  inactive_perc = inactive / total)

4.2.5 Distribution of unemployment percentage

data %>%
  filter(year == 2015) %>%
  ggplot(aes(unemployed_perc)) +
  geom_histogram(bins = 10)+
  theme_fivethirtyeight() + 
  labs(title="Distribution of unemployment percentage")+
  theme(plot.title = element_text(size=14, hjust=0.5))

## Warning: Removed 27 rows containing non-finite values (stat_bin).

data %>%
  arrange(desc(unemployed_perc)) %>%
  slice(1:10)

## # A tibble: 10 x 9
##    planning_area  year employed unemployed inactive  total employed_perc
##    <chr>         <int>    <int>      <int>    <int>  <int>         <dbl>
##  1 Yishun         2000    80706       6390    48710 135806         0.594
##  2 Geylang        2000    52315       4137    37402  93854         0.557
##  3 Kallang        2000    42506       3274    29764  75544         0.563
##  4 Ang Mo Kio     2000    85740       6487    56298 148525         0.577
##  5 Toa Payoh      2000    54082       3940    38013  96035         0.563
##  6 Outram         2000    10617        754     8009  19380         0.548
##  7 Queenstown     2000    44585       3151    31361  79097         0.564
##  8 Downtown Core  2000     2154        148     1542   3844         0.560
##  9 Bedok          2000   119417       8086    81307 208810         0.572
## 10 Bukit Merah    2000    70307       4743    48120 123170         0.571
## # ... with 2 more variables: unemployed_perc <dbl>, inactive_perc <dbl>

data_employment <- data

4.3 Dataset 2: Population by Age Group

4.3.1 Getting the data

url <- paste0(base_url, "/publicapi/popapi/getAllPopulationAgeGroup")
query = list(
token = token
)
data <- GET(url, query = query, verbose()) %>% content()
View(data)

4.3.2 Unnesting the data

data <- data %>%
purrr::transpose() %>%
as_tibble() %>%
unnest()

4.3.3 Summing up men and women

data <- data %>%
  select(-id) %>%
  filter(gender == "Total") %>%
  select(-gender)

4.3.4 Plotting population growth per planning area

data %>%
  ggplot(aes(year, total, color = planning_area)) +
  geom_line()+
  theme_fivethirtyeight() + 
  labs(title="Population growth per planning area")+
  theme(plot.title = element_text(size=14, hjust=0.5))

4.3.5 Percentages of age groups per planning area

data_age <- data
data_age <- data_age %>% mutate_at(vars(age_0_4:age_85_over), funs(perc=. / total * 100))

4.4 Dataset 3: Ethnic groups

4.4.1 Getting the data

url <- paste0(base_url, "/publicapi/popapi/getAllEthnicGroup")
query = list(
token = token
)
data <- GET(url, query = query, verbose()) %>% content()

4.4.2 Unnesting the data

data <- data %>%
  purrr::transpose() %>%
  as_tibble() %>%
  unnest()

4.4.3 Summing up men and women

data <- data %>% # similar to the economic data, we need to sum across genders
select(-id, -gender) %>%
group_by(planning_area, year) %>%
summarise_all(sum) %>%
ungroup()

4.4.4 Percentages of ethnic groups per planning area per year

data_ethnicity <- data %>% # normalized percentages are more insightful than absolute totals
mutate(chinese_perc = chinese / total,
malays_perc = malays / total,
indian_perc = indian / total,
others_perc = others / total)

4.5 Dataset 4: Income from work

4.5.1 Getting the data

url <- paste0(base_url, "/publicapi/popapi/getAllIncomeFromWork")
query = list(
token = token
)
data <- GET(url, query = query, verbose()) %>% content()
#View(data)

4.5.2 Unnesting the data

data <- data %>%
purrr::transpose() %>%
as_tibble() %>% 
select(-id)

data[data == "NULL"] <- 0 
data_income <- data %>% 
   unnest()

4.5.3 Percentages of income groups per planning area per year

data_income <- data_income %>% 
  mutate_at(vars(below_sgd_1000:sgd_9000_to_9999), funs(perc=. / total * 100)) %>% 
  select(-sgd_8000_over_perc) %>% 
  select(-sgd_6000_over_perc)

4.6 Dataset 5: Education Status

4.6.1 Getting the data

url <- paste0(base_url, "/publicapi/popapi/getAllEducationAttending")
query = list(
token = token
)
data <- GET(url, query = query, verbose()) %>% content()
View(data)

4.6.2 Unnesting the data

data <- data %>%
purrr::transpose() %>%
as_tibble() %>% 
unnest() %>% 
select(-id)

data_education <- data

4.6.3 Percentages of education status per planning area per year

data_education <- data_education %>% 
  group_by(planning_area,year) %>% 
  mutate(total=sum(pre_primary, primary, secondary, post_secondary, polytechnic, prof_qualification_diploma,university))
data_education <- data_education %>% mutate_at(vars(pre_primary:university), funs(perc=. / total * 100))

4.7 Joining the datasets

To join all 5 datasets:

data <- left_join(data_ethnicity, data_employment, by = c("planning_area", "year")) %>%
left_join(data_age, by = c("planning_area", "year")) %>% 
  left_join(data_income, by = c("planning_area", "year")) %>% 
  left_join(data_education, by = c("planning_area", "year"))

cor(data$malays_perc, data$inactive_perc, use = "pairwise.complete.obs")

## [1] -0.4926429

data %>% # visually
ggplot(aes(malays_perc, inactive_perc)) +
geom_point()

## Warning: Removed 27 rows containing missing values (geom_point).

4.8 Finding all possible correlations

4.8.1 Correlations across all planning areas

#install.packages("corrr")
library(corrr)
cor <- data %>% # look at the correlation matrix
select(ends_with("_perc")) %>%
correlate()

## 
## Correlation method: 'pearson'
## Missing treated using: 'pairwise.complete.obs'

cor[is.na(cor)] <- 0

cor_long <- data %>% # look at the correlation matrix
select(ends_with("_perc")) %>%
correlate()

## 
## Correlation method: 'pearson'
## Missing treated using: 'pairwise.complete.obs'

cor_long[is.na(cor_long)]<- 0

#install.packages("knitr")
library(knitr)

kable(cor_long %>% 
  rearrange() %>% 
  shave() %>% 
  stretch() %>% 
  filter(abs(r)>0.65) %>% 
  filter(substr(x,1,3)!=substr(y,1,3)) %>% 
  arrange(desc(r)))

x	y	r
university_perc	age_55_59_perc	0.8466404
sgd_1500_to_1999_perc	unemployed_perc	0.8065964
age_35_39_perc	pre_primary_perc	0.7996807
university_perc	age_60_64_perc	0.7677187
sgd_2000_to_2499_perc	unemployed_perc	0.7650020
age_25_29_perc	sgd_2000_to_2499_perc	0.7356289
age_5_9_perc	primary_perc	0.7311009
age_25_29_perc	sgd_1500_to_1999_perc	0.7197881
age_25_29_perc	sgd_2500_to_2999_perc	0.7177043
sgd_1000_to_1499_perc	unemployed_perc	0.7086490
age_0_4_perc	pre_primary_perc	0.7064846
age_35_39_perc	primary_perc	0.6993927
age_50_54_perc	university_perc	0.6944299
sgd_2500_to_2999_perc	unemployed_perc	0.6887245
age_55_59_perc	sgd_6000_to_6999_perc	0.6780642
age_30_34_perc	pre_primary_perc	0.6776413
age_25_29_perc	unemployed_perc	0.6747752
age_0_4_perc	primary_perc	0.6672446
university_perc	age_65_69_perc	0.6662950
age_40_44_perc	primary_perc	0.6608485
age_40_44_perc	polytechnic_perc	-0.6562966
sgd_7000_to_7999_perc	unemployed_perc	-0.6597550
age_0_4_perc	post_secondary_perc	-0.6623422
university_perc	sgd_2000_to_2499_perc	-0.6624473
university_perc	sgd_2500_to_2999_perc	-0.6637440
primary_perc	prof_qualification_diploma_perc	-0.6673969
sgd_6000_to_6999_perc	unemployed_perc	-0.6819359
others_perc	sgd_2000_to_2499_perc	-0.6876880
age_50_54_perc	primary_perc	-0.7011344
pre_primary_perc	age_50_54_perc	-0.7022355
age_5_9_perc	university_perc	-0.7030247
age_0_4_perc	university_perc	-0.7041291
others_perc	sgd_1500_to_1999_perc	-0.7091899
others_perc	sgd_2500_to_2999_perc	-0.7206527
others_perc	sgd_1000_to_1499_perc	-0.7388730
primary_perc	age_55_59_perc	-0.7726750
inactive_perc	employed_perc	-0.9769730

So, it seems there is a strong correlation between: * income and others_perc * education and age * income and employment status * age and income * race and income

4.8.2 Correlations within each planning area

It might be interesting to see the difference in the correlation of factors between planning areas. To do this, we have to find the correlations between the factors in each planning area.

cor_area <- data %>% 
  select((ends_with("_perc")),planning_area) %>% 
  group_by(planning_area) %>%
  nest() %>%
  mutate(cor_area = map(data, correlate)) %>%
  unnest(cor_area) 

cor_area[is.na(cor_area)]<- 0
#install.packages("tidyr")
library(tidyr)
cor_area_long <- gather(cor_area, factor, cor_value, chinese_perc:university_perc, factor_key=TRUE)
colnames(cor_area_long)=c("planning_area","x","y","cor_value")
cor_area_long <- cor_area_long %>% 
  filter(cor_value!="0") %>%
  filter(abs(cor_value)>0.65) %>% 
  filter(abs(cor_value)<1) %>% 
  filter(substr(x,1,3)!=substr(y,1,3)) %>% 
  arrange(desc(cor_value))

toDelete <- seq(1, nrow(cor_area_long), 2)
cor_area_long <- cor_area_long[-toDelete ,]

cor_area_long <- cor_area_long %>% 
  mutate(pln_area_n = toupper(planning_area))

4.9 Visualising through maps

4.9.1 Getting the base map

url <- paste0(base_url, "/publicapi/popapi/getMp14PlanningAreaGeojson")
query = list(
token = token
)
planning_areas <- GET(url, query = query, verbose()) %>%
content(as = "text") %>%
geojson_sf()

plot(planning_areas)

## Warning: plotting the first 9 out of 10 attributes; use max.plot = 10 to
## plot all

4.9.2 Joining datasets to spatial data

data <- data %>%
mutate(pln_area_n = toupper(planning_area))

planning_areas <- left_join(planning_areas, data)

## Joining, by = "pln_area_n"

planning_areas <- left_join(planning_areas, cor_area_long)

## Joining, by = c("pln_area_n", "planning_area")

planning_areas <- planning_areas %>% 
  mutate(cor_between=paste0(x,", ",y))

4.9.3 Spatial distribution of ethnic groups in 2015

#tmap_mode("view") #just this line makes your subsequent plot interactive
tmap_mode("plot") #use this to revert to static mode

## tmap mode set to plotting

indian <- planning_areas %>%
  filter(year == 2015) %>%
  tm_shape() +
  tm_fill("indian_perc")+
  tm_facets(by = "year", nrow = 2, ncol=2)

malays <- planning_areas %>%
  filter(year == 2015) %>%
  tm_shape() +
  tm_fill("malays_perc")

chinese <- planning_areas %>%
  filter(year == 2015) %>%
  tm_shape() +
  tm_fill("chinese_perc")

others <- planning_areas %>%
  filter(year == 2015) %>%
  tm_shape() +
  tm_fill("others_perc")

tmap_arrange(indian, malays, chinese, others, ncol = NA, nrow = NA, sync = FALSE, asp = 0,
  outer.margins = 0.02)

4.9.4 Map of correlation between highest earning group and others in planning areas

“sg_12000_over_perc” and “others_perc” has the highest correlation (0.937), when taking rows of all the planning areas. We are interested in how thae correlation between these two variables vary spatially, ie. in each planning area.

The resulting plot seems strange, with only pasir ris showing a negative correlation and the rest of the areas at an extremely high correlation of 0.8 to 1.0. This could be due to how there are only 3 years of data. Thus by doing correlation within each planning area, the limited data easily leads to a high correlation factor. Such an analysis could be more insightful given more years of data.

tmap_mode("plot")

## tmap mode set to plotting

planning_areas %>%
  filter(cor_between == "unemployed_perc, sgd_1500_to_1999_perc") %>% 
  tm_shape() +
  tm_fill("cor_value", breaks = c(-1, -0.8, -0.6, -0.4, -0.2, 0, 0.2, 0.4, 0.6, 0.8, 1))

## Variable "cor_value" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.