2 An example of a data analytics project
2.1 The data science process
This walkthrough follows the data science process, as described by Hadley Wickham and Garrett Grolemund in R for Data Science.
2.2 New Housing Price Index
This set of scripts creates summary tables and graphs plotting the New Housing Price Index (NHPI) data collected and reported by Statistics Canada.
Our end point will be this chart:
This plot was created in early 2020, when the most recent data available was for the month of November 2019.
2.2.1 Setup
This chunk of R code loads the packages that we will be using
# tidyverse
library(tidyverse)
library(lubridate) # date functions
library(scales) # extending {ggplot2}
library(glue) # for gluing strings together
#
# utilities
library(cansim) # data extract
library(janitor) # for `clean_names()`
library(knitr) # for publication - includes `kable()` for tables
library(kableExtra) # - format kable tables
library(flextable) # another table formatting package
First, we have some code for two custom functions that calculate month-over-month and year-over-year statistics. These will be called later in the scripts. Functions give us a shorthand way to do the same thing multiple times.
get_mom_stats <- function(df) {
df %>%
arrange(ref_date) %>%
mutate(mom_val = lag(value),
mom_pct = ((value / lag(value, n = 1)) - 1) * 100,
mom_chg = (value - lag(value, n = 1)))
}
get_yoy_stats <- function(df) {
df %>%
arrange(ref_date) %>%
mutate(yoy_val = lag(value, n = 12),
yoy_pct = ((value / lag(value, n = 12)) - 1) * 100,
yoy_chg = (value - lag(value, n = 12)))
}
We then create a chart theme that will also be called later. This theme will be applied to all of our charts, and ensures that every one of our charts have exactly the same formatting. Like functions, this streamlines our code and helps reduce the chance of error.
For this chart theme, more than 10 lines of code gets reduced to a simple, memorable name: bida_chart_theme
.
bida_chart_theme <-
theme_bw() +
theme(
panel.border = element_rect(colour="white"),
plot.title = element_text(face="bold"),
legend.position=c(1,0),
legend.justification=c(1,0),
legend.title = element_text(size=12),
legend.text = element_text(size=11),
axis.line = element_line(colour="black"),
axis.title = element_text(size=12),
axis.text = element_text(size=12)
)
2.3 IMPORT
The first step is to import the data. For this, we will access Statistics Canada’s database CANSIM, using an R package {cansim}. The file has 18 variables and over 56,000 rows (as of February 2020) and growing every month.
data source
Table: 18-10-0205-01 (formerly CANSIM 327-0056) https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=1810020501
Note: “The index base period, for which the New Housing Price Index (NHPI) equals 100, is December 2016.”
This version (updated 2022-09-10) uses the cansim
package to pull the data from CANSIM, and assign it to an object named “thedata”.
table_id <- "18-10-0205-01"
thedata <- get_cansim(table_id)
2.4 TIDY
2.4.1 understanding the data
It’s a good practice before starting any data analysis to understand the structure of your data, by interrogating the variables and the values in those variables.
{cansim} (like many other data-focused packages) provides some utility functions that let us quickly understand the data frame. Here we use the function get_cansim_table_overview
to generate an overview of the table we downloaded.
# from the {cansim} package
get_cansim_table_overview(table_id)
We can also create some summaries with the names of the variables. One way to do this is with the ls.str()
function:
# list the variables in the data
ls.str(thedata)
## Classification Code for New housing price indexes : chr [1:59040] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA ...
## COORDINATE : chr [1:59040] "1.1" "1.2" "1.3" "2.1" "2.2" "2.3" "3.1" "3.2" "3.3" "4.1" "4.2" "4.3" "5.1" ...
## Date : Date[1:59040], format: "1981-01-01" "1981-01-01" "1981-01-01" "1981-01-01" "1981-01-01" "1981-01-01" ...
## DECIMALS : chr [1:59040] "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" ...
## DGUID : chr [1:59040] "2016A000011124" "2016A000011124" "2016A000011124" "2016A00011" "2016A00011" ...
## GEO : chr [1:59040] "Canada" "Canada" "Canada" "Atlantic Region" "Atlantic Region" "Atlantic Region" ...
## GeoUID : chr [1:59040] "11124" "11124" "11124" "1" "1" "1" "10" "10" "10" "1" "1" "1" "11" "11" "11" ...
## Hierarchy for GEO : chr [1:59040] "1" "1" "1" "1.2" "1.2" "1.2" "1.2.3" "1.2.3" "1.2.3" "1.2.3.4" "1.2.3.4" ...
## Hierarchy for New housing price indexes : chr [1:59040] "1" "1.2" "1.3" "1" "1.2" "1.3" "1" "1.2" "1.3" "1" "1.2" "1.3" "1" "1.2" "1.3" ...
## New housing price indexes : Factor w/ 3 levels "Total (house and land)",..: 1 2 3 1 2 3 1 2 3 1 ...
## REF_DATE : chr [1:59040] "1981-01" "1981-01" "1981-01" "1981-01" "1981-01" "1981-01" "1981-01" "1981-01" ...
## SCALAR_FACTOR : chr [1:59040] "units" "units" "units" "units" "units" "units" "units" "units" "units" "units" ...
## SCALAR_ID : chr [1:59040] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" ...
## STATUS : chr [1:59040] NA NA "E" ".." ".." ".." ".." ".." ".." NA NA "E" ".." ".." ".." ".." ".." ".." ...
## SYMBOL : chr [1:59040] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA ...
## TERMINATED : chr [1:59040] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA ...
## UOM : chr [1:59040] "Index, 201612=100" "Index, 201612=100" "Index, 201612=100" "Index, 201612=100" ...
## UOM_ID : chr [1:59040] "347" "347" "347" "347" "347" "347" "347" "347" "347" "347" "347" "347" "347" ...
## val_norm : num [1:59040] 38.2 36.1 40.6 NA NA NA NA NA NA 36.1 ...
## VALUE : num [1:59040] 38.2 36.1 40.6 NA NA NA NA NA NA 36.1 ...
## VECTOR : chr [1:59040] "v111955442" "v111955443" "v111955444" "v111955445" "v111955446" "v111955447" ...
We can explore the individual variables as well—let’s generate a small table counting the instances of the GEO
and GeoUID
variables:
## # A tibble: 40 x 3
## # Groups: GEO [40]
## GEO GeoUID n
## <chr> <chr> <int>
## 1 Alberta 48 1476
## 2 Atlantic Region 1 1476
## 3 British Columbia 59 1476
## 4 Calgary, Alberta 825 1476
## 5 Canada 11124 1476
## 6 Charlottetown, Prince Edward Island 105 1476
## 7 Edmonton, Alberta 835 1476
## 8 Greater Sudbury, Ontario 580 1476
## 9 Guelph, Ontario 550 1476
## 10 Halifax, Nova Scotia 205 1476
## # ... with 30 more rows
The tail()
function shows the last 6 rows of a data table.
tail(thedata)
## # A tibble: 6 × 21
## REF_DATE GEO DGUID UOM UOM_ID SCALAR_FACTOR SCALAR_ID VECTOR COORDINATE VALUE STATUS SYMBOL
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 2021-12 Vancouv… 2011… Inde… 347 units 0 v1119… 39.1 125. <NA> <NA>
## 2 2021-12 Vancouv… 2011… Inde… 347 units 0 v1119… 39.2 124. <NA> <NA>
## 3 2021-12 Vancouv… 2011… Inde… 347 units 0 v1119… 39.3 119. E <NA>
## 4 2021-12 Victori… 2011… Inde… 347 units 0 v1119… 40.1 120. <NA> <NA>
## 5 2021-12 Victori… 2011… Inde… 347 units 0 v1119… 40.2 127. <NA> <NA>
## 6 2021-12 Victori… 2011… Inde… 347 units 0 v1119… 40.3 109. E <NA>
## # … with 9 more variables: TERMINATED <chr>, DECIMALS <chr>, GeoUID <chr>,
## # `Hierarchy for GEO` <chr>, `Classification Code for New housing price indexes` <chr>,
## # `Hierarchy for New housing price indexes` <chr>, val_norm <dbl>, Date <date>,
## # `New housing price indexes` <fct>
Based on this interrogation, we can see that the chart is made up of the following variables:
“REF_DATE”, starting in January 2007
“GEO”, with British Columbia & Canada
“New housing price indexes”, but total (not house and land separately)
2.4.2 cleaning
Cleaning up our data table before we start working is often the next step. In this case, we use the {janitor} package to clean the variable names.
The values in the date field is also problematic—they are stored as character strings. We will clean the values by specifying them as a “date” format.
thedata <- janitor::clean_names(thedata)
thedata <- thedata %>%
mutate(ref_date = ymd(ref_date, truncated = 2))
head(thedata)
## # A tibble: 6 x 21
## ref_date geo dguid uom uom_id scalar_factor scalar_id vector coordinate value status symbol
## <date> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 1981-01-01 Canada 2016~ Inde~ 347 units 0 v1119~ 1.1 38.2 <NA> <NA>
## 2 1981-01-01 Canada 2016~ Inde~ 347 units 0 v1119~ 1.2 36.1 <NA> <NA>
## 3 1981-01-01 Canada 2016~ Inde~ 347 units 0 v1119~ 1.3 40.6 E <NA>
## 4 1981-01-01 Atlan~ 2016~ Inde~ 347 units 0 v1119~ 2.1 NA .. <NA>
## 5 1981-01-01 Atlan~ 2016~ Inde~ 347 units 0 v1119~ 2.2 NA .. <NA>
## 6 1981-01-01 Atlan~ 2016~ Inde~ 347 units 0 v1119~ 2.3 NA .. <NA>
## # ... with 9 more variables: terminated <chr>, decimals <chr>, geo_uid <chr>,
## # hierarchy_for_geo <chr>, classification_code_for_new_housing_price_indexes <chr>,
## # hierarchy_for_new_housing_price_indexes <chr>, val_norm <dbl>, date <date>,
## # new_housing_price_indexes <fct>
Important Note
The clean_names()
function changes the names of the three variables that are shown in the chart:
Variable: Old Name | New Name |
---|---|
REF_DATE | ref_date |
GEO | geo |
`New housing price indexes` | new_housing_price_indexes |
2.4.3 filtering: BC & Canada data
For our chart, we don’t need all the geographic regions—just B.C. and Canada. And for our purposes, we need the total value and want the chart to start in 2007, not 1981.
To do that, we use two functions from the {dplyr} package: filter
(to pick the rows we want, based on the values of interest) and select
(to pick the columns).
startdate <- as.Date("2007-01-01")
# filter to have BC and Canada
thedata_BC_Can <- thedata %>%
filter(ref_date >= startdate) %>%
filter(geo %in% c("British Columbia", "Canada"),
new_housing_price_indexes == "Total (house and land)") %>%
select(ref_date, geo, new_housing_price_indexes, value)
head(thedata_BC_Can)
## # A tibble: 6 x 4
## ref_date geo new_housing_price_indexes value
## <date> <chr> <fct> <dbl>
## 1 2007-01-01 Canada Total (house and land) 82.3
## 2 2007-01-01 British Columbia Total (house and land) 95.7
## 3 2007-02-01 Canada Total (house and land) 82.7
## 4 2007-02-01 British Columbia Total (house and land) 95.7
## 5 2007-03-01 Canada Total (house and land) 83
## 6 2007-03-01 British Columbia Total (house and land) 96.7
2.5 UNDERSTAND
2.5.1 Visualize - tables
One way to understand your data is to create a summary table that is more easily scanned by the human eye.
In this case, we will put the rows as each year, and the columns as the months January through December.
NHPI_table <- thedata_BC_Can %>%
filter(geo == "British Columbia") %>%
mutate(year = year(ref_date),
month = month(ref_date, label = TRUE)) %>%
select(year, month, value) %>%
pivot_wider(names_from = month, values_from = value) %>%
arrange(desc(year))
# display the table
head(NHPI_table)
## # A tibble: 6 x 13
## year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2021 112. 116. 117. 119. 120 120. 120. 121. 122. 123. 124 124.
## 2 2020 106. 106. 106. 106. 106. 106. 107. 108. 110. 112. 112. 112.
## 3 2019 108. 107. 107. 107. 107. 107. 106. 106. 106 106. 105. 106.
## 4 2018 108. 108. 108. 108. 108. 108 108. 108. 108 108 108. 108.
## 5 2017 99.9 100 100. 102. 103. 105. 106. 107. 107. 108. 108. 108.
## 6 2016 96.3 97.1 97.4 97.7 98.8 99.2 99.7 99.7 100 100. 100 100
For the next piece, we will calculate an annual average for each year, and add that column to the table.
This is a good time to say that there are often multiple ways to get to the same result—they are not “right” or “wrong”, just different. Here are three different ways to create a column with the annual average (and there are more):
# how to add annual average
# Julie's genius solution
NHPI_table2 <- NHPI_table %>%
mutate(annual_avg = rowMeans(NHPI_table[-1], na.rm = TRUE))
head(NHPI_table2)
## # A tibble: 6 x 14
## year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec annual_avg
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2021 112. 116. 117. 119. 120 120. 120. 121. 122. 123. 124 124. 120.
## 2 2020 106. 106. 106. 106. 106. 106. 107. 108. 110. 112. 112. 112. 108.
## 3 2019 108. 107. 107. 107. 107. 107. 106. 106. 106 106. 105. 106. 107.
## 4 2018 108. 108. 108. 108. 108. 108 108. 108. 108 108 108. 108. 108.
## 5 2017 99.9 100 100. 102. 103. 105. 106. 107. 107. 108. 108. 108. 104.
## 6 2016 96.3 97.1 97.4 97.7 98.8 99.2 99.7 99.7 100 100. 100 100 98.8
# Stephanie's genius solution
# starts with the raw data table
NHPI_table3 <- thedata_BC_Can %>%
filter(geo == "British Columbia") %>%
mutate(year = year(ref_date),
month = month(ref_date, label = TRUE)) %>%
select(year, month, value) %>%
group_by(year) %>%
mutate(annual_avg = mean(value, na.rm = TRUE)) %>%
pivot_wider(names_from = month, values_from = value) %>%
arrange(desc(year))
head(NHPI_table3)
## # A tibble: 6 x 14
## # Groups: year [6]
## year annual_avg Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2021 120. 112. 116. 117. 119. 120 120. 120. 121. 122. 123. 124 124.
## 2 2020 108. 106. 106. 106. 106. 106. 106. 107. 108. 110. 112. 112. 112.
## 3 2019 107. 108. 107. 107. 107. 107. 107. 106. 106. 106 106. 105. 106.
## 4 2018 108. 108. 108. 108. 108. 108. 108 108. 108. 108 108 108. 108.
## 5 2017 104. 99.9 100 100. 102. 103. 105. 106. 107. 107. 108. 108. 108.
## 6 2016 98.8 96.3 97.1 97.4 97.7 98.8 99.2 99.7 99.7 100 100. 100 100
This is also a good example of how R is continuously evolving. When this course was first taught in autumn 2019, this is where the section on adding the annual average column ended.
But that ongoing evolution has given us a new {dplyr} function, rowwise()
. This function allow us to calculate the average of the specified rows as above. What is required is telling R to treat the functions rowwise()
, and then specify what we want using the same sort of functions and syntax that would be required for a column:
NHPI_table4 <- NHPI_table %>%
rowwise() %>%
mutate(annual_avg = mean(c(Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec), na.rm = TRUE))
head(NHPI_table4)
## # A tibble: 6 x 14
## # Rowwise:
## year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec annual_avg
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2021 112. 116. 117. 119. 120 120. 120. 121. 122. 123. 124 124. 120.
## 2 2020 106. 106. 106. 106. 106. 106. 107. 108. 110. 112. 112. 112. 108.
## 3 2019 108. 107. 107. 107. 107. 107. 106. 106. 106 106. 105. 106. 107.
## 4 2018 108. 108. 108. 108. 108. 108 108. 108. 108 108 108. 108. 108.
## 5 2017 99.9 100 100. 102. 103. 105. 106. 107. 107. 108. 108. 108. 104.
## 6 2016 96.3 97.1 97.4 97.7 98.8 99.2 99.7 99.7 100 100. 100 100 98.8
In the revised example below, the range of the month variables can be specified using the c_across()
function.
NHPI_table4 <- NHPI_table %>%
rowwise() %>%
# use `c_across()` to specify the range of columns
mutate(annual_avg = mean(c_across(Jan:Dec), na.rm = TRUE))
head(NHPI_table4)
## # A tibble: 6 x 14
## # Rowwise:
## year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec annual_avg
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2021 112. 116. 117. 119. 120 120. 120. 121. 122. 123. 124 124. 120.
## 2 2020 106. 106. 106. 106. 106. 106. 107. 108. 110. 112. 112. 112. 108.
## 3 2019 108. 107. 107. 107. 107. 107. 106. 106. 106 106. 105. 106. 107.
## 4 2018 108. 108. 108. 108. 108. 108 108. 108. 108 108 108. 108. 108.
## 5 2017 99.9 100 100. 102. 103. 105. 106. 107. 107. 108. 108. 108. 104.
## 6 2016 96.3 97.1 97.4 97.7 98.8 99.2 99.7 99.7 100 100. 100 100 98.8
For more information about row-wise operations with {dplyr}, see the following resources:
https://community.rstudio.com/t/dplyr-alternatives-to-rowwise/8071/45
-
https://speakerdeck.com/jennybc/row-oriented-workflows-in-r-with-the-tidyverse
2.6 COMMUNICATE - table
One way we often communicate the data is through a summary table.
The tables above have the data we are interested in, but are not appealing to the eye. Using the {flextable} package, we can apply formatting to the tables above. We can also add things like headers and footnotes.
# print table with {flextable} formatting
NHPI_table4 %>%
flextable() %>%
# add_footer(glue::glue("Source: Statistics Canada, Table {table_id}"))
add_footer_row(values = glue::glue("Source: Statistics Canada, Table {table_id}"),
colwidths = 14)
year |
Jan |
Feb |
Mar |
Apr |
May |
Jun |
Jul |
Aug |
Sep |
Oct |
Nov |
Dec |
annual_avg |
2,021 |
112.3 |
116.3 |
117.2 |
118.9 |
120.0 |
119.7 |
120.3 |
120.8 |
121.9 |
123.3 |
124.0 |
124.1 |
119.90000 |
2,020 |
105.5 |
105.9 |
106.3 |
106.3 |
106.3 |
106.4 |
107.2 |
108.1 |
110.2 |
111.5 |
111.8 |
111.9 |
108.11667 |
2,019 |
107.6 |
107.4 |
107.4 |
107.2 |
106.9 |
106.8 |
106.4 |
106.4 |
106.0 |
105.8 |
105.3 |
105.7 |
106.57500 |
2,018 |
107.8 |
107.8 |
107.9 |
107.9 |
107.9 |
108.0 |
107.9 |
107.8 |
108.0 |
108.0 |
107.8 |
107.7 |
107.87500 |
2,017 |
99.9 |
100.0 |
100.5 |
101.5 |
103.4 |
104.7 |
106.5 |
106.6 |
107.3 |
107.6 |
107.6 |
107.8 |
104.45000 |
2,016 |
96.3 |
97.1 |
97.4 |
97.7 |
98.8 |
99.2 |
99.7 |
99.7 |
100.0 |
100.3 |
100.0 |
100.0 |
98.85000 |
2,015 |
93.6 |
93.6 |
93.6 |
94.2 |
94.2 |
94.4 |
94.7 |
94.7 |
95.1 |
95.3 |
95.8 |
95.9 |
94.59167 |
2,014 |
94.3 |
94.3 |
94.2 |
93.9 |
93.7 |
93.6 |
93.5 |
93.7 |
93.7 |
94.0 |
93.8 |
93.8 |
93.87500 |
2,013 |
95.3 |
95.4 |
95.2 |
95.3 |
95.1 |
94.9 |
95.1 |
94.8 |
94.8 |
94.6 |
94.5 |
94.3 |
94.94167 |
2,012 |
96.3 |
96.2 |
96.0 |
96.0 |
96.0 |
96.1 |
96.1 |
96.0 |
95.9 |
96.1 |
95.7 |
95.4 |
95.98333 |
2,011 |
96.6 |
96.6 |
96.8 |
96.9 |
97.1 |
97.1 |
97.1 |
96.7 |
96.8 |
96.6 |
96.3 |
96.1 |
96.72500 |
2,010 |
96.5 |
97.0 |
97.5 |
97.8 |
98.0 |
97.9 |
97.2 |
97.2 |
96.8 |
96.8 |
96.6 |
96.5 |
97.15000 |
2,009 |
98.1 |
95.4 |
94.3 |
93.3 |
93.6 |
92.8 |
93.4 |
93.6 |
94.8 |
95.3 |
95.6 |
96.1 |
94.69167 |
2,008 |
101.5 |
101.6 |
102.2 |
102.2 |
101.9 |
101.9 |
101.9 |
101.8 |
101.8 |
100.6 |
99.0 |
99.0 |
101.28333 |
2,007 |
95.7 |
95.7 |
96.7 |
97.3 |
99.5 |
100.3 |
100.4 |
100.5 |
100.5 |
101.1 |
101.4 |
101.4 |
99.20833 |
Source: Statistics Canada, Table 18-10-0205-01 |
2.6.1 table: monthly summary
This code produces a monthly summary table with the most recent 13 months of B.C. data, which allows for year-over-year comparisons with the same month in the previous year.
The second part of the code then formats the table ready for publication.
# create summary table
tbl_month_BC <-
thedata_BC_Can %>%
filter(geo == "British Columbia") %>%
arrange(ref_date) %>%
# calculate percent change stats
get_mom_stats() %>%
get_yoy_stats() %>%
# pull year and month
mutate(year = year(ref_date),
month = month(ref_date, label = TRUE)) %>%
# select relevant columns, rename as necessary
select(year, month, value,
"from previous month" = mom_chg,
"from same month, previous year" = yoy_chg) %>%
arrange(desc(year), desc(month)) %>%
# just print rows 1 to 13
slice(1:13)
# print table with {kableExtra} formatting
tbl_month_BC %>%
kable(caption = "NHPI, British Columbia", digits = 1) %>%
kable_styling(bootstrap_options = "striped") %>%
row_spec(0, bold = T, font_size = 14) %>%
row_spec(1, bold = T) %>%
add_header_above(c(" " = 3, "index point change" = 2), font_size = 14)
year | month | value | from previous month | from same month, previous year |
---|---|---|---|---|
2021 | Dec | 124.1 | 0.1 | 12.2 |
2021 | Nov | 124.0 | 0.7 | 12.2 |
2021 | Oct | 123.3 | 1.4 | 11.8 |
2021 | Sep | 121.9 | 1.1 | 11.7 |
2021 | Aug | 120.8 | 0.5 | 12.7 |
2021 | Jul | 120.3 | 0.6 | 13.1 |
2021 | Jun | 119.7 | -0.3 | 13.3 |
2021 | May | 120.0 | 1.1 | 13.7 |
2021 | Apr | 118.9 | 1.7 | 12.6 |
2021 | Mar | 117.2 | 0.9 | 10.9 |
2021 | Feb | 116.3 | 4.0 | 10.4 |
2021 | Jan | 112.3 | 0.4 | 6.8 |
2020 | Dec | 111.9 | 0.1 | 6.2 |
2.6.2 text
pull B.C. values for latest month
# filter a B.C.-only data frame
thedata_BC <- thedata_BC_Can %>%
filter(geo == "British Columbia") %>%
arrange(ref_date) %>%
# calculate percent change stats
get_mom_stats() %>%
get_yoy_stats()
# determine most recent month
latest_ref_date <- max(thedata_BC$ref_date)
this_month <- month(latest_ref_date, label = TRUE)
this_year <- year(latest_ref_date)
this_month_nhpi <- thedata_BC %>%
filter(ref_date == latest_ref_date) %>%
pull(value)
this_month_mom <- thedata_BC %>%
filter(ref_date == latest_ref_date) %>%
pull(mom_chg)
this_month_yoy <- thedata_BC %>%
filter(ref_date == latest_ref_date) %>%
pull(yoy_chg)
Our text:
In Dec 2021, the New Housing Price Index for British Columbia was 124.1. This was a change of 0.1 compared to the previous month, and 12.2 year-over-year.
2.7 UNDERSTAND
2.7.1 Visualize - plot
Making a plot of your data is often a good way to understand what the data can tell us, and helps us think about the story in the data. In this case, we will make a basic plot with just a few lines of code.
We can make it a little more presentable with some additional formatting
#
# with formatting applied
dataplot <- ggplot(thedata_BC_Can, aes(x=ref_date, y=value, colour=geo)) +
geom_line(size=1.5)
dataplot
And with even more formatting it’s starting to look like something we could publish.
dataplot2 <- dataplot +
scale_x_date(date_breaks = "2 years", labels = year) +
scale_y_continuous(labels = comma, limits = c(80, 125)) +
scale_colour_manual(name=NULL,
breaks=c("Canada", "British Columbia"),
labels=c("Canada", "British Columbia"),
values=c("#325A80", "#CCB550")) +
bida_chart_theme
Just a reminder: all of the code below is what’s in the bida_chart_theme
–the last line in the code chunk above.
If we were to make more than one copy of the plot, we would have to repeat this every time. And if we made more than one copy of the plot, we would have to make the same change in every place.
By assigning all of this code to the object bida_chart_theme
, we can make a change in one place, and it will carry through all of the times we need it.
# theme_bw() +
# theme(
# panel.border = element_rect(colour="white"),
# plot.title = element_text(face="bold"),
# legend.position=c(1,0),
# legend.justification=c(1,0),
# legend.title = element_text(size=12),
# legend.text = element_text(size=11),
# axis.line = element_line(colour="black"),
# axis.title = element_text(size=12),
# axis.text = element_text(size=12)
# )
#
dataplot2
2.8 COMMUNICATE - plot
Let’s finish our formatting…
# experiments with ggplot2's new subtitle and caption options
NHPI_title <- as.character("New Housing Price Index, Canada & B.C.")
NHPI_subtitle <- as.character("December 2016 = 100")
NHPI_caption <- as.character("Source: Statistics Canada, CANSIM table 18-10-0205-01")
# add titles / X-Y axis labels
NHPI_plot <- dataplot2 +
labs(title = NHPI_title,
subtitle = NHPI_subtitle,
caption = NHPI_caption,
x = NULL, y = "NHPI (Dec. 2016 = 100)")
NHPI_plot
ggsave(filename = "NHPI_plot.png", plot = NHPI_plot,
width = 8, height = 6)
2.9 REPEAT WITH DIFFERENT VALUES
2.9.1 Vancouver house and land version
Now we’ve set up the B.C. comparison to Canada, we can quickly adapt our previous work to look at another region–in this case, a deeper dive into Vancouver, comparing the changes in the “land” and “house” components of the NHPI.
startdate <- as.Date("2007-01-01")
# filter the original data to have Vancouver
thedata_YVR <- thedata %>%
filter(ref_date >= startdate) %>%
filter(geo %in% c("Vancouver, British Columbia"),
new_housing_price_indexes %in% c("House only", "Land only")) %>%
arrange(desc(ref_date))
thedata_YVR
## # A tibble: 360 × 21
## ref_date geo dguid uom uom_id scalar_factor scalar_id vector coordinate value status symbol
## <date> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 2021-12-01 Vanc… 2011… Inde… 347 units 0 v1119… 39.2 124. <NA> <NA>
## 2 2021-12-01 Vanc… 2011… Inde… 347 units 0 v1119… 39.3 119. E <NA>
## 3 2021-11-01 Vanc… 2011… Inde… 347 units 0 v1119… 39.2 123. <NA> <NA>
## 4 2021-11-01 Vanc… 2011… Inde… 347 units 0 v1119… 39.3 119. E <NA>
## 5 2021-10-01 Vanc… 2011… Inde… 347 units 0 v1119… 39.2 122 <NA> <NA>
## 6 2021-10-01 Vanc… 2011… Inde… 347 units 0 v1119… 39.3 119. E <NA>
## 7 2021-09-01 Vanc… 2011… Inde… 347 units 0 v1119… 39.2 121. <NA> <NA>
## 8 2021-09-01 Vanc… 2011… Inde… 347 units 0 v1119… 39.3 118. E <NA>
## 9 2021-08-01 Vanc… 2011… Inde… 347 units 0 v1119… 39.2 119. <NA> <NA>
## 10 2021-08-01 Vanc… 2011… Inde… 347 units 0 v1119… 39.3 117. E <NA>
## # … with 350 more rows, and 9 more variables: terminated <chr>, decimals <chr>, geo_uid <chr>,
## # hierarchy_for_geo <chr>, classification_code_for_new_housing_price_indexes <chr>,
## # hierarchy_for_new_housing_price_indexes <chr>, val_norm <dbl>, date <date>,
## # new_housing_price_indexes <fct>
the plot
NHPI_title <- as.character("New Housing Price Index, Vancouver: house and land")
NHPI_subtitle <- as.character("December 2016 = 100")
NHPI_caption <- as.character("Source: Statistics Canada, CANSIM table 18-10-0205-01")
# with formatting applied
NHPI_Vancouver_plot <-
ggplot(thedata_YVR, aes(x=ref_date, y=value,
colour=new_housing_price_indexes)) +
geom_line(size=1.5) +
#
scale_x_date(date_breaks = "2 years", labels = year) +
scale_y_continuous(labels = comma, limits = c(80, 125)) +
scale_colour_manual(name=NULL,
breaks=c("House only", "Land only"),
labels=c("House only", "Land only"),
values=c("#325A80", "#CCB550")) +
bida_chart_theme +
# set chart titles and labels
labs(title = NHPI_title,
subtitle = NHPI_subtitle,
caption = NHPI_caption,
x = NULL, y = "NHPI (Dec. 2016 = 100)")
NHPI_Vancouver_plot
-30-