3 An example of a data analytics project

3.1 The data science process

This walkthrough follows the data science process, as described by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund in R for Data Science, 2nd ed..

The data science process
The data science process

3.2 New Housing Price Index

This set of scripts creates summary tables and graphs plotting the New Housing Price Index (NHPI) data collected and reported by Statistics Canada.

Our end point will be this chart:

New Housing Price Index, B.C. and Canada
New Housing Price Index, B.C. and Canada

This plot was created in early 2020, when the most recent data available was for the month of November 2019.


3.2.1 Setup

This chunk of R code loads the packages that we will be using.

# tidyverse
library(tidyverse)
library(lubridate)  # date functions
library(scales)     # extending {ggplot2}
library(glue)       # for gluing strings together
#
# utilities
library(cansim)     # data extract
library(janitor)    # for `clean_names()`
library(knitr)      # for publication - includes `kable()` for tables
library(kableExtra) # - format kable tables
library(flextable)  # another table formatting package

First, we have some code for two custom functions that calculate month-over-month and year-over-year statistics. These will be called later in the scripts. Functions give us a shorthand way to do the same thing multiple times.

get_mom_stats <- function(df) {
  df |>
    arrange(ref_date) |>
    mutate(mom_val = lag(value), 
           mom_pct = ((value / lag(value, n = 1)) - 1) * 100,
           mom_chg = (value - lag(value, n = 1)))
}


get_yoy_stats <- function(df) {
  df |>
    arrange(ref_date) |>
    mutate(yoy_val = lag(value, n = 12),
           yoy_pct = ((value / lag(value, n = 12)) - 1) * 100,
           yoy_chg = (value - lag(value, n = 12)))
}

We then create a chart theme that will also be called later. This theme will be applied to all of our charts, and ensures that every one of our charts have exactly the same formatting. Like functions, this streamlines our code and helps reduce the chance of error.

For this chart theme, more than 10 lines of code gets reduced to a simple, memorable name: bida_chart_theme.

bida_chart_theme <-
  theme_bw() +
  theme(
    panel.border = element_rect(colour="white"),
    plot.title = element_text(face="bold"),
    legend.position=c(1,0), 
    legend.justification=c(1,0),
    legend.title = element_text(size=12),
    legend.text = element_text(size=11),
    axis.line = element_line(colour="black"),
    axis.title = element_text(size=12),
    axis.text = element_text(size=12)
  )

3.3 IMPORT

The first step is to import the data. For this, we will access Statistics Canada’s database CANSIM, using an R package {cansim}. The file has 18 variables and over 56,000 rows (as of February 2020) and growing every month.

data source

The Daily, 2019-10-10

Table: 18-10-0205-01 (formerly CANSIM 327-0056) https://www150.statcan.gc.ca/t1/tbl1/en/tv.action?pid=1810020501

Note: “The index base period, for which the New Housing Price Index (NHPI) equals 100, is December 2016.”

This version (updated 2024-03-21) uses the cansim package to pull the data from CANSIM, and assign it to an object named “thedata”.

table_id <- "18-10-0205-01"

thedata <- get_cansim(table_id)

3.4 TIDY

3.4.1 understanding the data

It’s a good practice before starting any data analysis to understand the structure of your data, by interrogating the variables and the values in those variables.

{cansim} (like many other data-focused packages) provides some utility functions that let us quickly understand the data frame. Here we use the function get_cansim_table_overview to generate an overview of the table we downloaded.

# from the {cansim} package
get_cansim_table_overview(table_id)

We can also create some summaries with the names of the variables. One way to do this is with the ls.str() function:

# list the variables in the data
ls.str(thedata)
## Classification Code for New housing price indexes :  chr [1:62160] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA ...
## COORDINATE :  chr [1:62160] "1.1" "1.2" "1.3" "2.1" "2.2" "2.3" "3.1" "3.2" "3.3" "4.1" "4.2" "4.3" "5.1" ...
## Date :  Date[1:62160], format: "1981-01-01" "1981-01-01" "1981-01-01" "1981-01-01" "1981-01-01" "1981-01-01" ...
## DECIMALS :  chr [1:62160] "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" ...
## DGUID :  chr [1:62160] "2016A000011124" "2016A000011124" "2016A000011124" "2016A00011" "2016A00011" ...
## GEO :  chr [1:62160] "Canada" "Canada" "Canada" "Atlantic Region" "Atlantic Region" "Atlantic Region" ...
## GeoUID :  chr [1:62160] "11124" "11124" "11124" "1" "1" "1" "10" "10" "10" "1" "1" "1" "11" "11" "11" ...
## Hierarchy for GEO :  chr [1:62160] "1" "1" "1" "1.2" "1.2" "1.2" "1.2.3" "1.2.3" "1.2.3" "1.2.3.4" "1.2.3.4" ...
## Hierarchy for New housing price indexes :  chr [1:62160] "1" "1.2" "1.3" "1" "1.2" "1.3" "1" "1.2" "1.3" "1" "1.2" "1.3" "1" "1.2" "1.3" ...
## New housing price indexes :  Factor w/ 3 levels "Total (house and land)",..: 1 2 3 1 2 3 1 2 3 1 ...
## REF_DATE :  chr [1:62160] "1981-01" "1981-01" "1981-01" "1981-01" "1981-01" "1981-01" "1981-01" "1981-01" ...
## SCALAR_FACTOR :  chr [1:62160] "units" "units" "units" "units" "units" "units" "units" "units" "units" "units" ...
## SCALAR_ID :  chr [1:62160] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" ...
## STATUS :  chr [1:62160] NA NA "E" ".." ".." ".." ".." ".." ".." NA NA "E" ".." ".." ".." ".." ".." ".." ...
## SYMBOL :  chr [1:62160] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA ...
## TERMINATED :  chr [1:62160] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA ...
## UOM :  chr [1:62160] "Index, 201612=100" "Index, 201612=100" "Index, 201612=100" "Index, 201612=100" ...
## UOM_ID :  chr [1:62160] "347" "347" "347" "347" "347" "347" "347" "347" "347" "347" "347" "347" "347" ...
## val_norm :  num [1:62160] 38.2 36.1 40.6 NA NA NA NA NA NA 36.1 ...
## VALUE :  num [1:62160] 38.2 36.1 40.6 NA NA NA NA NA NA 36.1 ...
## VECTOR :  chr [1:62160] "v111955442" "v111955443" "v111955444" "v111955445" "v111955446" "v111955447" ...

We can explore the individual variables as well—let’s generate a small table counting the instances of the GEO and GeoUID variables:

thedata |>
  group_by(GEO, GeoUID) |>
  tally()
## # A tibble: 40 × 3
## # Groups:   GEO [40]
##    GEO                                 GeoUID     n
##    <chr>                               <chr>  <int>
##  1 Alberta                             48      1554
##  2 Atlantic Region                     1       1554
##  3 British Columbia                    59      1554
##  4 Calgary, Alberta                    825     1554
##  5 Canada                              11124   1554
##  6 Charlottetown, Prince Edward Island 105     1554
##  7 Edmonton, Alberta                   835     1554
##  8 Greater Sudbury, Ontario            580     1554
##  9 Guelph, Ontario                     550     1554
## 10 Halifax, Nova Scotia                205     1554
## # ℹ 30 more rows

The tail() function shows the last 6 rows of a data table.

tail(thedata)
## # A tibble: 6 × 21
##   REF_DATE GEO      DGUID UOM   UOM_ID SCALAR_FACTOR SCALAR_ID VECTOR COORDINATE VALUE STATUS SYMBOL
##   <chr>    <chr>    <chr> <chr> <chr>  <chr>         <chr>     <chr>  <chr>      <dbl> <chr>  <chr> 
## 1 2024-02  Vancouv… 2011… Inde… 347    units         0         v1119… 39.1        128. <NA>   <NA>  
## 2 2024-02  Vancouv… 2011… Inde… 347    units         0         v1119… 39.2        127. <NA>   <NA>  
## 3 2024-02  Vancouv… 2011… Inde… 347    units         0         v1119… 39.3        122. E      <NA>  
## 4 2024-02  Victori… 2011… Inde… 347    units         0         v1119… 40.1        120. <NA>   <NA>  
## 5 2024-02  Victori… 2011… Inde… 347    units         0         v1119… 40.2        125. <NA>   <NA>  
## 6 2024-02  Victori… 2011… Inde… 347    units         0         v1119… 40.3        110. E      <NA>  
## # ℹ 9 more variables: TERMINATED <chr>, DECIMALS <chr>, GeoUID <chr>, `Hierarchy for GEO` <chr>,
## #   `Classification Code for New housing price indexes` <chr>,
## #   `Hierarchy for New housing price indexes` <chr>, val_norm <dbl>, Date <date>,
## #   `New housing price indexes` <fct>

Based on this interrogation, we can see that the chart is made up of the following variables:

  • “REF_DATE”, starting in January 2007

  • “GEO”, with British Columbia & Canada

  • “New housing price indexes”, but total (not house and land separately)

3.4.2 cleaning

Cleaning up our data table before we start working is often the next step. In this case, we use the {janitor} package to clean the variable names.

The values in the date field is also problematic—they are stored as character strings. We will clean the values by specifying them as a “date” format.

thedata <- janitor::clean_names(thedata)

thedata <- thedata |>
    mutate(ref_date = ymd(ref_date, truncated = 2)) 

head(thedata)
## # A tibble: 6 × 21
##   ref_date   geo    dguid uom   uom_id scalar_factor scalar_id vector coordinate value status symbol
##   <date>     <chr>  <chr> <chr> <chr>  <chr>         <chr>     <chr>  <chr>      <dbl> <chr>  <chr> 
## 1 1981-01-01 Canada 2016… Inde… 347    units         0         v1119… 1.1         38.2 <NA>   <NA>  
## 2 1981-01-01 Canada 2016… Inde… 347    units         0         v1119… 1.2         36.1 <NA>   <NA>  
## 3 1981-01-01 Canada 2016… Inde… 347    units         0         v1119… 1.3         40.6 E      <NA>  
## 4 1981-01-01 Atlan… 2016… Inde… 347    units         0         v1119… 2.1         NA   ..     <NA>  
## 5 1981-01-01 Atlan… 2016… Inde… 347    units         0         v1119… 2.2         NA   ..     <NA>  
## 6 1981-01-01 Atlan… 2016… Inde… 347    units         0         v1119… 2.3         NA   ..     <NA>  
## # ℹ 9 more variables: terminated <chr>, decimals <chr>, geo_uid <chr>, hierarchy_for_geo <chr>,
## #   classification_code_for_new_housing_price_indexes <chr>,
## #   hierarchy_for_new_housing_price_indexes <chr>, val_norm <dbl>, date <date>,
## #   new_housing_price_indexes <fct>

Important Note

The clean_names() function changes the names of the three variables that are shown in the chart:

Variable: Old Name New Name
REF_DATE ref_date
GEO geo
New housing price indexes new_housing_price_indexes

3.4.3 filtering: BC & Canada data

For our chart, we don’t need all the geographic regions—just B.C. and Canada. And for our purposes, we need the total value and want the chart to start in 2007, not 1981.

To do that, we use two functions from the {dplyr} package: filter (to pick the rows we want, based on the values of interest) and select (to pick the columns).

startdate <- as.Date("2007-01-01")
# filter to have BC and Canada
thedata_BC_Can <- thedata |>
  filter(ref_date >= startdate) |>
  filter(geo %in% c("British Columbia", "Canada"), 
         new_housing_price_indexes == "Total (house and land)") |>
  select(ref_date, geo, new_housing_price_indexes, value)
head(thedata_BC_Can)
## # A tibble: 6 × 4
##   ref_date   geo              new_housing_price_indexes value
##   <date>     <chr>            <fct>                     <dbl>
## 1 2007-01-01 Canada           Total (house and land)     82.3
## 2 2007-01-01 British Columbia Total (house and land)     95.7
## 3 2007-02-01 Canada           Total (house and land)     82.7
## 4 2007-02-01 British Columbia Total (house and land)     95.7
## 5 2007-03-01 Canada           Total (house and land)     83  
## 6 2007-03-01 British Columbia Total (house and land)     96.7

3.5 UNDERSTAND

3.5.1 Visualize - tables

One way to understand your data is to create a summary table that is more easily scanned by the human eye.

In this case, we will put the rows as each year, and the columns as the months January through December.

NHPI_table <- thedata_BC_Can |>
  filter(geo == "British Columbia") |>
  mutate(year = year(ref_date),
         month = month(ref_date, label = TRUE)) |>
  select(year, month, value) |>
  pivot_wider(names_from = month, values_from = value) |>
  arrange(desc(year))

# display the table
head(NHPI_table)
## # A tibble: 6 × 13
##    year   Jan   Feb   Mar   Apr   May   Jun   Jul   Aug   Sep   Oct   Nov   Dec
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  2024  127.  127.   NA    NA    NA    NA    NA    NA    NA    NA    NA    NA 
## 2  2023  127.  127.  127.  127.  127   128.  127.  128.  127.  127.  127.  127.
## 3  2022  126.  127   128.  128.  128   128.  129.  129.  129.  128.  128.  128.
## 4  2021  112.  116.  117.  119.  120   120.  120.  121.  122.  123.  124   124.
## 5  2020  106.  106.  106.  106.  106.  106.  107.  108.  110.  112.  112.  112.
## 6  2019  108.  107.  107.  107.  107.  107.  106.  106.  106   106.  105.  106.

For the next piece, we will calculate an annual average for each year, and add that column to the table.

This is a good time to say that there are often multiple ways to get to the same result—they are not “right” or “wrong”, just different. Here are three different ways to create a column with the annual average (and there are more):

# how to add annual average
# Julie's genius solution
NHPI_table2 <- NHPI_table |>
  mutate(annual_avg = rowMeans(NHPI_table[-1], na.rm = TRUE))
head(NHPI_table2)
## # A tibble: 6 × 14
##    year   Jan   Feb   Mar   Apr   May   Jun   Jul   Aug   Sep   Oct   Nov   Dec annual_avg
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>      <dbl>
## 1  2024  127.  127.   NA    NA    NA    NA    NA    NA    NA    NA    NA    NA        127.
## 2  2023  127.  127.  127.  127.  127   128.  127.  128.  127.  127.  127.  127.       127.
## 3  2022  126.  127   128.  128.  128   128.  129.  129.  129.  128.  128.  128.       128.
## 4  2021  112.  116.  117.  119.  120   120.  120.  121.  122.  123.  124   124.       120.
## 5  2020  106.  106.  106.  106.  106.  106.  107.  108.  110.  112.  112.  112.       108.
## 6  2019  108.  107.  107.  107.  107.  107.  106.  106.  106   106.  105.  106.       107.
# Stephanie's genius solution
# starts with the raw data table
NHPI_table3 <- thedata_BC_Can |>
  filter(geo == "British Columbia") |>
  mutate(year = year(ref_date),
         month = month(ref_date, label = TRUE)) |>
  select(year, month, value) |>
  group_by(year) |>
  mutate(annual_avg = mean(value, na.rm = TRUE)) |>
  pivot_wider(names_from = month, values_from = value) |>
  arrange(desc(year))

head(NHPI_table3)
## # A tibble: 6 × 14
## # Groups:   year [6]
##    year annual_avg   Jan   Feb   Mar   Apr   May   Jun   Jul   Aug   Sep   Oct   Nov   Dec
##   <dbl>      <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1  2024       127.  127.  127.   NA    NA    NA    NA    NA    NA    NA    NA    NA    NA 
## 2  2023       127.  127.  127.  127.  127.  127   128.  127.  128.  127.  127.  127.  127.
## 3  2022       128.  126.  127   128.  128.  128   128.  129.  129.  129.  128.  128.  128.
## 4  2021       120.  112.  116.  117.  119.  120   120.  120.  121.  122.  123.  124   124.
## 5  2020       108.  106.  106.  106.  106.  106.  106.  107.  108.  110.  112.  112.  112.
## 6  2019       107.  108.  107.  107.  107.  107.  107.  106.  106.  106   106.  105.  106.

This is also a good example of how R is continuously evolving. When this course was first taught in autumn 2019, this is where the section on adding the annual average column ended.

But that ongoing evolution has given us a new {dplyr} function, rowwise(). This function allow us to calculate the average of the specified rows as above. What is required is telling R to treat the functions rowwise(), and then specify what we want using the same sort of functions and syntax that would be required for a column:

NHPI_table4 <- NHPI_table |>
  rowwise() |> 
  mutate(annual_avg = mean(c(Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec), na.rm = TRUE))

head(NHPI_table4)
## # A tibble: 6 × 14
## # Rowwise: 
##    year   Jan   Feb   Mar   Apr   May   Jun   Jul   Aug   Sep   Oct   Nov   Dec annual_avg
##   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>      <dbl>
## 1  2024  127.  127.   NA    NA    NA    NA    NA    NA    NA    NA    NA    NA        127.
## 2  2023  127.  127.  127.  127.  127   128.  127.  128.  127.  127.  127.  127.       127.
## 3  2022  126.  127   128.  128.  128   128.  129.  129.  129.  128.  128.  128.       128.
## 4  2021  112.  116.  117.  119.  120   120.  120.  121.  122.  123.  124   124.       120.
## 5  2020  106.  106.  106.  106.  106.  106.  107.  108.  110.  112.  112.  112.       108.
## 6  2019  108.  107.  107.  107.  107.  107.  106.  106.  106   106.  105.  106.       107.

For more information about row-wise operations with {dplyr}, see the following resources:


3.6 COMMUNICATE - table

One way we often communicate the data is through a summary table.

The tables above have the data we are interested in, but are not appealing to the eye. Using the {flextable} package, we can apply formatting to the tables above. We can also add things like headers and footnotes.

# print table with {flextable} formatting  
NHPI_table4 |>
  flextable() |> 
#  add_footer(glue::glue("Source: Statistics Canada, Table {table_id}"))
  add_footer_row(values = glue::glue("Source: Statistics Canada, Table {table_id}"),
                 colwidths = 14)

year

Jan

Feb

Mar

Apr

May

Jun

Jul

Aug

Sep

Oct

Nov

Dec

annual_avg

2,024

127.1

127.2

127.15000

2,023

127.1

126.8

126.8

126.8

127.0

127.5

127.3

127.9

127.4

127.4

127.3

127.3

127.21667

2,022

126.2

127.0

127.7

127.9

128.0

128.1

128.7

128.7

128.7

127.8

127.7

127.6

127.84167

2,021

112.3

116.3

117.2

118.9

120.0

119.7

120.3

120.8

121.9

123.3

124.0

124.1

119.90000

2,020

105.5

105.9

106.3

106.3

106.3

106.4

107.2

108.1

110.2

111.5

111.8

111.9

108.11667

2,019

107.6

107.4

107.4

107.2

106.9

106.8

106.4

106.4

106.0

105.8

105.3

105.7

106.57500

2,018

107.8

107.8

107.9

107.9

107.9

108.0

107.9

107.8

108.0

108.0

107.8

107.7

107.87500

2,017

99.9

100.0

100.5

101.5

103.4

104.7

106.5

106.6

107.3

107.6

107.6

107.8

104.45000

2,016

96.3

97.1

97.4

97.7

98.8

99.2

99.7

99.7

100.0

100.3

100.0

100.0

98.85000

2,015

93.6

93.6

93.6

94.2

94.2

94.4

94.7

94.7

95.1

95.3

95.8

95.9

94.59167

2,014

94.3

94.3

94.2

93.9

93.7

93.6

93.5

93.7

93.7

94.0

93.8

93.8

93.87500

2,013

95.3

95.4

95.2

95.3

95.1

94.9

95.1

94.8

94.8

94.6

94.5

94.3

94.94167

2,012

96.3

96.2

96.0

96.0

96.0

96.1

96.1

96.0

95.9

96.1

95.7

95.4

95.98333

2,011

96.6

96.6

96.8

96.9

97.1

97.1

97.1

96.7

96.8

96.6

96.3

96.1

96.72500

2,010

96.5

97.0

97.5

97.8

98.0

97.9

97.2

97.2

96.8

96.8

96.6

96.5

97.15000

2,009

98.1

95.4

94.3

93.3

93.6

92.8

93.4

93.6

94.8

95.3

95.6

96.1

94.69167

2,008

101.5

101.6

102.2

102.2

101.9

101.9

101.9

101.8

101.8

100.6

99.0

99.0

101.28333

2,007

95.7

95.7

96.7

97.3

99.5

100.3

100.4

100.5

100.5

101.1

101.4

101.4

99.20833

Source: Statistics Canada, Table 18-10-0205-01

3.6.1 table: monthly summary

This code produces a monthly summary table with the most recent 13 months of B.C. data, which allows for year-over-year comparisons with the same month in the previous year.

The second part of the code then formats the table ready for publication.

# create summary table
tbl_month_BC <-
thedata_BC_Can |>
  filter(geo == "British Columbia") |>
  arrange(ref_date) |>
  # calculate percent change stats
  get_mom_stats() |>
  get_yoy_stats() |>
  # pull year and month
  mutate(year = year(ref_date),
         month = month(ref_date, label = TRUE)) |>
  # select relevant columns, rename as necessary
  select(year, month, value, 
         "from previous month" = mom_chg, 
         "from same month, previous year" = yoy_chg) |>
  arrange(desc(year), desc(month)) |>
  # just print rows 1 to 13
  slice(1:13)

# print table with {kableExtra} formatting  
tbl_month_BC |>
  kable(caption = "NHPI, British Columbia", digits = 1) |>
  kable_styling(bootstrap_options = "striped") |>
  row_spec(0, bold = T, font_size = 14) |>
  row_spec(1, bold = T) |>
  add_header_above(c(" " = 3, "index point change" = 2), font_size = 14)
Table 3.1: Table 3.2: NHPI, British Columbia
index point change
year month value from previous month from same month, previous year
2024 Feb 127.2 0.1 0.4
2024 Jan 127.1 -0.2 0.0
2023 Dec 127.3 0.0 -0.3
2023 Nov 127.3 -0.1 -0.4
2023 Oct 127.4 0.0 -0.4
2023 Sep 127.4 -0.5 -1.3
2023 Aug 127.9 0.6 -0.8
2023 Jul 127.3 -0.2 -1.4
2023 Jun 127.5 0.5 -0.6
2023 May 127.0 0.2 -1.0
2023 Apr 126.8 0.0 -1.1
2023 Mar 126.8 0.0 -0.9
2023 Feb 126.8 -0.3 -0.2

3.6.2 text

pull B.C. values for latest month

# filter a B.C.-only data frame

thedata_BC <- thedata_BC_Can |>
  filter(geo == "British Columbia") |>
  arrange(ref_date) |>
  # calculate percent change stats
  get_mom_stats() |>
  get_yoy_stats()

# determine most recent month

latest_ref_date <- max(thedata_BC$ref_date)

this_month <- month(latest_ref_date, label = TRUE)
this_year <- year(latest_ref_date)

this_month_nhpi <- thedata_BC |>
  filter(ref_date == latest_ref_date) |>
  pull(value)
this_month_mom <- thedata_BC |>
  filter(ref_date == latest_ref_date) |>
  pull(mom_chg)
this_month_yoy <- thedata_BC |>
  filter(ref_date == latest_ref_date) |>
  pull(yoy_chg)

Our text:

In Feb 2024, the New Housing Price Index for British Columbia was 127.2. This was a change of 0.1 compared to the previous month, and 0.4 year-over-year.


3.7 UNDERSTAND

3.7.1 Visualize - plot

Making a plot of your data is often a good way to understand what the data can tell us, and helps us think about the story in the data. In this case, we will make a basic plot with just a few lines of code.

# PLOT!
# basic
ggplot(thedata_BC_Can, aes(x=ref_date, y=value, group=geo)) + 
  geom_line()

We can make it a little more presentable with some additional formatting

#
# with formatting applied
dataplot <- ggplot(thedata_BC_Can, aes(x=ref_date, y=value, colour=geo)) + 
  geom_line(size=1.5) 
dataplot

And with even more formatting it’s starting to look like something we could publish.

dataplot2 <- dataplot +
  scale_x_date(date_breaks = "2 years", labels = year) +
  scale_y_continuous(labels = comma, limits = c(80, 125)) +
  scale_colour_manual(name=NULL,
                      breaks=c("Canada", "British Columbia"),
                      labels=c("Canada", "British Columbia"), 
                      values=c("#325A80", "#CCB550")) +
  bida_chart_theme

Just a reminder: all of the code below is what’s in the bida_chart_theme–the last line in the code chunk above.

If we were to make more than one copy of the plot, we would have to repeat this every time. And if we made more than one copy of the plot, we would have to make the same change in every place.

By assigning all of this code to the object bida_chart_theme, we can make a change in one place, and it will carry through all of the times we need it.

#  theme_bw() +
#  theme(
#    panel.border = element_rect(colour="white"),
#    plot.title = element_text(face="bold"),
#    legend.position=c(1,0), 
#    legend.justification=c(1,0),
#    legend.title = element_text(size=12),
#    legend.text = element_text(size=11),
#    axis.line = element_line(colour="black"),
#    axis.title = element_text(size=12),
#    axis.text = element_text(size=12)
#  )
#
dataplot2 

3.8 COMMUNICATE - plot

Let’s finish our formatting…

# experiments with ggplot2's new subtitle and caption options
NHPI_title <- as.character("New Housing Price Index, Canada & B.C.")
NHPI_subtitle <- as.character("December 2016 = 100")
NHPI_caption <- as.character("Source: Statistics Canada, CANSIM table 18-10-0205-01")
# add titles / X-Y axis labels
NHPI_plot <- dataplot2 +
  labs(title = NHPI_title,
       subtitle = NHPI_subtitle,
       caption = NHPI_caption, 
       x = NULL, y = "NHPI (Dec. 2016 = 100)") 
NHPI_plot
ggsave(filename = "NHPI_plot.jpg", plot = NHPI_plot,
       width = 8, height = 6)

3.9 REPEAT WITH DIFFERENT VALUES

3.9.1 Vancouver house and land version

Now we’ve set up the B.C. comparison to Canada, we can quickly adapt our previous work to look at another region–in this case, a deeper dive into Vancouver, comparing the changes in the “land” and “house” components of the NHPI.

startdate <- as.Date("2007-01-01")

# filter the original data to have Vancouver
thedata_YVR <- thedata |>
  filter(ref_date >= startdate) |>
  filter(geo %in% c("Vancouver, British Columbia"), 
         new_housing_price_indexes %in% c("House only", "Land only")) |>
  arrange(desc(ref_date))

thedata_YVR
## # A tibble: 412 × 21
##    ref_date   geo   dguid uom   uom_id scalar_factor scalar_id vector coordinate value status symbol
##    <date>     <chr> <chr> <chr> <chr>  <chr>         <chr>     <chr>  <chr>      <dbl> <chr>  <chr> 
##  1 2024-02-01 Vanc… 2011… Inde… 347    units         0         v1119… 39.2        127. <NA>   <NA>  
##  2 2024-02-01 Vanc… 2011… Inde… 347    units         0         v1119… 39.3        122. E      <NA>  
##  3 2024-01-01 Vanc… 2011… Inde… 347    units         0         v1119… 39.2        126. <NA>   <NA>  
##  4 2024-01-01 Vanc… 2011… Inde… 347    units         0         v1119… 39.3        122. E      <NA>  
##  5 2023-12-01 Vanc… 2011… Inde… 347    units         0         v1119… 39.2        127. <NA>   <NA>  
##  6 2023-12-01 Vanc… 2011… Inde… 347    units         0         v1119… 39.3        122. E      <NA>  
##  7 2023-11-01 Vanc… 2011… Inde… 347    units         0         v1119… 39.2        127. <NA>   <NA>  
##  8 2023-11-01 Vanc… 2011… Inde… 347    units         0         v1119… 39.3        122. E      <NA>  
##  9 2023-10-01 Vanc… 2011… Inde… 347    units         0         v1119… 39.2        127. <NA>   <NA>  
## 10 2023-10-01 Vanc… 2011… Inde… 347    units         0         v1119… 39.3        122. E      <NA>  
## # ℹ 402 more rows
## # ℹ 9 more variables: terminated <chr>, decimals <chr>, geo_uid <chr>, hierarchy_for_geo <chr>,
## #   classification_code_for_new_housing_price_indexes <chr>,
## #   hierarchy_for_new_housing_price_indexes <chr>, val_norm <dbl>, date <date>,
## #   new_housing_price_indexes <fct>

the plot

NHPI_title <- as.character("New Housing Price Index, Vancouver: house and land")
NHPI_subtitle <- as.character("December 2016 = 100")
NHPI_caption <- as.character("Source: Statistics Canada, CANSIM table 18-10-0205-01")

# with formatting applied
NHPI_Vancouver_plot <- 
  ggplot(thedata_YVR, aes(x=ref_date, y=value,
                           colour=new_housing_price_indexes)) + 
  geom_line(size=1.5) +
  #
  scale_x_date(date_breaks = "2 years", labels = year) +
  scale_y_continuous(labels = comma, limits = c(80, 125)) +
  scale_colour_manual(name=NULL,
                      breaks=c("House only", "Land only"),
                      labels=c("House only", "Land only"), 
                      values=c("#325A80", "#CCB550")) +
  bida_chart_theme +
  # set chart titles and labels
  labs(title = NHPI_title,
       subtitle = NHPI_subtitle,
       caption = NHPI_caption, 
       x = NULL, y = "NHPI (Dec. 2016 = 100)") 

NHPI_Vancouver_plot

-30-