Geographic Datascience with R

Author

Peter Baumgartner

Published

2025-05-11 11:29

Preface

This is work in progress: Finished about 11%

This Quarto book collects my personal notes, trials and exercises of Geographic Data Science With R: Visualizing and Analyzing Environmental Change by Michael Wimberly (). The book is also available online. The dataset that accompanies the book can be downloaded from figshare.

I have some experience with the R language but none working with geodata. I want to use maps to show how statistical results differ by country. My focus is therefore on geodata, e.g., I have skipped chapters 1-4 because I am already familiar with their content of introducing R. I have added some additional material/experiments as separate annex chapters.

The book has twelve chapters with 267 pages (appendix and bibliography not included.) I skipped Chapter 1 to 4 (83 pages) because I am already familiar with their topics. The percentage is therefore calculated by notes referring to the chapters 5 to 12 (184 pages).

Resource 1 : Resources used for Geographic Datascience with R (GDSWR)

Packages used in this chapter

To refer to an R package I have used the format {}, e.g., {glossary}. At the beginning of each file I have listed all packages used in the current chapter, as the next box demonstrates for this preface.

The package name is followed by the reference to the package author and a link to a short profile of the package that are collected in the annex ().

Resource 2 : List of packages to install

Personal learning material

I am writing this book as a text for others to read because that forces me to become explicit and explain all my learning outcomes more carefully. Please keep in mind that this text is not written by an expert but by a learner.

Text passages with content I am already familiar I have skipped. In this case I have started with chapter 5 because the chapters 1 to 4 is an introduction into using R which I am already comfortable with.

Section of the original text where I needed more in-depth knowledge I have elaborated and added my own comments resulted from my personal research.

All mistakes are my own responsibility

This is my personal learning material and is therefore neither an accurate replication nor an authoritative textbook.

In spite of replicating most of the content this Quarto book may contain many mistakes. All the misapprehensions and errors are of course my own responsibility.

Glossary

I am using the {glossary} package to create links to glossary entries. If you hover with your mouse over the double underlined links it opens an window with the appropriate glossary text. Try this example: geoJSON. Additionally you will find a table of all used glossary entries in a specific chapter at the end of the chapter, immediately in front of the session-info.

I have added many of the glossary entries when I was working through other books either taking the text passage of these books I was reading or via an internet recherche from other resources. I have added the source of glossary entry. Sometimes I have used abbreviation, but I need still to provide a key what this short references mean.

R Code 1 : Load glossary

Listing / Output 1: Install and load the glossary package with the appropriate glossary.yml file
Code
## 1. Install the glossary package:
## https://debruine.github.io/glossary/

library(glossary)

## If you want to use my glossary.yml file:

## 1. fork my repo
##    https://github.com/petzi53/glossary-pb

## 2. Download the `glossary.yml` file from
##    https://github.com/petzi53/glossary-pb/blob/master/glossary.yml)

## 3. Store the file on your hard disk
##    and change the following path accordingly

glossary::glossary_path("../glossary-pb/glossary.yml")

If you fork the repository of my [Geographic Datascience with R] (https://github.com/petzi53/GDSWR), the glossary will not work out of the box. Load down the glossary.yml file from my glossary-pb GitHub repo, store it on your hard disk and change the path in the code chunk .

Glossary is my private learning vehicle

What I have said to my personal notes in general is also valid for the glossary. The entries represent the state of my personal knowledge and are neither authoritative nor complete.

R Code and Datasets

Download datasets

The data files used in each chapter has to be downloaded from https://doi.org/10.6084/m9.figshare.21301212.

Run the following code chunk only once (manually).

R Code 2 : Download Data Files

Code
base::source(file = "R/helper.R")

## create data folder (only once, e.g., only in this chapter)
baseURL <- here::here()
my_create_folder(base::paste0(baseURL, "/data"))

url = "https://figshare.com/ndownloader/files/39733921"
utils::download.file(url, base::paste0(baseURL, "/data/gdswr_data.zip"))
utils::unzip(base::paste0(baseURL, "/data/gdswr_data.zip"), 
             exdir = base::paste0(baseURL, "/data"))

## delete .zip file
fn <- (base::paste0(baseURL, "/data/gdswr_data.zip"))
if (file.exists(fn)) {
  file.remove(fn)
}
(For this R code chunk is no output available)
One data file exceeds GitHub’s file size

After I tried to commit my changes, I learned that there is a file mesodata_large.csv in “data/Chapter3” with 151.28 MB. This exceeds GitHub’s file size limit of 100.00 MB and can’t, therefore, be stored on GitHub the standard way.

There is the possibility to use Git Large File Storage - https://git-lfs.github.com. However, I decided against this complexity. Actually, I did not need the files of Chapter1 to Chapter4 as I skipped the starting chapters. But instead to delete the file from my data subfolder, I changed the content of .gitignore accordingly so that this large file will not be transferred to GitHub.

Style guides

Generally I am using the Tidyverse Style Guide for code chunks. I am going to use underscore (_) or snake case to replace spaces as studies has shown that it is easier to read ().

Additionally I will use some Google style modifications from the tidyverse style guide:

  • Start the names of private functions with my_ (not with a dot, as recommended in the Google style guide).
  • Don’t use base::attach().
  • No right-hand assignments.
  • Use explicit returns.
  • Qualify namespace.

Qualifying namespace

Especially the last point (qualifying namespace) is important for my learning. Besides preventing conflicts with functions of identical names from different packages it helps to learn (or remember) which function belongs to which package. I think this justifies the small overhead and helps to make R code chunks self-sufficient. (No previous package loading, or library calls in the setup chunk.) To foster learning the relation between function and package I embrace the package name with curly brakes and format it in bold.

I am using the package name also for the default installation of base R. This wouldn’t be necessary but it helps me to understand where the base R functions come from. What follows is a list of base R packages of the system library included into every installation and attached (opened) by default:

  • {base}: The R Base Package
  • {datsets}: The R Datasets Package
  • {graphics}: The R Graphics Package
  • {grDevices}: The R Graphics Devices and Support for Colours and Fonts
  • {methods}: Formal Methods and Classes
  • {stats}: The R Stats Package
  • {utils}: The R Utils Package

When it is clear then I will follow the advice from Hadley Wickham:

When you call a function, you typically omit the names of data arguments, because they are used so commonly. If you override the default value of an argument, use the full name (tidyverse style guide).

Native pipe

I am using the native pipe provided with R 4.1.0. It is important to know that there exist small but significant differences using the native pipe instead of the {magrittr} pipe. On of the consequences is that sometimes the code provided from the book with the {magrittr} pipe does not work with the native pipe.

This does not pose a general problem because you can mix both pipes. But where ever possible I have tried to stay with the native pipe.

The differences between native and {magrittr} pipe are complex and I will not go into the details here. But the following reading list provides you wit material to study the discrepancies.

Differences between native pipe versus {magrittr} pipe

  • Hadley Wickham in the Tidyverse (“Differences between the base R and magrittr pipes”),
  • Kathie Press (“Replacing the Magrittr Pipe With the Native R Pipe”)
  • Geek for Geeks (“What are the differences between R’s native pipe |> and the {magrittr} pipe %>%?”)
  • Isabella Velásquez (“Understanding the native R pipe |>”)
  • StackOverflow (“What are the differences between R’s native pipe |> and the {magrittr} pipe %>%?”)
  • Yihui Xie (“Substitute the magrittr Pipe %>% with R’s Native Pipe Operator |>”)
  • R Bloggers (The new R pipe)
  • Statistik Dresden (“R 4.1.0: Base R Pipe! |>”) in German.

Code linking

Code linking does not work together with code annotation. I am therefore using standard comments for line numbering and explaining it in normal numbered lists after the code chunk. This is not optimal but for learning issues it is important to have link to the original documentations of the packages function.

Code snippets

I am not using always the exact code snippets for my replications because I am not only replicating the code to see how it works but also to change the values of parameters to observe their influences.

Function starting with my_ are private function developed by me to facilitate repetitive tasks. To understand what these functions do, inspect the “R/helper.r” file.

There is one exception to the my_ convention: Geospatial data files are special and use a column type not covered with base R. To get data summaries using the skimr::skim() function I had to define a skim function list (sfl) specific for the geometry column.

Skimmers for the sfc data class

For the {sf} data classes in the geometry column are no skimmers available. Possible data types are:

  • sfc_POINT,
  • sfc_LINESTRING,
  • sfc_POLYGON,
  • sfc_MULTIPOINT,
  • sfc_MULTILINESTRING,
  • sfc_MULTIPOLYGON, and
  • sfc_GEOMETRY.

(sfc stands for “simple feature list column”.)

The geometry column is a combination of one of the above data types with the general sfc type. It is possible to adapt {skimr} for working with user defined data types using skimr::skim_with() or more generally with the skimr::get_skimmers() function to add default values for new data types to the standard skimr::skim() function.

I have applied with skimr::get_skimmers.sfc() a skim function list sfl() to get summary data for the sfc data type of the geometry column. You can inspect the details of this function in the already mentioned “R/helper.R” file.

Resource 3 : Defining sfl for sfc data types

Resources I have drawn upon to learn how to develop skimmers for the geometrycolumn:

  • Defining sfl’s for a package: General article that explains how to generate and use with user defined data types. sflstands for “skimr function list”. It is a list-like data structure used to define custom summary statistics for specific data types.
  • skim of {sf} objects: Discussion specific to the data types in the {sf} package.

Glossary table

term definition
GeoJSON GeoJSON is an open standard geospatial data interchange format based on JavaScript Object Notation (JSON). It was first published in June 2008 and became a standard specification with the publication of RFC 7946 in August 2016. GeoJSON is used to represent simple geographic features and their non-spatial attributes, such as points, lines, and polygons, and it uses the World Geodetic System 1984 as its geographic coordinate reference system. GeoJSON is widely used in web mapping applications due to its simplicity and compatibility with JavaScript. It supports various geometry types, including Point, LineString, Polygon, MultiPoint, MultiLineString, MultiPolygon, and GeometryCollection. GeoJSON files are typically saved with a .geojson extension and have a MIME type of application/geo+json.

Session Info

Session Info

Code
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.5.0 (2025-04-11)
#>  os       macOS Sequoia 15.4.1
#>  system   aarch64, darwin20
#>  ui       X11
#>  language (EN)
#>  collate  en_US.UTF-8
#>  ctype    en_US.UTF-8
#>  tz       Europe/Vienna
#>  date     2025-05-11
#>  pandoc   3.2 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/aarch64/ (via rmarkdown)
#>  quarto   1.8.4 @ /usr/local/bin/quarto
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package      * version date (UTC) lib source
#>  cli            3.6.5   2025-04-23 [2] CRAN (R 4.5.0)
#>  commonmark     1.9.5   2025-03-17 [2] CRAN (R 4.5.0)
#>  curl           6.2.2   2025-03-24 [2] CRAN (R 4.5.0)
#>  dichromat      2.0-0.1 2022-05-02 [2] CRAN (R 4.5.0)
#>  digest         0.6.37  2024-08-19 [2] CRAN (R 4.5.0)
#>  evaluate       1.0.3   2025-01-10 [2] CRAN (R 4.5.0)
#>  farver         2.1.2   2024-05-13 [2] CRAN (R 4.5.0)
#>  fastmap        1.2.0   2024-05-15 [2] CRAN (R 4.5.0)
#>  glossary     * 1.0.0   2023-05-30 [2] CRAN (R 4.5.0)
#>  glue           1.8.0   2024-09-30 [2] CRAN (R 4.5.0)
#>  htmltools      0.5.8.1 2024-04-04 [2] CRAN (R 4.5.0)
#>  htmlwidgets    1.6.4   2023-12-06 [2] CRAN (R 4.5.0)
#>  jsonlite       2.0.0   2025-03-27 [2] CRAN (R 4.5.0)
#>  kableExtra     1.4.0   2024-01-24 [2] CRAN (R 4.5.0)
#>  knitr          1.50    2025-03-16 [2] CRAN (R 4.5.0)
#>  lifecycle      1.0.4   2023-11-07 [2] CRAN (R 4.5.0)
#>  litedown       0.7     2025-04-08 [2] CRAN (R 4.5.0)
#>  magrittr       2.0.3   2022-03-30 [2] CRAN (R 4.5.0)
#>  markdown       2.0     2025-03-23 [2] CRAN (R 4.5.0)
#>  R6             2.6.1   2025-02-15 [2] CRAN (R 4.5.0)
#>  RColorBrewer   1.1-3   2022-04-03 [2] CRAN (R 4.5.0)
#>  rlang          1.1.6   2025-04-11 [2] CRAN (R 4.5.0)
#>  rmarkdown      2.29    2024-11-04 [2] CRAN (R 4.5.0)
#>  rstudioapi     0.17.1  2024-10-22 [2] CRAN (R 4.5.0)
#>  rversions      2.1.2   2022-08-31 [2] CRAN (R 4.5.0)
#>  scales         1.4.0   2025-04-24 [2] CRAN (R 4.5.0)
#>  sessioninfo    1.2.3   2025-02-05 [2] CRAN (R 4.5.0)
#>  stringi        1.8.7   2025-03-27 [2] CRAN (R 4.5.0)
#>  stringr        1.5.1   2023-11-14 [2] CRAN (R 4.5.0)
#>  svglite        2.2.0   2025-05-07 [2] CRAN (R 4.5.0)
#>  systemfonts    1.2.3   2025-04-30 [2] CRAN (R 4.5.0)
#>  textshaping    1.0.1   2025-05-01 [2] CRAN (R 4.5.0)
#>  vctrs          0.6.5   2023-12-01 [2] CRAN (R 4.5.0)
#>  viridisLite    0.4.2   2023-05-02 [2] CRAN (R 4.5.0)
#>  xfun           0.52    2025-04-02 [2] CRAN (R 4.5.0)
#>  xml2           1.3.8   2025-03-14 [2] CRAN (R 4.5.0)
#>  yaml           2.3.10  2024-07-26 [2] CRAN (R 4.5.0)
#> 
#>  [1] /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library
#>  [2] /Library/Frameworks/R.framework/Versions/4.5-arm64/library
#>  * ── Packages attached to the search path.
#> 
#> ──────────────────────────────────────────────────────────────────────────────

References

DeBruine, L. (2023). Glossary: Glossaries for markdown and quarto documents. https://doi.org/10.32614/CRAN.package.glossary
Dyba, K. (2024). How to load and save vector data in r. https://r-spatial.org/r/2024/06/26/sf-load-save.html
Sharif, B., & Maletic, J. I. (2010). 2010 IEEE 18th international conference on program comprehension. 196–205. https://doi.org/10.1109/ICPC.2010.41
Wimberly, M. C. (2022). Geographic Data Science with R. [Dataset on Figshare]. https://doi.org/10.6084/m9.figshare.21301212.v3
Wimberly, M. C. (2023a). Geographic Data Science With R: Visualizing and Analyzing Environmental Change (1st ed.). Chapman & Hall/CRC.
Wimberly, M. C. (2023b). Geographic data science with r: Visualizing and analyzing environmental change. https://bookdown.org/mcwimberly/gdswr-book/