Geographic Data Science with R: Visualizing and Analyzing Environmental Change
We live in a time of unprecedented environmental change, driven by the effects of fossil fuels on the Earth’s climate and the expanding footprint of human land use. To mitigate and adapt to these changes, there is a need to understand their myriad impacts on human and natural systems. Achieving this goal requires geospatial data on a variety of environmental factors, including climate, vegetation, biodiversity, soils, terrain, water, and human populations. Consistent monitoring is also necessary to identify where changes are occuring and determine their rates. Large volumes of relevant data are collected by Earth-observing satellites and ground-based sensors. However, the data alone are not enough. Using them effectively requires tools for appropriate manipulation and analysis.
The burgeoning field of data science has provided a wealth of techniques for analyzing large and complex datasets, including methods for descriptive, explanatory, and predictive analytics. However, actually applying these methods is typically a small part of the overall data science workflow. Other critical tasks include screening for suspect data, handling missing values, harmonizing data from multiple sources, summarizing variables for analysis, and visualizing data and analysis results. Although there are now many books available on statistical and machine learning methods, there are fewer that address the broader topic of scientific workflows for geospatial data processing and analysis.
The purpose of Geographic Data Science with R (GDSWR) is to fill this gap. GDSWR provides a series of tutorials aimed at teaching good practices for using time series and geospatial data to address topics related to environmental change It is based on the R language and environment, which currently provides the best option for working with diverse sources of spatial and non-spatial data using a single platform. The book is not intended to provide a comprehensive overview of R. Instead, it uses an example-based approach to present practical approaches for working on diverse problems using a variety of datasets.
The material in GDSWR was originally developed for upper-level undergraduate and graduate courses in geospatial data science. It is also suitable for individual study by students or professionals who want to expand their capabilities for working with geospatial data in R. Although the book is not intended to be a comprehensive reference manual, it can also be useful for readers who are looking for examples of particular methods that can be modified for new applications. The tutorials focus on physical geography and draw upon a variety of data sources, including weather station data, gridded climate data, classified land cover data, and digital elevation models. It is my sincere hope that GDSWR will help readers increase their proficiency with R so that they can implement more sophisticated data science workflows that make effective use of diverse geographic data sources. These skills will allow them to address pressing scientific questions and develop new geospatial applications that can enhance our understanding of the changing world we inhabit.
GSDWR is aimed at readers who have already taken one or more introductory courses in Geographic Information Sciences (GIS) or who have experience working with GIS software and geospatial data. It assumes that readers are familiar with basic geospatial data structures, such as vector and raster data, along with basic cartographic concepts such as projections and coordinate systems. Readers who have a limited background in GIS and other geospatial technologies should consider reviewing an introductory text such as Paul Bolstad’s GIS Fundamentals (Bolstad 2019).
The R language and environment for statistical computing is one of the most widely used platforms for data analysis. Two of the major advantages of R are that it is available to users at no cost, and that it can be used with a multitude of add-on packages that provide customized methods for working with particular types of data or applying specialized analytical tools. However, this breadth and complexity also present a significant challenge for users who want to learn R and use it to analyze their own data. It is frequently remarked that anything that can be done in R can be accomplished in multiple ways. This flexibility can be a strength in the hands of an experienced user, but to learners, it is often overwhelming and baffling. These observations have strongly influenced the design of my university classes and workshops that incorporate R, and they inform the structure and content of Geographic Data Science with R (GSDWR).
The book is not intended to be a comprehensive introduction to R, but it provides a path for new users to quickly master basic skills and move on to more sophisticated data analysis. To this end, GDSWR begins with an introductory chapter that highlights key aspects of the base R language that provide a foundation for working with more specialized packages in the later chapters. Readers who want more extensive knowledge of R should consult one of the many introductory texts that are available, including Tilman Davies’ The Book of R (Davies 2016), and Rob Kabacoff’s R in Action (Kabacoff 2015).
GDSWR introduces several major R packages that facilitate the processing and analysis of tabular data along with vector and raster forms of geospatial data. In particular, the book makes extensive use of the tidyverse collection of R packages. More background on the general concepts of “tidy” data organization and the specific tidyverse packages and functions can be found in Wickham and Grolemund’s R For Data Science (Wickham and Grolemund 2016). The packages used in this book have been selected because they provide powerful and generalizable frameworks for data processing, visualization, and analysis. They are well maintained, have a broad user base, and are useful for a variety of data science applications. In many cases, additional functions from other packages could be incorporated to make particular tasks more convenient or efficient. However, the overall approach of GDSWR is to focus on a more limited set of powerful tools for students to master and eventually apply to a much broader set of problems.
To the extent possible, GSDWR has been organized so that the techniques learned in each chapter continue to be applied and expanded upon in the subsequent chapters. Chapter 1 provides a brief overview of important concepts in R that can serve as an introduction for readers who have not worked with R before or as a refresher for more experienced users. Chapter 2 then introduces the ggplot2 package (Wickham, Chang, et al. 2022) for scientific graphs. This package is used throughout the book to generate charts and maps. Chapter 3 introduces the dplyr and tidyr packages (Wickham, François, et al. 2022; Wickham and Girlich 2022) for the manipulation of non-spatial data frames. Chapter 4 provides an overview on how to work with dates using the lubridate package (Spinu, Grolemund, and Wickham 2021). These techniques are then applied to geospatial data in Chapter 5, which introduces the sf package (Pebesma 2022) for storing and processing spatial features and attributes. Chapters 6 and 7 introduce the terra package (Hijmans 2022) for manipulating and analyzing raster data, and Chapter 8 provides an overview of tools and approaches for integrating geographic datasets with different coordinate reference systems. Chapters 9 and 10 present examples of more complex analyses that combine multiple vector and raster datasets. Finally, Chapters 11 and 12 present examples of specific applications of the techniques covered in the book for analyzing patterns of wildfire severity and modeling species ranges with ecological niche models and climate data.
If you are new to R, it is best to start from the beginning and progress through the chapters sequentially, as each one builds on topics covered previously. More experienced R users, particularly those who have already worked with the packages in the tidyverse collection (Wickham 2021), should be able to skip to specific chapters of interest. Each chapter provides narrative text along with blocks of R code and the outputs from running the code. In the text, package names are in bold text (e.g., ggplot2), and inline code, function arguments, object classes, object names, and filenames are formatted in a typewriter font (e.g.,
myobject). Function names are followed by parentheses (e.g.,
ggplot()). There are many excellent learning resources and reference guides that expand on the topics covered in GDSWR. Recommendations for further study are provided at appropriate locations throughout the book.
The data files used in each chapter can be downloaded from https://doi.org/10.6084/m9.figshare.21301212 and used to run the code on your own system. One of the best ways to learn programming is to experiment by modifying existing code to see how the outputs change as a result. At the end of each chapter, there are several suggestions for how to practice by modifying the example code. When running the code, we recommend that readers use the RStudio graphical user interface (GUI) software and create a separate project for each chapter. The code for file input and output assumes that data will be read from and written to the working directory of the RStudio project. The appendix includes instructions on how to set up RStudio projects and associate them with folders in your computer’s file system.
Finally, readers should not hesitate to use the techniques herein for their own projects with new datasets. For example, simple data manipulation and graphing tasks that are done with a spreadsheet can be automated with R scripts. Spatial analyses that are done interactively with dedicated GIS software can similarly be translated into R code. Once various analysis steps are implemented in R, it is much easier to combine them into well-documented reproducible workflows. Making the transition from tutorials to designing analyses and writing your own code will require considerable trial and error. As with any endeavor, practice and perserverance are ultimately the keys to learning to use R effectively for geographic data science.
My intellectual growth as a scientist and my capabilities as a data analyst have benefited from the many fabulous colleagues with whom I have had the privilege to collaborate. In particular, I owe a debut of gratitude to all the students, postdocs, and research staff who have worked with me over the years. Furthermore, my development as a teacher has been strongly influenced by the many students who have taken my university courses and professional workshops. The approach to geospatial data science presented in this book represents many years of experimentation and refinement in response to their suggestions and feedback. Finally, this book would not have been possible without the unwavering support of my wonderful family. Thank you Anne, Alice, Zach, Slushie, Ivo, Sadie, and Pepper.
Mike Wimberly, Norman, OK