Introduction to Environmental Data Science
Data science is an inter-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from many structural and unstructured data (Wikipedia). A data science approach is especially suitable for applications involving large and complex data sets, and environmental data is a prime example, with rapidly growing collections from automated sensors in space and time domains.
Environmental data science is data science applied to environmental science research. In general data science can be seen as being the intersection of math & statistics, computer science/IT, and some research domain, and in this case it’s environmental:
1.1 Environmental Data and Methods
The methods needed for environmental research can include many things since environmental data can include many things, including environmental measurements in time and space domains.
- Data analysis and transformation methods
- importing and other methods to create rectangular data frames
- reorganization and creation of fields
- filtering observations
- data joins
- stratified statistical summaries
- reorganizing data, including pivots
- physical models
- statistical modeling
- models based on machine learning algorithms
- Spatial analysis & maps
- vector and raster spatial analysis, e.g.
- spatial joins
- distance analysis
- overlay analysis
- spatial statistics
- static and interactive maps
- vector and raster spatial analysis, e.g.
- Time series
- analyzing and visualizing long-term data records (e.g. for climate change)
- analyzing and visualizing high-frequency data from loggers
1.2 Goals of this book
While the methodological reach of data science is very great, and the spectrum of environmental data is as well, our goal is to lay the foundation and provide useful introductory methods in the areas outlined above, but as a “live” book be able to extend into more advanced methods and provide a growing suite of research examples with associated data sets. We’ll briefly explore some data mining methods that can be applied to so-called “big data” challenges, but our focus is on exploratory data analysis in general, applied to environmental data in space and time domains. Much of our data will be in fact be quite small, derived from field-based environmental measurements, and it will primarily be in the areas of time-series and imagery, where automated data capture is employed, when we’ll dip our toes into what you might call big-data issues.
1.2.1 Some definitions:
Big data: data having a size or complexity too big to be processed effectively by traditional software
- data with many cases or dimensions (including imagery)
- many applications in environmental science due to the great expansion of automated environmental data capture in space and time domains
- big data challenges exist across the spectrum of the environmental research process, from data capture, storage, sharing, visualization, querying
Data Mining: discovering patterns in large data sets
- databases collected by government agencies
- imagery data from satellite, aerial (including drone) sensors
- time-series data from long-term data records or high-frequency data loggers
- methods may involve machine learning / artificial intelligence / computer vision
Exploratory data analysis: procedures for analyzing data, techniques for interpreting the results of such procedures, ways of structuring data to make its analysis easier
1.3 Exploratory Data Analysis
Just as exploration is a part of what National Geographic has long covered, it’s an important part of geographic and environmental science research.
Exploratory Data Analysis is exploration applied to data, and has grown as an alternative approach to traditional statistical analysis. This basic approach perhaps dates back to the work of Thomas Bayes in the 18th century, but Tukey (1961) may have best articulated the basic goals of this approach in defining the “data analysis” methods he was promoting: “Procedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyzing data.”
- EDA is an approach to analyzing data via summaries and graphics. The key word is exploratory.
- In contrast to confirmatory statistics
- suggest hypotheses
- assess assumptions on which inference will be based
- select appropriate statistical tools
- guide further data collection
- Led to the development of S, then R
- Built on clear design and extensive, clear graphics, one key to exploring data
- S Developed at Bell Labs by John Chambers, 1976
- R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues.
1.4 Software and data we’ll need
First, we’re going to use the R language. It’s not the only way to do data analysis, and probably Python is the leading data science overall, but R is clearly the leading such language for academic research, especially in the environmental sciences.
For a start, you’ll need to have R and RStudio installed, and at least the following packages:
- tidyverse (to include ggplot2, dplyr, tidyr, stringr, etc.)
There will be other packages we’ll meet along the way.
To install these packages, the following code will work:
install.packages(c("tidyverse", "lubridate", "sf", "raster", "tmap"))
You can always add more packages as needed, do them one at a time, whatever. But
generally don’t reinstall the packages again with the unless you actually want to reinstall it, maybe because it’s been updated. So generally I don’t include
install.packages() in my script. Once installed, you can access the packages with the library function, e.g.
which you will want to include in your script.
We’ll be using data from various sources, including data on CRAN like the code packages
above which you install the same way – so use
We’ve also created a repository on GitHub that includes data we’ve developed in the Institute for Geographic Information Science (iGISc) at SFSU, and you’ll need to install that package a slightly different way.
GitHub packages require a bit more work on the user’s part since we need to first install
remotes2, then use that to install the GitHub data package:
Then you can access it just like other built-in data by including:
To see what’s in it, you’ll see the various datasets listed in:
For instance, the following is a map of California counties using the CA_counties sf (simple features) data:
library(tidyverse); library(iGIScData); library(sf) ggplot(data=CA_counties) + geom_sf()
The package datasets can be used directly as sf data (if the sf library is installed) or data frames (all tibbles). Raw data can also be read from the
extdata folder that is installed on your computer when you install the package, using code such as:
csvPath <- system.file("extdata","TRI_1987_BaySites.csv", package="iGIScData") TRI87 <- read_csv(csvPath)
or something similar for shapefiles, such as:
shpPath <- system.file("extdata","trails.shp", package="iGIScData") trails <- st_read(shpPath)
“Dandelion fluff – Ephemeral stalk sheds seeds to the universe”↩︎
Note: you can also use
remotesif you have that installed. They do the same thing;
remotesis a subset of
devtools. If you see a message about Rtools, you can ignore it since that is only needed for building tools from C++ and things like that.↩︎