10 Machine Learning for Imagery Classification

10.1 Introduction

Using machine learning algorithms is one approach to imagery classification, whether that classification is object-based or pixel-based. We’re going to focus on a pixel-based approach to identify land cover from imagery or other continuous raster variables (such as elevation or elevation-derived rasters such as slope, curvature, and roughness), employing samples of those variables called training samples in a “supervised” imagery classification.

It’s useful to realize that the modeling methods we use for this type of imagery classification are really no different at the core from methods we’d use to work with continuous variables that might not even be spatial. For example, if we leave off the classification process for a moment, a machine learning algorithm might be used to predict a response result from a series of predictor variables, like predicting temperature from elevation and latitude, or acceleration might be predicted by some force acting on a body. So the first model might be used to predict the temperature of any location (within the boundary of the study area) given an elevation and latitude; or the second model might predict an acceleration given the magnitude of a force applied to the body (maybe of a given mass). A classification model varies on this by predicting a nominal variable like type of land cover; some other types of responses might be counts (using a Poisson model) or probabilities (using a logistic model).

The imagery classification approach adds to this model an input preparation process and an output prediction process:

  • A training set of points and polygons are created that represent areas of known classification such as land cover like forest or wetland, used to identify values of the predictor variables (e.g. imagery bands)
  • A predicted raster is created using the model applied to the original rasters.

This chapter delves into methods in the subdiscipline of remote sensing which students are encouraged to learn more about in courses such as GEOG 610 and 611 at San Francisco State University (Department of Geography & Environment, San Francisco State University (n.d.)). Some key things to learn more about are the nature of the electromagnetic spectrum and especially bands of that spectrum that are informative for land cover and especially vegetation detection, and the process of image classification.

10.1.1 Validation, creating a training set and the “overfit model” problem

An important part of the imagery classification process is validation, where we look at how well the model works. The way this is done is pretty easy to understand, and requires have testing data in addition to the training data mentioned above. Testing data can be created in a variety of ways – we’ll randomly pull out a percentage of our training data and not use it for training, only for testing. Then we’ll employ a confusion matrix and various measures of accuracy to see how well the model predicts the testing data.

10.1.1.1 The “overfit model” problem

It’s important to realize that the accuracy we determine is only based on the training and testing data. The accuracy of the prediction of the classification elsewhere will likely be somewhat less than this, and if this is substantial our model is “overfit.” We don’t actually know how overfit a model truly is because that depends on how likely the conditions seen in our training and testing data also occur throughout the rest of the image; if those conditions are common, just not sampled, then the model might actually be pretty well fit.

In thinking about the concept of overfit models and selecting out training (and testing) sets, it’s useful to consider the purpose of our classification and how important it is that our predictions are absolutely reliable. In choosing training sets, accuracy is also important, so we will want to make sure that they are good representatives of the land cover type (assuming that’s our response variable). While some land covers are pretty clear (like streets or buildings), there’s a lot of fuzziness in the world: you might be trying to identify wetland conditions based on the type of vegetation growing, but in a meadow you can commonly find wetland species mixed in with more mesic species – to pick a reliable wetland sample we might want to only pick areas with only wetland species (and this can get tricky since there are many challenges of “obligate wetland” or “facultative wetland” species.) It might actually be more appropriate to use a probability model to do this, but we can also take a spatial approach to prediction and accept that some areas will tend more toward one vs. another.

10.2 Imagery Setup and Visualization Methods

Since machine learning algorithms are very black boxy, getting them working right mostly requires getting good data inputs and tweaking the variety of parameters that control how the model training proceeds. Fortunately there are a lot of good online resources such as:

As with everything else in R, the specific coding methods used vary with the coder and what they’re used to using. If a method works, it’s certainly ok to keep using it, but where it can be clarified with clearer code such as from the tidyverse or sf, it may make it easier to understand and improve on.

This code accesses Sentinel-2 imagery downloaded from Copernicus Open Access Hub (n.d.) covering Red Clover Valley, where we are doing research on biogeochemical cycles and carbon sequestration from hydrologic restoration with The Sierra Fund (Clover Valley Ranch Restoration, the Sierra Fund (n.d.)), and then cropped to that site in a previous script, and saved to folders named in the form cropYYYYMMDD, such as “crop20210708.” The following code chunk specifies the date, with multiple cropped images stored in iGIScData.

imgDate <- "20210708"
# other dates stored:  "20200802"
library(rgdal)         # [readOGR]
library(gdalUtils)
library(raster)        # [writeRaster, brick, beginCluster, unique, projection, rasterToPoints]
library(sf)
library(sp)            # [spTransform]
library(RStoolbox)     # Remote Sensing toolbox.  [normImage]
library(getSpatialData)# for reading Sentinel, Landsat, MODIS, SRTM [set_aoi, view_aoi]
# library(rasterVis)   # terra required, creates error in module
                       # "spat" -- function 'Rcpp_precious_remove' not provided by package 'Rcpp'
library(mapview)       # [viewRGB, mapview]
library(RColorBrewer)
library(plotly)        # [plot_ly]
library(grDevices)

library(caret)         # [createDataPartition, createFolds, trainControl, train, confusionMatrix]

library(data.table)    # [setDT, setnames]
library(dplyr)
library(stringr)
library(doParallel)    # [registerDoParallel]
library(snow)          # [makeCluster, stopCluster]
library(parallel)      # [makeCluster, stopCluster, detectCores]

# set the temporary folder for raster package operations
unlink("./cache", recursive = T) # Deletes crop folder if it already exists
dir.create(file.path(".", "cache")) # will create warning if it already exists
dir.create(file.path("./cache", "temp"))
rasterOptions(tmpdir = "./cache/temp")

10.2.1 Read all 10m & 20m bands into a list of 10 bands

As documented at Copernicus Open Access Hub (n.d.), Sentinel-2 imagery is collected at three resolutions, with the most bands at the coarsest (60m) resolution. The bands added at that coarsest resolution are not critical for our work as they relate to oceanographic and atmospheric research, and our focus will be on land cover and vegetation in a terrestrial area. So we’ll work with four bands at 10m and an additional six bands at 20m resolution:

10 m bands

  • B02 - Blue 0.490 \(\mu\)m
  • B03 - Green 0.560 \(\mu\)m
  • B04 - Red 0.665 \(\mu\)m
  • B08 - NIR 0.842 \(\mu\)m

20 m bands

  • B05 - Red Edge 0.705 \(\mu\)m
  • B06 - Red Edge 0.740 \(\mu\)m
  • B07 - Red Edge 0.783 \(\mu\)m
  • B11 - SWIR 1.610 \(\mu\)m
  • B12 - SWIR 2.190 \(\mu\)m
  • B8A - NIR 0.865 \(\mu\)m
dtaPath <- system.file("extdata", "RCVimagery", package="iGIScData")
imgList <- list.files(paste0(dtaPath,"/crop", imgDate), pattern = "*.tif", full.names = TRUE)
rst_lst <- lapply(imgList, FUN = raster) # Apply the raster() function over the list of 10 bands
names(rst_lst) <- str_extract(sapply(rst_lst, names), "B.{2}")
   # sapply here returns a vector using the names() function, 
   # then str_extract() just gets the part of file name with B followed by any 2 characters 

Note that the order will be automatically detected, and the 10m B08 will follow the 20m B07:

str_extract(sapply(rst_lst, names), "B.{2}") # shows the order of bands, with B08 (NIR) as the 7th
##  [1] "B02" "B03" "B04" "B05" "B06" "B07" "B08" "B11" "B12" "B8A"

10.2.2 Visualize the original imagery

Displaying the imagery 3 bands at a time (displayed as RGB on our computer screen) is always a good place to start, and two especially useful band sets are RGB itself – so looking like a conventional color aerial photography – and “false color” that includes a band normally invisible to our eyes, such as near infrared that reflects chlorophyll in healthy plants.

A RasterBrick is commonly used to store multiband imagery, and we’ll create a brick from the imagery currently stored in a list.

viewRGB(brick(rst_lst[1:3]), r = 3, g = 2, b = 1)  # [mapview::viewRGB, raster::brick]