2 Introduction to R

We’re assuming you’re either new to R or need a refresher. We’ll start with some basic R operations entered directly in the console in RStudio.

RStudio

If you’re new to RStudio, or would like to learn more about using it, there are plenty of resources you can access to learn more about using it. As with many of the major packages we’ll explore, there’s even a cheat sheet: [https://github.com/rstudio/cheatsheets/raw/master/rstudio-ide.pdf]

You’ll want to know about:

  • The Console where you’ll enter short lines of code, install packages, and get help on functions. Messages created from running code will also be displayed here. There are other tabs in this area (e.g. Terminal, R Markdown) we may explore a bit, but mostly we’ll use the console.
  • The Source Editor where you’ll write full R scripts and R Markdown documents. You should get used to writing complete scripts and R Markdown documents as we go through the book.
  • Various Tab Panes such as the Environment pane where you can explore what variables and more complex objects contain.
  • The Plots pane in the lower right for static plots (graphs & maps that aren’t interactive), which also lets you see a listing of Files, or View interactive maps and maps.

2.1 Variables

Variables are objects that store values. Every computer language, like in math, stores values by assigning them constants or results of expressions. x <- 5 uses the R standard assignment operator <- though you can also use =. We’ll use <- because it is more common and avoids some confusion with other syntax.

Variable names must start with a letter, have no spaces, and not use any names that are built into the R language or used in package libraries, such as reserved words like for or function names like log()

x <- 5
y <- 8
longitude <- -122.4
latitude <- 37.8
my_name <- "Inigo Montoya"

To check the value of a variable or other object, you can just enter the name in the console, or even in the code in a code chunk.

x
## [1] 5
y
## [1] 8
longitude
## [1] -122.4
latitude
## [1] 37.8
my_name
## [1] "Inigo Montoya"

This is counter to the way printing out values work in programming, and you will need to know how this method works as well because you will want to use your code to develop tools that accomplish things, and there are also limitations to what you can see by just naming variables.

To see the values of variables in programming mode, use the print() function, or to concatenate character string output, use paste():

print(x)
## [1] 5
print(y)
## [1] 8
print(latitude)
## [1] 37.8
paste("The location is latitude", latitude, "longitude", longitude)
## [1] "The location is latitude 37.8 longitude -122.4"
# Or to get fancy:
deg <- rawToChar(as.raw(176))  # creates a degree symbol
paste0("The location is latitude ", latitude, deg, ", longitude ", longitude, deg)
## [1] "The location is latitude 37.8°, longitude -122.4°"
paste0("My name is ", my_name, ". You killed my father. Prepare to die.")
## [1] "My name is Inigo Montoya. You killed my father. Prepare to die."

Review the code above and what it produces. Without looking it up, what’s the difference between paste() and paste0()?

2.2 Functions

Once you have variables or other objects to work with, most of your work involves functions such as well-known math functions

log10(100)
log(exp(5))
cos(pi)
sin(90 * pi/180)

Most of your work will involve functions and there are too many to name, even in the base functions, not to mention all the packages we will want to use. You will likely have already used the install.packages() and library() functions that add in an array of other functions.

Later in this chapter, we’ll also learn how to write our own functions, a capability that is easy to accomplish and also gives you a sense of what developing your own package might be like.

Arithmetic operators There are of course all the normal arithmetic operators (that are actually functions) like plus + and minus - or the key-stroke approximations of multiply * and divide / operators. You’re probably familiar with these approximations from using equations in Excel if not in some other programming language you may have learned. These operators look a bit different from how they’d look when creating a nicely formatted equation.

For example, \(\frac{NIR - R}{NIR + R}\) instead has to look like (NIR-R)/(NIR+R). Similarly * must be used to multiply; there’s no implied multiplication that we expect in a math equation like \(x(2+y)\) which would need to be written x*(2+y).

In contrast to those four well-known operators, the symbol used to exponentiate – raise to a power – varies among programming languages. R uses ** so the the Pythagorean theorem \(c^2=a^2+b^2\) would be written c**2 = a**2 + b**2 except for the fact that it wouldn’t make sense as a statement to R.

We’ll need to talk about expressions and statements.

2.3 Expressions and Statements

The concepts of expressions and statements are very important to understand in any programming language.

An expression in R (or any programming language) has a value just like a variable has a value. An expression will commonly combine variables and functions to be evaluated to derive the value of the expression. Here are some examples of expressions:

5
x
x*2
sin(x)
sqrt(a**2 + b**2)
(-b+sqrt(b**2-4*a*c))/2*a
paste("My name is", aname)

Note that some of those expressions used previously assigned variables – x, a, b, c, aname.

An expression can be entered in the console to display its current value, and this is commonly done in R for objects of many types and complexity.

cos(pi)
## [1] -1
Nile
## Time Series:
## Start = 1871 
## End = 1970 
## Frequency = 1 
##   [1] 1120 1160  963 1210 1160 1160  813 1230 1370 1140  995  935 1110  994 1020  960 1180  799  958 1140 1100 1210 1150 1250 1260 1220
##  [27] 1030 1100  774  840  874  694  940  833  701  916  692 1020 1050  969  831  726  456  824  702 1120 1100  832  764  821  768  845
##  [53]  864  862  698  845  744  796 1040  759  781  865  845  944  984  897  822 1010  771  676  649  846  812  742  801 1040  860  874
##  [79]  848  890  744  749  838 1050  918  986  797  923  975  815 1020  906  901 1170  912  746  919  718  714  740

A statement in R does something. It represents a directive we’re assigning to the computer, or maybe the environment we’re running on the computer (like RStudio, which then runs R). A simple print() statement seems a lot like what we just did when we entered an expression in the console, but recognize that it does something:

print("Hello, World")
## [1] "Hello, World"

Which is the same as just typing “Hello, World,” but that’s just because the job of the console is to display what we are looking for [where we are the ones doing something], or if our statement includes something to display.

Statements in R are usually put on one line, but you can use a semicolon to have multiple statements on one line, if desired:

x <- 5; print(x); print(x**2)
## [1] 5
## [1] 25

Many (perhaps most) statements don’t actually display anything. For instance:

x <- 5

doesn’t display anything, but it does assign the value 5 to the variable x, so it simply does something. It’s an assignment statement and uses that special assignment operator <- . Most languages just use = which the designers of R didn’t want to use, to avoid confusing it with the equal sign meaning “is equal to.”

An assignment statement assigns an expression to a variable. If that variable already exists, it is reused with the new value. For instance it’s completely legit (and commonly done in coding) to update the variable in an assignment statement. This is very common when using a counter variable:

i = i + 1

You’re simply updating the index variable with the next value. This also illustrates why it’s not an equation: \(i=i+1\) doesn’t work as an equation (unless i is actually \(\infty\) but that’s just really weird.)

And c**2 = a**2 + b**2 doesn’t make sense as an R statement because c**2 isn’t a variable to be created. The ** part is interpreted as raise to a power. What is to the left of the assignment operator = must be a variable to be assigned the value of the expression.

2.4 Data Types

Variables, constants and other data elements in R have data types. Common types are numeric and character.

x <- 5
class(x)
## [1] "numeric"
class(4.5)
## [1] "numeric"
class("Fred")
## [1] "character"

2.4.1 Integers

By default, R creates double-precision floating-point numeric variables To create integer variables: - append an L to a constant, e.g. 5L is an integer 5 - convert with as.integer We’re going to be looking at various as. functions in R, more on that later, but we should look at as.integer() now. Most other languages use int() for this, and what it does is converts any number into an integer, truncating it to an integer, not rounding it.

as.integer(5)
## [1] 5
as.integer(4.5)
## [1] 4

To round a number, there’s a round() function or you can instead use as.integer adding 0.5:

x <- 4.8
y <- 4.2
as.integer(x + 0.5)
## [1] 5
round(x)
## [1] 5
as.integer(y + 0.5)
## [1] 4
round(y)
## [1] 4

Integer divison:

5 %/% 2
## [1] 2

Integer remainder from division (the modulus, using a %% to represent the modulo):

5 %% 2
## [1] 1

Surprisingly, the values returned by integer division or the remainder are not stored as integers. R seems to prefer floating point…

2.5 Rectangular data

A common data format used in most types of research is rectangular data such as in a spreadsheet, with rows and columns, where rows might be observations and columns might be variables. We’ll read this type of data in from spreadsheets or even more commonly from comma-separated-variable (CSV)

library(iGIScData)
sierraFeb
## # A tibble: 62 x 9
##    STATION_NAME                      COUNTY    ELEVATION LATITUDE LONGITUDE PRECIPITATION TEMPERATURE  resid predict
##    <chr>                             <fct>         <dbl>    <dbl>     <dbl>         <dbl>       <dbl>  <dbl>   <dbl>
##  1 GROVELAND 2, CA US                Tuolumne      853.      37.8     -120.         176.          6.1 -0.574   6.67 
##  2 CANYON DAM, CA US                 Plumas       1390.      40.2     -121.         164.          1.4 -2.00    3.40 
##  3 KERN RIVER PH 3, CA US            Kern          824.      35.8     -118.          67.1         8.9  2.05    6.85 
##  4 DONNER MEMORIAL ST PARK, CA US    Nevada       1810.      39.3     -120.         167.         -0.9 -1.74    0.840
##  5 BOWMAN DAM, CA US                 Nevada       1641.      39.5     -121.         277.          2.9  1.03    1.87 
##  6 GRANT GROVE, CA US                Tulare       2012.      36.7     -119.         186.          1.7  2.09   -0.394
##  7 LEE VINING, CA US                 Mono         2072.      38.0     -119.          71.9         0.4  1.16   -0.760
##  8 OROVILLE MUNICIPAL AIRPORT, CA US Butte          57.9     39.5     -122.         138.         10.3 -1.23   11.5  
##  9 LEMON COVE, CA US                 Tulare        156.      36.4     -119.          62.7        11.3  0.373  10.9  
## 10 CALAVERAS BIG TREES, CA US        Calaveras    1431       38.3     -120.         254           2.7 -0.450   3.15 
## # ... with 52 more rows

2.6 Data Structures in R

We looked briefly at numeric and character string (we’ll abbreviate simply as “string” from here on). We’ll also look at factors and dates/times later on.

2.6.1 Vectors

A vector is an ordered collection of numbers, strings, vectors, data frames, etc. What we mostly refer to as vectors are formally called atomic vectors which requires that they be homogeneous sets of whatever type we’re referring to, such as a vector of numbers, or a vector of strings, or a vector of dates/times.

You can create a simple vector with the c() function:

lats <- c(37.5,47.4,29.4,33.4)
lats
## [1] 37.5 47.4 29.4 33.4
states = c("VA", "WA", "TX", "AZ")
states
## [1] "VA" "WA" "TX" "AZ"
zips = c(23173, 98801, 78006, 85001)
zips
## [1] 23173 98801 78006 85001

The class of a vector is the type of data it holds

temp <- c(10.7, 9.7, 7.7, 9.2, 7.3, 6.7)
class(temp)
## [1] "numeric"

Let’s also introduce the handy str() function which in one step gives you a view of the class of an item and its content, so its structure. We’ll often use it instead of listing a vector and its class separately.

str(temp)
##  num [1:6] 10.7 9.7 7.7 9.2 7.3 6.7

Vectors can only have one data class, and if mixed with character types, numeric elements will become character:

mixed <- c(1, "fred", 7)
str(mixed)
##  chr [1:3] "1" "fred" "7"
mixed[3]   # gets a subset, example of coercion
## [1] "7"

2.6.1.1 NA

Data science requires dealing with missing data by storing some sort of null value, called various things: - null - nodata - NA “not available” or “not applicable”

as.numeric(c("1","Fred","5")) # note NA introduced by coercion
## [1]  1 NA  5

Ignoring NA in statistical summaries is commonly used. Where normally the summary statistic can only return NA…

mean(as.numeric(c("1", "Fred", "5")))
## [1] NA

… with na.rm=T you can still get the result for all actual data:

mean(as.numeric(c("1", "Fred", "5")), na.rm=T)
## [1] 3

Don’t confuse with nan (“not a number”) which is used for things like imaginary numbers (explore the help for more on this)

is.na(NA)
## [1] TRUE
is.nan(NA)
## [1] FALSE
is.na(as.numeric(''))
## [1] TRUE
is.nan(as.numeric(''))
## [1] FALSE
i <- sqrt(-1)
is.na(i) # interestingly nan is also na
## [1] TRUE
is.nan(i)
## [1] TRUE

2.6.1.2 Sequences

An easy way to make a vector from a sequence of values. The following 3 examples are equivalent:

seq(1,10)
c(1:10)
c(1,2,3,4,5,6,7,8,9,10)

The seq() function has special uses like using a step parameter:

seq(2,10,2)
## [1]  2  4  6  8 10

2.6.1.3 Vectorization and vector arithmetic

Arithmetic on vectors operates element-wise

elev <- c(52,394,510,564,725,848,1042,1225,1486,1775,1899,2551)
elevft <- elev / 0.3048
elevft
##  [1]  170.6037 1292.6509 1673.2283 1850.3937 2378.6089 2782.1522 3418.6352 4019.0289 4875.3281 5823.4908 6230.3150 8369.4226

Another example, with 2 vectors:

temp03 <- c(13.1,11.4,9.4,10.9,8.9,8.4,6.7,7.6,2.8,1.6,1.2,-2.1)
temp02 <- c(10.7,9.7,7.7,9.2,7.3,6.7,4.0,5.0,0.9,-1.1,-0.8,-4.4)
tempdiff <- temp03 - temp02
tempdiff
##  [1] 2.4 1.7 1.7 1.7 1.6 1.7 2.7 2.6 1.9 2.7 2.0 2.3

2.6.1.4 Plotting vectors

Vectors of Feb temperature, elevation and latitude at stations in the Sierra:

temp <- c(10.7, 9.7, 7.7, 9.2, 7.3, 6.7, 4.0, 5.0, 0.9, -1.1, -0.8, -4.4)
elev <- c(52, 394, 510, 564, 725, 848, 1042, 1225, 1486, 1775, 1899, 2551)
lat <- c(39.52, 38.91, 37.97, 38.70, 39.09, 39.25, 39.94, 37.75, 40.35, 39.33, 39.17, 38.21)

Plot individually

We’ll use the plot() function to visualize what’s in a vector. The plot() function will create an output based upon its best guess of what you’re wanting to see, and will depend on the nature of the data you provide it. We’ll be looking at a lot of ways to visualize data soon, but it’s often useful to just see what plot() gives you. In this case it just makes a bivariate plot where the x dimension is the sequential index of the vector from 1 through the length of the vector, and the values are in the y dimension.

plot(temp)
Temperature simply plotted by index

Figure 2.1: Temperature simply plotted by index

plot(elev)
Elevation plotted by index

Figure 2.2: Elevation plotted by index

plot(lat)
Latitude plotted by index

Figure 2.3: Latitude plotted by index

Then plot as a scatterplot

If we provide two vectors, we’ll get a more useful bivariate scatter plot.

plot(elev,temp)
Temperature~Elevation

Figure 2.4: Temperature~Elevation

2.6.1.5 Named indices

Vector indices can be named.

fips <- c(16, 30, 56)
str(fips)
##  num [1:3] 16 30 56
fips <- c(idaho = 16, montana = 30, wyoming = 56)
str(fips)
##  Named num [1:3] 16 30 56
##  - attr(*, "names")= chr [1:3] "idaho" "montana" "wyoming"

Why? I guess so you can refer to observations by name instead of index. The following are equivalent:

fips[2]
## montana 
##      30
fips["montana"]
## montana 
##      30

The names() function can be used to display a character vector of names, or assign names from a character vector:

names(fips) # returns a character vector of names
## [1] "idaho"   "montana" "wyoming"
names(fips) <- c("Idaho","Montana","Wyoming")
names(fips)
## [1] "Idaho"   "Montana" "Wyoming"

2.6.2 Lists

Lists can be heterogeneous, with multiple class types. Lists are actually used a lot in R, and are created by many operations, so we all need to understand them better; but we’ll avoid them for a while…

2.6.3 Matrices

Vectors are commonly used as a column in a matrix (or as we’ll see, a data frame), like a variable

temp <- c(10.7, 9.7, 7.7, 9.2, 7.3, 6.7, 4.0, 5.0, 0.9, -1.1, -0.8, -4.4)
elev <- c(52, 394, 510, 564, 725, 848, 1042, 1225, 1486, 1775, 1899, 2551)
lat <- c(39.52, 38.91, 37.97, 38.70, 39.09, 39.25, 39.94, 37.75, 40.35, 39.33, 39.17, 38.21)

Building a matrix from vectors as columns

sierradata <- cbind(temp, elev, lat)
class(sierradata)
## [1] "matrix" "array"

2.6.3.1 Dimensions for arrays and matrices

Note: a matrix is just a 2D array. Arrays have 1, 3, or more dimensions.

dim(sierradata)
## [1] 12  3
a <- 1:12
dim(a) <- c(3, 4)   # matrix
class(a)
## [1] "matrix" "array"
dim(a) <- c(2,3,2)  # 3D array
class(a)
## [1] "array"
dim(a) <- 12        # 1D array
class(a)
## [1] "array"
b <- matrix(1:12, ncol=1)  # 1 column matrix is allowed

2.6.4 Data frames

A data frame is a database with fields (as vectors) with records (rows), so is very important for data analysis and GIS. They’re kind of like a spreadsheet with rules (first row is field names, fields all one type). So even though they’re more complex than a list, we use them so frequently they become quite familiar.

Before we get started, we’re going to use the palmerpenguins data set (Horst, Hill, and Gorman (2020)), so you’ll need to install it if you haven’t yet, and I’d encourage you to learn more about it at https://allisonhorst.github.io/palmerpenguins/articles/intro.html. It will be useful for a variety of demonstrations using numerical morphometric as well as a couple of categorical factors (species and island).

The three penguin species in palmerpenguins. Photos by KB Gorman. Used with permission

Figure 2.5: The three penguin species in palmerpenguins. Photos by KB Gorman. Used with permission

Figure 2.6: Palmer Station, Antarctic Peninsula

We’ll use a couple of alternative table display methods, first a simple one…

library(palmerpenguins)
penguins
## # A tibble: 344 x 8
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex     year
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int> <fct>  <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750 male    2007
##  2 Adelie  Torgersen           39.5          17.4               186        3800 female  2007
##  3 Adelie  Torgersen           40.3          18                 195        3250 female  2007
##  4 Adelie  Torgersen           NA            NA                  NA          NA <NA>    2007
##  5 Adelie  Torgersen           36.7          19.3               193        3450 female  2007
##  6 Adelie  Torgersen           39.3          20.6               190        3650 male    2007
##  7 Adelie  Torgersen           38.9          17.8               181        3625 female  2007
##  8 Adelie  Torgersen           39.2          19.6               195        4675 male    2007
##  9 Adelie  Torgersen           34.1          18.1               193        3475 <NA>    2007
## 10 Adelie  Torgersen           42            20.2               190        4250 <NA>    2007
## # ... with 334 more rows

… then a bit fancier table display using the DT package, with a bit of improvement using an option:

DT::datatable(penguins, options=list(scrollX=T))

… or we could use the kable and kableExtra packages to provide some other formatting advantages, such as including a numbered table caption for a “knitted” R Markdown document or book like this. In this case we’ll specifically choose just the start of the dataframe, called the “head”…

library(kableExtra)
knitr::kable(
  head(penguins, 10), booktabs = TRUE,
  caption = 'first 10 rows of the palmerpenguins data') %>%
  kable_styling("striped")
Table 2.1: first 10 rows of the palmerpenguins data
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
Adelie Torgersen 39.1 18.7 181 3750 male 2007
Adelie Torgersen 39.5 17.4 186 3800 female 2007
Adelie Torgersen 40.3 18.0 195 3250 female 2007
Adelie Torgersen NA NA NA NA NA 2007
Adelie Torgersen 36.7 19.3 193 3450 female 2007
Adelie Torgersen 39.3 20.6 190 3650 male 2007
Adelie Torgersen 38.9 17.8 181 3625 female 2007
Adelie Torgersen 39.2 19.6 195 4675 male 2007
Adelie Torgersen 34.1 18.1 193 3475 NA 2007
Adelie Torgersen 42.0 20.2 190 4250 NA 2007

… not to be confused with a penguin head, which we’ll show to define a couple of morphometric terms used in the data frame:

Diagram of penguin head with indication of bill length and bill depth (from Horst, Hill, and Gorman (2020), used with permission)

Figure 2.7: Diagram of penguin head with indication of bill length and bill depth (from Horst, Hill, and Gorman (2020), used with permission)

You’ll find that things display a bit differently in RStudio than in the R Markdown method seen in this book. There are a lot of options for the knitr::kable and DT::datatable functions.

Creating a data frame out of a matrix

There are many functions that start with as. that convert things to a desired type. We’ll use as.data.frame() to create a data frame out of a matrix.

mydata <- as.data.frame(sierradata)
DT::datatable(mydata)

Figure 2.8: Temperature~Elevation

Then to plot the two variables that are now part of the data frame, we’ll need to make vectors out of them again using the $ accessor.

plot(mydata$elev, mydata$temp)

Read a data frame from a CSV

We’ll be looking at this more in the next chapter, but a common need is to read data from a spreadsheet stored in the CSV format. Normally, you’d have that stored with your project and can just specify the file name, but we’ll access CSVs from the iGIScData package. Since you have this installed, it will already be on your computer, but not in your project folder. The path to it can be derived using the system.file() function.

Reading a csv in readr (part of the tidyverse that we’ll be looking at in the next chapter) is done with read_csv():

library(readr)
csvPath <- system.file("extdata","TRI_1987_BaySites.csv", package="iGIScData")
TRI87 <- read_csv(csvPath)
DT::datatable(TRI87, options=list(scrollX=T))

Sort, Index, & Max/Min

head(sort(TRI87$air_releases))
## [1]  2  5  5  7  9 10
index <- order(TRI87$air_releases)
head(TRI87$FACILITY_NAME[index])   # displays facilities in order of their air releases
## [1] "AIR PRODUCTS MANUFACTURING CORP"          "UNITED FIBERS"                            "CLOROX MANUFACTURING CO"                 
## [4] "ICI AMERICAS INC WESTERN RESEARCH CENTER" "UNION CARBIDE CORP"                       "SCOTTS-SIERRA HORTICULTURAL PRODS CO INC"
i_max <- which.max(TRI87$air_releases)
TRI87$FACILITY_NAME[i_max]   # was NUMMI at the time
## [1] "TESLA INC"

2.6.5 Factors

Factors are vectors with predefined values - Normally used for categorical data. - Built on an integer vector - Levels are the set of predefined values.

nut <- factor(c("almond", "walnut", "pecan", "almond"))
str(nut)   # note that levels will be in alphabetical order
##  Factor w/ 3 levels "almond","pecan",..: 1 3 2 1
typeof(nut)
## [1] "integer"

An equivalent conversion:

nutint <- c(1, 2, 3, 2) # equivalent conversion
nut <- factor(nutint, labels = c("almond", "pecan", "walnut"))
str(nut)
##  Factor w/ 3 levels "almond","pecan",..: 1 2 3 2

2.6.5.1 Categorical Data and Factors

While character data might be seen as categorical (e.g. “urban,” “agricultural,” “forest” land covers), to be used as categorical variables they must be made into factors. So we have something to work with, we’ll generate some random memberships in one of three vegetation moisture categories using the sample() function:

veg_moisture_categories <- c("xeric", "mesic", "hydric")
veg_moisture_char <- sample(veg_moisture_categories, 42, replace = TRUE)
veg_moisture_fact <- factor(veg_moisture_char, levels = veg_moisture_categories)
veg_moisture_char
##  [1] "hydric" "hydric" "mesic"  "hydric" "hydric" "hydric" "mesic"  "mesic"  "mesic"  "xeric"  "xeric"  "hydric" "hydric" "xeric" 
## [15] "mesic"  "xeric"  "xeric"  "mesic"  "mesic"  "hydric" "mesic"  "hydric" "mesic"  "xeric"  "mesic"  "mesic"  "hydric" "xeric" 
## [29] "hydric" "mesic"  "hydric" "mesic"  "xeric"  "hydric" "mesic"  "xeric"  "xeric"  "hydric" "hydric" "xeric"  "mesic"  "hydric"
veg_moisture_fact
##  [1] hydric hydric mesic  hydric hydric hydric mesic  mesic  mesic  xeric  xeric  hydric hydric xeric  mesic  xeric  xeric  mesic 
## [19] mesic  hydric mesic  hydric mesic  xeric  mesic  mesic  hydric xeric  hydric mesic  hydric mesic  xeric  hydric mesic  xeric 
## [37] xeric  hydric hydric xeric  mesic  hydric
## Levels: xeric mesic hydric

To make a categorical variable a factor:

nut <- c("almond", "walnut", "pecan", "almond")
farm <- c("organic", "conventional", "organic", "organic")
ag <- as.data.frame(cbind(nut, farm))
ag$nut <- factor(ag$nut)
ag$nut
## [1] almond walnut pecan  almond
## Levels: almond pecan walnut

Factor example

sierraFeb$COUNTY <- factor(sierraFeb$COUNTY)
str(sierraFeb$COUNTY)
##  Factor w/ 20 levels "Amador","Butte",..: 19 14 7 12 12 18 11 2 18 3 ...

2.6.6 Accessors

The use of accessors in R can be confusing to learners without previous programming experience, so this is good to explore. (Thanks to Brown (n.d.) for a blog on this topic). An accessor is “a method for accessing data in an object usually an attribute of that object” (Brown (n.d.)), and for R these are [], [[]], and $, but as that source notes it can be confusing to know when you might use which one. There are good reasons to have these three types for code clarity, however you can also use [] with a bit of clumsiness for all purposes.

We’ve already been using these in this chapter and will continue to use them throughout the book. Let’s look at the various accessors:

2.6.6.1 Subsetting with []

You use this to get a subset of any R object, whether it be a vector, list, or data frame. Normally it will return an object of the same type, but if only one item is returned from a data frame, it will be a vector.

lats <- c(37.5,47.4,29.4,33.4)
str(lats[2:3])
##  num [1:2] 47.4 29.4
str(letters[24:26])
##  chr [1:3] "x" "y" "z"
str(cars)
## 'data.frame':    50 obs. of  2 variables:
##  $ speed: num  4 4 7 7 8 9 10 10 10 11 ...
##  $ dist : num  2 10 4 22 16 10 18 26 34 17 ...

Getting one element from a data frame will return a data frame.

str(cars[1])
## 'data.frame':    50 obs. of  1 variable:
##  $ speed: num  4 4 7 7 8 9 10 10 10 11 ...

2.6.6.2 The mysterious double bracket [[]]

Double brackets extract just one element, so just one value from a vector or one vector from a data frame. You’re going one step further into the structure.

str(cars[,"speed"])
##  num [1:50] 4 4 7 7 8 9 10 10 10 11 ...
str(cars[["speed"]])
##  num [1:50] 4 4 7 7 8 9 10 10 10 11 ...

Though uncommon, you can also use the index of variables instead of its name. It may be useful to realize this to help conceptualize variables as members of a set with indices indicating which one. Since “speed” is the first variable in the cars data frame, these return the same thing:

str(cars[[1]])
##  num [1:50] 4 4 7 7 8 9 10 10 10 11 ...
str(cars[["speed"]])
##  num [1:50] 4 4 7 7 8 9 10 10 10 11 ...

2.6.6.3 Accessing a vector from a data frame with $

The $ accessor is really just a shortcut, but any shortcut reduces code and thus increases clarity, so it’s a good idea and so this accessor is commonly used. Their only limitation is that you can’t use the integer indices, but that’s rare to do anyway.

These do the same thing:

str(cars$speed)
##  num [1:50] 4 4 7 7 8 9 10 10 10 11 ...
str(cars[,"speed"])
##  num [1:50] 4 4 7 7 8 9 10 10 10 11 ...
str(cars[["speed"]])
##  num [1:50] 4 4 7 7 8 9 10 10 10 11 ...

2.7 Programming scripts in RStudio

Given the exploratory nature of the R language, we sometimes forget that it provides significant capabilities as a programming language where we can solve more complex problems by coding procedures and using logic to control the process and handle a range of possible scenarios.

Programming languages are used for a wide range of purposes, from developing operating systems built from low-level code to high-level scripting used to run existing functions in libraries. R and Python are commonly used for scripting, and you may be familiar with using arcpy to script ArcGIS geoprocessing tools. But whether low- or high-level, some common operational structures are used in all computer programming languages:

  • Functions (defining your own)
  • Conditional operations If a condition is true, do this.
  • Loops

2.7.1 Functions (defining your own)

function(input){Do this and return the resulting expression}

The various packages that we’re installing, all those that aren’t purely data (like iGIScData), are built primarily of functions and perhaps most of those functions are written in R. Many of these simply make existing R functions work better or at least differently, often for a particular data science task common needed in a discipline or application area.

In geospatial environmental research for instance, we are often dealing with direction, for instance the movement of marine mammals who might be influenced by ship traffic. An agent-based model simulation of marine mammal movement might have the animal respond by shifting to the left or right, so we might want a turnLeft() or turnRight() function. Given the nature of circular data however, the code might be sufficiently complex to warrant writing a function that will make our main code easier to read:

turnright <- function(ang){(ang + 90) %% 360}

Then in our code later on…

id <- c("5A", "12D", "3B")
direction <- c(260, 270, 300)
whale <- dplyr::bind_cols(id = id,direction = direction) # better than cbind
whale
## # A tibble: 3 x 2
##   id    direction
##   <chr>     <dbl>
## 1 5A          260
## 2 12D         270
## 3 3B          300

… we can call this function:

whale$direction <- turnright(whale$direction)
whale
## # A tibble: 3 x 2
##   id    direction
##   <chr>     <dbl>
## 1 5A          350
## 2 12D           0
## 3 3B           30

Another function I found useful for our external data in iGIScData is to simplify the code needed to access the external data. I found I had to keep looking up the syntax for that task that we use a lot. It also makes the code difficult to read. Adding this function to the top of your code helps for both:

ext <- function(fnam){system.file("extdata",fnam,package="iGIScData")}

Then our code that accesses data is greatly simplified, with read.csv calls looking a lot like reading data stored in our project folder. Where if we might read streams.csv stored in our project folder with:

read.csv("eucoakSites.csv")

we can read it from the data package’s extdata folder with:

read.csv(ext("eucoakSites.csv"))
##   ï..Site      X       Y      long      lat
## 1     TP1 564919 4196290 -122.2615 37.91183
## 2     TP2 564547 4196250 -122.2657 37.91150
## 3     TP3 564161 4196619 -122.2701 37.91485
## 4     TP4 566354 4194690 -122.2453 37.89731
## 5     AB1 541784 4196595 -122.5246 37.91594
## 6     AB2 542063 4196759 -122.5214 37.91741
## 7     KM1 540459 4199159 -122.5396 37.93911
## 8     PR1 546532 4182700 -122.4715 37.79048

2.7.2 Conditional operations

if (condition) {Do this} else {Do this other thing}

Probably not used as much in R for most users, as it’s mostly useful for building a new tool (or function) that needs to run as a complete operation. In this case, you’ll need to be able to handle a variety of inputs and avoid errors. Here’s an admittedly trivial example of some short code that avoids an error:

getRatio <- function(x,y){
  if (y!=0){x/y}
  else {1e1000}
}

getRatio(5,0)
## [1] Inf
getRatio(2,5)
## [1] 0.4

2.7.3 Loops

for(counterinlist)Do something

Loops are very common in traditional computer languages (in FORTRAN they were called Do Loops), but is not used as much in R due to its vectorization approach. But they still can be useful in some situations. The following is trivial but illustrates a simple loop that prints a series of results:

for(i in 1:10) print(paste(i, 1/i))
## [1] "1 1"
## [1] "2 0.5"
## [1] "3 0.333333333333333"
## [1] "4 0.25"
## [1] "5 0.2"
## [1] "6 0.166666666666667"
## [1] "7 0.142857142857143"
## [1] "8 0.125"
## [1] "9 0.111111111111111"
## [1] "10 0.1"

Here’s a more complex loop that builds river data for a map and profile that we’ll look at again in the visualization chapter. It also includes a conditional operation with if..else, embedded within the for loop (note the { } enclosures for each structure.)

x <- c(1000, 1100, 1300, 1500, 1600, 1800, 1900)
y <- c(500, 700, 800, 1000, 1200, 1300, 1500)
elev <- c(0, 1, 2, 5, 25, 75, 150)
d <- double()      # creates an empty numeric vector 
longd <- double()  # ("double" means double-precision floating point)
s <- double()
for(i in 1:length(x)){
  if(i==1){longd[i] <- 0; d[i] <-0}
  else{
    d[i] <- sqrt((x[i]-x[i-1])^2 + (y[i]-y[i-1])^2)
    longd[i] <- longd[i-1] + d[i]
    s[i-1] <- (elev[i]-elev[i-1])/d[i]
    }
  }
s[length(x)] <- NA
riverData <- as.data.frame(cbind(x=x,y=y,elev=elev,d=d,longd=longd,s=s))
riverData
##      x    y elev        d     longd           s
## 1 1000  500    0   0.0000    0.0000 0.004472136
## 2 1100  700    1 223.6068  223.6068 0.004472136
## 3 1300  800    2 223.6068  447.2136 0.010606602
## 4 1500 1000    5 282.8427  730.0563 0.089442719
## 5 1600 1200   25 223.6068  953.6631 0.223606798
## 6 1800 1300   75 223.6068 1177.2699 0.335410197
## 7 1900 1500  150 223.6068 1400.8767          NA
plot(riverData$longd, riverData$elev)

2.7.4 Free-standing scripts and RStudio projects

As we move forward, we’ll be wanting to develop complete, free-standing scripts that have all of the needed libraries and data. Your scripts should stand on their own. One example of this that may seem insignificant is using print() statements instead of just naming the object or variable in the console. While that is common in exploratory work, we need to learn to create free-standing scripts.

However, “free standing” still allows for loading libraries of functions we’ll be using. We’re still talking about high-level (scripting), not low-level programming, so we can depend on those libraries that any user can access by installing those packages. If we develop our own packages, we just need to provide the user the ability to install those packages.

RStudio projects are going to be the way we’ll want to work for the rest of this book, so you’ll often want to create new ones for particular data sets so things don’t get messy. Since we’re using our iGIScData package, this is less of an issue since at least input data aren’t stored in the project folder. However you’re going to be creating data, so you’ll want to manage your data in individual projects. You may want to start a new project for each data set, using File/New Project, and try to keep things organized (things can get messy fast!)

In this book, we’ll be making a lot of use of data provided for you from various data packages such as built-in data, palmerpenguins (Horst, Hill, and Gorman 2020), or iGIScData, but they correspond to specific research projects, such as Sierra Climate to which several data frames and spatial data apply. For this chapter, you can probably just use one project, but later you’ll find it useful to create separate projects for each data set – such as a sierra project and return to it every time it applies.

In that project, you’ll build a series of scripts, many of which you’ll re-use to develop new methods. When you’re working on your own project with your own data, which you should store in a data folder inside the project folder. All paths are local, and the default working directory is the project folder, so you can specify "data/mydata.csv" as the path to a csv of that name.

R Markdown

An alternative to writing scripts is writing R Markdown documents, which includes both formatted text (such as you’re seeing in this book, like the italics just used above which was created using asterisks) and code chunks. R lends itself to running code in chunks, as opposed to creating complete tools that run all of the way through. This book is built from R Markdown documents organized in a bookdown structure, and most of the figures are created from R code chunks. There are also many good resources on writing R Markdown documents, including the very thorough R Markdown: The Definitive Guide (Yihui Xie 2019).

2.7.5 Subsetting with logic

We’ll use the 2022 USDOE fuel efficiency data to list all of the car lines with at least 50 miles to the gallon:

library(readxl)
excelPath <- system.file("extdata","USDOE_FuelEfficiency2022.xlsx", package="iGIScData")
fuelEff22 <- read_excel(excelPath) 
i <- fuelEff22$`Hwy FE (Guide) - Conventional Fuel` >= 50
paste(fuelEff22$Division[i],fuelEff22$Carline[i])
## [1] "TOYOTA COROLLA HYBRID"                     "HYUNDAI MOTOR COMPANY Elantra Hybrid"      "HYUNDAI MOTOR COMPANY Elantra Hybrid Blue"
## [4] "TOYOTA PRIUS"                              "TOYOTA PRIUS Eco"                          "HYUNDAI MOTOR COMPANY Ioniq"              
## [7] "HYUNDAI MOTOR COMPANY Ioniq Blue"          "HYUNDAI MOTOR COMPANY Sonata Hybrid"       "HYUNDAI MOTOR COMPANY Sonata Hybrid Blue"

which [in a new air_quality project]

library(readr)
TRI87 <- read_csv("data/TRI_1987_BaySites.csv")
i <- which(TRI87$air_releases > 1e6)
TRI87$FACILITY_NAME[i]
## [1] "VALERO REFINING CO-CALI FORNIA BENICIA REFINERY" "TESLA INC"                                      
## [3] "TESORO REFINING & MARKETING CO LLC"              "HGST INC"

%in%

library(readr)
csvPath = system.file("extdata","TRI_1987_BaySites.csv", package="iGIScData")
TRI87 <- read_csv(csvPath)
i <- TRI87$COUNTY %in% c("NAPA","SONOMA")
TRI87$FACILITY_NAME[i]
## [1] "SAWYER OF NAPA"                                "BERINGER VINEYARDS"                           
## [3] "CAL-WOOD DOOR INC"                             "SOLA OPTICAL USA INC"                         
## [5] "KEYSIGHT TECHNOLOGIES INC"                     "SANTA ROSA STAINLESS STEEL"                   
## [7] "OPTICAL COATING LABORATORY INC"                "MGM BRAKES"                                   
## [9] "SEBASTIANI VINEYARDS INC, SONOMA CASK CELLARS"

2.7.6 Apply functions

There are many apply functions in R, and they largely obviate the need for looping. For instance:

  • apply derives values at margins of rows and columns, e.g. to sum across rows or down columns [create the following in a new generic_methods project which you’ll use for a variety of generic methods]
# matrix apply – the same would apply to data frames
matrix12 <- 1:12
dim(matrix12) <- c(3,4)
rowsums <- apply(matrix12, 1, sum)
colsums <- apply(matrix12, 2, sum)
sum(rowsums)
## [1] 78
sum(colsums)
## [1] 78
zero <- sum(rowsums) - sum(colsums)
matrix12
##      [,1] [,2] [,3] [,4]
## [1,]    1    4    7   10
## [2,]    2    5    8   11
## [3,]    3    6    9   12

Apply functions satisfy one of the needs that spreadsheets are used for. Consider how of ten you use sum, mean or similar functions in Excel.

sapply

sapply applies functions to either:

  • all elements of a vector – unary functions only
sapply(1:12, sqrt)
##  [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751 2.828427 3.000000 3.162278 3.316625 3.464102
  • or all variables of a data frame (not a matrix), where it works much like a column-based apply (since variables are columns) but more easily interpreted without the need of specifying columns with 2:
sapply(cars,mean)  # same as apply(cars,2,mean)
## speed  dist 
## 15.40 42.98
temp02 <- c(10.7,9.7,7.7,9.2,7.3,6.7,4.0,5.0,0.9,-1.1,-0.8,-4.4)
temp03 <- c(13.1,11.4,9.4,10.9,8.9,8.4,6.7,7.6,2.8,1.6,1.2,-2.1)
sapply(as.data.frame(cbind(temp02,temp03)),mean) # has to be a data frame
##   temp02   temp03 
## 4.575000 6.658333

While various apply functions are in base R, the purrr package takes these further.
See: purrr cheat sheet

2.8 Exercises

  1. Assign variables for your name, city, state and zip code, and use paste() to combine them, and assign them to the variable me. What is the class of me?

  2. Knowing that trigonometric functions require angles (including azimuth directions) to be provided in radians, and that degrees can be converted into radians by dividing by 180 and multiplying that by pi, derive the sine of 30 degrees with an R expression. (Base R knows what pi is, so you can just use pi)

  3. If two sides of a right triangle on a map can be represented as \(dX\) and \(dY\) and the direct line path between them \(c\), and the coordinates of 2 points on a map might be given as \((x1,y1)\) and \((x2,y2)\), with \(dX=x2-x1\) and \(dY=y2-y1\), use the Pythagorean theorem to derive the distance between them and assign that expression to \(c\).

  4. You can create a vector uniform random numbers from 0 to 1 using runif(n=30) where n=30 says to make 30 of them. Use the round() function to round each of the values, and provide what you created and explain what happened.

  5. Create two vectors of 10 numbers each with the c() function, then assigning to x and y. Then plot(x,y), and provide the three lines of code you used to do the assignment and plot.

  6. Change your code from #5 so that one value is NA (entered simply as NA, no quotation marks), and derive the mean value for x. Then add ,na.rm=T to the parameters for mean(). Also do this for y. Describe your results and explain what happens.

  7. Create two sequences, a and b, with a all odd numbers from 1 to 99, b all even numbers from 2 to 100. Then derive c through vector division of b/a. Plot a and c together as a scatterplot.

  8. Build the sierradata data frame from the data at the top of the Matrices section, also given here:

temp <- c(10.7, 9.7, 7.7, 9.2, 7.3, 6.7, 4.0, 5.0, 0.9, -1.1, -0.8, -4.4)
elev <- c(52, 394, 510, 564, 725, 848, 1042, 1225, 1486, 1775, 1899, 2551)
lat <- c(39.52, 38.91, 37.97, 38.70, 39.09, 39.25, 39.94, 37.75, 40.35, 39.33, 39.17, 38.21)

Create a data frame from it using the same steps, and plot temp against latitude.

  1. From the sierradata matrix built with cbind(), derive colmeans using the mean parameter on the columns 2 for apply().

  2. Do the same thing with the sierra data data frame with sapply().

References

Brown, Christopher. n.d. https://www.r-bloggers.com/2009/10/r-accessors-explained/.
Horst, Allison Marie, Alison Presmanes Hill, and Kristen B Gorman. 2020. Palmerpenguins: Palmer Archipelago (Antarctica) Penguin Data. https://allisonhorst.github.io/palmerpenguins/.
Yihui Xie, Garrett Grolemund, J. J. Allaire. 2019. R Markdown: The Definitive Guide. 1st ed. Boca Raton, Florida: Chapman; Hall/CRC. https://bookdown.org/yihui/rmarkdown/.