2.1 Basic R Programming

Rodney Dyer (worked example) and Helene Wagner (adaptation)

1. Overview

This worked example is adapted from “Applied Population Genetics” by Rodney Dyer. The entire book is available here: http://dyerlab.github.io/applied_population_genetics/index.html

R is a statistical programming language, and it thus requires users to work with code. This can be intimidating at first. Working through this document will help you get up to speed with basic R concepts and notation. Whether you are new to R or need a refresher, this worked example will get you to the level expected for the ‘Landscape Genetics with R’ lab course. The main topics covered here are:

R data types (how data are stored in R: numeric, character, etc.)
R containers (how data are organized: vectors, data frames, etc.)
R functions (how to tell R what to do)

See also video “Week 0: Intro to R Notebooks” for options how to work this this document as .html or .Rmd.

Install packages needed for this worked example. Note: popgraph needs to be installed before installing gstudio.

if(!requireNamespace("popgraph", quietly = TRUE))
{
  install.packages(c("RgoogleMaps", "geosphere", "proto", "sampling", 
                      "seqinr", "spacetime", "spdep"), dependencies=TRUE)
  remotes::install_github("dyerlab/popgraph")
}
  
if(!requireNamespace("gstudio", quietly = TRUE)) remotes::install_github("dyerlab/gstudio")

## Warning: replacing previous import 'dplyr::union' by 'raster::union' when
## loading 'gstudio'

## Warning: replacing previous import 'dplyr::intersect' by 'raster::intersect'
## when loading 'gstudio'

## Warning: replacing previous import 'dplyr::select' by 'raster::select' when
## loading 'gstudio'

2. Data Types

The data we work with comes in many forms—integers, stratum, categories, genotypes, etc.—all of which we need to be able to work with in our analyses. In this chapter, the basic data types we will commonly use in population genetic analyses. This section covers some of the basic types of data we will use in R. These include numbers, character, factors, and logical data types. We will also introduce the locus object from the gstudio library and see how it is just another data type that we can manipulate in R.

The very first hurdle you need to get over is the oddness in the way in which R assigns values to variables.

variable <- value

Yes that is a less-than and dash character. This is the assignment operator that historically has been used and it is the one that I will stick with. In some cases you can use the ‘=’ to assign variables instead but then it takes away the R-ness of R itself. For decision making, the equality operator (e.g., is this equal to that) is the double equals sign ‘==’. We will get into that below where we talk about logical types and later in decision making.

If you are unaware of what type a particular variable may be, you can always use the type() function and R will tell you.

class( variable )

R also has a pretty good help system built into itself. You can get help for any function by typing a question mark in front of the function name. This is a particularly awesome features because at the end of the help file, there is often examples of its usage, which are priceless. Here is the documentation for the ‘help’ function as given by:

?help

There are also package vignettes available (for most packages you download) that provide additional information on the routines, data sets, and other items included in these packages. You can get a list of vignettes currently installed on your machine by:

vignette()

and vignettes for a particular package by passing the package name as an argument to the function itself.

2.1. Numeric Data Types

a. Numeric data

The quantitative measurements we make are often numeric, in that they can be represented as as a number with a decimal component (think weight, height, latitude, soil moisture, ear wax viscosity, etc.). The most basic type of data in R, is the numeric type and represents both integers and floating point numbers (n.b., there is a strict integer data type but it is often only needed when interfacing with other C libraries and can for what we are doing be disregarded).

Assigning a value to a variable is easy

x <- 3
x

## [1] 3

By default, R automatically outputs whole numbers numbers within decimal values appropriately.

y <- 22/7
y

## [1] 3.142857

If there is a mix of whole numbers and numbers with decimals together in a container such as

c(x,y)

## [1] 3.000000 3.142857

then both are shown with decimals. The c() part here is a function that combines several data objects together into a vector and is very useful. In fact, the use of vectors are are central to working in R and functions almost all the functions we use on individual variables can also be applied to vectors.

A word of caution should be made about numeric data types on any computer. Consider the following example.

x <- .3 / 3
x

## [1] 0.1

which is exactly what we’d expect. However, the way in which computers store decimal numbers plays off our notion of significant digits pretty well. Look what happens when I print out x but carry out the number of decimal places.

print(x, digits=20)

## [1] 0.099999999999999991673

Not quite 0.1 is it? Not that far away from it but not exact. That is a general problem, not one that R has any more claim to than any other language and/or implementation. Does this matter much, probably not in the realm of the kinds of things we do in population genetics, it is just something that you should be aware of. You can make random sets of numeric data by using using functions describing various distributions. For example, some random numbers from the normal distribution are:

rnorm(10)

##  [1]  2.06662667 -0.04966700  0.80074666 -1.73626656  1.50572252 -2.84825623
##  [7]  0.22749453 -0.52537009 -0.09460814  1.22570130

from the normal distribution with designated mean and standard deviation:

rnorm(10,mean=42,sd=12)

##  [1] 52.33096 43.47910 66.84867 39.28119 45.80486 46.65419 47.61229 47.47668
##  [9] 56.21916 37.66977

A poisson distribution with mean 2:

rpois(10,lambda = 2)

##  [1] 4 5 5 0 2 0 1 5 2 4

and the $\chi^2$ distribution with 1 degree of freedom:

rchisq(10, df=1)

##  [1] 1.9875285236 0.1285863056 0.0005875502 1.1408654015 1.3253290633
##  [6] 4.5492867565 0.0940073161 1.8556983125 0.0185083337 0.0738698859

There are several more distributions that if you need to access random numbers, quantiles, probability densities, and cumulative density values are available.

b. Coercion to Numeric

All data types have the potential ability to take another variable and coerce it into their type. Some combinations make sense, and some do not. For example, if you load in a CSV data file using read_csv(), and at some point a stray non-numeric character was inserted into one of the cells on your spreadsheet, R will interpret the entire column as a character type rather than as a numeric type. This can be a very frustrating thing, spreadsheets should generally be considered evil as they do all kinds of stuff behind the scenes and make your life less awesome.

Here is an example of coercion of some data that is initially defined as a set of characters

x <- c("42","99")
x

## [1] "42" "99"

and is coerced into a numeric type using the as.numeric() function.

y <- as.numeric( x )
y

## [1] 42 99

It is a built-in feature of the data types in R that they all have (or should have if someone is producing a new data type and is being courteous to their users) an as.X() function. This is where the data type decides if the values asked to be coerced are reasonable or if you need to be reminded that what you are asking is not possible. Here is an example where I try to coerce a non-numeric variable into a number.

x <- "The night is dark and full of terrors..."
as.numeric( x )

## Warning: NAs introduced by coercion

## [1] NA

By default, the result should be NA (missing data/non-applicable) if you ask for things that are not possible.

2.2. Characters

a. Character data

A collection of letters, number, and or punctuation is represented as a character data type. These are enclosed in either single or double quotes and are considered a single entity. For example, my name can be represented as:

prof <- "Rodney J. Dyer"
prof

## [1] "Rodney J. Dyer"

In R, character variables are considered to be a single entity, that is the entire prof variable is a single unit, not a collection of characters. This is in part due to the way in which vectors of variables are constructed in the language. For example, if you are looking at the length of the variable I assigned my name to you see

length(prof)

## [1] 1

which shows that there is only one ‘character’ variable. If, as is often the case, you are interested in knowing how many characters are in the variable prof, then you use the

nchar(prof)

## [1] 14

function instead. This returns the number of characters (even the non-printing ones like tabs and spaces.

nchar(" \t ")

## [1] 3

As all other data types, you can define a vector of character values using the c() function.

x <- "I am"
y <- "not"
z <- 'a looser'
terms <- c(x,y,z)
terms

## [1] "I am"     "not"      "a looser"

And looking at the length() and nchar() of this you can see how these operations differ.

length(terms)

## [1] 3

nchar(terms)

## [1] 4 3 8

b. Concatenation of Characters

Another common use of characters is concatenating them into single sequences. Here we use the function paste() and can set the separators (or characters that are inserted between entities when we collapse vectors). Here is an example, entirely fictional and only provided for instructional purposes only.

paste(terms, collapse=" ")

## [1] "I am not a looser"

paste(x,z)

## [1] "I am a looser"

paste(x,z,sep=" not ")

## [1] "I am not a looser"

c. Coercion to Characters

A character data type is often the most basal type of data you can work with. For example, consider the case where you have named sample locations. These can be kept as a character data type or as a factor (see below). There are benefits and drawbacks to each representation of the same data (see below). By default (as of the version of R I am currently using when writing this book), if you use a function like read_table() to load in an external file, columns of character data will be treated as factors. This can be good behavior if all you are doing is loading in data and running an analysis, or it can be a total pain in the backside if you are doing more manipulative analyses.

Here is an example of coercing a numeric type into a character type using the as.character() function.

x <- 42
x

## [1] 42

y <- as.character(x)
y

## [1] "42"

2.3. Factors

a. Factor vs. Character

A factor is a categorical data type. If you are coming from SAS, these are class variables. If you are not, then perhaps you can think of them as mutually exclusive classifications. For example, an sample may be assigned to one particular locale, one particular region, and one particular species. Across all the data you may have several species, regions, and locales. These are finite, and defined, sets of categories. One of the more common headaches encountered by people new to R is working with factor types and trying to add categories that are not already defined.

Since factors are categorical, it is in your best interest to make sure you label them in as descriptive as a fashion as possible. You are not saving space or cutting down on computational time to take shortcuts and label the locale for Rancho Santa Maria as RSN or pop3d or 5. Our computers are fast and large enough, and our programmers are cleaver enough, to not have to rename our populations in numeric format to make them work (hello STRUCTURE I’m calling you out here). The only thing you have to loose by adopting a reasonable naming scheme is confusion in your output.

To define a factor type, you use the function factor() and pass it a vector of values.

region <- c("North","North","South","East","East","South","West","West","West")
region <- factor( region )
region

## [1] North North South East  East  South West  West  West 
## Levels: East North South West

When you print out the values, it shows you all the levels present for the factor. If you have levels that are not present in your data set, when you define it, you can tell R to consider additional levels of this factor by passing the optional levels= argument as:

region <- factor( region, levels=c("North","South","East","West","Central"))
region

## [1] North North South East  East  South West  West  West 
## Levels: North South East West Central

If you try to add a data point to a factor list that does not have the factor that you are adding, it will give you an error (or ‘barf’ as I like to say).

region[1] <- "Bob"

## Warning in `[<-.factor`(`*tmp*`, 1, value = "Bob"): invalid factor level, NA
## generated

Now, I have to admit that the Error message in its entirety, with its “[<-.factor(*tmp*, 1, value = “Bob”)“` part is, perhaps, not the most informative. Agreed. However, the “invalid factor level” does tell you something useful. Unfortunately, the programmers that put in the error handling system in R did not quite adhere to the spirit of the “fail loudly” mantra. It is something you will have to get good at. Google is your friend, and if you post a questions to (http://stackoverflow.org) or the R user list without doing serious homework, put on your asbestos shorts!

Unfortunately, the error above changed the first element of the region vector to NA (missing data). I’ll turn it back before we move too much further.

region[1] <- "North"

Factors in R can be either unordered (as say locale may be since locale A is not >, =, or < locale B) or they may be ordered categories as in Small < Medium < Large < X-Large. When you create the factor, you need to indicate if it is an ordered type (by default it is not). If the factors are ordered in some way, you can also create an ordination on the data. If you do not pass a levels= option to the factors() function, it will take the order in which they occur in data you pass to it. If you want to specify an order for the factors specifically, pass the optional levels= and they will be ordinated in the order given there.

region <- factor( region, ordered=TRUE, levels = c("West", "North", "South", "East") )
region

## [1] North North South East  East  South West  West  West 
## Levels: West < North < South < East

b. Missing Levels in Factors

There are times when you have a subset of data that do not have all the potential categories.

subregion <- region[ 3:9 ]
subregion

## [1] South East  East  South West  West  West 
## Levels: West < North < South < East

table( subregion )

## subregion
##  West North South  East 
##     3     0     2     2

2.4. Logical Types

A logical type is either TRUE or FALSE, there is no in-between. It is common to use these types in making decisions (see if-else decisions) to check a specific condition being satisfied. To define logical variables you can either use the TRUE or FALSE directly

canThrow <- c(FALSE, TRUE, FALSE, FALSE, FALSE)
canThrow

## [1] FALSE  TRUE FALSE FALSE FALSE

or can implement some logical condition

stable <- c( "RGIII" == 0, nchar("Marshawn") == 8)
stable

## [1] FALSE  TRUE

on the variables. Notice here how each of the items is actually evaluated as to determine the truth of each expression. In the first case, the character is not equal to zero and in the second, the number of characters (what nchar() does) is indeed equal to 8 for the character string “Marshawn”.

It is common to use logical types to serve as indices for vectors. Say for example, you have a vector of data that you want to select some subset from.

data <- rnorm(20)
data

##  [1] -0.79745223  1.46501894 -1.03047618 -0.21853566 -0.23757372  0.73102901
##  [7]  1.11701584 -0.63453169  1.67139031 -1.71802700 -0.31123567 -0.84801288
## [13] -0.36546077  0.81964648  0.13217699  0.17356375 -0.60772655 -0.98576698
## [19]  1.29962637 -0.02286482

Perhaps you are on interested in the non-negative values

data[ data > 0 ]

## [1] 1.4650189 0.7310290 1.1170158 1.6713903 0.8196465 0.1321770 0.1735638
## [8] 1.2996264

If you look at the condition being passed to as the index

data > 0

##  [1] FALSE  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE  TRUE FALSE FALSE FALSE
## [13] FALSE  TRUE  TRUE  TRUE FALSE FALSE  TRUE FALSE

you see that individually, each value in the data vector is being evaluated as a logical value, satisfying the condition that it is strictly greater than zero. When you pass that as indices to a vector it only shows the indices that are TRUE.

You can coerce a value into a logical if you understand the rules. Numeric types that equal 0 (zero) are FALSE, always. Any non-zero value is considered TRUE. Here I use the modulus operator, %%, which provides the remainder of a division.

1:20 %% 2

##  [1] 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0

which used as indices give us

data[ (1:20 %% 2) > 0 ]

##  [1] -0.7974522 -1.0304762 -0.2375737  1.1170158  1.6713903 -0.3112357
##  [7] -0.3654608  0.1321770 -0.6077265  1.2996264

You can get as complicated in the creation of indices as you like, even using logical operators such as OR and AND. I leave that as an example for you to play with.

3. Data Containers

We almost never work with a single datum¹, rather we keep lots of data. Moreover, the kinds of data are often heterogeneous, including categorical (Populations, Regions), continuous (coordinates, rainfall, elevation), imagry (hyperspectral, LiDAR), and perhaps even genetic. R has a very rich set of containers into which we can stuff our data as we work with it. Here these container types are examined and the restrictions and benefits associated with each type are explained.

3.1. Vectors

We have already seen several examples of several vectors in action (see the introduction to Numeric data types for example). A vector of objects is simply a collection of them, often created using the c() function (c for combine). Vectorized data is restricted to having homogeneous data types—you cannot mix character and numeric types in the same vector. If you try to mix types, R will either coerce your data into a reasonable type

x <- c(1,2,3)
x

## [1] 1 2 3

y <- c(TRUE,TRUE,FALSE)
y

## [1]  TRUE  TRUE FALSE

z <- c("I","am","not","a","looser")
z

## [1] "I"      "am"     "not"    "a"      "looser"

or coearce them into one type that is amenable to all the types of data that you have given it. In this example, a Logical, Character, Constant, and Function are combined resulting in a vector output of type Character.

w <- c(TRUE, "1", pi, ls())
w

##  [1] "TRUE"             "1"                "3.14159265358979" "canThrow"        
##  [5] "data"             "prof"             "region"           "rmd_file"        
##  [9] "stable"           "subregion"        "terms"            "x"               
## [13] "y"                "yml_metadata"     "z"

class(w)

## [1] "character"

Accessing elements within a vector are done using the square bracket [] notation. All indices (for vectors and matrices) start at 1 (not zero as is the case for some languages). Getting and setting the components within a vector are accomplished using numeric indices with the assignment operators just like we do for variables containing a single value.

## [1] 1 2 3

x[1] <- 2
x[3] <- 1
x

## [1] 2 2 1

x[2]

## [1] 2

A common type of vector is that of a sequences. We use sequences all the time, to iterate through a list, to counting generations, etc. There are a few ways to generate sequences, depending upon the step sequence. For a sequence of whole numbers, the easiest is through the use of the colon operator.

x <- 1:6
x

## [1] 1 2 3 4 5 6

This provides a nice shorthand for getting the values X:Y from X to Y, inclusive. It is also possible to go backwards using this operator, counting down from X to Y as in:

x <- 5:2
x

## [1] 5 4 3 2

The only constraint here is that we are limited to a step size of 1.0. It is possible to use non-integers as the bounds, it will just count up by 1.0 each time.

x <- 3.2:8.4
x

## [1] 3.2 4.2 5.2 6.2 7.2 8.2

If you are interested in making a sequence with a step other than 1.0, you can use the seq() function. If you do not provide a step value, it defaults to 1.0.

y <- seq(1,6)
y

## [1] 1 2 3 4 5 6

But if you do, it will use that instead.

z <- seq(1,20,by=2)
z

##  [1]  1  3  5  7  9 11 13 15 17 19

It is also possible to create a vector of objects as repetitions using the rep() (for repeat) function.

rep("Beetlejuice",3)

## [1] "Beetlejuice" "Beetlejuice" "Beetlejuice"

If you pass a vector of items to rep(), it can repeat these as either a vector being repeated (the default value)

x <- c("No","Free","Lunch")
rep(x,time=3)

## [1] "No"    "Free"  "Lunch" "No"    "Free"  "Lunch" "No"    "Free"  "Lunch"

or as each item in the vector repeated.

rep(x,each=3)

## [1] "No"    "No"    "No"    "Free"  "Free"  "Free"  "Lunch" "Lunch" "Lunch"

3.2. Matrices

A matrix is a 2- or higher dimensional container, most commonly used to store numeric data types. There are some libraries that use matrices in more than two dimensions (rows and columns and sheets), though you will not run across them too often. Here I restrict myself to only 2-dimensional matrices.

You can define a matrix by giving it a set of values and an indication of the number of rows and columns you want. The easiest matrix to try is one with empty values:

matrix(nrow=2, ncol=2)

##      [,1] [,2]
## [1,]   NA   NA
## [2,]   NA   NA

Perhaps more useful is one that is pre-populated with values.

matrix(1:4, nrow=2 )

##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4

Notice that here, there were four entries and I only specified the number of rows required. By default the ‘filling-in’ of the matrix will proceed down column (by-column). In this example, we have the first column with the first two entries and the last two entries down the second column. If you want it to fill by row, you can pass the optional argument

matrix(1:4, nrow=2, byrow=TRUE)

##      [,1] [,2]
## [1,]    1    2
## [2,]    3    4

and it will fill by-row.

When filling matrices, the default size and the size of the data being added to the matrix are critical. For example, I can create a matrix as:

Y <- matrix(c(1,2,3,4,5,6),ncol=2,byrow=TRUE)
Y

##      [,1] [,2]
## [1,]    1    2
## [2,]    3    4
## [3,]    5    6

X <- matrix(c(1,2,3,4,5,6),nrow=2)
X

##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6

and both produce a similar matrix, only transposed.

X == t(Y)

##      [,1] [,2] [,3]
## [1,] TRUE TRUE TRUE
## [2,] TRUE TRUE TRUE

In the example above, the number of rows (or columns) was a clean multiple of the number of entries. However, if it is not, R will fill in values.

X <- matrix(c(1,2,3,4,5,6),ncol=4, byrow=TRUE)

## Warning in matrix(c(1, 2, 3, 4, 5, 6), ncol = 4, byrow = TRUE): data length [6]
## is not a sub-multiple or multiple of the number of columns [4]

Notice how you get a warning from the interpreter. But that does not stop it from filling in the remaining slots by starting over in the sequence of numbers you passed to it.

##      [,1] [,2] [,3] [,4]
## [1,]    1    2    3    4
## [2,]    5    6    1    2

The dimensionality of a matrix (and data.frame as we will see shortly) is returned by the dim() function. This will provide the number of rows and columns as a vector.

dim(X)

## [1] 2 4

Accessing elements to retrieve or set their values within a matrix is done using the square brackets just like for a vector but you need to give [row,col] indices. Again, these are 1-based so that

X[1,3]

## [1] 3

is the entry in the 1st row and 3rd column.

You can also use ‘slices’ through a matrix to get the rows

X[1,]

## [1] 1 2 3 4

or columns

X[,3]

## [1] 3 1

of data. Here you just omit the index for the entity you want to span. Notice that when you grab a slice, even if it is a column, is given as a vector.

length(X[,3])

## [1] 2

You can grab a sub-matrix using slices if you give a range (or sequence) of indices.

X[,2:3]

##      [,1] [,2]
## [1,]    2    3
## [2,]    6    1

If you ask for values from a matrix that exceed its dimensions, R will give you an error.

X[1,8]

## Error in X[1, 8] : subscript out of bounds
## Calls: <Anonymous> ... handle -> withCallingHandlers -> withVisible -> eval -> eval
## Execution halted

There are a few cool extensions of the rep() function that can be used to create matrices as well. They are optional values that can be passed to the function.

times=x: This is the default option that was occupied by the ‘3’ in the example above and represents the number of times that first argument will be repeated.
each=x This will take each element in the first argument are repeat them each times.
length.out=x: This make the result equal in length to x.

In combination, these can be quite helpful. Here is an example using numeric sequences in which it is necessary to find the index of all entries in a 3x2 matrix. To make the indices, I bind two columns together using cbind(). There is a matching row binding function, denoted as rbind() (perhaps not so surprisingly). What is returned is a matrix

indices <- cbind( rep(1:2, each=3), rep(1:3,times=2), rep(5,length.out=6)  )
indices

##      [,1] [,2] [,3]
## [1,]    1    1    5
## [2,]    1    2    5
## [3,]    1    3    5
## [4,]    2    1    5
## [5,]    2    2    5
## [6,]    2    3    5

3.3. Lists

A list is a type of vector but is indexed by ‘keys’ rather than by numeric indices. Moreover, lists can contain heterogeneous types of data (e.g., values of different class), which is not possible in a vector type. For example, consider the list

theList <- list( x=seq(2,40, by=2), dog=LETTERS[1:5], hasStyle=logical(5) )
summary(theList)

##          Length Class  Mode     
## x        20     -none- numeric  
## dog       5     -none- character
## hasStyle  5     -none- logical

which is defined with a numeric, a character, and a logical component. Each of these entries can be different in length as well as type. Once defined, the entries may be observed as:

theList

## $x
##  [1]  2  4  6  8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40
## 
## $dog
## [1] "A" "B" "C" "D" "E"
## 
## $hasStyle
## [1] FALSE FALSE FALSE FALSE FALSE

Once created, you can add variables to the list using the $-operator followed by the name of the key for the new entry.

theList$my_favoriate_number <- 2.9 + 3i

or use double brackets and the name of the variable as a character string.

theList[["lotto numbers"]] <- rpois(7,lambda=42)

The keys currently in the list are given by the names() function

names(theList)

## [1] "x"                   "dog"                 "hasStyle"           
## [4] "my_favoriate_number" "lotto numbers"

Getting and setting values within a list are done the same way using either the $-operator

theList$x

##  [1]  2  4  6  8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40

theList$x[2] <- 42
theList$x

##  [1]  2 42  6  8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40

or the double brackets

theList[["x"]]

##  [1]  2 42  6  8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40

or using a numeric index, but that numeric index is looks to the results of names() to figure out which key to use.

theList[[2]]

## [1] "A" "B" "C" "D" "E"

The use of the double brackets in essence provides a direct link to the variable in the list whose name is second in the names() function (dog in this case). If you want to access elements within that variable, then you add a second set of brackets on after the double ones.

theList[[1]][3]

## [1] 6

This deviates from the matrix approach as well as from how we access entries in a data.frame (described next). It is not a single square bracket with two indices, that gives you an error:

theList[1,3]

## Error in theList[1, 3] : incorrect number of dimensions
## Calls: <Anonymous> ... handle -> withCallingHandlers -> withVisible -> eval -> eval
## Execution halted

List are rather robust objects that allow you to store a wide variety of data types (including nested lists). Once you get the indexing scheme down, it they will provide nice solutions for many of your computational needs.

3.4. Data Frames

a. Data Frames as spreadsheets

The data.frame is the default data container in R. It is analogous to both a spreadsheet, at least in the way that I have used spreadsheets in the past, as well as a database. If you consider a single spreadsheet containing measurements and observations from your research, you may have many columns of data, each of which may be a different kind of data. There may be factors representing designations such as species, regions, populations, sex, flower color, etc. Other columns may contain numeric data types for items such as latitude, longitude, dbh, and nectar sugar content. You may also have specialized columns such as dates collected, genetic loci, and any other information you may be collecting.

On a spreadsheet, each column has a unified data type, either quantified with a value or as a missing value, NA, in each row. Rows typically represent the sampling unit, perhaps individual or site, along which all of these various items have been measured or determined. A data.frame is similar to this, at least conceptually. You define a data.frame by designating the columns of data to be used. You do not need to define all of them, more can be added later. The values passed can be sequences, collections of values, or computed parameters. For example:

df <- data.frame( ID=1:5, Names=c("Bob","Alice","Vicki","John","Sarah"), Score=100 - rpois(5,lambda=10))
df

##   ID Names Score
## 1  1   Bob    90
## 2  2 Alice    91
## 3  3 Vicki    93
## 4  4  John    96
## 5  5 Sarah    93

You can see that each column is a unified type of data and each row is equivalent to a record. Additional data columns may be added to an existing data.frame as:

df$Passed_Class <- c(TRUE,TRUE,TRUE,FALSE,TRUE)

Since we may have many (thousands?) of rows of observations, a summary() of the data.frame can provide a more compact description.

summary(df)

##        ID       Names               Score      Passed_Class   
##  Min.   :1   Length:5           Min.   :90.0   Mode :logical  
##  1st Qu.:2   Class :character   1st Qu.:91.0   FALSE:1        
##  Median :3   Mode  :character   Median :93.0   TRUE :4        
##  Mean   :3                      Mean   :92.6                  
##  3rd Qu.:4                      3rd Qu.:93.0                  
##  Max.   :5                      Max.   :96.0

We can add columns of data to the data.frame after the fact using the $-operator to indicate the column name. Depending upon the data type, the summary will provide an overview of what is there.

b. Indexing Data Frames

You can access individual items within a data.frame by numeric index such as:

df[1,3]

## [1] 90

You can slide indices along rows (which return a new data.frame for you)

df[1,]

##   ID Names Score Passed_Class
## 1  1   Bob    90         TRUE

or along columns (which give you a vector of data)

df[,3]

## [1] 90 91 93 96 93

or use the $-operator as you did for the list data type to get direct access to a either all the data or a specific subset therein.

df$Names[3]

## [1] "Vicki"

Indices are ordered just like for matrices, rows first then columns. You can also pass a set of indices such as:

df[1:3,]

##   ID Names Score Passed_Class
## 1  1   Bob    90         TRUE
## 2  2 Alice    91         TRUE
## 3  3 Vicki    93         TRUE

It is also possible to use logical operators as indices. Here I select only those names in the data.frame whose score was >90 and they passed popgen.

df$Names[df$Score > 90 & df$Passed_Class==TRUE]

## [1] "Alice" "Vicki" "Sarah"

This is why data.frame objects are very database like. They can contain lots of data and you can extract from them subsets that you need to work on. This is a VERY important feature, one that is vital for reproducible research. Keep you data in one and only one place.

4. Programming

One of the strengths of R as an analysis platform is that it is a language rather than a program. With programs, such as SPSS & JMP, you are limited by the functionality that the designers thought would be necessary to meet the broadest audience. In R, you can rely upon simple functions or you can create entire analysis and simulation programs de novo. To do this, we need to dig into flow control and decision making processes, both of which you need for doing more in-depth programming.

4.1. Function Writing

Here we look at how to create an R function. Writing small functions like this is a huge benefit to you as an analyst and this is a great place to start. A function in R is defined as:

function_name <- function( arguments ) { Stuff you want the function to do }

You define a function name for whatever you like and assign it the stuff to the right. In R, the function named function() is a special one, it tells R that you are about to create a little routine and you want that set of code to be available to you for later use under the name of whatever you named it. This allows a tremendous amount of flexibility as you develop your own set of routines and analyses for your work. The part that actually does stuff is after the function call. It may be that the function that you create does need some data (those are the arguments) or you may not need any input in to the function (in which you pass no arguments). It all depends upon what you are creating.

The key to understanding functions is that they are encapsulations of code—a shortcut for a sequence of instructions if you will not have to type over and over again. The less typing you do, the lower the probability that you will have errors (and all code has errors).

Here is an example of some code that I’m going to develop into a function. This function will allow me to determine if one genotype could possibly be the offspring of the other genotype.

library(gstudio)
loc1 <- locus( c(128,130) )
loc2 <- locus( c(128,128) )
cat( loc1, loc2 )

## 128:130 128:128

We start out with two loci, a 128:130 heterozygote and a 128:128 homozygote. These may represent repeat motifs at a microsatellite locus or some other co-dominant genotype. First, I’ll break the locus into a vector of genotypes.

off.alleles <- alleles( loc1 )
off.alleles

## [1] "128" "130"

mom.alleles <- alleles( loc2 )
mom.alleles

## [1] "128" "128"

To be a valid potential offspring there should be at least one of the alleles in the parent that matches the allele in the offspring. The intersect() function returns the set of values common to both vectors.

shared <- intersect( off.alleles, mom.alleles )
shared

## [1] "128"

If it has at least one of the alleles present (it could have both if parent and offspring are both the same heterozygote) then you cannot exclude this individual as a potential offspring. If there are no alleles in common, then the value returned is an empty vector.

loc3 <- locus( c(132,132))
dad.alleles <- alleles( loc3 )
intersect( mom.alleles, dad.alleles )

## character(0)

This logic can be shoved into a function. You have to wrap it into a set of curly brackets. I use the length of the result from the intersect() to return from the function. Potential values for

potential_offspring <- function( parent, offspring ) {
  off <- alleles( offspring )
  par <- alleles( loc2 )
  shared <- intersect( off, par )
  return( length( shared ) > 0 )
}

Now, you can call this function anytime you need, just passing it two genotypes. If they can be offspring it returns TRUE, as in the comparison between 128:130 and 128:128 genotypes.

potential_offspring(loc1, loc2)

## [1] TRUE

And it returns FALSE for the comparison between 128:128 and 132:132.

potential_offspring(loc2, loc3)

## [1] FALSE

4.2. Variable Scope

There is a lot more information on writing functions and we will get into that as we progress through the text. However, it is important that I bring this up now. The value assigned to a variable is defined by its scope. Consider the following code

x <- 10

and the function defined as

do_it <- function( x ) {
  x <- x + 10
  return( x )
}

When I call the function, the variable x that is the argument of the function is not the same variable that is in the environment that I assigned a value of 10. The x in the function argument is what we call “local to that function” in that within the curly brackets that follow (and any number of curly brackets nested within those, the value of x is given whatever was passed to the function.

4.3. Decision Making

We interact with our data in many ways and introspection of the values we have in the variables we are working with are of prime importance. Decision making in your code is where you evaluate your data and make a choice of outcomes based upon some criteria. Here is some example data that we can use as we explore the basics of if(), if(){} else{}, and if(){} elif(){} else{} coding patterns.

a. The if Pattern

The most basic version of decision making is asking a single question and if the answer is TRUE then do something. The if(){} function does this and has the form

if( CRITERIA ) {
    DO_SOMETHING
}

You pass a logical statement (or something that can be coerced into a logical type) to the function as the CRITERIA and if it evaluates to TRUE, then the contents of the DO_SOMETHING are executed. If the value of CRITERIA is not TRUE the DO_SOMETHING is skipped entirely—it is not even seen by the interpreter.

Here we can test this out using the loci defined above along with the is_heterozygote() function. This function takes one or more locus objects and returns TRUE/FALSE if they are or are not a heterozygote.

is_heterozygote( c(loc1, loc2) )

## [1]  TRUE FALSE

If we shove that function into the if() parameters we can use its evaluation of the heterozygous state of the locus to do something interesting, say tell us it is a heterozygote—it is admittedly a contrived example, but hey you try to make easy examples, it is not easy.

if( is_heterozygote(loc1) ){
  print("It's a het!")
}

## [1] "It's a het!"

If the is_heterozygote() function returns a value of FALSE, then the contents of the if() function (the stuff within the curly brackets is skipped entirely.

if( is_heterozygote(loc2) ){
  print("It's a het!")
}

Notice, there was no indication of any of that code inside the curly brackets. The if-else Pattern If there are more than on thing you want to potentially do when making a decision, you can add an else clause after the if pattern. Here if is_heterozygote() returns FALSE, the contents of the else{} clause will be executed. Here is the heterozygote example

if( is_heterozygote(loc1) ) {
  cat(loc1, "is a heterozygote")
} else {
  cat(loc1, "is a homozygote")
}

## 128:130 is a heterozygote

and the homozygote one

if( is_heterozygote(loc2) ) {
  cat(loc2, "is a heterozygote")
} else {
  cat(loc2, "is a homozygote")
}

## 128:128 is a homozygote

There is a slightly shorter version of this that is available for the lazy programmer and lets be honest, all programmers are lazy and the more you can accomplish with fewer strokes on the keyboard the better (this is how we got emacs and vim). I generally don’t teach the shortcuts up front, but this one is short and readily apparent so it may be more helpful than confusing. The ifelse() function has three parts, the condition, the result if TRUE, and the result if FALSE.

ans <- ifelse( is_heterozygote( c(loc1, loc2)) , "heterozygote", "Not")
ans

## [1] "heterozygote" "Not"

So iterating through the x vector, the condition x>0 is evaluated and if TRUE the sqrt() of the value is returned, else the NA is given. It is compact and easy to use so you may run into it often.

b. The `if-else` Pattern

It is possible to test many conditions in a single sequence by stringing together else-if conditions. The point that is important here is that the first condition that evaluates to TRUE will be executed and all remaining ones will be skipped, even if they also are logically TRUE. This means that it is important to figure out the proper order of asking your conditions. Here is an example function that determines if none, one, or both of the genotypes passed to it are heterozygotes. By default, I step through every one of the potential options of available on this comparison.
1. The first is a heterozygote and the second one isn’t 2. The first one isn’t and the second one is 3. Both are heterozygotes 4. The last state (both are not)

Here is the function.

which_is_het <- function( A, B) {
  if( is_heterozygote(A) & !is_heterozygote(B) ) {
    print("First is heterozygote")
  } else if( !is_heterozygote(A) & is_heterozygote(B) ){
    print("Second is heterozygote")
  } else if( is_heterozygote(A) & is_heterozygote(B) ){
    print("Both are heterozygotes")
  } else {
    print( "Neither are heterozygotes")
  }
}

It is possible that the order of these CRITERIA could be changed, the important thing to remember is that the sequence of if - else if - else if etc. will terminate the very first time one of the CRITERIA is evaluated to be TRUE.

4.4. Flow Control

Flow control is the process of iterating across objects and perhaps doing operations on those objects. The R language has several mechanisms that you can use to control the flow of a script or bit of code.

a. The `for()` Loop

x <- c(3,8,5,4,6)
x

## [1] 3 8 5 4 6

You can iterate through this vector using a for() loop. This is a simple function that has the form:

for( SOME_SEQUENCE ){
  DO_SOMETHING
}

Where the SOME_SEQUENCE component is a sequence of values either specified OR calculated and the DO_SOMETHING is the thing you want to do with each of the values in the sequence. Usually, there is a variable defined in the SOME_SEQUENCE component and the value of that variable is used. Here are a few examples. The first goes through the existing vector directly and assigns (in sequential order) the entries of ‘x’ to the variable val. We can then do whatever we want with the value in val (though if we change it, nothing happens to the original x vector).

for( val in x ){
  print(val)
}

## [1] 3
## [1] 8
## [1] 5
## [1] 4
## [1] 6

We can also specify a sequence directly and then use it as an index. Here I use an index variable named i to take on the integer seqeunce equal in length to the length of the original x variable. Then I can iterate through the original vector and use that index variable to grab the value I want.

for( i in 1:length(x)){
  print( x[i] )
}

## [1] 3
## [1] 8
## [1] 5
## [1] 4
## [1] 6

Both give us the same output, namely a way to go through the variable x. However, there may be a need to use the latter approach in your calculations. For example, perhaps I want to do some other operation on the values. In this very contrived example that follows, I want to perform operations on the values in x depending on if they are even or odd. For the odd ones, I add the corresponding value in y and if not I subtract it. Sure, this is totally contrived and I cannot think of a reason why I would be doing this, but if I need to know what index (row, column or whatever) an entry is during the iteration process, then I need to use this approach over the for( val in x) approach.

y <- 1:5
for( i in 1:length(x)){
  if( x[i] %% 2)
    print( x[i] + y[i])
  else
    print( x[i] - y[i] )
}

## [1] 4
## [1] 6
## [1] 8
## [1] 0
## [1] 1

b. Short Circuiting the Loop

It is possible to short circuit the looping process using the keywords next and break, though in my programming style, I consider their use in my source files as evidence of inelegant code. That said, you may need them on occasion.

The next keyword basically stops all commands after that during the current iteration of the loop. It does not terminate the loop itself, it just stops the commands that follow it this time through. Here is an example that uses the modulus operator, %% (e.g., the remainder after division), to print out only those numbers that are divisible by three.

for( i in 1:20 ){
  if( i %% 3 )
    next
  cat("The value of i =",i,"\n")
}

## The value of i = 3 
## The value of i = 6 
## The value of i = 9 
## The value of i = 12 
## The value of i = 15 
## The value of i = 18

The use of break to exit the loop entirely is perhaps more commonly encountered. When this keyword is encountered, the loop terminates immediately, as if it reached the send of the sequence.

for( i in 1:10){
  if( i > 2 )
    break
  cat("The value of i=",i,"\n")
}

## The value of i= 1 
## The value of i= 2

The word data is plural, datum is singular↩︎