Chapter 2 Data Frames

# It is good practice to load the packages important to the RMD file at the top of 
# the script. Extra syntax is provided here that suppresses the output of messages,
# of which some packages have extraordinary long message outputs (such as tidyverse).
# For this chapter we might want to load some common packages: dplyr, ggplot2,
# forcats, readr, and stringr. Lets load the packages from the `tidyverse` suite, but
# suppress the tidyverse message output. In future chapters, I will only show the packages
# necessary for that chapter. You may hide this information in your own script/RMD files.
suppressPackageStartupMessages({
  library(tidyverse, quietly = TRUE)
})

Dr. Sonderegger’s Video Companion: Video Lecture.

2.1 Data Frames

Data frames are the fundamental unit of data storage that casual users of R need to work with. Conceptually they are just like a single tab in a spreadsheet (e.g. Excel) file. There are multiple rows and columns and each column is of the same type of information (e.g. numerical values, dates, or character strings) and each row represents a single observation.

Because the columns have meaning and we generally give them column names, it is desirable to want to access an element by the name of the column as opposed to the column number. While writing formulas in large Excel spreadsheets I often get annoyed trying to remember which column something was in and muttering “Was total biomass in column P or Q?” A system where I could just name the column Total_Biomass and then always refer to it that way, is much nicer to work with and I make fewer mistakes.

In this chapter we will briefly cover the minimal set of tools for working with data frames. First we discuss how to import data sets, both from packages and appropriately formatted Excel and .csv files. Finally we’ll see how to create a data frame “by hand” and to access columns and do simple manipulations.

In this chapter, we will focus on standard R data frame manipulations so that readers gain basic familiarity with non-tidyverse accessor methods. This is our first building block before moving to more advanced functionality. When in doubt, there are always base-R methods to a solution, but we will find that R pacakges can make our life much easier when used properly!

2.2 Introduction to Importing Data

2.2.1 From a Package

For many students, they will be assigned homework that utilizes data sets that are stored in a package. To access those, we would need to first install the package if we haven’t already. Recall to do that, we can use the Rstudio menu bar Tools -> Install Packages... mouse action or the install_packages() function.

Because we might have thousands of packages installed on a computer, and those packages might all have data sets associated with them, they aren’t loaded into memory by default. Instead we have to go through a two-step process of making sure that the package is installed on the computer, and then load the desired data set into the local session of R. Once the package is installed, we can load the data into our session via the following command:

data('alfalfa', package='faraway')   # load the data set 'alfalfa' from the package 'faraway'

Because R tries to avoid loading datasets until it is sure that you need them,the object alfalfa isn’t initially loaded as a data.frame but rather as a “promise” that it eventually will be loaded whenever you first use it. So lets first access it by viewing it.

View(alfalfa)

There are two ways to enter the view command. Either executing the View() function from the console, or clicking on either the white table or the object name in the Environment tab. It is encouraged to remove any View() functions from your final RMD files, as this will consistently open the data set you are trying to view. Instead, use View() to digest the data, then remove this from your RMD file to not have negative impacts on your final “knit” document.

2.2.2 Import from .csv or .xls files

Often times data is stored in a Comma Separated Values (CSV) file where the rows in the file represent the data frame rows, and the columns are separated by commas. The first row of the file is usually the column titles. Alternatively, the data might be stored in an Excel file and we just need to tell R where the file is and which worksheet tab to import.

The hardest part for people that are new to programming is giving the path to the data file. In this case, I recommend students use the data import wizard that RStudio includes which is accessed via File -> Import Dataset. This will then give you a choice of file types to read from (.csv files are in the “Text” options). Once you have selected the file type to import, the user is presented with a file browser window where the desired file should be located. Once the file is chosen, we can import the file.

Critically, we should notice that the import wizard generates R code that does the actual import. We MUST copy that code into our Rmarkdown file or else the import won’t happen when we try to knit the Rmarkdown into an output document because knitting always occurs in a completely fresh R session. So only use the import wizard to generate the import code! The code generated by the import wizard ends with a View() command and I typically remove that as it can interfere with the knitting process. The code that I’ll paste into my RMarkdown file typically looks like this (be mindful you have no way to run code that requires access to my dropbox, this is just an example):

library(readxl)
Melioid_IgG <- read_excel("~/Dropbox/NAU/MAGPIX serology/Data/Melioid_IgG.xlsx")
# View(Melioid_IgG)

A nice property of working within an RMD file is that the working directory is set to be the same folder of which the RMD file is present. This can make loading data smoother and is another important reason for using projects and R-markdown files to improve efficiency. You may also wish to try loading the data using simple commands such as read.csv() within your script/RMd file, by placing the data file adjacent to your .rmd file. These types of ideas will be covered in more advanced sections later in the textbook, for now feel free to use the import wizard mentioned above, just be sure to copy the code into your RMD file!

2.3 Data Types

Data frames are required that each column have the same type. That is to say, if a column is numeric, you can’t just change one value to a character string. Below are the most common data types that are used within R.

  1. Integers - These are the integer numbers \(\left(\dots,-2,-1,0,1,2,\dots\right)\). To convert a numeric value to an integer you may use the function as.integer().

  2. Numeric - These could be any number (whole number or decimal). To convert another type to numeric you may use the function as.numeric().

  3. Strings - These are a collection of characters (example: Storing a student’s last name). To convert another type to a string, use as.character().

  4. Factors - These are strings that can only values from a finite set. For example we might wish to store a variable that records home department of a student. Since the department can only come from a finite set of possibilities, I would use a factor. Factors are categorical variables, but R calls them factors instead of categorical variable. A vector of values of another type can always be converted to a factor using the as.factor() command. For converting numeric values to factors, I will often use the function cut().

  5. Logicals - This is a special case of a factor that can only take on the values TRUE and FALSE, sometimes referred to as Boolean. (Be careful to always capitalize TRUE and FALSE. Because R is case-sensitive, TRUE is not the same as true.) Using the function as.logical() you can convert numeric values to TRUE and FALSE where 0 is FALSE and anything else is TRUE.

Depending on the command, R will coerce your data from one type to another if necessary, but it is a good habit to do the coercion yourself. If a variable is a number, R will automatically assume that it is continuous numerical variable. If it is a character string, then R will assume it is a factor when doing any statistical analysis. It is always good practice to double check that R is using the data as intended, as this can have major consequences in statistical modeling, where a continuous variable behaves much differently than a categorical variable. Such cases can arise if a value like 1 is actually denoting a group and not a numerical value.

Most of these types are familiar to beginning R users except for factors. Factors are how R keeps track of categorical variables. R does this in a two step pattern. First it figures out how many categories there are and remembers which category an observation belongs. Second, it keeps a vector of character strings that correspond to the names of each of the categories.

# A character vector
y <- c('B','B','A','A','C')
y
## [1] "B" "B" "A" "A" "C"
# convert the vector of characters into a vector of factors 
z <- factor(y)
str(z)
##  Factor w/ 3 levels "A","B","C": 2 2 1 1 3

Notice that the vector z is actually the combination of group assignment vector 2,2,1,1,3 and the group names vector “A”,”B”,”C”. So we could convert z to a vector of numerics or to a vector of character strings.

as.numeric(z)
## [1] 2 2 1 1 3
as.character(z)
## [1] "B" "B" "A" "A" "C"

Often we need to know what possible groups there are, and this is done using the levels() command.

levels(z)
## [1] "A" "B" "C"

Notice that the order of the group names was done alphabetically, which we did not chose. This ordering of the levels has implications when we do an analysis or make a plot and R will always display information about the factor levels using this order. It would be nice to be able to change the order. Also it would be really nice to give more descriptive names to the groups rather than just the group code in my raw data. Useful functions for controlling the order and labels of the factor can be found in the forcats package. We will learn how to manipulate factors more in a later chapter, but it is never too early to play around!

2.4 Basic Manipulation

Occasionally I’ll need to create a small data frame “by hand” to facilitate creating graphs in ggplot2. In this final section, we’ll cover creating a data frame and doing simple manipulations using the base R commands and syntax. To create a data frame, we have to squish together a bunch of column vectors. The command data.frame() does exactly that. The example below provides the names, ages, and heights (in inches) of the Sonderegger family (the Buscaglia family is too small of an example to be useful).

family <- data.frame(
  Names = c('Derek', 'Aubrey', 'Elise', 'Casey'),
  Age   = c(42, 39, 6, 3),
  Height.in = c(64, 66, 43, 39) 
)
family
##    Names Age Height.in
## 1  Derek  42        64
## 2 Aubrey  39        66
## 3  Elise   6        43
## 4  Casey   3        39

To access a particular column, we could use the $ operator. We could then do something like calculate the mean or standard deviation. Here are some examples of using the $ operator for statistical computation.

family$Age
## [1] 42 39  6  3
mean( family$Age )
## [1] 22.5
sd( family$Age )
## [1] 20.85665

As an alternative to the “$” operator, we could use the [row, column] notation. The first dimension of an array is always the row, the second always the column, and this can go up further (3rd, 4th, 5th) dimension if working with more complex data. Always remember - first entry is the row, second entry the column! To select a particular row or column, we can select them by either name or location.

family[ , 'Age'] # all the rows of the Age column (exact the Age column)
## [1] 42 39  6  3
family[ 2, 'Age'] # age of person in row 2
## [1] 39
family[ , 2] # extrac the second column, same as using 'Age'
## [1] 42 39  6  3
family[ 1, ] # extract the first row
##   Names Age Height.in
## 1 Derek  42        64

Next we could calculate everyone’s height in centimeters by multiplying the heights by 2.54 and saving the result in column appropriately named. Notice the syntax provided below actually creates a new column!

family$Height.cm <- family$Height.in * 2.54  # calculate the heights and save them!
family                                       # view our result!
##    Names Age Height.in Height.cm
## 1  Derek  42        64    162.56
## 2 Aubrey  39        66    167.64
## 3  Elise   6        43    109.22
## 4  Casey   3        39     99.06

2.5 Exercises

Exercise 1

Create a data frame “by hand” with the names, ages, and heights of your own family. If this feels excessively personal, feel free to make up people or include pets.

Exercise 2

Calculate the mean age among your family.

Exercise 3

We can load data directly from the internet. Dr. Sonderegger has many spreadsheet files hosted on GitHub. Lets use this file https://raw.githubusercontent.com/dereksonderegger/570L/master/data-raw/Example_1.csv. Because the readr package doesn’t care whether a file is on your local computer or on the Internet, we’ll use this file by downloading it directly through our R code.

a) Start the import wizard using: File -> Import Dataset -> From Text (readr) ... and input the above web URL. Click the update button near the top to cause the wizard to preview the result.

b) Save the generated code to your Rmarkdown file and show the first few rows using the head() command.