Chapter 2 Data Frames

# Load my favorite packages: dplyr, ggplot2, forcats, readr, and stringr
# but don't spit out all the messages about what all was loaded and the conflict
# information. It is good to look, though.
suppressPackageStartupMessages({
  library(tidyverse, quietly = TRUE)
})

As always, there is a Video Lecture that accompanies this chapter.

Data frames are the fundamental unit of data storage that casual users of R need to work with. Conceptually they are just like a single tab in a spreadsheet (e.g. Excel) file. There are multiple rows and columns and each column is of the same type of information (e.g. numerical values, dates, or character strings) and each row represents a single observation.

Because the columns have meaning and we generally give them column names, it is desirable to want to access an element by the name of the column as opposed to the column number. While writing formulas in large Excel spreadsheets I often get annoyed trying to remember which column something was in and muttering “Was total biomass in column P or Q?” A system where I could just name the column Total_Biomass and then always refer to it that way, is much nicer to work with and I make fewer dumb mistakes.

In this chapter we will briefly cover the minimal set of tools for working with data frames. First we discuss how to import data sets, both from packages and from appropriately formatted Excel and .csv files. Finally we’ll see how to create a data frame “by hand” and to access columns and do simple manipulations.

In this chapter, we will focus on standard R data frame manipulations so that readers gain basic familiarity with non-tidyverse accessor methods.

2.1 Introduction to Importing Data

2.1.1 From a Package

For many students, they will be assigned homework that utilizes data sets that are stored in some package. To access those, we would need to first install the package if we haven’t already. Recall to do that, we can use the Rstudio menu bar “Tools -> Install Packages…” mouse action.

Because we might have thousands of packages installed on a computer, and those packages might all have data sets associated with them, they aren’t loaded into memory by default. Instead we have to go through a two-step process of making sure that the package is installed on the computer, and then load the desired data set into the running session of R. Once the package is installed, we can load the data into our session via the following command:

data('alfalfa', package='faraway')   # load the data set 'alfalfa' from the package 'faraway'

Because R tries to avoid loading datasets until it is sure that you need them, the object alfalfa isn’t initially loaded as a data.frame but rather as a “promise” that it eventually will be loaded whenever you first use it. So lets first access it by viewing it.

View(alfalfa)

There are two ways to enter the view command. Either executing the View() function from the console, or clicking on either the white table or the object name in the Environment tab.

2.1.2 Import from .csv or .xls files

Often times data is stored in a “Comma Separated Values” file (with the file suffix of .csv) where the rows in the file represent the data frame rows, and the columns are just separated by commas. The first row of the file is usually the column titles.

Alternatively, the data might be stored in an Excel file and we just need to tell R where the file is and which worksheet tab to import.

The hardest part for people that are new to programming is giving the path to the data file. In this case, I recommend students use the data import wizard that RStudio includes which is accessed via ‘File -> Import Dataset.’ This will then give you a choice of file types to read from (.csv files are in the “Text” options). Once you have selected the file type to import, the user is presented with a file browser window where the desired file should be located. Once the file is chosen, we can import the file.

Critically, we should notice that the import wizard generates R code that does the actual import. We MUST copy that code into our Rmarkdown file or else the import won’t happen when we try to knit the Rmarkdown into an output document because knitting always occurs in a completely fresh R session. So only use the import wizard to generate the import code! The code generated by the import wizard ends with a View() command and I typically remove that as it can interfere with the knitting process. The code that I’ll paste into my RMarkdown file typically looks like this:

library(readxl)
Melioid_IgG <- read_excel("~/Dropbox/NAU/MAGPIX serology/Data/Melioid_IgG.xlsx")
# View(Melioid_IgG)

2.2 Data Types

Data frames are required that each column have the same type. That is to say, if a column is numeric, you can’t just change one value to a character string. Below are the most common data types that are used within R.

  1. Integers - These are the integer numbers \(\left(\dots,-2,-1,0,1,2,\dots\right)\). To convert a numeric value to an integer you may use the function as.integer().

  2. Numeric - These could be any number (whole number or decimal). To convert another type to numeric you may use the function as.numeric().

  3. Strings - These are a collection of characters (example: Storing a student’s last name). To convert another type to a string, use as.character().

  4. Factors - These are strings that can only values from a finite set. For example we might wish to store a variable that records home department of a student. Since the department can only come from a finite set of possibilities, I would use a factor. Factors are categorical variables, but R calls them factors instead of categorical variable. A vector of values of another type can always be converted to a factor using the as.factor() command. For converting numeric values to factors, I will often use the function cut().

  5. Logicals - This is a special case of a factor that can only take on the values TRUE and FALSE. (Be careful to always capitalize TRUE and FALSE. Because R is case-sensitive, TRUE is not the same as true.) Using the function as.logical() you can convert numeric values to TRUE and FALSE where 0 is FALSE and anything else is TRUE.

Depending on the command, R will coerce your data from one type to another if necessary, but it is a good habit to do the coercion yourself. If a variable is a number, R will automatically assume that it is continuous numerical variable. If it is a character string, then R will assume it is a factor when doing any statistical analysis.

Most of these types are familiar to beginning R users except for factors. Factors are how R keeps track of categorical variables. R does this in a two step pattern. First it figures out how many categories there are and remembers which category an observation belongs to and second, it keeps a vector of character strings that correspond to the names of each of the categories.

# A character vector
y <- c('B','B','A','A','C')
y
## [1] "B" "B" "A" "A" "C"
# convert the vector of characters into a vector of factors 
z <- factor(y)
str(z)
##  Factor w/ 3 levels "A","B","C": 2 2 1 1 3

Notice that the vector z is actually the combination of group assignment vector 2,2,1,1,3 and the group names vector “A”,”B”,”C”. So we could convert z to a vector of numerics or to a vector of character strings.

as.numeric(z)
## [1] 2 2 1 1 3
as.character(z)
## [1] "B" "B" "A" "A" "C"

Often we need to know what possible groups there are, and this is done using the levels() command.

levels(z)
## [1] "A" "B" "C"

Notice that the order of the group names was done alphabetically, which we did not chose. This ordering of the levels has implications when we do an analysis or make a plot and R will always display information about the factor levels using this order. It would be nice to be able to change the order. Also it would be really nice to give more descriptive names to the groups rather than just the group code in my raw data. Useful functions for controling the order and labels of the factor can be found in the forcats package which we use in a later chapter.

2.3 Basic Manipulation

Occasionally I’ll need to create a small data frame “by hand” to facilitate creating graphs in ggplot2. In this final section, we’ll cover creating a data frame and doing simple manipulations using the base R commands and syntax.

To create a data frame, we have to squish together a bunch of column vectors. The command data.frame() does exactly that. In the example below, I list the names, ages and heights (in inches) of my family.

family <- data.frame(
  Names = c('Derek', 'Aubrey', 'Elise', 'Casey'),
  Age   = c(42, 39, 6, 3),
  Height.in = c(64, 66, 43, 39) 
)
family
##    Names Age Height.in
## 1  Derek  42        64
## 2 Aubrey  39        66
## 3  Elise   6        43
## 4  Casey   3        39

To access a particular column, we could use the $ operator. We could then do something like calculate the mean or standard deviation.

family$Age
## [1] 42 39  6  3
mean( family$Age )
## [1] 22.5
sd( family$Age )
## [1] 20.85665

As an alternative to the “$” operator, we could use the [row, column] notation. To select a particular row or column, we can select them by either name or location.

family[ , 'Age']    # all the rows, Age column
## [1] 42 39  6  3
family[ 2, 'Age']  # age of person in row 2
## [1] 39

Next we could calculate everyone’s height in centimeters by multiplying the heights by 2.54 and saving the result in column appropriately named.

family$Height.cm <- family$Height.in * 2.54  # calculate the heights and save them!
family                                       # view our result!
##    Names Age Height.in Height.cm
## 1  Derek  42        64    162.56
## 2 Aubrey  39        66    167.64
## 3  Elise   6        43    109.22
## 4  Casey   3        39     99.06

2.4 Exercises

  1. Create a data frame “by hand” with the names, ages, and heights of your own family. If this feels excessively personal, feel free to make up people or include pets.

  2. Calculate the mean age among your family.

  3. I have a spreadsheet file hosted on GitHub at https://raw.githubusercontent.com/dereksonderegger/570L/master/data-raw/Example_1.csv. Because the readr package doesn’t care whether a file is on your local computer or on the Internet, we’ll use this file.

    1. Start the import wizard using: “File -> Import Dataset -> From Text (readr) …” and input the above web URL. Click the update button near the top to cause the wizard to preview the result.
    2. Save the generated code to your Rmarkdown file and show the first few rows using the head() command.