Section 4 Dataframes and More

Once you understand the basics of R’s data types, some of the more advanced features of R start to make sense. Below, we’ll cover some of these more advanced features. In particular, we’ll discuss data frames, which are used to store and analyse multiple rows and columns of data bundled together in a table.

4.1 Factors (categorical data)

Factors are how R represents categorical data. They have a fixed number of levels, that are set up when you first create a factor vector:

severity = sample(c("Moderate", "Severe"), 10, replace=TRUE)
# Setting 'levels' also sets the order of the levels
sev_factor = factor(severity, levels = c("Moderate", "Severe"))
sev_factor

##  [1] Moderate Moderate Severe   Moderate Severe   Severe   Moderate
##  [8] Severe   Moderate Severe  
## Levels: Moderate Severe

When you’re testing a factor, you use the label to test it:

sev_factor == "Moderate"

##  [1]  TRUE  TRUE FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE

Factors can only contain data that matches their levels, and will produce a warning if you try to add something else:

# Not one of the levels that was set up when the factor was
#   created:
sev_factor[1] = "Mild"

## Warning in `[<-.factor`(`*tmp*`, 1, value = "Mild"): invalid factor level,
## NA generated

sev_factor

##  [1] <NA>     Moderate Severe   Moderate Severe   Severe   Moderate
##  [8] Severe   Moderate Severe  
## Levels: Moderate Severe

4.2 Lists

Lists in R can be used to store multiple different types of data together. Unlike vectors, each element of a list can be a different type:

list(1, FALSE, "c")

## [[1]]
## [1] 1
## 
## [[2]]
## [1] FALSE
## 
## [[3]]
## [1] "c"

One of the most useful features of lists is you can give each element a name, and then access the data later using the name. This means you don’t have to remember things like “in these results, the 1st element is the mean and the 2nd is the variance”.

To access the named elements of a list, use a dollar sign $ or double square brackets [[]]:

results = list(mean = 4.83, variance = 1.09)
results$mean

## [1] 4.83

results[["variance"]]

## [1] 1.09

4.3 Dataframes!

The most common format for working with data is in a table, with data arranged in rows and columns. R’s main format for tables of data is the dataframe. In a dataframe:

Each column is a vector (meaning each column can only contain one type of data: numeric, character, factor etc.)³
Each column has the same length (the number of rows in the dataframe)
Each column can be assigned a name, so you can access individual columns using df$column_name

Most of the time, you’ll read your data from a file (a spreadsheet, an SPSS file, etc.) and it will be read in as a dataframe. You can also create dataframes manually:

outcomes = data.frame(
    Group = c("Control", "Treatment", "Treatment", "Control"),
    Sex = c("Male", "Female", "Male", "Female"),
    DepressionScore = c(8, 4, 6, 5)
)
outcomes

##       Group    Sex DepressionScore
## 1   Control   Male               8
## 2 Treatment Female               4
## 3 Treatment   Male               6
## 4   Control Female               5

4.3.1 Accessing parts of dataframes

Accessing a single column

To access a single column from a dataframe, you can use $, which will return a single vector:

outcomes$DepressionScore

## [1] 8 4 6 5

Selecting rows and columns

Accessing specific parts of a dataframe, e.g. by filtering out rows based on a test, is similar to accessing parts of vectors. You use square brackets [] to index the dataframe, but you can specify both which rows you want and which columns. The basic syntax is:

# (don't run: just a general example)
df[rows_to_select, cols_to_select]

If you only care about the rows for the current task, you can leave the other part blank (and likewise for columns):

# Leave the columns part blank: keep all columns 
df[rows_to_select, ]
# Leave the rows part blank: keep all rows
df[, cols_to_select]

There are multiple ways to select the rows, just like with vectors. You can use a vector of numbers to select by position:

# First and third row
outcomes[c(1, 3), ]

##       Group  Sex DepressionScore
## 1   Control Male               8
## 3 Treatment Male               6

You can also use a logical index: a vector of TRUE/FALSE the same length as the number of rows in the dataframe. Most of the time, you’ll create that logical index by testing one or more columns in the dataframe:

outcomes[outcomes$Group == "Treatment", ]

##       Group    Sex DepressionScore
## 2 Treatment Female               4
## 3 Treatment   Male               6

There are also multiple ways to select columns - by position:

outcomes[, c(1, 3)]

##       Group DepressionScore
## 1   Control               8
## 2 Treatment               4
## 3 Treatment               6
## 4   Control               5

By name:

outcomes[, c("Group", "DepressionScore")]

##       Group DepressionScore
## 1   Control               8
## 2 Treatment               4
## 3 Treatment               6
## 4   Control               5

Or using a logical vector (this one is a bit less useful, unless the column names have a specific structure to them). An example using the built-in iris dataset, which contains measurements of different flower species:

# Find all columns that contain "Petal"
petal_columns = stringr::str_detect(colnames(iris), "Petal")
head(iris[, petal_columns])

##   Petal.Length Petal.Width
## 1          1.4         0.2
## 2          1.4         0.2
## 3          1.3         0.2
## 4          1.5         0.2
## 5          1.4         0.2
## 6          1.7         0.4

A dataframe is basically just a list of vectors - everything in R is built out of the same basic pieces.↩