Section 4 Dataframes and More

Once you understand the basics of R’s data types, some of the more advanced features of R start to make sense. Below, we’ll cover some of these more advanced features. In particular, we’ll discuss data frames, which are used to store and analyse multiple rows and columns of data bundled together in a table.

4.1 Factors (categorical data)

Factors are how R represents categorical data. They have a fixed number of levels, that are set up when you first create a factor vector:

##  [1] Moderate Moderate Severe   Moderate Severe   Severe   Moderate
##  [8] Severe   Moderate Severe  
## Levels: Moderate Severe

When you’re testing a factor, you use the label to test it:

##  [1]  TRUE  TRUE FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE

Factors can only contain data that matches their levels, and will produce a warning if you try to add something else:

## Warning in `[<-.factor`(`*tmp*`, 1, value = "Mild"): invalid factor level,
## NA generated
##  [1] <NA>     Moderate Severe   Moderate Severe   Severe   Moderate
##  [8] Severe   Moderate Severe  
## Levels: Moderate Severe

4.2 Lists

Lists in R can be used to store multiple different types of data together. Unlike vectors, each element of a list can be a different type:

## [[1]]
## [1] 1
## 
## [[2]]
## [1] FALSE
## 
## [[3]]
## [1] "c"

One of the most useful features of lists is you can give each element a name, and then access the data later using the name. This means you don’t have to remember things like “in these results, the 1st element is the mean and the 2nd is the variance”.

To access the named elements of a list, use a dollar sign $ or double square brackets [[]]:

## [1] 4.83
## [1] 1.09

4.3 Dataframes!

The most common format for working with data is in a table, with data arranged in rows and columns. R’s main format for tables of data is the dataframe. In a dataframe:

  • Each column is a vector (meaning each column can only contain one type of data: numeric, character, factor etc.)3
  • Each column has the same length (the number of rows in the dataframe)
  • Each column can be assigned a name, so you can access individual columns using df$column_name

Most of the time, you’ll read your data from a file (a spreadsheet, an SPSS file, etc.) and it will be read in as a dataframe. You can also create dataframes manually:

##       Group    Sex DepressionScore
## 1   Control   Male               8
## 2 Treatment Female               4
## 3 Treatment   Male               6
## 4   Control Female               5

4.3.1 Accessing parts of dataframes

Accessing a single column

To access a single column from a dataframe, you can use $, which will return a single vector:

## [1] 8 4 6 5

Selecting rows and columns

Accessing specific parts of a dataframe, e.g. by filtering out rows based on a test, is similar to accessing parts of vectors. You use square brackets [] to index the dataframe, but you can specify both which rows you want and which columns. The basic syntax is:

If you only care about the rows for the current task, you can leave the other part blank (and likewise for columns):

There are multiple ways to select the rows, just like with vectors. You can use a vector of numbers to select by position:

##       Group  Sex DepressionScore
## 1   Control Male               8
## 3 Treatment Male               6

You can also use a logical index: a vector of TRUE/FALSE the same length as the number of rows in the dataframe. Most of the time, you’ll create that logical index by testing one or more columns in the dataframe:

##       Group    Sex DepressionScore
## 2 Treatment Female               4
## 3 Treatment   Male               6

There are also multiple ways to select columns - by position:

##       Group DepressionScore
## 1   Control               8
## 2 Treatment               4
## 3 Treatment               6
## 4   Control               5

By name:

##       Group DepressionScore
## 1   Control               8
## 2 Treatment               4
## 3 Treatment               6
## 4   Control               5

Or using a logical vector (this one is a bit less useful, unless the column names have a specific structure to them). An example using the built-in iris dataset, which contains measurements of different flower species:

##   Petal.Length Petal.Width
## 1          1.4         0.2
## 2          1.4         0.2
## 3          1.3         0.2
## 4          1.5         0.2
## 5          1.4         0.2
## 6          1.7         0.4

  1. A dataframe is basically just a list of vectors - everything in R is built out of the same basic pieces.