Section 4 Dataframes and More
Once you understand the basics of R’s data types, some of the more advanced features of R start to make sense. Below, we’ll cover some of these more advanced features. In particular, we’ll discuss data frames, which are used to store and analyse multiple rows and columns of data bundled together in a table.
4.1 Factors (categorical data)
Factors are how R represents categorical data. They have a fixed number of levels, that are set up when you first create a factor vector:
severity = sample(c("Moderate", "Severe"), 10, replace=TRUE)
# Setting 'levels' also sets the order of the levels
sev_factor = factor(severity, levels = c("Moderate", "Severe"))
sev_factor
## [1] Moderate Moderate Severe Moderate Severe Severe Moderate
## [8] Severe Moderate Severe
## Levels: Moderate Severe
When you’re testing a factor, you use the label to test it:
## [1] TRUE TRUE FALSE TRUE FALSE FALSE TRUE FALSE TRUE FALSE
Factors can only contain data that matches their levels, and will produce a warning if you try to add something else:
## Warning in `[<-.factor`(`*tmp*`, 1, value = "Mild"): invalid factor level,
## NA generated
## [1] <NA> Moderate Severe Moderate Severe Severe Moderate
## [8] Severe Moderate Severe
## Levels: Moderate Severe
4.2 Lists
Lists in R can be used to store multiple different types of data together. Unlike vectors, each element of a list can be a different type:
## [[1]]
## [1] 1
##
## [[2]]
## [1] FALSE
##
## [[3]]
## [1] "c"
One of the most useful features of lists is you can give each element a name, and then access the data later using the name. This means you don’t have to remember things like “in these results, the 1st element is the mean and the 2nd is the variance”.
To access the named elements of a list, use a dollar sign
$
or double square brackets [[]]
:
## [1] 4.83
## [1] 1.09
4.3 Dataframes!
The most common format for working with data is in a table, with data arranged in rows and columns. R’s main format for tables of data is the dataframe. In a dataframe:
- Each column is a vector (meaning each column can only contain one type of data: numeric, character, factor etc.)3
- Each column has the same length (the number of rows in the dataframe)
- Each column can be assigned a name, so you can access individual
columns using
df$column_name
Most of the time, you’ll read your data from a file (a spreadsheet, an SPSS file, etc.) and it will be read in as a dataframe. You can also create dataframes manually:
outcomes = data.frame(
Group = c("Control", "Treatment", "Treatment", "Control"),
Sex = c("Male", "Female", "Male", "Female"),
DepressionScore = c(8, 4, 6, 5)
)
outcomes
## Group Sex DepressionScore
## 1 Control Male 8
## 2 Treatment Female 4
## 3 Treatment Male 6
## 4 Control Female 5
4.3.1 Accessing parts of dataframes
Accessing a single column
To access a single column from a dataframe, you can use $
, which will
return a single vector:
## [1] 8 4 6 5
Selecting rows and columns
Accessing specific parts of a dataframe, e.g. by filtering out rows based
on a test, is similar to accessing parts of vectors. You use square brackets
[]
to index the dataframe, but you can specify both which rows you want
and which columns. The basic syntax is:
If you only care about the rows for the current task, you can leave the other part blank (and likewise for columns):
# Leave the columns part blank: keep all columns
df[rows_to_select, ]
# Leave the rows part blank: keep all rows
df[, cols_to_select]
There are multiple ways to select the rows, just like with vectors. You can use a vector of numbers to select by position:
## Group Sex DepressionScore
## 1 Control Male 8
## 3 Treatment Male 6
You can also use a logical index: a vector of TRUE
/FALSE
the same length
as the number of rows in the dataframe. Most of the time, you’ll create
that logical index by testing one or more columns in the dataframe:
## Group Sex DepressionScore
## 2 Treatment Female 4
## 3 Treatment Male 6
There are also multiple ways to select columns - by position:
## Group DepressionScore
## 1 Control 8
## 2 Treatment 4
## 3 Treatment 6
## 4 Control 5
By name:
## Group DepressionScore
## 1 Control 8
## 2 Treatment 4
## 3 Treatment 6
## 4 Control 5
Or using a logical vector (this one is a bit less useful, unless the column
names have a specific structure to them). An example using the built-in
iris
dataset, which contains measurements of different flower species:
# Find all columns that contain "Petal"
petal_columns = stringr::str_detect(colnames(iris), "Petal")
head(iris[, petal_columns])
## Petal.Length Petal.Width
## 1 1.4 0.2
## 2 1.4 0.2
## 3 1.3 0.2
## 4 1.5 0.2
## 5 1.4 0.2
## 6 1.7 0.4
A dataframe is basically just a list of vectors - everything in R is built out of the same basic pieces.↩