8.5 Slicing dataframes
Once you have a dataset stored as a matrix or dataframe in R, you’ll want to start accessing specific parts of the data based on some criteria. For example, if your dataset contains the result of an experiment comparing different experimental groups, you’ll want to calculate statistics for each experimental group separately. The process of selecting specific rows and columns of data based on some criteria is commonly known as slicing.
8.5.1 Slicing with
Just like vectors, you can access specific data in dataframes using brackets. But now, instead of just using one indexing vector, we use two indexing vectors: one for the rows and one for the columns. To do this, use the notation
data[rows, columns], where
columns are vectors of integers.
# Return row 1 df[1, ] # Return column 5 df[, 5] # Rows 1:5 and column 2 df[1:5, 2]
Let’s try indexing the
ToothGrowth dataframe. Again, the
ToothGrowth dataframe represents the results of a study testing the effectiveness of different types of supplements on the length of guinea pig’s teeth. First, let’s look at the entries in rows 1 through 5, and column 1:
# Give me the rows 1-6 and column 1 of ToothGrowth ToothGrowth[1:6, 1] ##  4.2 11.5 7.3 5.8 6.4 10.0
Because the first column is
len, the primary dependent measure, this means that the tooth lengths in the first 6 observations are 4.2, 11.5, 7.3, 5.8, 6.4, 10.
Of course, you can index matrices and dataframes with longer vectors to get more data. Now, let’s look at the first 3 rows of columns 1 and 3:
# Give me rows 1-3 and columns 1 and 3 of ToothGrowth ToothGrowth[1:3, c(1,3)] ## len dose ## 1 4.2 0.5 ## 2 11.5 0.5 ## 3 7.3 0.5
If you want to look at an entire row or an entire column of a matrix or dataframe, you can leave the corresponding index blank. For example, to see the entire 1st row of the
ToothGrowth dataframe, we can set the row index to 1, and leave the column index blank:
# Give me the 1st row (and all columns) of ToothGrowth ToothGrowth[1, ] ## len supp dose len.cm index ## 1 4.2 VC 0.5 0.42 1
Similarly, to get the entire 2nd column, set the column index to 2 and leave the row index blank:
# Give me the 2nd column (and all rows) of ToothGrowth ToothGrowth[, 2] ##  VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC VC ##  VC VC VC VC VC VC VC OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ ##  OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ OJ ## Levels: OJ VC
Many, if not all, of the analyses you will be doing will be on subsets of data, rather than entire datasets. For example, if you have data from an experiment, you may wish to calculate the mean of participants in one group separately from another. To do this, we’ll use subsetting – selecting subsets of data based on some criteria. To do this, we can use one of two methods: indexing with logical vectors, or the
subset() function. We’ll start with logical indexing first.
8.5.2 Slicing with logical vectors
Indexing dataframes with logical vectors is almost identical to indexing single vectors. First, we create a logical vector containing only TRUE and FALSE values. Next, we index a dataframe (typically the rows) using the logical vector to return only values for which the logical vector is TRUE.
For example, to create a new dataframe called
ToothGrowth.VC containing only data from the guinea pigs who were given the VC supplement, we’d run the following code:
# Create a new df with only the rows of ToothGrowth # where supp equals VC ToothGrowth.VC <- ToothGrowth[ToothGrowth$supp == "VC", ]
Of course, just like we did with vectors, we can make logical vectors based on multiple criteria – and then index a dataframe based on those criteria. For example, let’s create a dataframe called
ToothGrowth.OJ.a that contains data from the guinea pigs who were given an OJ supplement with a dose less than 1.0:
# Create a new df with only the rows of ToothGrowth # where supp equals OJ and dose < 1 ToothGrowth.OJ.a <- ToothGrowth[ToothGrowth$supp == "OJ" & ToothGrowth$dose < 1, ]
Indexing with brackets is the standard way to slice and dice dataframes. However, the code can get a bit messy. A more elegant method is to use the
8.5.3 Slicing with
subset() function is one of the most useful data management functions in R. It allows you to slice and dice datasets just like you would with brackets, but the code is much easier to write: Table 8.5 shows the main arguments to the
||A dataframe you want to subset|
||A logical vector indicating the rows to keep|
||The columns you want to keep|
Let’s use the
subset() function to create a new, subsetted dataset from the
ToothGrowth dataframe containing data from guinea pigs who had a tooth length less than 20cm (
len < 20), given the OJ supplement (
supp == "OJ"), and with a dose greater than or equal to 1 (
dose >= 1):
# Get rows of ToothGrowth where len < 20 AND supp == "OJ" AND dose >= 1 subset(x = ToothGrowth, subset = len < 20 & supp == "OJ" & dose >= 1) ## len supp dose len.cm index ## 41 20 OJ 1 2.0 41 ## 49 14 OJ 1 1.4 49
As you can see, there were only two cases that satisfied all 3 of our selection criteria.
In the example above, I didn’t specify an input to the
select argument because I wanted all columns. However, if you just want certain columns, you can just name the columns you want in the
# Get rows of ToothGrowth where len > 30 AND supp == "VC", but only return the len and dose columns subset(x = ToothGrowth, subset = len > 30 & supp == "VC", select = c(len, dose)) ## len dose ## 23 34 2 ## 26 32 2