Chapter 10 Reordering to match datasets

10.1 Reordering data to match

In the previous lesson, we learned how to determine whether the same data is present in two datasets, in addition to, whether it is in the same order. In this lesson, we will explore how to reorder the data such that the datasets are matching.


Exercise

Now that we know how to reorder using indices, let’s try to use it to reorder the contents of one vector to match the contents of another. Let’s create the vectors first and second as detailed below:

matching4

first <- c("A","B","C","D","E")
second <- c("B","D","E","A","C")  # same letters but different order

How would you reorder the second vector to match first?


If we had large datasets, it would be difficult to reorder them by searching for the indices of the matching elements, and it would be quite easy to make a typo or mistake. To help with matching datasets, there is a function called match().

10.2 The match function

We can use the match() function to match the values in two vectors. We’ll be using it to evaluate which values are present in both vectors, and how to reorder the elements to make the values match.

match() takes 2 arguments. The first argument is a vector of values in the order you want, while the second argument is the vector of values to be reordered such that it will match the first:

  1. a vector of values in the order you want
  2. a vector of values to be reordered

The function returns the position of the matches (indices) with respect to the second vector, which can be used to re-order it so that it matches the order in the first vector. Let’s use match() on the first and second vectors we created.

match(first,second)
[1] 4 1 5 2 3

The output is the indices for how to reorder the second vector to match the first. These indices match the indices that we derived manually before.

Now, we can just use the indices to reorder the elements of the second vector to be in the same positions as the matching elements in the first vector:

# Saving indices for how to reorder `second` to match `first`
reorder_idx <- match(first,second) 

Then, we can use those indices to reorder the second vector similar to how we ordered with the manually derived indices.

# Reordering the second vector to match the order of the first vector
second[reorder_idx]

If the output looks good, we can save the reordered vector to a new variable.

# Reordering and saving the output to a variable
second_reordered <- second[reorder_idx]  

matching7

Now that we know how match() works, let’s change vector second so that only a subset are retained:

first <- c("A","B","C","D","E")
second <- c("D","B","A")  # remove values

matching5

And try to match() again:

match(first,second)

[1]  3  2 NA  1 NA

We see that the match() function takes every element in the first vector and finds the position of that element in the second vector, and if that element is not present, will return a missing value of NA. The value NA represents missing data for any data type within R. In this case, we can see that the match() function output represents the value at position 3 as first, which is A, then position 2 is next, which is B, the value coming next is supposed to be C, but it is not present in the second vector, so NA is returned, so on and so forth.

NOTE: For values that don’t match by default return an NA value. You can specify what values you would have it assigned using nomatch argument. Also, if there is more than one matching value found only the first is reported.

If we rearrange second using these indices, then we should see that all the values present in both vectors are in the same positions and NAs are present for any missing values.

second[match(first, second)]

Reordering genomic data using match() function

While the input to the match() function is always going to be to vectors, often we need to use these vectors to reorder the rows or columns of a data frame to match the rows or columns of another dataframe. Let’s explore how to do this with our use case featuring RNA-seq data. To perform differential gene expression analysis, we have a data frame with the expression data or counts for every sample and another data frame with the information about to which condition each sample belongs. For the tools doing the analysis, the samples in the counts data, which are the column names, need to be the same and in the same order as the samples in the metadata data frame, which are the rownames.

We can take a look at these samples in each dataset by using the rownames() and colnames() functions.

# Check row names of the metadata
rownames(metadata)

# Check the column names of the counts data
colnames(rpkm_data)

We see the row names of the metadata are in a nice order starting at sample1 and ending at sample12, while the column names of the counts data look to be the same samples, but are randomly ordered. Therefore, we want to reorder the columns of the counts data to match the order of the row names of the metadata. To do so, we will use the match() function to match the row names of our metadata with the column names of our counts data, so these will be the arguments for match.

To do so, we will use the match function to match the row names of our metadata with the column names of our counts data, so these will be the arguments for match().

Within the match() function, the rownames of the metadata is the vector in the order that we want, so this will be the first argument, while the column names of the count or rpkm data is the vector to be reordered. We will save these indices for how to reorder the column names of the count data such that it matches the rownames of the metadata to a variable called genomic idx.

genomic_idx <- match(rownames(metadata), colnames(rpkm_data))
genomic_idx

The genomic_idx represents how to re-order the column names in our counts data to be identical to the row names in metadata.

Now we can create a new counts data frame in which the columns are re-ordered based on the match() indices. Remember that to reorder the rows or columns in a data frame we give the name of the data frame followed by square brackets, and then the indices for how to reorder the rows or columns.

Our genomic_idx represents how we would need to reorder the columns of our count data such that the column names would be in the same order as the row names of our metadata. Therefore, we need to add our genomic_idx to the columns position. We are going to save the output of the reordering to a new data frame called rpkm_ordered.

# Reorder the counts data frame to have the sample names in the same order as the metadata data frame
rpkm_ordered  <- rpkm_data[ , genomic_idx]

Check and see what happened by clicking on the rpkm_ordered in the Environment window or using the View() function.

# View the reordered counts
View(rpkm_ordered)

We can see the sample names are now in a nice order from sample 1 to 12, just like the metadata.

You can also verify that column names of this new data matrix matches the metadata row names by using the all function:

all(rownames(metadata) == colnames(rpkm_ordered))

Now that our samples are ordered the same in our metadata and counts data, if these were raw counts (not RPKM) we could proceed to perform differential expression analysis with this dataset.


Exercises

  1. After talking with your collaborator, it becomes clear that sample2 and sample9 were actually from a different mouse background than the other samples and should not be part of our analysis. Create a new variable called subset_rpkm that has these columns removed from the rpkm_ordered data frame.

  2. Use the match() function to subset the metadata data frame so that the row names of the metadata data frame match the column names of the subset_rpkm data frame.