6 Reading, tidying & joining data

This session will focus on reading information into R and ‘tidying’ it into a form that makes manipulation and plotting as easy as possible.
The tidyr functions within the eponymous tidyverse are designed to help users to convert data into a ‘tidy format’. There are three simple principals for tidy data:

each column represents a single measurement type
each row represents a single observation
each cell contains a single value

The context for this session is that your colleague hears that you have successfully completed the first sessions of the WEHI tidyverse R course 😊, and asks you to plot some data for their upcoming lab talk. They send you results from a differential gene expression experiment, and want you to plot the significant genes according to their Gene Ontology (GO) terms.

They send you four files:

entrez_logFC.xlsx, a results table containing the entrez gene ID, log2 fold-change (logFC) and adjusted P value for each significant gene (Microsoft Excel format)
Hs_entrez_annot.csv, the gene names and descriptions for each entrez gene ID (comma-separated values)
GO_entrez.txt, the entrez IDs corresponding to different GO terms (space-separated values); and,
GO_term_defin.tsv, a table of descriptions for each GO term (tab-separated values)

To access these files, download and unzip this folder into your Desktop WEHI_tidyR_course folder.

Feel free to open the text files (.csv ; .txt ; .tsv) in TextEdit or another editor to see the data, but make sure not to make any changes.

Unfortunately not all of the files come in a tidy form, so we will need to improve them according to the tidy data principals using the functions below, and then join them together to make the final plot.

First let’s create a new .R text file, save as ‘Week_3_tidyverse.R’ in the WEHI_tidyR_course folder, and load the tidyverse package

library(tidyverse)

Now we can read the data files into R.

6.1 Reading in data

6.1.1 read_excel()

Broadly speaking there are two ways to read an excel file into R.

You can click ‘Import Dataset’ in the Environment tab (top right pane) >> ‘From Excel…’ >> ‘Browse…’ >> navigate to your WEHI_tidyR_course folder and select the file ‘entrez_logFC.xlsx’.
In the ‘Options’ section check ‘First row as names’, then click ‘Import’.

The excel table is now imported into a variable called ‘entrez_logFC’ and you will see that code appears in the console similar to this:

library(readxl)
entrez_logFC <- read_excel("~/Desktop/WEHI_tidyR_course/data_files/entrez_logFC.xlsx")

You can copy and paste this code into your R text file so that next time you run your code, the table is imported without you having to go through the point-and-click route.

The alternative way is to directly type and run the code as it appears above.
Importantly, although the readxl library is part of the tidyverse package, it is not active unless we load it directly, using library(readxl).

Your colleague has also provide text files in the form of csv (comma-separated values), tsv (tab-separated values), and txt (space-separated values), which require slightly different read functions.

6.1.2 read_csv()

The read_csv() function from tidyr is used when you select ‘Import Dataset’ >> ‘From text (readr)…’ and set the ‘Import Options’ Delimiter to ‘comma’.
Alternatively you can directly call read_csv() as follows, and assign the output into a new variable ‘Hs_entrez_annot’

Hs_entrez_annot <- read_csv("~/Desktop/WEHI_tidyR_course/data_files/Hs_entrez_annot.csv")

## 
## ── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## cols(
##   entrez_id = col_double(),
##   annot_type = col_character(),
##   annot_text = col_character()
## )

This message indicates the names and type of data that has been assigned for each column. col_double() denotes a type of numeric data.

6.1.3 read_delim()

For the white space-separated file “GO_entrez.txt” (which is far from tidy), it is best to use a more flexible read function called read_delim(). This function allows us to directly specify the delimiter (i.e., the character used to separate the columns). In this case we will use delim=' ' to indicate that the columns are separated by a white space, and assign the output into a new variable ‘GO_entrez’

GO_entrez <- read_delim('~/Desktop/WEHI_tidyR_course/data_files/GO_entrez.txt',
                        delim = ' ')

## 
## ── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## cols(
##   rowname = col_double(),
##   go_entrez = col_character()
## )

We can see that the GO_entrez variable contains two columns, ‘rowname’ and ‘go_entrez’.

6.1.4 read_tsv()

Lastly we will read the “GO_term_defin.tsv” which is tab-separated (aka ‘tab-delimited’), using read_tsv(). The data will be assigned to a new variable ‘GO_terms’

GO_terms <- read_tsv('~/Desktop/WEHI_tidyR_course/data_files/GO_term_defin.tsv')

## 
## ── Column specification ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
## cols(
##   GOID = col_character(),
##   DEFINITION = col_character(),
##   BP = col_character(),
##   CC = col_character(),
##   MF = col_character()
## )

We see that the new variable contains a dataframe with five columns, all of which are character data.
Note that this tsv file could also be read in using read_delim(), specifying the tab delimiter as delim = '\t'. The .csv file could be read using the same function, with delim = ','.

6.2 Reshaping data

Let’s have a look at the newly created dataframes. Starting with entrez_logFC

entrez_logFC %>% head()

## # A tibble: 6 x 3
##   entrez_id logFC adj_Pval
##       <dbl> <dbl>    <dbl>
## 1       722  3.54  0.00113
## 2       725  5.62  0.0295 
## 3      1378  3.46  0.0315 
## 4      2302  3.37  0.0486 
## 5      5777  3.93  0.0204 
## 6     55061  2.76  0.0225

This dataframe conforms to the tidy data principals. Each row represents an ‘observation’ (a different entrez gene ID), and each column represents a different measurement type (log fold-change in RNA abundance, and adjusted p. value). Further, each cell contains a single value.
This is what we are aiming for with the other data sets.

Now to check Hs_entrez_annot

Hs_entrez_annot %>% head()

## # A tibble: 6 x 3
##   entrez_id annot_type annot_text                               
##       <dbl> <chr>      <chr>                                    
## 1        21 symbol     ABCA3                                    
## 2        21 gene_name  ATP binding cassette subfamily A member 3
## 3        90 symbol     ACVR1                                    
## 4        90 gene_name  activin A receptor type 1                
## 5        91 symbol     ACVR1B                                   
## 6        91 gene_name  activin A receptor type 1B

This is not tidy, because each observation (the entrez_id) is duplicated into 2 rows. The annotation type (‘annot_type’: symbol or gene name) should ideally be individual columns, containing the corresponding annotation_text (‘annot_text’).

6.2.1 pivot_wider()

To modify this table so that it conforms to the tidy principals, we need to ‘reshape’ it. We need a function that can convert the labels in the ‘annot_type’ column into two new column names, and assort the data in ‘annot_text’ according to their respective new column names.

The function for this job is pivot_wider()

thinkr.fr/wp-content/uploads/long_wide.png

pivot_wider() requires two commands, both of which are column names in the original table.

names_from indicates the column containing the labels which will be come the new column names.
values from indicates the column containing the values that will populate the the new columns.
To transform the illustrative table above at left (df_long) into the result at right, would require:
df_long %>% pivot_wider(names_from = V1, values_from = V2).
Note that the ‘shape’ of the dataframe is being converted from longer (more rows) to wider (more columns).

In our case, to reshape Hs_entrez_annot, we will use the column names annot_type, and annot_text:

Hs_entrez_annot %>% pivot_wider(names_from = annot_type, values_from = annot_text)

## # A tibble: 153 x 3
##    entrez_id symbol gene_name                                    
##        <dbl> <chr>  <chr>                                        
##  1        21 ABCA3  ATP binding cassette subfamily A member 3    
##  2        90 ACVR1  activin A receptor type 1                    
##  3        91 ACVR1B activin A receptor type 1B                   
##  4       156 ADRBK1 adrenergic, beta, receptor kinase 1          
##  5       207 AKT1   v-akt murine thymoma viral oncogene homolog 1
##  6       566 AZU1   azurocidin 1                                 
##  7       596 BCL2   B-cell CLL/lymphoma 2                        
##  8       655 BMP7   bone morphogenetic protein 7                 
##  9       722 C4BPA  complement component 4 binding protein alpha 
## 10       725 C4BPB  complement component 4 binding protein beta  
## # … with 143 more rows

Whereas the original table has 306 rows, the output of pivot_wider() has only 153 rows. This is because the rows are de-duplicated as the new columns are created.
Given that the output is in ‘tidy’ format, we can assign it to a new variable ‘Hs_entz_annot_tidy’

Hs_entz_annot_tidy <- Hs_entrez_annot %>%
  pivot_wider(names_from = annot_type, 
              values_from = annot_text)

Note that because we only have two labels in the annot_type column, we are replacing the existing two columns with only two new columns. As such the shape of the output technically isn’t any wider than the input dataframe. However when there are more than two unique labels in the names_from column, the output will be wider than the input.

6.2.2 pivot_longer()

The function that complements pivot_wider() is of course pivot_longer(). This function does not create tidy data, because it duplicates rows. However the ‘long format’ output from pivot_longer() is often required for ggplot, where each aesthetic or facet category must be a single column of values; and for left_join(), introduced below.

thinkr.fr/wp-content/uploads/wide_long.png

pivot_longer() takes three commands, specifying

a vector of the names of the columns to convert to labels in long form cols =,
a name for the new column containing the labels from 1: names_to =,
a name for the new column containing the values corresponding to 1: values_to =

Note that pivot_wider(), requires the new column names in quotes.

So for the figure above, to convert from the left table (df_wide) to the long table at right, would require: df_wide %>% pivot_longer(cols = c(X2,X3,X4), names_to = 'V1', values_to = 'V2')

Let’s see how this can be applied to the GO_terms dataframe

GO_terms

## # A tibble: 12 x 5
##    GOID    DEFINITION                                                    BP                                    CC           MF                
##    <chr>   <chr>                                                         <chr>                                 <chr>        <chr>             
##  1 GO:190… Any process that stops, prevents or reduces the frequency, r… negative regulation of vesicle trans… <NA>         <NA>              
##  2 GO:004… The directed movement of a vesicle along a microtubule, medi… vesicle transport along microtubule   <NA>         <NA>              
##  3 GO:001… The phosphorylation of peptidyl-threonine to form peptidyl-O… peptidyl-threonine phosphorylation    <NA>         <NA>              
##  4 GO:000… Any process that stops, prevents, or reduces the frequency, … negative regulation of humoral immun… <NA>         <NA>              
##  5 GO:006… The series of molecular signals initiated by binding of Wnt … Wnt signaling pathway involved in ki… <NA>         <NA>              
##  6 GO:000… The chemical reactions and pathways involving ubiquinone, a … ubiquinone metabolic process          <NA>         <NA>              
##  7 GO:009… The lipid bilayer surrounding a lamellar body. A lamellar bo… <NA>                                  lamellar bo… <NA>              
##  8 GO:001… The cell cycle process in which genetic material, in the for… female meiosis chromosome segregation <NA>         <NA>              
##  9 GO:003… Catalysis of the transfer of an amino group from kynurenine … <NA>                                  <NA>         kynurenine aminot…
## 10 GO:190… Any process that activates or increases the frequency, rate … positive regulation of protein impor… <NA>         <NA>              
## 11 GO:001… The chemical reactions and pathways resulting in the breakdo… fucose catabolic process              <NA>         <NA>              
## 12 GO:004… Catalysis of the reaction: RS-CH2-CH(NH3+)COO- = RSH + NH3 +… <NA>                                  <NA>         cysteine-S-conjug…

The BP, CC and MF columns relate to ‘Biological Process’, ‘Cellular Component’ and ‘Molecular Function’ ontologies within the Gene Ontology framework.

We can see there are many NA values in these columns, as each GOID relates to a single type of ontology. That is, if there is text in BP, then CC and MF will be NA values.
If we consider the text in this table as individual ‘values’ then the table is technically tidy, but the information is sparsely populated throughout.

To eliminate the NA values we first use pivot_longer(), with the aim of converting to long format those columns containing ontology descriptions: BP,CC, and MF.
The new column for labels will be ‘ONTOLOGY’, and the new column containing the associated text will be ‘DESCRIPTION’

GO_terms %>% 
  pivot_longer(cols = c(BP,CC,MF),  
               names_to = 'ONTOLOGY',
               values_to = 'DESCRIPTION')

## # A tibble: 36 x 4
##    GOID     DEFINITION                                                                ONTOLOGY DESCRIPTION                                    
##    <chr>    <chr>                                                                     <chr>    <chr>                                          
##  1 GO:1901… Any process that stops, prevents or reduces the frequency, rate or exten… BP       negative regulation of vesicle transport along…
##  2 GO:1901… Any process that stops, prevents or reduces the frequency, rate or exten… CC       <NA>                                           
##  3 GO:1901… Any process that stops, prevents or reduces the frequency, rate or exten… MF       <NA>                                           
##  4 GO:0047… The directed movement of a vesicle along a microtubule, mediated by moto… BP       vesicle transport along microtubule            
##  5 GO:0047… The directed movement of a vesicle along a microtubule, mediated by moto… CC       <NA>                                           
##  6 GO:0047… The directed movement of a vesicle along a microtubule, mediated by moto… MF       <NA>                                           
##  7 GO:0018… The phosphorylation of peptidyl-threonine to form peptidyl-O-phospho-L-t… BP       peptidyl-threonine phosphorylation             
##  8 GO:0018… The phosphorylation of peptidyl-threonine to form peptidyl-O-phospho-L-t… CC       <NA>                                           
##  9 GO:0018… The phosphorylation of peptidyl-threonine to form peptidyl-O-phospho-L-t… MF       <NA>                                           
## 10 GO:0002… Any process that stops, prevents, or reduces the frequency, rate, or ext… BP       negative regulation of humoral immune response…
## # … with 26 more rows

Now the data is in long form, with the GOID and DEFINITION columns duplicated. All of the NA values are now in a single column (DESCRIPTION). We can easily filter out these NA values, and assign the new dataframe to a variable ‘GO_terms_long’

GO_terms_long <- GO_terms %>% 
  pivot_longer(cols      =  c(BP, CC, MF),  
               names_to  = 'ONTOLOGY',
               values_to = 'DESCRIPTION') %>% 
  filter(!is.na(DESCRIPTION))

Importantly, in cases where there are 10s of column names to convert to long format, it is simpler to use cols = to specify those columns that we don’t want to convert, which are usually the left-most columns.
This is done with the - symbol we previously used with select(), and for negative vector sub-setting.
To achieve the same result as above using this approach:

GO_terms_long <- GO_terms %>% 
  pivot_longer(cols      =  -c(GOID, DEFINITION),  
               names_to  = 'ONTOLOGY',
               values_to = 'DESCRIPTION') %>% 
  filter(!is.na(DESCRIPTION))

cols can also take the other helpers used with select(), such as starts_with(), contains() etc.

6.3 Separating and uniting columns

So far we have reshaped data where the input conforms to the third principal of ‘tidiness’: each cell contains a single value. When this is not the case, we may need to split values into several columns, using separate(); or combine values into a single column using unite().

GO_entrez is certainly the messiest of the datasets we have been provided. Let’s take look:

GO_entrez %>% head()

## # A tibble: 6 x 2
##   rowname go_entrez                                                                                                                           
##     <dbl> <chr>                                                                                                                               
## 1       1 GO_0002924_722,725,1378,2302,5777,55061                                                                                             
## 2       2 GO_0006743_3156,4704,10229,23590,27235,29914,51004,51117,51805,56997,57017,57107,84274                                              
## 3       3 GO_0016321_4292,5347,728637                                                                                                         
## 4       4 GO_0018107_90,91,156,207,566,596,655,790,801,805,808,817,983,1020,1111,1116,1195,1859,1950,2011,2039,2475,2641,2932,3093,4092,4294,…
## 5       5 GO_0019317_2523,2524,2526,2527,2528,2529,2530,10690,84750                                                                           
## 6       6 GO_0036137_883,2806,51166,56267

The dataframe has two columns: ‘rownames’ which is a meaningless sequence of integers, and ‘go_entrez’ which appears to contain GO IDs interspersed with underscores, and a long string of comma-separated entrez IDs.

We will need to make several changes to produce a long-format table of GO terms and corresponding entrez IDs in consecutive rows.

First, we should separate the go_entrez column into three, containing the ‘GO’ prefix, the code, and the string of entrez IDs.

6.3.1 separate()

The tidyr separate() function takes a column name as the first command, and separates it into a number of new columns (a vector of names of our choosing, in quotes), according to a particular character delimiter, the ‘separator’ or ‘sep’.

The sep = command has the same role as delim = in read_delim() above.

Let’s split the go_entrez column into three new columns named ‘prefix’,‘code_only’ and ‘entrez_multi’, according to the underscore _ character:

GO_entrez %>% 
  separate(go_entrez, 
           into = c('prefix','code_only','entrez_multi'),
           sep = '_')

## # A tibble: 11 x 4
##    rowname prefix code_only entrez_multi                                                                                                      
##      <dbl> <chr>  <chr>     <chr>                                                                                                             
##  1       1 GO     0002924   722,725,1378,2302,5777,55061                                                                                      
##  2       2 GO     0006743   3156,4704,10229,23590,27235,29914,51004,51117,51805,56997,57017,57107,84274                                       
##  3       3 GO     0016321   4292,5347,728637                                                                                                  
##  4       4 GO     0018107   90,91,156,207,566,596,655,790,801,805,808,817,983,1020,1111,1116,1195,1859,1950,2011,2039,2475,2641,2932,3093,409…
##  5       5 GO     0019317   2523,2524,2526,2527,2528,2529,2530,10690,84750                                                                    
##  6       6 GO     0036137   883,2806,51166,56267                                                                                              
##  7       7 GO     0047496   1176,1201,1780,2647,3064,3799,5048,5590,5861,6845,8120,8546,8943,9001,9371,10125,10239,10947,11127,22920,23160,23…
##  8       8 GO     0061289   2625,55366                                                                                                        
##  9       9 GO     0097232   21,27074                                                                                                          
## 10      10 GO     1901609   254263                                                                                                            
## 11      11 GO     1903638   9141

Note that the code_only column contains only integers but remains encoded as ‘character data’.

Its possible to allow R to guess the data types for newly created columns by including convert = TRUE at the end of the above command.

Here however we want to preserve the leading 0s in code_only which would be dropped if this column was converted to the numeric data type.

Let’s store the output of separate() as a new variable named ‘GO_entrez_sep’

GO_entrez_sep <- GO_entrez %>% 
  separate(go_entrez, 
           into = c('prefix','code_only','entrez_multi'),
           sep = '_')

Next we have to deal with the long string of entrez IDs in the ‘entrez_multi’ column.
This is particularly tricky as there are uneven numbers of entrez IDs associated with each GO term. This column violates the tidy data principle of one value per cell. Using a separate() function will run in to problems when columns are highly uneven, and the maximum number of values is unknown.

6.3.2 separate_rows()

Hopefully you will rarely come across a dataset this messy, but when you do, separate_rows() will be a lifesaver. A previous version of this book introduced str_split() and unnest() which were very involved, and have been superseded by separate_rows() to the relief of R users and instructors alike.

separate_rows() creates a new row for each value in an ‘untidy’ column like entrez_multi. Each individual entrez ID will be inserted into its own row, and the data in other columns will be duplicated in each new row. In this way, separate_rows() essentially performs a pivot_longer() function in order to tidy all of the previously combined, and unusable data.

This function requires the column to be separated and converted into long-format, as well as the value separator (comma, in this case).

GO_entrez_sep %>% separate_rows(entrez_multi , sep=',')

## # A tibble: 155 x 4
##    rowname prefix code_only entrez_multi
##      <dbl> <chr>  <chr>     <chr>       
##  1       1 GO     0002924   722         
##  2       1 GO     0002924   725         
##  3       1 GO     0002924   1378        
##  4       1 GO     0002924   2302        
##  5       1 GO     0002924   5777        
##  6       1 GO     0002924   55061       
##  7       2 GO     0006743   3156        
##  8       2 GO     0006743   4704        
##  9       2 GO     0006743   10229       
## 10       2 GO     0006743   23590       
## # … with 145 more rows

We can see that e.g. the six genes (entrez IDs) associated with the GO code 0002924 each now occupy their own row. The data is tidy, in long format.

6.3.3 rename()

Given that the entrez_multi column no longer contains multiple entrez IDs per row, we can rename it using the rename() function. It simply takes the form ‘new column name’ = ‘existing name’:

GO_entrez_sep %>% 
  separate_rows(entrez_multi, sep = ",") %>% 
  rename(entrez_id = entrez_multi)

## # A tibble: 155 x 4
##    rowname prefix code_only entrez_id
##      <dbl> <chr>  <chr>     <chr>    
##  1       1 GO     0002924   722      
##  2       1 GO     0002924   725      
##  3       1 GO     0002924   1378     
##  4       1 GO     0002924   2302     
##  5       1 GO     0002924   5777     
##  6       1 GO     0002924   55061    
##  7       2 GO     0006743   3156     
##  8       2 GO     0006743   4704     
##  9       2 GO     0006743   10229    
## 10       2 GO     0006743   23590    
## # … with 145 more rows

We can write the results of this chain of commands into a new variable ‘GO_entrez_sep_long’

GO_entrez_sep_long <- GO_entrez_sep %>% 
  separate_rows(entrez_multi,sep=",") %>% 
  rename(entrez_id = entrez_multi)

6.3.4 unite()

The final step is to combine the GO prefix (‘GO’) and the code_only columns to produce a GO ID in the same form as that in the GOID column of GO_terms_long (created above).

To check the desired output:

GO_terms_long %>% select(GOID)

## # A tibble: 12 x 1
##    GOID      
##    <chr>     
##  1 GO:1901609
##  2 GO:0047496
##  3 GO:0018107
##  4 GO:0002924
##  5 GO:0061289
##  6 GO:0006743
##  7 GO:0097232
##  8 GO:0016321
##  9 GO:0036137
## 10 GO:1903638
## 11 GO:0019317
## 12 GO:0047804

The unite() function is the complement of separate(), and requires two arguments: a name for the new column that will contain the united values, and a vector of the names of the columns to unite.
We can optionally provide a separator (sep =) to insert between the combined values. If we don’t add a command here the values will be separated by underscores _.
In this case we want to insert a colon : to reproduce the desired GOID

GO_entrez_sep_long %>% unite(GOID, c(prefix, code_only), sep=":")

## # A tibble: 155 x 3
##    rowname GOID       entrez_id
##      <dbl> <chr>      <chr>    
##  1       1 GO:0002924 722      
##  2       1 GO:0002924 725      
##  3       1 GO:0002924 1378     
##  4       1 GO:0002924 2302     
##  5       1 GO:0002924 5777     
##  6       1 GO:0002924 55061    
##  7       2 GO:0006743 3156     
##  8       2 GO:0006743 4704     
##  9       2 GO:0006743 10229    
## 10       2 GO:0006743 23590    
## # … with 145 more rows

Now we can drop the meaningless rowname column, store the result as “GO_entrez_long”, and we’re finally done with GO_entrez!

GO_entrez_long <- GO_entrez_sep_long %>% 
  unite(GOID, c(prefix, code_only), sep=":") %>% 
  select(-rowname)

6.4 Removing variables

In the process of tidying up the data we created several variables in intermediates states of completeness. We can now remove these variables together with the original untidy input data, using the rm() function.
Happily because all of our commands are saved in a text file (and backed up!!) we can reproduce these variables later on if required.

rm(Hs_entrez_annot)

rm(GO_terms)

rm(GO_entrez)
rm(GO_entrez_sep)
rm(GO_entrez_sep_long)

6.5 Joining dataframes

Now that we have converted the three untidy input datasets into tidy format (Hs_entz_annot_tidy), or long format (GO_entrez_long, GO_terms_long) we can begin joining them together into a single dataframe from which to plot the fold-change per gene, coloured by GO term.

6.5.1 left_join()

Joining two or more different datasets can be a very tricky task in standard spread-sheeting programs, but is vastly simplified in the tidyverse.

The main requirement is to have a ‘joint key’, that is, a column in both dataframes that contains at least one identical value.
left_join() is the workhorse of the tidy R joining functions, so called because the first dataframe that is specified (in code from left to right) appears in the left-most columns of the resulting dataframe.

In the image below, the dataframe ‘b’ is being joined to ‘a’ using left_join(), and the ‘joint key’ is x1.
Note that the x1 column contains two identical values (A and B), but C is unique to ‘a’ and D is unique to ‘b’.
The output of left_join(), will contain the entire left-side dataframe (‘a’), and the additional columns from ‘b’ populated with values for all rows where the joint key is matching. NA will appear in cells for which there is no match in ‘b’.

The other important thing about left_join() is that any rows where the joint key is duplicated in the right-side dataframe, will also be duplicated in the left-side. For example, if ‘b’ contained the rows
‘B’ ‘F’
‘B’ ‘T’

then the result would include the rows
‘B’ ‘2’ ‘F’
‘B’ ‘2’ ‘T’

Our first task is to join the the entrez_logFC results table with the gene annotations in Hs_entz_annot_tidy. We will use the joint key ‘entrez_id’ (outlined in black).

Consistent with the previous sections, we will use the pipe %>% to ‘send’ the left-side dataframe (entrez_logFC) into a left_join() where we specify the right-side dataframe. We give the column name of the joint key, in quotes, in the by = command.

entrez_logFC %>% left_join(Hs_entz_annot_tidy, by = 'entrez_id')

## # A tibble: 19 x 5
##    entrez_id  logFC adj_Pval symbol gene_name                                                
##        <dbl>  <dbl>    <dbl> <chr>  <chr>                                                    
##  1       722  3.54   0.00113 C4BPA  complement component 4 binding protein alpha             
##  2       725  5.62   0.0295  C4BPB  complement component 4 binding protein beta              
##  3      1378  3.46   0.0315  CR1    complement component 3b/4b receptor 1 (Knops blood group)
##  4      2302  3.37   0.0486  FOXJ1  forkhead box J1                                          
##  5      5777  3.93   0.0204  PTPN6  protein tyrosine phosphatase, non-receptor type 6        
##  6     55061  2.76   0.0225  SUSD4  sushi domain containing 4                                
##  7      2523 -1.53   0.00091 FUT1   fucosyltransferase 1 (H blood group)                     
##  8      2524 -1.38   0.0434  FUT2   fucosyltransferase 2                                     
##  9      2526 -1.46   0.0268  FUT4   fucosyltransferase 4                                     
## 10      2527 -0.930  0.00605 FUT5   fucosyltransferase 5                                     
## 11      2528 -2.91   0.0494  FUT6   fucosyltransferase 6                                     
## 12      2529 -0.767  0.0185  FUT7   fucosyltransferase 7 (alpha (1,3) fucosyltransferase)    
## 13      2530 -0.705  0.0213  FUT8   fucosyltransferase 8 (alpha (1,6) fucosyltransferase)    
## 14     10690 -1.32   0.0396  FUT9   fucosyltransferase 9 (alpha (1,3) fucosyltransferase)    
## 15     84750 -0.756  0.00960 FUT10  fucosyltransferase 10 (alpha (1,3) fucosyltransferase)   
## 16       883  9.95   0.0483  CCBL1  cysteine conjugate-beta lyase, cytoplasmic               
## 17      2806  4.26   0.0179  GOT2   glutamic-oxaloacetic transaminase 2                      
## 18     51166  4.1    0.0417  AADAT  aminoadipate aminotransferase                            
## 19     56267  3.73   0.0406  CCBL2  cysteine conjugate-beta lyase 2

The result contains the symbols and gene_names columns from Hs_entz_annot_tidy, appended to the entire entrez_logFC dataframe.

We will store this result in a new variable ‘entrez_logFC_annot’

entrez_logFC_annot <- entrez_logFC %>% left_join(Hs_entz_annot_tidy, by = 'entrez_id')

Let’s try the second of three left_join()s required to complete the data set. We will use the same code to join entrez_logFC_annot and GO_entrez_long, also using ‘entrez_id’ as the joint key.

entrez_logFC_annot %>% left_join(GO_entrez_long, by = 'entrez_id')

## Error: Can't join on `x$entrez_id` x `y$entrez_id` because of incompatible types.
## ℹ `x$entrez_id` is of type <double>>.
## ℹ `y$entrez_id` is of type <character>>.

Okay that didn’t work 😬
It looks like somewhere along the line the entrez_id in GO_entrez_long was converted from numeric data into character data.
The joint key has to contain at least one identical value, down to the data type. To remedy this we will use mutate() to convert the entrez_id column in GO_entrez_long into numeric data, and overwrite the variable.

GO_entrez_long <- GO_entrez_long %>% mutate(entrez_id = as.numeric(entrez_id))

! Always be careful when overwriting variables. If you make a mistake it can produce incorrect results downstream.

Now let’s try the left_join() using the identical data type for the joint key.

entrez_logFC_annot %>% left_join(GO_entrez_long, by = 'entrez_id')

## # A tibble: 19 x 6
##    entrez_id  logFC adj_Pval symbol gene_name                                                 GOID      
##        <dbl>  <dbl>    <dbl> <chr>  <chr>                                                     <chr>     
##  1       722  3.54   0.00113 C4BPA  complement component 4 binding protein alpha              GO:0002924
##  2       725  5.62   0.0295  C4BPB  complement component 4 binding protein beta               GO:0002924
##  3      1378  3.46   0.0315  CR1    complement component 3b/4b receptor 1 (Knops blood group) GO:0002924
##  4      2302  3.37   0.0486  FOXJ1  forkhead box J1                                           GO:0002924
##  5      5777  3.93   0.0204  PTPN6  protein tyrosine phosphatase, non-receptor type 6         GO:0002924
##  6     55061  2.76   0.0225  SUSD4  sushi domain containing 4                                 GO:0002924
##  7      2523 -1.53   0.00091 FUT1   fucosyltransferase 1 (H blood group)                      GO:0019317
##  8      2524 -1.38   0.0434  FUT2   fucosyltransferase 2                                      GO:0019317
##  9      2526 -1.46   0.0268  FUT4   fucosyltransferase 4                                      GO:0019317
## 10      2527 -0.930  0.00605 FUT5   fucosyltransferase 5                                      GO:0019317
## 11      2528 -2.91   0.0494  FUT6   fucosyltransferase 6                                      GO:0019317
## 12      2529 -0.767  0.0185  FUT7   fucosyltransferase 7 (alpha (1,3) fucosyltransferase)     GO:0019317
## 13      2530 -0.705  0.0213  FUT8   fucosyltransferase 8 (alpha (1,6) fucosyltransferase)     GO:0019317
## 14     10690 -1.32   0.0396  FUT9   fucosyltransferase 9 (alpha (1,3) fucosyltransferase)     GO:0019317
## 15     84750 -0.756  0.00960 FUT10  fucosyltransferase 10 (alpha (1,3) fucosyltransferase)    GO:0019317
## 16       883  9.95   0.0483  CCBL1  cysteine conjugate-beta lyase, cytoplasmic                GO:0036137
## 17      2806  4.26   0.0179  GOT2   glutamic-oxaloacetic transaminase 2                       GO:0036137
## 18     51166  4.1    0.0417  AADAT  aminoadipate aminotransferase                             GO:0036137
## 19     56267  3.73   0.0406  CCBL2  cysteine conjugate-beta lyase 2                           GO:0036137

Now we have a GO ID column, the gene annotations and the original entrez_logFC data.

Let’s store the result of left_join() as a variable ‘entrez_logFC_GO’

entrez_logFC_GO <- entrez_logFC_annot %>% left_join(GO_entrez_long, by = 'entrez_id')

Finally, we will create a variable ‘complete_table’ by joining in the GO term descriptions in GO_terms_long. This time, the joint key will be the ‘GOID’ column

complete_table <- entrez_logFC_GO %>% left_join(GO_terms_long, by = 'GOID')

6.6 Plotting challenge

With the input data tidied and joined we have all of the information required to make a nice plot illustrating the change in gene expression for genes belonging to different Gene Ontolgies.
Your colleague has sketched out an idea for a plot. Can you now produce the plot below using complete_table and ggplot?

6.7 Solution

complete_table %>% 
  ggplot(aes(x = logFC, y = symbol)) +
  geom_col(aes(fill = DESCRIPTION), show.legend = FALSE) +
  geom_vline(xintercept = 0, lty=2) +
  facet_wrap(~DESCRIPTION, scales='free_y', ncol=1) +
  xlim(-10,10)

This plot can be refined further by controlling factors in the input data frame, as covered in the factors chapter.

6.8 Summary

Once you have finished the ☕ your grateful colleague has bought you, consider the skills you have acquired so far. You can make a wide variety of plots from datasets large and small; subset, transform and summarize data with dplyr, and tidy and join uncooperative datasets.
Together these skills allow you to answer interesting questions about your data more quickly, and if necessary, reproduce hours of work instantly and exactly.

The final piece of the puzzle is to be able to automate the drudge work of cleaning data and running the same analysis on multiple data sets — the subject of Week 4.