Chapter 17 Codebook for IntroR and RStudio

69 points total

17.1 01 Setting up R and RStudio

Let’s start by going to the environment panel in RStudio and clearing the environment. This is a good habit to get into when starting a new project, as it ensures that you are starting with a clean slate.

17.2 02 Introduction to R and RStudio

.md file = 02-introR-R-and-RStudio.md

17.2.1 Running commands and annotating code:

Add some additional code to the following chunk:

## Intro to R Lesson

3 + 5
## [1] 8

The chunk will run inline and output inline, but I prefer to set the chunk so it will run in the console, because it gives us extra room, especially for plotting. You can set this using the cog wheel at the top of the script window next to the Knit button.

17.2.2 The assignment operator (<-)

The assignment operator assigns values on the right to variables on the left.

In RStudio the keyboard shortcut for the assignment operator <- is Alt + - (Windows) or Option + - (Mac).

x <- 3
y <- 5
x + y
## [1] 8
number <- x + y

number
## [1] 8

Exercise points +2

  1. Try changing the value of the variable x to 5. What happens to number?
# Your code here
  • Ans:
  1. Now try changing the value of variable y to contain the value 10. What do you need to do, to update the variable number?
# Your code here
  • Ans:

17.3 03 R syntax and data structures

.md file = 03-introR-syntax-and-data-structures.md

17.3.1 Vectors

# Create a numeric vector and store the vector as a variable called 'glengths'
glengths <- c(4.6, 3000, 50000)
glengths
## [1]     4.6  3000.0 50000.0
# Create a character vector and store the vector as a variable called 'species'
species <- c("ecoli", "human", "corn")
species
## [1] "ecoli" "human" "corn"

Exercise points +1

Try to create a vector of numeric and character values by combining the two vectors that we just created (glengths and species). Assign this combined vector to a new variable called combined. Hint: you will need to use the combine c() function to do this. Print the combined vector in the console, what looks different compared to the original vectors?

# Your code here
  • Ans:

17.3.2 Factors

Create a character vector and store the vector as a variable called ‘expression’

expression <- c("low", "high", "medium", "high", "low", "medium", "high")
expression
## [1] "low"    "high"   "medium" "high"   "low"    "medium" "high"

Turn ‘expression’ vector into a factor

expression <- factor(expression)
expression
## [1] low    high   medium high   low    medium high  
## Levels: high low medium
length(expression)
## [1] 7
levels(expression)
## [1] "high"   "low"    "medium"

Exercises points +2

Let’s say that in our experimental analyses, we are working with three different sets of cells: normal, cells knocked out for geneA (a very exciting gene), and cells overexpressing geneA. We have three replicates for each celltype.

  1. Create a vector named samplegroup with nine elements: 3 control (“CTL”) values, 3 knock-out (“KO”) values, and 3 over-expressing (“OE”) values.
# Your code here
  1. Turn samplegroup into a factor data structure.
# Your code here

17.3.3 Dataframes

A data.frame is the de facto data structure for most tabular data and what we use for statistics and plotting. A data.frame is similar to a matrix in that it’s a collection of vectors of the same length and each vector represents a column. However, in a dataframe each vector can be of a different data type (e.g., characters, integers, factors). A dataframe can be created using the data.frame() function.

# Create a data frame and store it as a variable called 'df'
df <- data.frame(species, glengths)
df
##   species glengths
## 1   ecoli      4.6
## 2   human   3000.0
## 3    corn  50000.0

Exercise points +1

Create a data frame called favorite_books with the following vectors as columns:

titles <- c("Catch-22", "Pride and Prejudice", "Nineteen Eighty Four")
pages <- c(453, 432, 328)
# Your code here

17.4 04 Functions in R

.md file = 04-introR-functions-and-arguments.md

17.4.1 Basic functions

We have already used a few examples of basic functions in the previous lessons i.e c(), and factor().

Many of the base functions in R involve mathematical operations. One example would be the function sqrt(). The input/argument must be a number, and the output is the square root of that number. Let’s try finding the square root of 81:

sqrt(81)
## [1] 9
sqrt(glengths)
## [1]   2.144761  54.772256 223.606798

Round function:

round(3.14159)
## [1] 3
round(3.14159, digits = 2)
## [1] 3.14

You can also round in the opposite direction, for example, to the nearest 100 by setting the digits argument to -2:

round(367.14159, digits = -2)
## [1] 400

17.4.2 Seeking help on functions

`?`(round)
args(round)
## function (x, digits = 0, ...) 
## NULL
example("round")
## 
## round> round(.5 + -2:4) # IEEE / IEC rounding: -2  0  0  2  2  4  4
## [1] -2  0  0  2  2  4  4
## 
## round> ## (this is *good* behaviour -- do *NOT* report it as bug !)
## round> 
## round> ( x1 <- seq(-2, 4, by = .5) )
##  [1] -2.0 -1.5 -1.0 -0.5  0.0  0.5  1.0  1.5  2.0  2.5  3.0  3.5  4.0
## 
## round> round(x1) #-- IEEE / IEC rounding !
##  [1] -2 -2 -1  0  0  0  1  2  2  2  3  4  4
## 
## round> x1[trunc(x1) != floor(x1)]
## [1] -1.5 -0.5
## 
## round> x1[round(x1) != floor(x1 + .5)]
## [1] -1.5  0.5  2.5
## 
## round> (non.int <- ceiling(x1) != floor(x1))
##  [1] FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE
## [13] FALSE
## 
## round> x2 <- pi * 100^(-1:3)
## 
## round> round(x2, 3)
## [1]       0.031       3.142     314.159   31415.927 3141592.654
## 
## round> signif(x2, 3)
## [1] 3.14e-02 3.14e+00 3.14e+02 3.14e+04 3.14e+06

Exercise points +3

  1. Let’s use base R function to calculate mean value of the glengths vector. You might need to search online to find what function can perform this task.
# Your code here
  1. Create a new vector test <- c(1, NA, 2, 3, NA, 4). Use the same base R function from exercise 1 (with addition of proper argument), and calculate the mean value of the test vector. The output should be 2.5.

NOTE: In R, missing values are represented by the symbol NA (not available).
It’s a way to make sure that users know they have missing data, and make a conscious decision on how to deal with it.
There are ways to ignore NA during statistical calculation, or to remove NA from the vector.

test <- c(1, NA, 2, 3, NA, 4)
# Your code here
  1. Another commonly used base function is sort(). Use this function to sort the glengths vector in descending order.
# Your code here

17.4.3 User-defined Functions

Let’s try creating a simple example function. This function will take in a numeric value as input, and return the squared value.

square_it <- function(x) {
    square <- x * x
    return(square)
}

Once you run the code, you should see a function named square_it in the Environment panel (located at the top right of Rstudio interface). Now, we can use this function as any other base R functions. We type out the name of the function, and inside the parentheses we provide a numeric value x:

square_it(5)
## [1] 25

Pretty simple, right? In this case, we only had one line of code that was run, but in theory you could have many lines of code to get obtain the final results that you want to “return” to the user. We have only scratched the surface here when it comes to creating functions! If you are interested you can also find more detailed information on writing functions R-bloggers site.


Exercise points +2

  1. Write a function called multiply_it, which takes two inputs: a numeric value x, and a numeric value y. The function will return the product of these two numeric values, which is x * y. For example, multiply_it(x=4, y=6) will return output 24.

Function:

# Your code here

Demo:

# Your code here

17.5 05 Packages and libraries

.md file = 05-introR_packages.md

17.5.1 Packages and Libraries

You can check what libraries (packages) are loaded in your current R session by typing into the console:

sessionInfo()  #Print version information about R, the OS and attached or loaded packages
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-apple-darwin20
## Running under: macOS Sonoma 14.6.1
## 
## Matrix products: default
## BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/Los_Angeles
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] bookdown_0.40   lubridate_1.9.3 forcats_1.0.0   stringr_1.5.1  
##  [5] dplyr_1.1.4     readr_2.1.5     tidyr_1.3.1     tibble_3.2.1   
##  [9] tidyverse_2.0.0 gplots_3.1.3.1  purrr_1.0.2     ggplot2_3.5.1  
## 
## loaded via a namespace (and not attached):
##  [1] sass_0.4.9         utf8_1.2.4         generics_0.1.3     bitops_1.0-8      
##  [5] KernSmooth_2.23-24 gtools_3.9.5       stringi_1.8.4      hms_1.1.3         
##  [9] digest_0.6.37      magrittr_2.0.3     caTools_1.18.3     timechange_0.3.0  
## [13] evaluate_0.24.0    grid_4.4.1         fastmap_1.2.0      jsonlite_1.8.8    
## [17] tinytex_0.53       formatR_1.14       fansi_1.0.6        scales_1.3.0      
## [21] jquerylib_0.1.4    cli_3.6.3          crayon_1.5.3       rlang_1.1.4       
## [25] bit64_4.0.5        munsell_0.5.1      withr_3.0.1        cachem_1.1.0      
## [29] yaml_2.3.10        parallel_4.4.1     tools_4.4.1        tzdb_0.4.0        
## [33] colorspace_2.1-1   vctrs_0.6.5        R6_2.5.1           lifecycle_1.0.4   
## [37] bit_4.0.5          vroom_1.6.5        archive_1.1.9      pkgconfig_2.0.3   
## [41] pillar_1.9.0       bslib_0.8.0        gtable_0.3.5       rsconnect_1.3.1   
## [45] glue_1.7.0         xfun_0.47          tidyselect_1.2.1   highr_0.11        
## [49] rstudioapi_0.16.0  knitr_1.48         farver_2.1.2       htmltools_0.5.8.1 
## [53] rmarkdown_2.28     labeling_0.4.3     compiler_4.4.1
# OR

search()  #Gives a list of attached packages
##  [1] ".GlobalEnv"        "package:bookdown"  "package:lubridate"
##  [4] "package:forcats"   "package:stringr"   "package:dplyr"    
##  [7] "package:readr"     "package:tidyr"     "package:tibble"   
## [10] "package:tidyverse" "package:gplots"    "package:purrr"    
## [13] "package:ggplot2"   "tools:rstudio"     "package:stats"    
## [16] "package:graphics"  "package:grDevices" "package:utils"    
## [19] "package:datasets"  "package:methods"   "Autoloads"        
## [22] "package:base"

17.5.2 Installing packages

17.5.2.1 Packages from CRAN:

Previously we have introduced you to installing packages. An example is given below for the ggplot2 package that will be required for some plots we will create later on. Run this code to install ggplot2.

Set eval = TRUE if you haven’t installed ggplot2 yet

install.packages("ggplot2")

Alternatively, packages can also be installed from Bioconductor, a repository of packages which provides tools for the analysis and comprehension of high-throughput genomic data.

Set eval = TRUE if you haven’t installed BiocManager yet

install.packages("BiocManager")

17.5.2.2 Packages from Bioconductor:

Now you can use BiocManager::install to install packages available on Bioconductor. Bioconductor is a free, open source and open development software project for the analysis and comprehension of genomic data generated by wet lab experiments in molecular biology.

# DO NOT RUN THIS!
BiocManager::install("ggplot2")

17.5.3 Loading libraries/packages

library(ggplot2)

To see the functions in a package you can also type:

ls("package:ggplot2")
##   [1] "%+%"                        "%+replace%"                
##   [3] "aes"                        "aes_"                      
##   [5] "aes_all"                    "aes_auto"                  
##   [7] "aes_q"                      "aes_string"                
##   [9] "after_scale"                "after_stat"                
##  [11] "alpha"                      "annotate"                  
##  [13] "annotation_custom"          "annotation_logticks"       
##  [15] "annotation_map"             "annotation_raster"         
##  [17] "arrow"                      "as_label"                  
##  [19] "as_labeller"                "autolayer"                 
##  [21] "autoplot"                   "AxisSecondary"             
##  [23] "benchplot"                  "binned_scale"              
##  [25] "borders"                    "calc_element"              
##  [27] "check_device"               "combine_vars"              
##  [29] "continuous_scale"           "Coord"                     
##  [31] "coord_cartesian"            "coord_equal"               
##  [33] "coord_fixed"                "coord_flip"                
##  [35] "coord_map"                  "coord_munch"               
##  [37] "coord_polar"                "coord_quickmap"            
##  [39] "coord_radial"               "coord_sf"                  
##  [41] "coord_trans"                "CoordCartesian"            
##  [43] "CoordFixed"                 "CoordFlip"                 
##  [45] "CoordMap"                   "CoordPolar"                
##  [47] "CoordQuickmap"              "CoordRadial"               
##  [49] "CoordSf"                    "CoordTrans"                
##  [51] "cut_interval"               "cut_number"                
##  [53] "cut_width"                  "datetime_scale"            
##  [55] "derive"                     "diamonds"                  
##  [57] "discrete_scale"             "draw_key_abline"           
##  [59] "draw_key_blank"             "draw_key_boxplot"          
##  [61] "draw_key_crossbar"          "draw_key_dotplot"          
##  [63] "draw_key_label"             "draw_key_linerange"        
##  [65] "draw_key_path"              "draw_key_point"            
##  [67] "draw_key_pointrange"        "draw_key_polygon"          
##  [69] "draw_key_rect"              "draw_key_smooth"           
##  [71] "draw_key_text"              "draw_key_timeseries"       
##  [73] "draw_key_vline"             "draw_key_vpath"            
##  [75] "dup_axis"                   "economics"                 
##  [77] "economics_long"             "el_def"                    
##  [79] "element_blank"              "element_grob"              
##  [81] "element_line"               "element_rect"              
##  [83] "element_render"             "element_text"              
##  [85] "enexpr"                     "enexprs"                   
##  [87] "enquo"                      "enquos"                    
##  [89] "ensym"                      "ensyms"                    
##  [91] "expand_limits"              "expand_scale"              
##  [93] "expansion"                  "expr"                      
##  [95] "Facet"                      "facet_grid"                
##  [97] "facet_null"                 "facet_wrap"                
##  [99] "FacetGrid"                  "FacetNull"                 
## [101] "FacetWrap"                  "faithfuld"                 
## [103] "fill_alpha"                 "find_panel"                
## [105] "flip_data"                  "flipped_names"             
## [107] "fortify"                    "Geom"                      
## [109] "geom_abline"                "geom_area"                 
## [111] "geom_bar"                   "geom_bin_2d"               
## [113] "geom_bin2d"                 "geom_blank"                
## [115] "geom_boxplot"               "geom_col"                  
## [117] "geom_contour"               "geom_contour_filled"       
## [119] "geom_count"                 "geom_crossbar"             
## [121] "geom_curve"                 "geom_density"              
## [123] "geom_density_2d"            "geom_density_2d_filled"    
## [125] "geom_density2d"             "geom_density2d_filled"     
## [127] "geom_dotplot"               "geom_errorbar"             
## [129] "geom_errorbarh"             "geom_freqpoly"             
## [131] "geom_function"              "geom_hex"                  
## [133] "geom_histogram"             "geom_hline"                
## [135] "geom_jitter"                "geom_label"                
## [137] "geom_line"                  "geom_linerange"            
## [139] "geom_map"                   "geom_path"                 
## [141] "geom_point"                 "geom_pointrange"           
## [143] "geom_polygon"               "geom_qq"                   
## [145] "geom_qq_line"               "geom_quantile"             
## [147] "geom_raster"                "geom_rect"                 
## [149] "geom_ribbon"                "geom_rug"                  
## [151] "geom_segment"               "geom_sf"                   
## [153] "geom_sf_label"              "geom_sf_text"              
## [155] "geom_smooth"                "geom_spoke"                
## [157] "geom_step"                  "geom_text"                 
## [159] "geom_tile"                  "geom_violin"               
## [161] "geom_vline"                 "GeomAbline"                
## [163] "GeomAnnotationMap"          "GeomArea"                  
## [165] "GeomBar"                    "GeomBlank"                 
## [167] "GeomBoxplot"                "GeomCol"                   
## [169] "GeomContour"                "GeomContourFilled"         
## [171] "GeomCrossbar"               "GeomCurve"                 
## [173] "GeomCustomAnn"              "GeomDensity"               
## [175] "GeomDensity2d"              "GeomDensity2dFilled"       
## [177] "GeomDotplot"                "GeomErrorbar"              
## [179] "GeomErrorbarh"              "GeomFunction"              
## [181] "GeomHex"                    "GeomHline"                 
## [183] "GeomLabel"                  "GeomLine"                  
## [185] "GeomLinerange"              "GeomLogticks"              
## [187] "GeomMap"                    "GeomPath"                  
## [189] "GeomPoint"                  "GeomPointrange"            
## [191] "GeomPolygon"                "GeomQuantile"              
## [193] "GeomRaster"                 "GeomRasterAnn"             
## [195] "GeomRect"                   "GeomRibbon"                
## [197] "GeomRug"                    "GeomSegment"               
## [199] "GeomSf"                     "GeomSmooth"                
## [201] "GeomSpoke"                  "GeomStep"                  
## [203] "GeomText"                   "GeomTile"                  
## [205] "GeomViolin"                 "GeomVline"                 
## [207] "get_alt_text"               "get_element_tree"          
## [209] "get_guide_data"             "gg_dep"                    
## [211] "ggplot"                     "ggplot_add"                
## [213] "ggplot_build"               "ggplot_gtable"             
## [215] "ggplotGrob"                 "ggproto"                   
## [217] "ggproto_parent"             "ggsave"                    
## [219] "ggtitle"                    "Guide"                     
## [221] "guide_axis"                 "guide_axis_logticks"       
## [223] "guide_axis_stack"           "guide_axis_theta"          
## [225] "guide_bins"                 "guide_colorbar"            
## [227] "guide_colorsteps"           "guide_colourbar"           
## [229] "guide_coloursteps"          "guide_custom"              
## [231] "guide_gengrob"              "guide_geom"                
## [233] "guide_legend"               "guide_merge"               
## [235] "guide_none"                 "guide_train"               
## [237] "guide_transform"            "GuideAxis"                 
## [239] "GuideAxisLogticks"          "GuideAxisStack"            
## [241] "GuideAxisTheta"             "GuideBins"                 
## [243] "GuideColourbar"             "GuideColoursteps"          
## [245] "GuideCustom"                "GuideLegend"               
## [247] "GuideNone"                  "GuideOld"                  
## [249] "guides"                     "has_flipped_aes"           
## [251] "is.Coord"                   "is.facet"                  
## [253] "is.ggplot"                  "is.ggproto"                
## [255] "is.theme"                   "label_both"                
## [257] "label_bquote"               "label_context"             
## [259] "label_parsed"               "label_value"               
## [261] "label_wrap_gen"             "labeller"                  
## [263] "labs"                       "last_plot"                 
## [265] "layer"                      "layer_data"                
## [267] "layer_grob"                 "layer_scales"              
## [269] "layer_sf"                   "Layout"                    
## [271] "lims"                       "luv_colours"               
## [273] "map_data"                   "margin"                    
## [275] "max_height"                 "max_width"                 
## [277] "mean_cl_boot"               "mean_cl_normal"            
## [279] "mean_sdl"                   "mean_se"                   
## [281] "median_hilow"               "merge_element"             
## [283] "midwest"                    "mpg"                       
## [285] "msleep"                     "new_guide"                 
## [287] "old_guide"                  "panel_cols"                
## [289] "panel_rows"                 "pattern_alpha"             
## [291] "Position"                   "position_dodge"            
## [293] "position_dodge2"            "position_fill"             
## [295] "position_identity"          "position_jitter"           
## [297] "position_jitterdodge"       "position_nudge"            
## [299] "position_stack"             "PositionDodge"             
## [301] "PositionDodge2"             "PositionFill"              
## [303] "PositionIdentity"           "PositionJitter"            
## [305] "PositionJitterdodge"        "PositionNudge"             
## [307] "PositionStack"              "presidential"              
## [309] "qplot"                      "quickplot"                 
## [311] "quo"                        "quo_name"                  
## [313] "quos"                       "register_theme_elements"   
## [315] "rel"                        "remove_missing"            
## [317] "render_axes"                "render_strips"             
## [319] "reset_theme_settings"       "resolution"                
## [321] "Scale"                      "scale_alpha"               
## [323] "scale_alpha_binned"         "scale_alpha_continuous"    
## [325] "scale_alpha_date"           "scale_alpha_datetime"      
## [327] "scale_alpha_discrete"       "scale_alpha_identity"      
## [329] "scale_alpha_manual"         "scale_alpha_ordinal"       
## [331] "scale_color_binned"         "scale_color_brewer"        
## [333] "scale_color_continuous"     "scale_color_date"          
## [335] "scale_color_datetime"       "scale_color_discrete"      
## [337] "scale_color_distiller"      "scale_color_fermenter"     
## [339] "scale_color_gradient"       "scale_color_gradient2"     
## [341] "scale_color_gradientn"      "scale_color_grey"          
## [343] "scale_color_hue"            "scale_color_identity"      
## [345] "scale_color_manual"         "scale_color_ordinal"       
## [347] "scale_color_steps"          "scale_color_steps2"        
## [349] "scale_color_stepsn"         "scale_color_viridis_b"     
## [351] "scale_color_viridis_c"      "scale_color_viridis_d"     
## [353] "scale_colour_binned"        "scale_colour_brewer"       
## [355] "scale_colour_continuous"    "scale_colour_date"         
## [357] "scale_colour_datetime"      "scale_colour_discrete"     
## [359] "scale_colour_distiller"     "scale_colour_fermenter"    
## [361] "scale_colour_gradient"      "scale_colour_gradient2"    
## [363] "scale_colour_gradientn"     "scale_colour_grey"         
## [365] "scale_colour_hue"           "scale_colour_identity"     
## [367] "scale_colour_manual"        "scale_colour_ordinal"      
## [369] "scale_colour_steps"         "scale_colour_steps2"       
## [371] "scale_colour_stepsn"        "scale_colour_viridis_b"    
## [373] "scale_colour_viridis_c"     "scale_colour_viridis_d"    
## [375] "scale_continuous_identity"  "scale_discrete_identity"   
## [377] "scale_discrete_manual"      "scale_fill_binned"         
## [379] "scale_fill_brewer"          "scale_fill_continuous"     
## [381] "scale_fill_date"            "scale_fill_datetime"       
## [383] "scale_fill_discrete"        "scale_fill_distiller"      
## [385] "scale_fill_fermenter"       "scale_fill_gradient"       
## [387] "scale_fill_gradient2"       "scale_fill_gradientn"      
## [389] "scale_fill_grey"            "scale_fill_hue"            
## [391] "scale_fill_identity"        "scale_fill_manual"         
## [393] "scale_fill_ordinal"         "scale_fill_steps"          
## [395] "scale_fill_steps2"          "scale_fill_stepsn"         
## [397] "scale_fill_viridis_b"       "scale_fill_viridis_c"      
## [399] "scale_fill_viridis_d"       "scale_linetype"            
## [401] "scale_linetype_binned"      "scale_linetype_continuous" 
## [403] "scale_linetype_discrete"    "scale_linetype_identity"   
## [405] "scale_linetype_manual"      "scale_linewidth"           
## [407] "scale_linewidth_binned"     "scale_linewidth_continuous"
## [409] "scale_linewidth_date"       "scale_linewidth_datetime"  
## [411] "scale_linewidth_discrete"   "scale_linewidth_identity"  
## [413] "scale_linewidth_manual"     "scale_linewidth_ordinal"   
## [415] "scale_radius"               "scale_shape"               
## [417] "scale_shape_binned"         "scale_shape_continuous"    
## [419] "scale_shape_discrete"       "scale_shape_identity"      
## [421] "scale_shape_manual"         "scale_shape_ordinal"       
## [423] "scale_size"                 "scale_size_area"           
## [425] "scale_size_binned"          "scale_size_binned_area"    
## [427] "scale_size_continuous"      "scale_size_date"           
## [429] "scale_size_datetime"        "scale_size_discrete"       
## [431] "scale_size_identity"        "scale_size_manual"         
## [433] "scale_size_ordinal"         "scale_type"                
## [435] "scale_x_binned"             "scale_x_continuous"        
## [437] "scale_x_date"               "scale_x_datetime"          
## [439] "scale_x_discrete"           "scale_x_log10"             
## [441] "scale_x_reverse"            "scale_x_sqrt"              
## [443] "scale_x_time"               "scale_y_binned"            
## [445] "scale_y_continuous"         "scale_y_date"              
## [447] "scale_y_datetime"           "scale_y_discrete"          
## [449] "scale_y_log10"              "scale_y_reverse"           
## [451] "scale_y_sqrt"               "scale_y_time"              
## [453] "ScaleBinned"                "ScaleBinnedPosition"       
## [455] "ScaleContinuous"            "ScaleContinuousDate"       
## [457] "ScaleContinuousDatetime"    "ScaleContinuousIdentity"   
## [459] "ScaleContinuousPosition"    "ScaleDiscrete"             
## [461] "ScaleDiscreteIdentity"      "ScaleDiscretePosition"     
## [463] "seals"                      "sec_axis"                  
## [465] "set_last_plot"              "sf_transform_xy"           
## [467] "should_stop"                "stage"                     
## [469] "standardise_aes_names"      "stat"                      
## [471] "Stat"                       "stat_align"                
## [473] "stat_bin"                   "stat_bin_2d"               
## [475] "stat_bin_hex"               "stat_bin2d"                
## [477] "stat_binhex"                "stat_boxplot"              
## [479] "stat_contour"               "stat_contour_filled"       
## [481] "stat_count"                 "stat_density"              
## [483] "stat_density_2d"            "stat_density_2d_filled"    
## [485] "stat_density2d"             "stat_density2d_filled"     
## [487] "stat_ecdf"                  "stat_ellipse"              
## [489] "stat_function"              "stat_identity"             
## [491] "stat_qq"                    "stat_qq_line"              
## [493] "stat_quantile"              "stat_sf"                   
## [495] "stat_sf_coordinates"        "stat_smooth"               
## [497] "stat_spoke"                 "stat_sum"                  
## [499] "stat_summary"               "stat_summary_2d"           
## [501] "stat_summary_bin"           "stat_summary_hex"          
## [503] "stat_summary2d"             "stat_unique"               
## [505] "stat_ydensity"              "StatAlign"                 
## [507] "StatBin"                    "StatBin2d"                 
## [509] "StatBindot"                 "StatBinhex"                
## [511] "StatBoxplot"                "StatContour"               
## [513] "StatContourFilled"          "StatCount"                 
## [515] "StatDensity"                "StatDensity2d"             
## [517] "StatDensity2dFilled"        "StatEcdf"                  
## [519] "StatEllipse"                "StatFunction"              
## [521] "StatIdentity"               "StatQq"                    
## [523] "StatQqLine"                 "StatQuantile"              
## [525] "StatSf"                     "StatSfCoordinates"         
## [527] "StatSmooth"                 "StatSum"                   
## [529] "StatSummary"                "StatSummary2d"             
## [531] "StatSummaryBin"             "StatSummaryHex"            
## [533] "StatUnique"                 "StatYdensity"              
## [535] "summarise_coord"            "summarise_layers"          
## [537] "summarise_layout"           "sym"                       
## [539] "syms"                       "theme"                     
## [541] "theme_bw"                   "theme_classic"             
## [543] "theme_dark"                 "theme_get"                 
## [545] "theme_gray"                 "theme_grey"                
## [547] "theme_light"                "theme_linedraw"            
## [549] "theme_minimal"              "theme_replace"             
## [551] "theme_set"                  "theme_test"                
## [553] "theme_update"               "theme_void"                
## [555] "transform_position"         "translate_shape_string"    
## [557] "txhousing"                  "unit"                      
## [559] "update_geom_defaults"       "update_labels"             
## [561] "update_stat_defaults"       "vars"                      
## [563] "waiver"                     "wrap_dims"                 
## [565] "xlab"                       "xlim"                      
## [567] "ylab"                       "ylim"                      
## [569] "zeroGrob"

Exercise points +1

The tidyverse suite of integrated packages was designed to work together to make common data science operations more user-friendly. We will be using the tidyverse suite in later lessons, and so let’s install it. Use the function from theBiocManager package.

Set eval=TRUE if tidyverse is not yet installed. How can you run the BiocManager::install() command for a package that is already installed without getting an error?

# Your code here
  • Ans:

17.6 06 Subsetting: vectors and factors

.md file = 06-introR-data wrangling.md

17.6.1 Vectors: Selecting using indices

age <- c(15, 22, 45, 52, 73, 81)
age
## [1] 15 22 45 52 73 81
age[5]
## [1] 73

Reverse selection:

age[-5]
## [1] 15 22 45 52 81
age[c(3, 5, 6)]  ## nested
## [1] 45 73 81
# OR

## create a vector first then select
idx <- c(3, 5, 6)  # create vector of the elements of interest
age[idx]
## [1] 45 73 81
age[1:4]
## [1] 15 22 45 52

Exercises points +3, +4 for bonus

  1. Create a vector called alphabets with the following letters, C, D, X, L, F.
# Your code here
  1. Use the associated indices along with [ ] to do the following:

    1. only display C, D and F
# Your code here
b) display all except X
# Your code here
c)  display the letters in the opposite order (F, L, X, D, C) *Bonus points for using a function instead of the indices (Hint: use Google)
# Your code here

17.6.2 Selecting using logical operators

age
## [1] 15 22 45 52 73 81
age > 50
## [1] FALSE FALSE FALSE  TRUE  TRUE  TRUE
age[age > 50]  ## nested
## [1] 52 73 81
age > 50 | age < 18  ## returns logical values for each element
## [1]  TRUE FALSE FALSE  TRUE  TRUE  TRUE
age
## [1] 15 22 45 52 73 81
age[age > 50 | age < 18]  ## nested
## [1] 15 52 73 81
# OR

## create a vector first then select
idx <- age > 50 | age < 18
idx
## [1]  TRUE FALSE FALSE  TRUE  TRUE  TRUE
age[idx]
## [1] 15 52 73 81

17.6.3 Logical operators with which()

which(age > 50 | age < 18)  ## returns only the indices of the TRUE values
## [1] 1 4 5 6
age[which(age > 50 | age < 18)]  ## nested
## [1] 15 52 73 81
# OR

## create a vector first then select
idx_num <- which(age > 50 | age < 18)
age[idx_num]
## [1] 15 52 73 81

17.6.4 Factors

expression <- c("low", "high", "medium", "high", "low", "medium", "high")

expression <- factor(expression)

expression
## [1] low    high   medium high   low    medium high  
## Levels: high low medium
str(expression)
##  Factor w/ 3 levels "high","low","medium": 2 1 3 1 2 3 1
expression[expression == "high"]  ## This will only return those elements in the factor equal to 'high'
## [1] high high high
## Levels: high low medium
exp <- expression[expression == "high"]
exp
## [1] high high high
## Levels: high low medium

Notice here that exp is a factor with only one level, “high”, yet Levels still shows all three levels. This is because the factor is a categorical variable, and the levels are the categories. In this case, the only category is “high”. If you want to remove unused levels, you can use droplevels(), or since there is only one level, you can use as.character() to convert it to a character vector.

droplevels(exp)  # this will remove the unused levels
## [1] high high high
## Levels: high
as.character(exp)  # this will convert the factor to a character vector
## [1] "high" "high" "high"

Exercise points +1

Extract only those elements in samplegroup that are not KO (nesting the logical operation is optional).

samplegroup <- c(rep("CTL", 3), rep("KO", 3), rep("OE", 3))

samplegroup <- factor(samplegroup)

# Your code here

17.6.4.1 Releveling factors

expression  # the default of the factor function is to order the levels alphabetically
## [1] low    high   medium high   low    medium high  
## Levels: high low medium
expression <- factor(expression, levels = c("low", "medium", "high"))  # you can re-factor a factor 

str(expression)
##  Factor w/ 3 levels "low","medium",..: 1 3 2 3 1 2 3

Exercise points +1

Use the samplegroup factor we created in a previous lesson, and relevel it such that KO is the first level followed by CTL and OE.

samplegroup <- c(rep("CTL", 3), rep("KO", 3), rep("OE", 3))

samplegroup <- factor(samplegroup)

# Your code here

17.7 07 Reading and data inspection

.md file = 07-reading_and_data_inspection.md

17.7.1 Reading data into R:

`?`(read.csv)
### Reading data into R
metadata <- read.csv(file = "data/mouse_exp_design.csv")

We can see if it has successfully been read in by running:

metadata
##          genotype celltype replicate
## sample1        Wt    typeA         1
## sample2        Wt    typeA         2
## sample3        Wt    typeA         3
## sample4        KO    typeA         1
## sample5        KO    typeA         2
## sample6        KO    typeA         3
## sample7        Wt    typeB         1
## sample8        Wt    typeB         2
## sample9        Wt    typeB         3
## sample10       KO    typeB         1
## sample11       KO    typeB         2
## sample12       KO    typeB         3

Exercise 1 points +2

  1. Read “project-summary.txt” in to R using read.table() with the approriate arguments and store it as the variable proj_summary. To figure out the appropriate arguments to use with read.table(), keep the following in mind:
  • all the columns in the input text file have column name/headers

  • you want the first column of the text file to be used as row names

  • hint: look up the input for the row.names = argument in read.table()

# Your code here
  1. Display the contents of proj_summary in your console
# Your code here

17.7.2 Inspecting data structures:

Getting info on environmental variables:

Let’s use the metadata file that we created to test out data inspection functions.

head(metadata)  # shows the first 6 rows of the data frame
##         genotype celltype replicate
## sample1       Wt    typeA         1
## sample2       Wt    typeA         2
## sample3       Wt    typeA         3
## sample4       KO    typeA         1
## sample5       KO    typeA         2
## sample6       KO    typeA         3
str(metadata)  # shows the structure of the data frame
## 'data.frame':    12 obs. of  3 variables:
##  $ genotype : chr  "Wt" "Wt" "Wt" "KO" ...
##  $ celltype : chr  "typeA" "typeA" "typeA" "typeA" ...
##  $ replicate: int  1 2 3 1 2 3 1 2 3 1 ...
dim(metadata)
## [1] 12  3
nrow(metadata)  # shows the number of rows in the data frame
## [1] 12
ncol(metadata)
## [1] 3
class(metadata)
## [1] "data.frame"
names(metadata)
## [1] "genotype"  "celltype"  "replicate"

Exercise 2 points +2

  • What is the class of each column in metadata (use one command)?
# Your code here
  • What is the median of the replicates in metadata (use one command)?
# Your code here

Exercise 3 points +6

Use the class() function on glengths and metadata, how does the output differ between the two?

# Your code here
  • Ans:

Use the summary() function on the proj_summary dataframe, what is the median “Intronic_Rate?

# Your code here
  • Ans:

How long is the samplegroup factor?

# Your code here
  • Ans:

What are the dimensions of the proj_summary dataframe?

# Your code here
  • Ans:

When you use the rownames() function on metadata, what is the data structure of the output?

# Your code here
  • Ans:

How many elements in (how long is) the output of colnames(proj_summary)? Don’t count, but use another function to determine this.

# Your code here
  • Ans:

17.8 08 Data frames and matrices

**.md file = 08-introR-_data_wrangling2.md**

17.8.1 Data wrangling: dataframes

Dataframes (and matrices) have 2 dimensions (rows and columns), so if we want to select some specific data from it we need to specify the “coordinates” we want from it. We use the same square bracket notation but rather than providing a single index, there are two indices required. Within the square bracket, row numbers come first followed by column numbers (and the two are separated by a comma).

Extracting values from data.frames:

# Extract value 'Wt'
metadata[1, 1]
## [1] "Wt"
# Extract third row
metadata[3, ]
##         genotype celltype replicate
## sample3       Wt    typeA         3
# Extract third column
metadata[, 3]
##  [1] 1 2 3 1 2 3 1 2 3 1 2 3
# Extract third column as a data frame
metadata[, 3, drop = FALSE]
##          replicate
## sample1          1
## sample2          2
## sample3          3
## sample4          1
## sample5          2
## sample6          3
## sample7          1
## sample8          2
## sample9          3
## sample10         1
## sample11         2
## sample12         3
# Dataframe containing first two columns
metadata[, 1:2]
##          genotype celltype
## sample1        Wt    typeA
## sample2        Wt    typeA
## sample3        Wt    typeA
## sample4        KO    typeA
## sample5        KO    typeA
## sample6        KO    typeA
## sample7        Wt    typeB
## sample8        Wt    typeB
## sample9        Wt    typeB
## sample10       KO    typeB
## sample11       KO    typeB
## sample12       KO    typeB
# Data frame containing first, third and sixth rows
metadata[c(1, 3, 6), ]
##         genotype celltype replicate
## sample1       Wt    typeA         1
## sample3       Wt    typeA         3
## sample6       KO    typeA         3
# Extract the celltype column for the first three samples
metadata[c("sample1", "sample2", "sample3"), "celltype"]
## [1] "typeA" "typeA" "typeA"
# Check column names of metadata data frame
colnames(metadata)
## [1] "genotype"  "celltype"  "replicate"
# names and colnames are the same for data frames

# Check row names of metadata data frame
rownames(metadata)
##  [1] "sample1"  "sample2"  "sample3"  "sample4"  "sample5"  "sample6" 
##  [7] "sample7"  "sample8"  "sample9"  "sample10" "sample11" "sample12"
# Extract the genotype column
metadata$genotype
##  [1] "Wt" "Wt" "Wt" "KO" "KO" "KO" "Wt" "Wt" "Wt" "KO" "KO" "KO"
# Extract the first five values/elements of the genotype column using the $
# operator
metadata$genotype[1:5]
## [1] "Wt" "Wt" "Wt" "KO" "KO"

Exercises points +3

Return a data frame with only the genotype and replicate column values for sample2 and sample8.

# Your code here

Return the fourth and ninth values of the replicate column.

# Your code here

Extract the replicate column as a data frame.

# Your code here

17.8.1.1 Selecting using logical operators:

For example, if we want to return only those rows of the data frame with the celltype column having a value of typeA, we would perform two steps:

  1. Identify which rows in the celltype column have a value of typeA.

  2. Use those TRUE values to extract those rows from the data frame.

To do this we would extract the column of interest as a vector, with the first value corresponding to the first row, the second value corresponding to the second row, so on and so forth. We use that vector in the logical expression. Here we are looking for values to be equal to typeA,

So our logical expression would be:

names(metadata)
## [1] "genotype"  "celltype"  "replicate"
metadata$celltype
##  [1] "typeA" "typeA" "typeA" "typeA" "typeA" "typeA" "typeB" "typeB" "typeB"
## [10] "typeB" "typeB" "typeB"
table(metadata$celltype)
## 
## typeA typeB 
##     6     6
metadata$celltype == "typeA"
##  [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
logical_idx <- metadata$celltype == "typeA"

If we want to keep all the rows that have celltype == “typeA”: Any row in the logical_idx that is TRUE will be kept, those that are FALSE will be discarded.

metadata[logical_idx, ]
##         genotype celltype replicate
## sample1       Wt    typeA         1
## sample2       Wt    typeA         2
## sample3       Wt    typeA         3
## sample4       KO    typeA         1
## sample5       KO    typeA         2
## sample6       KO    typeA         3
idx <- which(metadata$celltype == "typeA")
metadata[idx, ]
##         genotype celltype replicate
## sample1       Wt    typeA         1
## sample2       Wt    typeA         2
## sample3       Wt    typeA         3
## sample4       KO    typeA         1
## sample5       KO    typeA         2
## sample6       KO    typeA         3
# OR you can use nesting
metadata[which(metadata$celltype == "typeA"), ]
##         genotype celltype replicate
## sample1       Wt    typeA         1
## sample2       Wt    typeA         2
## sample3       Wt    typeA         3
## sample4       KO    typeA         1
## sample5       KO    typeA         2
## sample6       KO    typeA         3

Let’s try another subsetting. Extract the rows of the metadata data frame for only the replicates 2 and 3. First, let’s create the logical expression for the column of interest (replicate):

table(metadata$replicate)
## 
## 1 2 3 
## 4 4 4
idx <- which(metadata$replicate > 1)
idx
## [1]  2  3  5  6  8  9 11 12
metadata[idx, ]
##          genotype celltype replicate
## sample2        Wt    typeA         2
## sample3        Wt    typeA         3
## sample5        KO    typeA         2
## sample6        KO    typeA         3
## sample8        Wt    typeB         2
## sample9        Wt    typeB         3
## sample11       KO    typeB         2
## sample12       KO    typeB         3

This should return the indices for the rows in the replicate column within metadata that have a value of 2 or 3. Now, we can save those indices to a variable and use that variable to extract those corresponding rows from the metadata table. Alternatively, you could do this in one line:

sub_meta <- metadata[which(metadata$replicate > 1), ]

Exercise points +1

Subset the metadata dataframe to return only the rows of data with a genotype of KO.

# Your code here

17.9 09 Logical operators for matching

.md file = 09-identifying-matching-elements.md

17.9.1 Logical operators to match elements

rpkm_data <- read.csv("data/counts.rpkm.csv")  # note the pathname is relative to the working directory if run in console but not if run inline

# rpkm_data <- read.csv('../data/counts.rpkm.csv')
head(rpkm_data)  # this is our counts data
##                      sample2    sample5  sample7   sample8   sample9   sample4
## ENSMUSG00000000001 19.265000 23.7222000 2.611610 5.8495400 6.5126300 24.076700
## ENSMUSG00000000003  0.000000  0.0000000 0.000000 0.0000000 0.0000000  0.000000
## ENSMUSG00000000028  1.032290  0.8269540 1.134410 0.6987540 0.9251170  0.827891
## ENSMUSG00000000031  0.000000  0.0000000 0.000000 0.0298449 0.0597726  0.000000
## ENSMUSG00000000037  0.056033  0.0473238 0.000000 0.0685938 0.0494147  0.180883
## ENSMUSG00000000049  0.258134  1.0730200 0.252342 0.2970320 0.2082800  2.191720
##                       sample6   sample12   sample3   sample11  sample10
## ENSMUSG00000000001 20.8198000 26.9158000 20.889500 24.0465000 24.198100
## ENSMUSG00000000003  0.0000000  0.0000000  0.000000  0.0000000  0.000000
## ENSMUSG00000000028  1.1686300  0.6735630  0.892183  0.9753270  1.045920
## ENSMUSG00000000031  0.0511932  0.0204382  0.000000  0.0000000  0.000000
## ENSMUSG00000000037  0.1438840  0.0662324  0.146196  0.0206405  0.017004
## ENSMUSG00000000049  1.6853800  0.1161970  0.421286  0.0634322  0.369550
##                       sample1
## ENSMUSG00000000001 19.7848000
## ENSMUSG00000000003  0.0000000
## ENSMUSG00000000028  0.9377920
## ENSMUSG00000000031  0.0359631
## ENSMUSG00000000037  0.1514170
## ENSMUSG00000000049  0.2567330
colnames(rpkm_data)
##  [1] "sample2"  "sample5"  "sample7"  "sample8"  "sample9"  "sample4" 
##  [7] "sample6"  "sample12" "sample3"  "sample11" "sample10" "sample1"
rownames(metadata)  # this is our metadata describing the count data samples
##  [1] "sample1"  "sample2"  "sample3"  "sample4"  "sample5"  "sample6" 
##  [7] "sample7"  "sample8"  "sample9"  "sample10" "sample11" "sample12"

It looks as if the sample names (header) in our data matrix are similar to the row names of our metadata file, but it’s hard to tell since they are not in the same order.

ncol(rpkm_data)
## [1] 12
nrow(metadata)
## [1] 12

17.9.2 The %in% operator

The %in% operator will take each element from vector1 as input, one at a time, and evaluate if the element is present in vector2. The two vectors do not have to be the same size. This operation will return a vector containing logical values to indicate whether or not there is a match. The new vector will be of the same length as vector1. Take a look at the example below:

A <- c(1, 3, 5, 7, 9, 11)  # odd numbers
B <- c(2, 4, 6, 8, 10, 12)  # even numbers

# test to see if each of the elements of A is in B
A %in% B
## [1] FALSE FALSE FALSE FALSE FALSE FALSE

Let’s change a couple of numbers inside vector B to match vector A:

A <- c(1, 3, 5, 7, 9, 11)  # odd numbers
B <- c(2, 4, 6, 8, 1, 5)  # add some odd numbers in 

# test to see if each of the elements of A is in B
A %in% B
## [1]  TRUE FALSE  TRUE FALSE FALSE FALSE
intersection <- A %in% B
intersection
## [1]  TRUE FALSE  TRUE FALSE FALSE FALSE
A[intersection]
## [1] 1 5
any(A %in% B)
## [1] TRUE
all(A %in% B)
## [1] FALSE

Exercise points +2

  1. Using the A and B vectors created above, evaluate each element in B to see if there is a match in A
# Your code here
  1. Subset the B vector to only return those values that are also in A.
# Your code here

Suppose we had two vectors containing same values. How can we check if those values are in the same order in each vector? In this case, we can use == operator to compare each element of the same position from two vectors. Unlike %in% operator, == operator requires that two vectors are of equal length.

A <- c(10, 20, 30, 40, 50)
B <- c(50, 40, 30, 20, 10)  # same numbers but backwards 

# test to see if each element of A is in B
A %in% B
## [1] TRUE TRUE TRUE TRUE TRUE
# test to see if each element of A is in the same position in B
A == B
## [1] FALSE FALSE  TRUE FALSE FALSE
# use all() to check if they are a perfect match
all(A == B)
## [1] FALSE

Let’s try this on our genomic data, and see whether we have metadata information for all samples in our expression data.

x <- rownames(metadata)
y <- colnames(rpkm_data)
all(x %in% y)
## [1] TRUE

Now we can replace x and y by their real values to get the same answer:

all(rownames(metadata) %in% colnames(rpkm_data))
## [1] TRUE
x == y
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
all(x == y)
## [1] FALSE

Looks like all of the samples are there, but they need to be reordered. To reorder our genomic samples, we will learn different ways to reorder data in our next lesson.

Exercise points +3

We have a list of 6 marker genes that we are very interested in. Our goal is to extract count data for these genes using the %in% operator from the rpkm_data data frame, instead of scrolling through rpkm_data and finding them manually.

First, let’s create a vector called important_genes with the Ensembl IDs of the 6 genes we are interested in:

  1. Use the %in% operator to determine if all of these genes are present in the row names of the rpkm_data data frame.
# Your code here
  1. Extract the rows from rpkm_data that correspond to these 6 genes using [] and the %in% operator. Double check the row names to ensure that you are extracting the correct rows.
# Your code here
  1. Extract the rows from rpkm_data that correspond to these 6 genes using [], but without using the %in% operator.
# Your code here

17.10 10 Reordering to match datasets

.md file = 10-reordering-to-match-datasets.md

Exercise points +1

We know how to reorder using indices, let’s try to use it to reorder the contents of one vector to match the contents of another. Let’s create the vectors first and second as detailed below:

first <- c("A", "B", "C", "D", "E")
second <- c("B", "D", "E", "A", "C")  # same letters but different order

first
## [1] "A" "B" "C" "D" "E"
second
## [1] "B" "D" "E" "A" "C"

How would you reorder the second vector to match first?

# Your code here

17.10.1 The match function

match() takes 2 arguments. The first argument is a vector of values in the order you want, while the second argument is the vector of values to be reordered such that it will match the first:

1st vector: values in the order you want 2nd vector: values to be reordered

The match function returns the position of the matches (indices) with respect to the second vector, which can be used to re-order it so that it matches the order in the first vector. Let’s use match() on the first and second vectors we created.

# Saving indices for how to reorder `second` to match `first`
reorder_idx <- match(first, second)
reorder_idx
## [1] 4 1 5 2 3
# Reordering the second vector to match the order of the first vector
second[reorder_idx]
## [1] "A" "B" "C" "D" "E"
# Reordering and saving the output to a variable
second_reordered <- second[reorder_idx]
second_reordered == first
## [1] TRUE TRUE TRUE TRUE TRUE

Matching with vectors of different lengths:

first <- c("A", "B", "C", "D", "E")
second <- c("D", "B", "A")  # remove values
match(first, second)
## [1]  3  2 NA  1 NA
second[match(first, second)]
## [1] "A" "B" NA  "D" NA

NOTE: For values that don’t match by default return an NA value. You can specify what values you would have it assigned using nomatch argument.

Example:

mtch <- match(first, second, nomatch = 0)
mtch
## [1] 3 2 0 1 0
second[mtch]
## [1] "A" "B" "D"
# Check row names of the metadata
rownames(metadata)
##  [1] "sample1"  "sample2"  "sample3"  "sample4"  "sample5"  "sample6" 
##  [7] "sample7"  "sample8"  "sample9"  "sample10" "sample11" "sample12"
# Check the column names of the counts data
colnames(rpkm_data)
##  [1] "sample2"  "sample5"  "sample7"  "sample8"  "sample9"  "sample4" 
##  [7] "sample6"  "sample12" "sample3"  "sample11" "sample10" "sample1"

Reordering genomic data using the match() function:

# Use the match function to reorder the counts data frame to have the sample
# names in the same order as the metadata data frame, so rownames(metadata)
# comes first, order is important
genomic_idx <- match(rownames(metadata), colnames(rpkm_data))

genomic_idx
##  [1] 12  1  9  6  2  7  3  4  5 11 10  8
# Reorder the counts data frame to have the sample names in the same order as
# the metadata data frame
rpkm_ordered <- rpkm_data[, genomic_idx]
# View the reordered counts View(rpkm_ordered)

# I prefer to use `head` -- and set the output to a limited number of columns
# so that it fits in the console

head(rpkm_ordered[, 1:5])
##                       sample1   sample2   sample3   sample4    sample5
## ENSMUSG00000000001 19.7848000 19.265000 20.889500 24.076700 23.7222000
## ENSMUSG00000000003  0.0000000  0.000000  0.000000  0.000000  0.0000000
## ENSMUSG00000000028  0.9377920  1.032290  0.892183  0.827891  0.8269540
## ENSMUSG00000000031  0.0359631  0.000000  0.000000  0.000000  0.0000000
## ENSMUSG00000000037  0.1514170  0.056033  0.146196  0.180883  0.0473238
## ENSMUSG00000000049  0.2567330  0.258134  0.421286  2.191720  1.0730200

Now we can check if the rownames of metadata is in the same order as the colnames of rpkm_ordered:

all(rownames(metadata) == colnames(rpkm_ordered))
## [1] TRUE
# important to check that the order is correct!!!

Exercises points +2

After talking with your collaborator, it becomes clear that sample2 and sample9 were actually from a different mouse background than the other samples and should not be part of our analysis. Create a new variable called subset_rpkm that has these columns removed from the rpkm_ordered data frame.

# Your code here

Use the match() function to subset the metadata data frame so that the row names of the metadata data frame match the column names of the subset_rpkm data frame.

# Your code here

17.11 11 Plotting and data visualization

.md file = 11-setting_up_to_plot.md10-reordering-to-match-datasets.md

17.11.1 Setup a data frame for visualization

mean(rpkm_ordered$sample1)
## [1] 10.2661

But what we want is a vector of 12 values that we can add to the metadata data frame. What is the best way to do this?

To get the mean of all the samples in a single line of code the map() family of function is a good option.

17.11.2 The map family of functions

We can use map functions to execute some task/function on every element in a vector, or every column in a dataframe, or every component of a list, and so on.

  • map() creates a list.
  • map_lgl() creates a logical vector.
  • map_int() creates an integer vector.
  • map_dbl() creates a “double” or numeric vector.
  • map_chr() creates a character vector.

The syntax for the map() family of functions is:

  • map(object, function_to_apply)

We can use the map_dbl() to get the mean of each column of rpkm_ordered in just one command:

# install purrr if you need to by uncommenting the following line:
# BiocManager::install(purrr)
library(purrr)  # Load the purrr

samplemeans <- map_dbl(rpkm_ordered, mean)

The output of map_dbl() is a named vector of length 12.

17.11.3 Adding to a metadata object

Before we add samplemeans as a new column to metadata, let’s create a vector with the ages of each of the mice in our data set.

# Create a numeric vector with ages. Note that there are 12 elements here

age_in_days <- c(40, 32, 38, 35, 41, 32, 34, 26, 28, 28, 30, 32)

Now, we are ready to combine the metadata data frame with the 2 new vectors to create a new data frame with 5 columns:

# Add the new vectors as the last columns to the metadata

new_metadata <- data.frame(metadata, samplemeans, age_in_days)

# Take a look at the new_metadata object
head(new_metadata)
##         genotype celltype replicate samplemeans age_in_days
## sample1       Wt    typeA         1   10.266102          40
## sample2       Wt    typeA         2   10.849759          32
## sample3       Wt    typeA         3    9.452517          38
## sample4       KO    typeA         1   15.833872          35
## sample5       KO    typeA         2   15.590184          41
## sample6       KO    typeA         3   15.551529          32

17.12 12 Data visualization with ggplot2

.md file = 12-ggplot2.md

## load the new_metadata data frame into your environment from a .RData object,
## if you need to set eval = TRUE
load("data/new_metadata.RData")
# this data frame should have 12 rows and 5 columns
head(new_metadata)
##         genotype celltype replicate samplemeans age_in_days
## sample1       Wt    typeA         1   10.266102          40
## sample2       Wt    typeA         2   10.849759          32
## sample3       Wt    typeA         3    9.452517          38
## sample4       KO    typeA         1   15.833872          35
## sample5       KO    typeA         2   15.590184          41
## sample6       KO    typeA         3   15.551529          32
library(ggplot2)

The ggplot() function is used to initialize the basic graph structure, then we add to it. The basic idea is that you specify different parts of the plot using additional functions one after the other and combine them using the + operator; the functions in the resulting code chunk are called layers.

1st layer: plot window

ggplot(new_metadata)  # what happens? we get a plot window

The geom (geometric) object is the layer that specifies what kind of plot we want to draw. A plot must have at least one geom; there is no upper limit. Examples include:

  • points (geom_point, geom_jitter for scatter plots, dot plots, etc)
  • lines (geom_line, for time series, trend lines, etc)
  • boxplot (geom_boxplot, for, well, boxplots!)

2nd layer: geometries

ggplot(new_metadata) + geom_point()  # note what happens here

17.12.1 Geometries

Geoms (e.g.) in layers: - geom_point() - geom_boxplot() - geom_bar() - geom_density() - geom_dotplot()

17.12.2 Aesthetics

We get an error because each type of geom usually has a required set of aesthetics. “Aesthetics” are set with the aes() function and can be set within geom_point() or within ggplot().

The aes() function can be used to specify many plot elements including the following:

  • position (i.e., on the x and y axes)
  • color (“outside” color)
  • fill (“inside” color)
  • shape (of points)
  • etc.

3rd layer: aesthetics

ggplot(new_metadata, aes(x = age_in_days, y = samplemeans)) + geom_point()  # what happens here? we initialize the plot window with the axes

Add color to aesthetics:

ggplot(new_metadata) + geom_point(aes(x = age_in_days, y = samplemeans, color = genotype))

Add shape to aesthetics:

ggplot(new_metadata) + geom_point(aes(x = age_in_days, y = samplemeans, color = genotype,
    shape = celltype))

Add point size to geometry:

ggplot(new_metadata) + geom_point(aes(x = age_in_days, y = samplemeans, color = genotype,
    shape = celltype), size = 2.25)

The labels on the x- and y-axis are also quite small and hard to read. To change their size, we need to add an additional theme layer. The ggplot2 theme system handles non-data plot elements such as:

  • Axis label aesthetics
  • Plot background
  • Facet label background
  • Legend appearance

5th layer: theme

ggplot(new_metadata) + geom_point(aes(x = age_in_days, y = samplemeans, color = genotype,
    shape = celltype), size = 3) + theme_bw()

Do the axis labels or the tick labels get any larger by changing themes?

6th layer: axes title size

ggplot(new_metadata) + geom_point(aes(x = age_in_days, y = samplemeans, color = genotype,
    shape = celltype), size = 2.25) + theme_bw() + theme(axis.title = element_text(size = rel(1.5)))  # increase the size of the axis titles     

Exercise points +5

  1. The current axis label text defaults to what we gave as input to geom_point (i.e the column headers). We can change this by adding additional layers called xlab() and ylab() for the x- and y-axis, respectively. Add these layers to the current plot such that the x-axis is labeled “Age (days)” and the y-axis is labeled “Mean expression”.

  2. Use the ggtitle layer to add a plot title of your choice.

  3. Add the following new layer to the code chunk: theme(plot.title = element_text(hjust =0.5))

# Your code here

What does it change?

  • Ans:

How many theme() layers can be added to a ggplot code chunk, in your estimation?

  • Ans:

17.13 13 Boxplot exercise points +16

.md file = 13-boxplot_exercise.md

  1. Generate a boxplot using the data in the new_metadata dataframe. Create a ggplot2 code chunk with the following instructions:
  • Use the geom_boxplot() layer to plot the differences in sample means between the Wt and KO genotypes.

  • Use the fill aesthetic to look at differences in sample means between the celltypes within each genotype.

  • Add a title to your plot.

  • Add labels, ‘Genotype’ for the x-axis and ‘Mean expression’ for the y-axis.

  • Make the following theme() changes:

    1. Use the theme_bw() function to make the background white.
    2. Change the size of your axes labels to 1.25x larger than the default.
    3. Change the size of your plot title to 1.5x larger than default.
    4. Center the plot title.
# Your code here
  1. Change genotype order: Factor the new_metadata$genotype column without creating any extra variables/objects and change the levels to c("Wt", "KO")
# Your code here
  1. Re-run the boxplot code chunk you created in 1.
# Your code here
  1. Change default colors

Add a new layer scale_color_manual(values=c(“purple”,“orange”)). Do you observe a change?

  • Ans:

Replace scale_color_manual(values=c(“purple”,“orange”)) with scale_fill_manual(values=c(“purple”,“orange”)). Do you observe a change?

  • Ans:

In the scatterplot we drew in class, add a new layer scale_color_manual(values=c(“purple”,“orange”)), do you observe a difference?

  • Ans:

What do you think is the difference between scale_color_manual() and scale_fill_manual()?

  • Ans:

  • Boxplot using “color” instead of “fill”. Use purple and orange for your colors.

# Your code here
  • Back in your boxplot code, change the colors in the scale_fill_manual() layer to be your 2 favorite colors.
# Your code here

Are there any colors that you tried that did not work?

  • Ans:
  1. OPTIONAL Exercise

Find the hexadecimal code for your 2 favourite colors (from exercise 3 above) and replace the color names with the hexadecimal codes within the ggplot2 code chunk.

# Hint: the gplots package. If gplots is not already installed, uncomment the
# following line of code

# BiocManager::install('gplots')

library(gplots)
# Your code here

17.14 14 Saving data and plots to file

.md file = 14-exporting_data_and_plots.md

17.14.1 Writing data to file

# Save a data frame to file
write.csv(sub_meta, file = "data/subset_meta.csv")

17.14.2 Exporting figures to file:

# opens the graphics device for writing to a file
pdf(file = "figures/scatterplot.pdf")

ggplot(new_metadata) + geom_point(aes(x = age_in_days, y = samplemeans, color = genotype,
    shape = celltype), size = rel(3))

dev.off()  # closes the graphics device
## RStudioGD 
##         2

17.15 15 Finding help

.md file = 15-finding_help.md

17.15.1 Saving variables

In order to find help on line you must provide a reproducible example. Google and R sites such as Stackoverflow are good sources. To provide an example you may want to share an object with someone else, you can save any or all R data structures that you have in your environment to a file.

The function save.image() saves all environmental variables in your current R session to a file called “.RData” in your working directory. This is a command that you should use often when in an R session to save your work.

save.image()

# OR give it a memorable name

save.image("2024_IntroR_and_RStudio.RData")

# OR for a single object

save.image(metadata, "metadata.RData")

# you can also use

saveRDS(metadata, file = "metadata.RDS")  # we'll see how to load this later

17.15.2 Loading data

load(file = ".RData")  # to load .RData files

readRDS(file = "metadata.RDS")  # to load .rds files

17.15.3 sessionInfo()

Lists details about your current R session: R version and platform; locale and timezone; all packages and versions that are loaded as we saw before.

# you'll also often be required to provide your session info
sessionInfo()

17.16 16 Tidyverse data wrangling

.md file = 16-tidyverse.md

# If tidyverse is not already installed, uncomment the following line of code
# BiocManager::install('tidyverse')

library(tidyverse)

Tidyverse basics

17.16.1 Pipes: The pipe (%>%) allows the output of a previous command to be used as input to another command instead of using nested functions.

NOTE: Shortcut to write the pipe is shift + command + m on Macos; shift + ctrl + m on Windows

## A single command
sqrt(83)
## [1] 9.110434
## Base R method of running more than one command
round(sqrt(83), digits = 2)
## [1] 9.11
## Running more than one command with piping
sqrt(83) %>%
    round(digits = 2)
## [1] 9.11

Exercise points +2

  1. Create a vector of random numbers using the code below:
random_numbers <- c(81, 90, 65, 43, 71, 29)
  1. Use the pipe (%>%) to perform two steps in a single line:

    1. Take the mean of random_numbers using the mean() function.
    2. Round the output to three digits using the round() function.
# Your code here

17.16.2 Tibbles

Experimental data

The dataset: gprofiler_results_Mov10oe.tsv

The gene list represents the functional analysis results of differences between control mice and mice over-expressing a gene involved in RNA splicing. We will focus on gene ontology (GO) terms, which describe the roles of genes and gene products organized into three controlled vocabularies/ontologies (domains):

  • biological processes (BP)

  • cellular components (CC)

  • molecular functions (MF)

that are over-represented in our list of genes.

17.16.3 Analysis goal and workflow

Goal: Visually compare the most significant biological processes (BP) based on the number of associated differentially expressed genes (gene ratios) and significance values and create a plot.

1. Read in the functional analysis results

# Read in the functional analysis results
functional_GO_results <- read.delim(file = "data/gprofiler_results_Mov10oe.tsv")

# Take a look at the results
functional_GO_results[1:3, 1:12]
##   query.number significant p.value term.size query.size overlap.size recall
## 1            1        TRUE 0.00434       111       5850           52  0.009
## 2            1        TRUE 0.00330       110       5850           52  0.009
## 3            1        TRUE 0.02970        39       5850           21  0.004
##   precision    term.id domain subgraph.number
## 1     0.468 GO:0032606     BP             237
## 2     0.473 GO:0032479     BP             237
## 3     0.538 GO:0032480     BP             237
##                                             term.name
## 1                        type I interferon production
## 2          regulation of type I interferon production
## 3 negative regulation of type I interferon production
names(functional_GO_results)
##  [1] "query.number"    "significant"     "p.value"         "term.size"      
##  [5] "query.size"      "overlap.size"    "recall"          "precision"      
##  [9] "term.id"         "domain"          "subgraph.number" "term.name"      
## [13] "relative.depth"  "intersection"
names(functional_GO_results)
##  [1] "query.number"    "significant"     "p.value"         "term.size"      
##  [5] "query.size"      "overlap.size"    "recall"          "precision"      
##  [9] "term.id"         "domain"          "subgraph.number" "term.name"      
## [13] "relative.depth"  "intersection"

2. Extract only the GO biological processes (BP) of interest

# Return only GO biological processes
bp_oe <- functional_GO_results %>%
    dplyr::filter(domain == "BP")

bp_oe[1:3, 1:12]
##   query.number significant p.value term.size query.size overlap.size recall
## 1            1        TRUE 0.00434       111       5850           52  0.009
## 2            1        TRUE 0.00330       110       5850           52  0.009
## 3            1        TRUE 0.02970        39       5850           21  0.004
##   precision    term.id domain subgraph.number
## 1     0.468 GO:0032606     BP             237
## 2     0.473 GO:0032479     BP             237
## 3     0.538 GO:0032480     BP             237
##                                             term.name
## 1                        type I interferon production
## 2          regulation of type I interferon production
## 3 negative regulation of type I interferon production

Note: the dplyr::filter is necessary because the stats library gets loaded with ggplot2, rmarkdown and bookdown and stats also has a filter function

Exercise points +1

We would like to perform an additional round of filtering to only keep the most specific GO terms.

  1. For bp_oe, use the filter() function to only keep those rows where the relative.depth is greater than 4.
  2. Save output to overwrite our bp_oe variable.
bp_oe <- bp_oe %>%
    dplyr::filter(relative.depth > 4)

17.16.4 Select

  1. Select only the colnames needed for visualization
# Selecting columns to keep for visualization
names(bp_oe)
##  [1] "query.number"    "significant"     "p.value"         "term.size"      
##  [5] "query.size"      "overlap.size"    "recall"          "precision"      
##  [9] "term.id"         "domain"          "subgraph.number" "term.name"      
## [13] "relative.depth"  "intersection"
bp_oe <- bp_oe %>%
    select(term.id, term.name, p.value, query.size, term.size, overlap.size, intersection)

head(bp_oe[, 1:6])
##      term.id                                 term.name  p.value query.size
## 1 GO:0090672               telomerase RNA localization 2.41e-02       5850
## 2 GO:0090670            RNA localization to Cajal body 2.41e-02       5850
## 3 GO:0090671 telomerase RNA localization to Cajal body 2.41e-02       5850
## 4 GO:0071702               organic substance transport 7.90e-11       5850
## 5 GO:0015931  nucleobase-containing compound transport 4.43e-05       5850
## 6 GO:0050657                    nucleic acid transport 3.67e-06       5850
##   term.size overlap.size
## 1        16           11
## 2        16           11
## 3        16           11
## 4      2629          973
## 5       200           93
## 6       166           83
dim(bp_oe)
## [1] 668   7

17.16.5 Arrange

Let’s sort the rows by adjusted p-value with the arrange() function.

# Order by adjusted p-value ascending
bp_oe <- bp_oe %>%
    arrange(p.value)

17.16.6 Rename

Let’s rename the term.id and term.name columns.

# Provide better names for columns
bp_oe <- bp_oe %>%
    dplyr::rename(GO_id = term.id, GO_term = term.name)

Exercise points +1

Rename the intersection column to genes to reflect the fact that these are the DE genes associated with the GO process.

# Your code here

17.16.7 Mutate

Create additional metrics for plotting (e.g. gene ratios)

# Create gene ratio column based on other columns in dataset
bp_oe <- bp_oe %>%
    mutate(gene_ratio = overlap.size/query.size)

Exercise points +1

Create a column in bp_oe called term_percent to determine the percent of DE genes associated with the GO term relative to the total number of genes associated with the GO term (overlap.size / term.size)

# Your code here

Gives you a sense of how many of the genes in the GO term are differentially expressed.

Exercise points +4

Use ggplot to plot the first 30 significant GO terms vs gene_ratio. Color the points by p.value as shown in the figure in below:

dotplot6

sub = bp_oe[1:30, ]

Change the p.value gradient colors to red-orange-yellow

# to help linearize the p.values for the colors; not essential for answer
sub$`-log10(p.value)` = -log10(sub$p.value)
# Your code here

See ggplot for more detail.