Chapter 17 Codebook

17.1 02 Introduction to R and RStudio

.md file = 02-introR-R-and-RStudio.md

17.1.1 Running commands and annotating code:

Add some additional code to the following chunk:

# Intro to R Lesson

3 + 5
## [1] 8

The chunk will run inline, but I prefer to set the chunk so it will run in the console, because it gives us extra room, especially for plotting. You can set this using the cog wheel at the top of the script window next to the Knit button.

17.1.2 The assignment operator (<-)

The assignment operator assigns values on the right to variables on the left.

In RStudio the keyboard shortcut for the assignment operator <- is Alt + - (Windows) or Option + - (Mac).

x <- 3
y <- 5
x + y
## [1] 8
number <- x + y

Exercise

  1. Try changing the value of the variable x to 5. What happens to number?
x <- 5
number
## [1] 8
number <- x + y

number
## [1] 10
  • Ans:
  1. Now try changing the value of variable y to contain the value 10. What do you need to do, to update the variable number?
  • Ans:

17.2 03 R syntax and data structures

.md file = 03-introR-syntax-and-data-structures.md

17.2.1 Vectors

# Create a numeric vector and store the vector as a variable called 'glengths'
glengths <- c(4.6, 3000, 50000)
glengths
## [1]     4.6  3000.0 50000.0
# Create a character vector and store the vector as a variable called 'species'
species <- c("ecoli", "human", "corn")
species
## [1] "ecoli" "human" "corn"

Exercise

Try to create a vector of numeric and character values by combining the two vectors that we just created (glengths and species). Assign this combined vector to a new variable called combined. Hint: you will need to use the combine c() function to do this. Print the combined vector in the console, what looks different compared to the original vectors?

combined <- c(glengths, species)
combined
## [1] "4.6"   "3000"  "50000" "ecoli" "human" "corn"
  • Ans:

17.2.2 Factors

# Create a character vector and store the vector as a variable called 'expression'
expression <- c("low", "high", "medium", "high", "low", "medium", "high")
expression
## [1] "low"    "high"   "medium" "high"   "low"    "medium" "high"
# Turn 'expression' vector into a factor
expression <- factor(expression)
expression
## [1] low    high   medium high   low    medium high  
## Levels: high low medium

Exercises

Let’s say that in our experimental analyses, we are working with three different sets of cells: normal, cells knocked out for geneA (a very exciting gene), and cells overexpressing geneA. We have three replicates for each celltype.

  1. Create a vector named samplegroup with nine elements: 3 control (“CTL”) values, 3 knock-out (“KO”) values, and 3 over-expressing (“OE”) values.

  2. Turn samplegroup into a factor data structure.

17.2.3 Dataframes

A data.frame is the de facto data structure for most tabular data and what we use for statistics and plotting. A data.frame is similar to a matrix in that it’s a collection of vectors of the same length and each vector represents a column. However, in a dataframe each vector can be of a different data type (e.g., characters, integers, factors).

# Create a data frame and store it as a variable called 'df'
df <- data.frame(species, glengths)
df
species glengths
ecoli 4.6
human 3000.0
corn 50000.0

Exercise

Create a data frame called favorite_books with the following vectors as columns:

titles <- c("Catch-22", "Pride and Prejudice", "Nineteen Eighty Four")
pages <- c(453, 432, 328)

17.3 04 Functions in R

.md file = 04-introR-functions-and-arguments.md

17.3.1 Basic functions

We have already used a few examples of basic functions in the previous lessons i.e c(), and factor().

Many of the base functions in R involve mathematical operations. One example would be the function sqrt(). The input/argument must be a number, and the output is the square root of that number. Let's try finding the square root of 81:

sqrt(81)
## [1] 9
sqrt(glengths)
## [1]   2.144761  54.772256 223.606798
round(3.14159)
## [1] 3

17.3.2 Seeking help on functions

`?`(round)
args(round)
## function (x, digits = 0) 
## NULL
example("round")
## 
## round> round(.5 + -2:4) # IEEE / IEC rounding: -2  0  0  2  2  4  4
## [1] -2  0  0  2  2  4  4
## 
## round> ## (this is *good* behaviour -- do *NOT* report it as bug !)
## round> 
## round> ( x1 <- seq(-2, 4, by = .5) )
##  [1] -2.0 -1.5 -1.0 -0.5  0.0  0.5  1.0  1.5  2.0  2.5  3.0  3.5  4.0
## 
## round> round(x1) #-- IEEE / IEC rounding !
##  [1] -2 -2 -1  0  0  0  1  2  2  2  3  4  4
## 
## round> x1[trunc(x1) != floor(x1)]
## [1] -1.5 -0.5
## 
## round> x1[round(x1) != floor(x1 + .5)]
## [1] -1.5  0.5  2.5
## 
## round> (non.int <- ceiling(x1) != floor(x1))
##  [1] FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE
## 
## round> x2 <- pi * 100^(-1:3)
## 
## round> round(x2, 3)
## [1]       0.031       3.142     314.159   31415.927 3141592.654
## 
## round> signif(x2, 3)
## [1] 3.14e-02 3.14e+00 3.14e+02 3.14e+04 3.14e+06
round(3.14159, digits = 2)
## [1] 3.14

Exercise

  1. Let’s use base R function to calculate mean value of the glengths vector. You might need to search online to find what function can perform this task.
mean(glengths)
## [1] 17668.2
  1. Create a new vector test <- c(1, NA, 2, 3, NA, 4). Use the same base R function from exercise 1 (with addition of proper argument), and calculate the mean value of the test vector. The output should be 2.5. > NOTE: In R, missing values are represented by the symbol NA (not available). It’s a way to make sure that users know they have missing data, and make a conscious decision on how to deal with it. There are ways to ignore NA during statistical calculation, or to remove NA from the vector. If you want more information related to missing data or NA you can go to the NA help page (please note that there are many advanced concepts on that page that have not been covered in class).
test <- c(1, NA, 2, 3, NA, 4)

mean(test, na.rm = TRUE)
## [1] 2.5
  1. Another commonly used base function is sort(). Use this function to sort the glengths vector in descending order.
sort(glengths)
## [1]     4.6  3000.0 50000.0

17.3.3 User-defined Functions

Let's try creating a simple example function. This function will take in a numeric value as input, and return the squared value.

square_it <- function(x) {
    square <- x * x
    return(square)
}

Once you run the code, you should see a function named square_it in the Environment panel (located at the top right of Rstudio interface). Now, we can use this function as any other base R functions. We type out the name of the function, and inside the parentheses we provide a numeric value x:

square_it(5)

Pretty simple, right? In this case, we only had one line of code that was run, but in theory you could have many lines of code to get obtain the final results that you want to "return" to the user. We have only scratched the surface here when it comes to creating functions! If you are interested you can also find more detailed information on writing functions R-bloggers site.


Exercise

  1. Write a function called multiply_it, which takes two inputs: a numeric value x, and a numeric value y. The function will return the product of these two numeric values, which is x * y. For example, multiply_it(x=4, y=6) will return output 24.
multiply_it <- function(x, y) {
    product <- x * y
    return(product)
}
multiply_it(x = 4, y = 6)
## [1] 24

17.4 05 Packages and libraries

.md file = 05-introR_packages.md

17.4.1 Packages and Libraries

You can check what libraries (packages) are loaded in your current R session by typing into the console:

sessionInfo()  #Print version information about R, the OS and attached or loaded packages
## R version 4.2.1 (2022-06-23)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Monterey 12.6.5
## 
## Matrix products: default
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] rmarkdown_2.20 knitr_1.42     bookdown_0.32 
## 
## loaded via a namespace (and not attached):
##  [1] digest_0.6.31   R6_2.5.1        jsonlite_1.8.4  formatR_1.14    evaluate_0.20   cachem_1.0.6    rlang_1.0.6    
##  [8] cli_3.6.0       rstudioapi_0.14 jquerylib_0.1.4 bslib_0.4.2     tools_4.2.1     xfun_0.36       yaml_2.3.7     
## [15] fastmap_1.1.0   compiler_4.2.1  htmltools_0.5.4 sass_0.4.5
# OR

search()  #Gives a list of attached packages
##  [1] ".GlobalEnv"        "package:rmarkdown" "package:knitr"     "package:bookdown"  "tools:rstudio"     "package:stats"    
##  [7] "package:graphics"  "package:grDevices" "package:utils"     "package:datasets"  "package:methods"   "Autoloads"        
## [13] "package:base"

17.4.2 Installing packages

17.4.2.1 Packages from CRAN:

Previously we have introduced you to installing packages. An example is given below for the ggplot2 package that will be required for some plots we will create later on. Run this code to install ggplot2.

Set eval = TRUE if you haven't installed ggplot2 yet

install.packages("ggplot2")

Alternatively, packages can also be installed from Bioconductor, a repository of packages which provides tools for the analysis and comprehension of high-throughput genomic data.

Set eval = TRUE if you haven't installed BiocManager yet

install.packages("BiocManager")

17.4.2.2 Packages from Bioconductor:

Now you can use BiocManager::install to install packages available on Bioconductor. Bioconductor is a free, open source and open development software project for the analysis and comprehension of genomic data generated by wet lab experiments in molecular biology.

# DO NOT RUN THIS!
BiocManager::install("ggplot2")

17.4.3 Loading libraries/packages

library(ggplot2)

To see the functions in a package you can also type:

ls("package:ggplot2")
##   [1] "%+%"                        "%+replace%"                 "aes"                        "aes_"                      
##   [5] "aes_all"                    "aes_auto"                   "aes_q"                      "aes_string"                
##   [9] "after_scale"                "after_stat"                 "alpha"                      "annotate"                  
##  [13] "annotation_custom"          "annotation_logticks"        "annotation_map"             "annotation_raster"         
##  [17] "arrow"                      "as_label"                   "as_labeller"                "autolayer"                 
##  [21] "autoplot"                   "AxisSecondary"              "benchplot"                  "binned_scale"              
##  [25] "borders"                    "calc_element"               "combine_vars"               "continuous_scale"          
##  [29] "Coord"                      "coord_cartesian"            "coord_equal"                "coord_fixed"               
##  [33] "coord_flip"                 "coord_map"                  "coord_munch"                "coord_polar"               
##  [37] "coord_quickmap"             "coord_sf"                   "coord_trans"                "CoordCartesian"            
##  [41] "CoordFixed"                 "CoordFlip"                  "CoordMap"                   "CoordPolar"                
##  [45] "CoordQuickmap"              "CoordSf"                    "CoordTrans"                 "cut_interval"              
##  [49] "cut_number"                 "cut_width"                  "derive"                     "diamonds"                  
##  [53] "discrete_scale"             "draw_key_abline"            "draw_key_blank"             "draw_key_boxplot"          
##  [57] "draw_key_crossbar"          "draw_key_dotplot"           "draw_key_label"             "draw_key_linerange"        
##  [61] "draw_key_path"              "draw_key_point"             "draw_key_pointrange"        "draw_key_polygon"          
##  [65] "draw_key_rect"              "draw_key_smooth"            "draw_key_text"              "draw_key_timeseries"       
##  [69] "draw_key_vline"             "draw_key_vpath"             "dup_axis"                   "economics"                 
##  [73] "economics_long"             "el_def"                     "element_blank"              "element_grob"              
##  [77] "element_line"               "element_rect"               "element_render"             "element_text"              
##  [81] "enexpr"                     "enexprs"                    "enquo"                      "enquos"                    
##  [85] "ensym"                      "ensyms"                     "expand_limits"              "expand_scale"              
##  [89] "expansion"                  "expr"                       "Facet"                      "facet_grid"                
##  [93] "facet_null"                 "facet_wrap"                 "FacetGrid"                  "FacetNull"                 
##  [97] "FacetWrap"                  "faithfuld"                  "find_panel"                 "flip_data"                 
## [101] "flipped_names"              "fortify"                    "Geom"                       "geom_abline"               
## [105] "geom_area"                  "geom_bar"                   "geom_bin_2d"                "geom_bin2d"                
## [109] "geom_blank"                 "geom_boxplot"               "geom_col"                   "geom_contour"              
## [113] "geom_contour_filled"        "geom_count"                 "geom_crossbar"              "geom_curve"                
## [117] "geom_density"               "geom_density_2d"            "geom_density_2d_filled"     "geom_density2d"            
## [121] "geom_density2d_filled"      "geom_dotplot"               "geom_errorbar"              "geom_errorbarh"            
## [125] "geom_freqpoly"              "geom_function"              "geom_hex"                   "geom_histogram"            
## [129] "geom_hline"                 "geom_jitter"                "geom_label"                 "geom_line"                 
## [133] "geom_linerange"             "geom_map"                   "geom_path"                  "geom_point"                
## [137] "geom_pointrange"            "geom_polygon"               "geom_qq"                    "geom_qq_line"              
## [141] "geom_quantile"              "geom_raster"                "geom_rect"                  "geom_ribbon"               
## [145] "geom_rug"                   "geom_segment"               "geom_sf"                    "geom_sf_label"             
## [149] "geom_sf_text"               "geom_smooth"                "geom_spoke"                 "geom_step"                 
## [153] "geom_text"                  "geom_tile"                  "geom_violin"                "geom_vline"                
## [157] "GeomAbline"                 "GeomAnnotationMap"          "GeomArea"                   "GeomBar"                   
## [161] "GeomBlank"                  "GeomBoxplot"                "GeomCol"                    "GeomContour"               
## [165] "GeomContourFilled"          "GeomCrossbar"               "GeomCurve"                  "GeomCustomAnn"             
## [169] "GeomDensity"                "GeomDensity2d"              "GeomDensity2dFilled"        "GeomDotplot"               
## [173] "GeomErrorbar"               "GeomErrorbarh"              "GeomFunction"               "GeomHex"                   
## [177] "GeomHline"                  "GeomLabel"                  "GeomLine"                   "GeomLinerange"             
## [181] "GeomLogticks"               "GeomMap"                    "GeomPath"                   "GeomPoint"                 
## [185] "GeomPointrange"             "GeomPolygon"                "GeomQuantile"               "GeomRaster"                
## [189] "GeomRasterAnn"              "GeomRect"                   "GeomRibbon"                 "GeomRug"                   
## [193] "GeomSegment"                "GeomSf"                     "GeomSmooth"                 "GeomSpoke"                 
## [197] "GeomStep"                   "GeomText"                   "GeomTile"                   "GeomViolin"                
## [201] "GeomVline"                  "get_alt_text"               "get_element_tree"           "gg_dep"                    
## [205] "ggplot"                     "ggplot_add"                 "ggplot_build"               "ggplot_gtable"             
## [209] "ggplotGrob"                 "ggproto"                    "ggproto_parent"             "ggsave"                    
## [213] "ggtitle"                    "guide_axis"                 "guide_bins"                 "guide_colorbar"            
## [217] "guide_colorsteps"           "guide_colourbar"            "guide_coloursteps"          "guide_gengrob"             
## [221] "guide_geom"                 "guide_legend"               "guide_merge"                "guide_none"                
## [225] "guide_train"                "guide_transform"            "guides"                     "has_flipped_aes"           
## [229] "is.Coord"                   "is.facet"                   "is.ggplot"                  "is.ggproto"                
## [233] "is.theme"                   "label_both"                 "label_bquote"               "label_context"             
## [237] "label_parsed"               "label_value"                "label_wrap_gen"             "labeller"                  
## [241] "labs"                       "last_plot"                  "layer"                      "layer_data"                
## [245] "layer_grob"                 "layer_scales"               "layer_sf"                   "Layout"                    
## [249] "lims"                       "luv_colours"                "map_data"                   "margin"                    
## [253] "max_height"                 "max_width"                  "mean_cl_boot"               "mean_cl_normal"            
## [257] "mean_sdl"                   "mean_se"                    "median_hilow"               "merge_element"             
## [261] "midwest"                    "mpg"                        "msleep"                     "panel_cols"                
## [265] "panel_rows"                 "Position"                   "position_dodge"             "position_dodge2"           
## [269] "position_fill"              "position_identity"          "position_jitter"            "position_jitterdodge"      
## [273] "position_nudge"             "position_stack"             "PositionDodge"              "PositionDodge2"            
## [277] "PositionFill"               "PositionIdentity"           "PositionJitter"             "PositionJitterdodge"       
## [281] "PositionNudge"              "PositionStack"              "presidential"               "qplot"                     
## [285] "quickplot"                  "quo"                        "quo_name"                   "quos"                      
## [289] "register_theme_elements"    "rel"                        "remove_missing"             "render_axes"               
## [293] "render_strips"              "reset_theme_settings"       "resolution"                 "Scale"                     
## [297] "scale_alpha"                "scale_alpha_binned"         "scale_alpha_continuous"     "scale_alpha_date"          
## [301] "scale_alpha_datetime"       "scale_alpha_discrete"       "scale_alpha_identity"       "scale_alpha_manual"        
## [305] "scale_alpha_ordinal"        "scale_color_binned"         "scale_color_brewer"         "scale_color_continuous"    
## [309] "scale_color_date"           "scale_color_datetime"       "scale_color_discrete"       "scale_color_distiller"     
## [313] "scale_color_fermenter"      "scale_color_gradient"       "scale_color_gradient2"      "scale_color_gradientn"     
## [317] "scale_color_grey"           "scale_color_hue"            "scale_color_identity"       "scale_color_manual"        
## [321] "scale_color_ordinal"        "scale_color_steps"          "scale_color_steps2"         "scale_color_stepsn"        
## [325] "scale_color_viridis_b"      "scale_color_viridis_c"      "scale_color_viridis_d"      "scale_colour_binned"       
## [329] "scale_colour_brewer"        "scale_colour_continuous"    "scale_colour_date"          "scale_colour_datetime"     
## [333] "scale_colour_discrete"      "scale_colour_distiller"     "scale_colour_fermenter"     "scale_colour_gradient"     
## [337] "scale_colour_gradient2"     "scale_colour_gradientn"     "scale_colour_grey"          "scale_colour_hue"          
## [341] "scale_colour_identity"      "scale_colour_manual"        "scale_colour_ordinal"       "scale_colour_steps"        
## [345] "scale_colour_steps2"        "scale_colour_stepsn"        "scale_colour_viridis_b"     "scale_colour_viridis_c"    
## [349] "scale_colour_viridis_d"     "scale_continuous_identity"  "scale_discrete_identity"    "scale_discrete_manual"     
## [353] "scale_fill_binned"          "scale_fill_brewer"          "scale_fill_continuous"      "scale_fill_date"           
## [357] "scale_fill_datetime"        "scale_fill_discrete"        "scale_fill_distiller"       "scale_fill_fermenter"      
## [361] "scale_fill_gradient"        "scale_fill_gradient2"       "scale_fill_gradientn"       "scale_fill_grey"           
## [365] "scale_fill_hue"             "scale_fill_identity"        "scale_fill_manual"          "scale_fill_ordinal"        
## [369] "scale_fill_steps"           "scale_fill_steps2"          "scale_fill_stepsn"          "scale_fill_viridis_b"      
## [373] "scale_fill_viridis_c"       "scale_fill_viridis_d"       "scale_linetype"             "scale_linetype_binned"     
## [377] "scale_linetype_continuous"  "scale_linetype_discrete"    "scale_linetype_identity"    "scale_linetype_manual"     
## [381] "scale_linewidth"            "scale_linewidth_binned"     "scale_linewidth_continuous" "scale_linewidth_date"      
## [385] "scale_linewidth_datetime"   "scale_linewidth_discrete"   "scale_linewidth_ordinal"    "scale_radius"              
## [389] "scale_shape"                "scale_shape_binned"         "scale_shape_continuous"     "scale_shape_discrete"      
## [393] "scale_shape_identity"       "scale_shape_manual"         "scale_shape_ordinal"        "scale_size"                
## [397] "scale_size_area"            "scale_size_binned"          "scale_size_binned_area"     "scale_size_continuous"     
## [401] "scale_size_date"            "scale_size_datetime"        "scale_size_discrete"        "scale_size_identity"       
## [405] "scale_size_manual"          "scale_size_ordinal"         "scale_type"                 "scale_x_binned"            
## [409] "scale_x_continuous"         "scale_x_date"               "scale_x_datetime"           "scale_x_discrete"          
## [413] "scale_x_log10"              "scale_x_reverse"            "scale_x_sqrt"               "scale_x_time"              
## [417] "scale_y_binned"             "scale_y_continuous"         "scale_y_date"               "scale_y_datetime"          
## [421] "scale_y_discrete"           "scale_y_log10"              "scale_y_reverse"            "scale_y_sqrt"              
## [425] "scale_y_time"               "ScaleBinned"                "ScaleBinnedPosition"        "ScaleContinuous"           
## [429] "ScaleContinuousDate"        "ScaleContinuousDatetime"    "ScaleContinuousIdentity"    "ScaleContinuousPosition"   
## [433] "ScaleDiscrete"              "ScaleDiscreteIdentity"      "ScaleDiscretePosition"      "seals"                     
## [437] "sec_axis"                   "set_last_plot"              "sf_transform_xy"            "should_stop"               
## [441] "stage"                      "standardise_aes_names"      "stat"                       "Stat"                      
## [445] "stat_align"                 "stat_bin"                   "stat_bin_2d"                "stat_bin_hex"              
## [449] "stat_bin2d"                 "stat_binhex"                "stat_boxplot"               "stat_contour"              
## [453] "stat_contour_filled"        "stat_count"                 "stat_density"               "stat_density_2d"           
## [457] "stat_density_2d_filled"     "stat_density2d"             "stat_density2d_filled"      "stat_ecdf"                 
## [461] "stat_ellipse"               "stat_function"              "stat_identity"              "stat_qq"                   
## [465] "stat_qq_line"               "stat_quantile"              "stat_sf"                    "stat_sf_coordinates"       
## [469] "stat_smooth"                "stat_spoke"                 "stat_sum"                   "stat_summary"              
## [473] "stat_summary_2d"            "stat_summary_bin"           "stat_summary_hex"           "stat_summary2d"            
## [477] "stat_unique"                "stat_ydensity"              "StatAlign"                  "StatBin"                   
## [481] "StatBin2d"                  "StatBindot"                 "StatBinhex"                 "StatBoxplot"               
## [485] "StatContour"                "StatContourFilled"          "StatCount"                  "StatDensity"               
## [489] "StatDensity2d"              "StatDensity2dFilled"        "StatEcdf"                   "StatEllipse"               
## [493] "StatFunction"               "StatIdentity"               "StatQq"                     "StatQqLine"                
## [497] "StatQuantile"               "StatSf"                     "StatSfCoordinates"          "StatSmooth"                
## [501] "StatSum"                    "StatSummary"                "StatSummary2d"              "StatSummaryBin"            
## [505] "StatSummaryHex"             "StatUnique"                 "StatYdensity"               "summarise_coord"           
## [509] "summarise_layers"           "summarise_layout"           "sym"                        "syms"                      
## [513] "theme"                      "theme_bw"                   "theme_classic"              "theme_dark"                
## [517] "theme_get"                  "theme_gray"                 "theme_grey"                 "theme_light"               
## [521] "theme_linedraw"             "theme_minimal"              "theme_replace"              "theme_set"                 
## [525] "theme_test"                 "theme_update"               "theme_void"                 "transform_position"        
## [529] "txhousing"                  "unit"                       "update_geom_defaults"       "update_labels"             
## [533] "update_stat_defaults"       "vars"                       "waiver"                     "wrap_dims"                 
## [537] "xlab"                       "xlim"                       "ylab"                       "ylim"                      
## [541] "zeroGrob"

Exercise

The tidyverse suite of integrated packages was designed to work together to make common data science operations more user-friendly. We will be using the tidyverse suite in later lessons, and so let’s install it. Use the function from theBiocManager package.

Set eval=TRUE if tidyverse is not yet installed. How can you run the BiocManager::install() command for a package that is already installed without getting an error?

  • Ans:

17.5 06 Subsetting: vectors and factors

.md file = 06-introR-data wrangling.md

17.5.1 Vectors: Selecting using indices

age <- c(15, 22, 45, 52, 73, 81)
age
## [1] 15 22 45 52 73 81
age[5]
## [1] 73

Reverse selection:

age[-5]
## [1] 15 22 45 52 81
age[c(3, 5, 6)]  ## nested
## [1] 45 73 81
# OR

## create a vector first then select
idx <- c(3, 5, 6)  # create vector of the elements of interest
age[idx]
## [1] 45 73 81
age[1:4]
## [1] 15 22 45 52

Exercises

  1. Create a vector called alphabets with the following letters, C, D, X, L, F.

  2. Use the associated indices along with [ ] to do the following:

    1. only display C, D and F

    2. display all except X

    3. display the letters in the opposite order (F, L, X, D, C) *Bonus points for using a function instead of the indices (Hint: use Google)

17.5.2 Selecting using logical operators

age > 50
## [1] FALSE FALSE FALSE  TRUE  TRUE  TRUE
age > 50 | age < 18
## [1]  TRUE FALSE FALSE  TRUE  TRUE  TRUE
age
## [1] 15 22 45 52 73 81
age[age > 50 | age < 18]  ## nested
## [1] 15 52 73 81
# OR

## create a vector first then select
idx <- age > 50 | age < 18
idx
## [1]  TRUE FALSE FALSE  TRUE  TRUE  TRUE
age[idx]
## [1] 15 52 73 81

17.5.3 Logical operators with which()

which(age > 50 | age < 18)
## [1] 1 4 5 6
age[which(age > 50 | age < 18)]  ## nested
## [1] 15 52 73 81
# OR

## create a vector first then select
idx_num <- which(age > 50 | age < 18)
age[idx_num]
## [1] 15 52 73 81

17.5.4 Factors

expression <- c("low", "high", "medium", "high", "low", "medium", "high")

expression <- factor(expression)

str(expression)
##  Factor w/ 3 levels "high","low","medium": 2 1 3 1 2 3 1
expression[expression == "high"]  ## This will only return those elements in the factor equal to 'high'
## [1] high high high
## Levels: high low medium

Exercise

Extract only those elements in samplegroup that are not KO (nesting the logical operation is optional).

17.5.4.1 Releveling factors

expression <- factor(expression, levels = c("low", "medium", "high"))  # you can re-factor a factor 

str(expression)
##  Factor w/ 3 levels "low","medium",..: 1 3 2 3 1 2 3

Exercise

Use the samplegroup factor we created in a previous lesson, and relevel it such that KO is the first level followed by CTL and OE.

It's always a good idea to periodically save your work. You can do this by using the command save.image():

save.image()  # saves all environmental variables in your current R session to a file called '.RData' in your working directory -- this is a command that you should use often when in an R session to save your work.

# OR give it a memorable name

save.image("intro-to-R.RData")

17.6 07 Reading and data inspection

.md file = 07-reading_and_data_inspection.md

17.6.1 Reading data into R:

`?`(read.csv)
metadata <- read.csv(file = "data/mouse_exp_design.csv")

We can see if it has successfully been read in by running:

metadata
genotype celltype replicate
sample1 Wt typeA 1
sample2 Wt typeA 2
sample3 Wt typeA 3
sample4 KO typeA 1
sample5 KO typeA 2
sample6 KO typeA 3
sample7 Wt typeB 1
sample8 Wt typeB 2
sample9 Wt typeB 3
sample10 KO typeB 1
sample11 KO typeB 2
sample12 KO typeB 3

Exercise 1

  1. Read "project-summary.txt" in to R using read.table() with the approriate arguments and store it as the variable proj_summary. To figure out the appropriate arguments to use with read.table(), keep the following in mind:
  • all the columns in the input text file have column name/headers

  • you want the first column of the text file to be used as row names

  • hint: look up the input for the row.names = argument in read.table()

  1. Display the contents of proj_summary in your console

17.6.2 Inspecting data structures:

Getting info on environmental variables:

Let's use the metadata file that we created to test out data inspection functions.

head(metadata)
genotype celltype replicate
sample1 Wt typeA 1
sample2 Wt typeA 2
sample3 Wt typeA 3
sample4 KO typeA 1
sample5 KO typeA 2
sample6 KO typeA 3
str(metadata)
## 'data.frame':    12 obs. of  3 variables:
##  $ genotype : chr  "Wt" "Wt" "Wt" "KO" ...
##  $ celltype : chr  "typeA" "typeA" "typeA" "typeA" ...
##  $ replicate: int  1 2 3 1 2 3 1 2 3 1 ...
dim(metadata)
## [1] 12  3
nrow(metadata)
## [1] 12
ncol(metadata)
## [1] 3
class(metadata)
## [1] "data.frame"
colnames(metadata)
## [1] "genotype"  "celltype"  "replicate"

Exercise 2

  • What is the class of each column in metadata (use one command)?
str(metadata)
## 'data.frame':    12 obs. of  3 variables:
##  $ genotype : chr  "Wt" "Wt" "Wt" "KO" ...
##  $ celltype : chr  "typeA" "typeA" "typeA" "typeA" ...
##  $ replicate: int  1 2 3 1 2 3 1 2 3 1 ...
  • What is the median of the replicates in metadata (use one command)?

Exercise 3

Use the class() function on glengths and metadata, how does the output differ between the two?

class(glengths)
## [1] "numeric"
class(metadata)
## [1] "data.frame"

Use the summary() function on the proj_summary dataframe, what is the median “Intronic_Rate?

  • Ans:

How long is the samplegroup factor?

  • Ans:

What are the dimensions of the proj_summary dataframe?

When you use the rownames() function on metadata, what is the data structure of the output?

  • Ans:

How many elements in (how long is) the output of colnames(proj_summary)? Don’t count, but use another function to determine this.

17.7 08 Data frames and matrices

**.md file = 08-introR-_data_wrangling2.md**

17.7.1 Data wrangling: dataframes

Dataframes (and matrices) have 2 dimensions (rows and columns), so if we want to select some specific data from it we need to specify the “coordinates” we want from it. We use the same square bracket notation but rather than providing a single index, there are two indices required. Within the square bracket, row numbers come first followed by column numbers (and the two are separated by a comma).

Extracting values from data.frames:

# Extract value 'Wt'
metadata[1, 1]
## [1] "Wt"
# Extract third row
metadata[3, ]
genotype celltype replicate
sample3 Wt typeA 3
# Extract third column
metadata[, 3]
##  [1] 1 2 3 1 2 3 1 2 3 1 2 3
# Extract third column as a data frame
metadata[, 3, drop = FALSE]
replicate
sample1 1
sample2 2
sample3 3
sample4 1
sample5 2
sample6 3
sample7 1
sample8 2
sample9 3
sample10 1
sample11 2
sample12 3
# Dataframe containing first two columns
metadata[, 1:2]
genotype celltype
sample1 Wt typeA
sample2 Wt typeA
sample3 Wt typeA
sample4 KO typeA
sample5 KO typeA
sample6 KO typeA
sample7 Wt typeB
sample8 Wt typeB
sample9 Wt typeB
sample10 KO typeB
sample11 KO typeB
sample12 KO typeB
# Data frame containing first, third and sixth rows
metadata[c(1, 3, 6), ]
genotype celltype replicate
sample1 Wt typeA 1
sample3 Wt typeA 3
sample6 KO typeA 3
# Extract the celltype column for the first three samples
metadata[c("sample1", "sample2", "sample3"), "celltype"]
## [1] "typeA" "typeA" "typeA"
# Check column names of metadata data frame
colnames(metadata)
## [1] "genotype"  "celltype"  "replicate"
# Check row names of metadata data frame
rownames(metadata)
##  [1] "sample1"  "sample2"  "sample3"  "sample4"  "sample5"  "sample6"  "sample7"  "sample8"  "sample9"  "sample10" "sample11"
## [12] "sample12"
# Extract the genotype column
metadata$genotype
##  [1] "Wt" "Wt" "Wt" "KO" "KO" "KO" "Wt" "Wt" "Wt" "KO" "KO" "KO"
# Extract the first five values/elements of the genotype column
metadata$genotype[1:5]
## [1] "Wt" "Wt" "Wt" "KO" "KO"

Exercises

Return a data frame with only the genotype and replicate column values for sample2 and sample8.

Return the fourth and ninth values of the replicate column.

Extract the replicate column as a data frame.

17.7.1.1 Selecting using logical operators:

For example, if we want to return only those rows of the data frame with the celltype column having a value of typeA, we would perform two steps:

  1. Identify which rows in the celltype column have a value of typeA.

  2. Use those TRUE values to extract those rows from the data frame.

To do this we would extract the column of interest as a vector, with the first value corresponding to the first row, the second value corresponding to the second row, so on and so forth. We use that vector in the logical expression. Here we are looking for values to be equal to typeA,

So our logical expression would be:

metadata$celltype == "typeA"
##  [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
logical_idx <- metadata$celltype == "typeA"

If we want to keep all the rows that have celltype == "typeA": Any row in the logical_idx that is TRUE will be kept, those that are FALSE will be discarded.

metadata[logical_idx, ]
genotype celltype replicate
sample1 Wt typeA 1
sample2 Wt typeA 2
sample3 Wt typeA 3
sample4 KO typeA 1
sample5 KO typeA 2
sample6 KO typeA 3
idx <- which(metadata$celltype == "typeA")
metadata[idx, ]
genotype celltype replicate
sample1 Wt typeA 1
sample2 Wt typeA 2
sample3 Wt typeA 3
sample4 KO typeA 1
sample5 KO typeA 2
sample6 KO typeA 3
# OR you can use nesting
metadata[which(metadata$celltype == "typeA"), ]
genotype celltype replicate
sample1 Wt typeA 1
sample2 Wt typeA 2
sample3 Wt typeA 3
sample4 KO typeA 1
sample5 KO typeA 2
sample6 KO typeA 3

Let’s try another subsetting. Extract the rows of the metadata data frame for only the replicates 2 and 3. First, let’s create the logical expression for the column of interest (replicate):

idx <- which(metadata$replicate > 1)

metadata[idx, ]
genotype celltype replicate
sample2 Wt typeA 2
sample3 Wt typeA 3
sample5 KO typeA 2
sample6 KO typeA 3
sample8 Wt typeB 2
sample9 Wt typeB 3
sample11 KO typeB 2
sample12 KO typeB 3

This should return the indices for the rows in the replicate column within metadata that have a value of 2 or 3. Now, we can save those indices to a variable and use that variable to extract those corresponding rows from the metadata table. Alternatively, you could do this in one line:

sub_meta <- metadata[which(metadata$replicate > 1), ]

Exercise

Subset the metadata dataframe to return only the rows of data with a genotype of KO.

17.8 09 Logical operators for matching

.md file = 09-identifying-matching-elements.md

17.8.1 Logical operators to match elements

rpkm_data <- read.csv("data/counts.rpkm.csv")
head(rpkm_data)
sample2 sample5 sample7 sample8 sample9 sample4 sample6 sample12 sample3 sample11 sample10 sample1
ENSMUSG00000000001 19.265000 23.7222000 2.611610 5.8495400 6.5126300 24.076700 20.8198000 26.9158000 20.889500 24.0465000 24.198100 19.7848000
ENSMUSG00000000003 0.000000 0.0000000 0.000000 0.0000000 0.0000000 0.000000 0.0000000 0.0000000 0.000000 0.0000000 0.000000 0.0000000
ENSMUSG00000000028 1.032290 0.8269540 1.134410 0.6987540 0.9251170 0.827891 1.1686300 0.6735630 0.892183 0.9753270 1.045920 0.9377920
ENSMUSG00000000031 0.000000 0.0000000 0.000000 0.0298449 0.0597726 0.000000 0.0511932 0.0204382 0.000000 0.0000000 0.000000 0.0359631
ENSMUSG00000000037 0.056033 0.0473238 0.000000 0.0685938 0.0494147 0.180883 0.1438840 0.0662324 0.146196 0.0206405 0.017004 0.1514170
ENSMUSG00000000049 0.258134 1.0730200 0.252342 0.2970320 0.2082800 2.191720 1.6853800 0.1161970 0.421286 0.0634322 0.369550 0.2567330
colnames(rpkm_data)
##  [1] "sample2"  "sample5"  "sample7"  "sample8"  "sample9"  "sample4"  "sample6"  "sample12" "sample3"  "sample11" "sample10"
## [12] "sample1"
rownames(metadata)
##  [1] "sample1"  "sample2"  "sample3"  "sample4"  "sample5"  "sample6"  "sample7"  "sample8"  "sample9"  "sample10" "sample11"
## [12] "sample12"

It looks as if the sample names (header) in our data matrix are similar to the row names of our metadata file, but it's hard to tell since they are not in the same order.

ncol(rpkm_data)
## [1] 12
nrow(metadata)
## [1] 12

17.8.2 The %in% operator

The %in% operator will take each element from vector1 as input, one at a time, and evaluate if the element is present in vector2. The two vectors do not have to be the same size. This operation will return a vector containing logical values to indicate whether or not there is a match. The new vector will be of the same length as vector1. Take a look at the example below:

A <- c(1, 3, 5, 7, 9, 11)  # odd numbers
B <- c(2, 4, 6, 8, 10, 12)  # even numbers

# test to see if each of the elements of A is in B
A %in% B
## [1] FALSE FALSE FALSE FALSE FALSE FALSE

Let’s change a couple of numbers inside vector B to match vector A:

A <- c(1, 3, 5, 7, 9, 11)  # odd numbers
B <- c(2, 4, 6, 8, 1, 5)  # add some odd numbers in 

# test to see if each of the elements of A is in B
A %in% B
## [1]  TRUE FALSE  TRUE FALSE FALSE FALSE
intersection <- A %in% B
intersection
## [1]  TRUE FALSE  TRUE FALSE FALSE FALSE
A[intersection]
## [1] 1 5
any(A %in% B)
## [1] TRUE
all(A %in% B)
## [1] FALSE

Exercise

  1. Using the A and B vectors created above, evaluate each element in B to see if there is a match in A
B %in% A
## [1] FALSE FALSE FALSE FALSE  TRUE  TRUE
  1. Subset the B vector to only return those values that are also in A.
B[B %in% A]
## [1] 1 5

Suppose we had two vectors containing same values. How can we check if those values are in the same order in each vector? In this case, we can use == operator to compare each element of the same position from two vectors. Unlike %in% operator, == operator requires that two vectors are of equal length.

A <- c(10, 20, 30, 40, 50)
B <- c(50, 40, 30, 20, 10)  # same numbers but backwards 

# test to see if each element of A is in B
A %in% B
## [1] TRUE TRUE TRUE TRUE TRUE
# test to see if each element of A is in the same position in B
A == B
## [1] FALSE FALSE  TRUE FALSE FALSE
# use all() to check if they are a perfect match
all(A == B)
## [1] FALSE

Let’s try this on our genomic data, and see whether we have metadata information for all samples in our expression data.

x <- rownames(metadata)
y <- colnames(rpkm_data)
all(x %in% y)
## [1] TRUE

Now we can replace x and y by their real values to get the same answer:

all(rownames(metadata) %in% colnames(rpkm_data))
## [1] TRUE
x == y
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
all(x == y)
## [1] FALSE

Looks like all of the samples are there, but they need to be reordered. To reorder our genomic samples, we will learn different ways to reorder data in our next lesson.

Exercise

We have a list of 6 marker genes that we are very interested in. Our goal is to extract count data for these genes using the %in% operator from the rpkm_data data frame, instead of scrolling through rpkm_data and finding them manually.

First, let’s create a vector called important_genes with the Ensembl IDs of the 6 genes we are interested in:

important_genes <- c("ENSMUSG00000083700", "ENSMUSG00000080990", "ENSMUSG00000065619", "ENSMUSG00000047945",
    "ENSMUSG00000081010", "ENSMUSG00000030970")
  1. Use the %in% operator to determine if all of these genes are present in the row names of the rpkm_data data frame.

  2. Extract the rows from rpkm_data that correspond to these 6 genes using [] and the %in% operator. Double check the row names to ensure that you are extracting the correct rows.

  3. Bonus question: Extract the rows from rpkm_data that correspond to these 6 genes using [], but without using the %in% operator.

17.9 10 Reordering to match datasets

.md file = 10-reordering-to-match-datasets.md

Exercise

We know how to reorder using indices, let’s try to use it to reorder the contents of one vector to match the contents of another. Let’s create the vectors first and second as detailed below:

first <- c("A", "B", "C", "D", "E")
second <- c("B", "D", "E", "A", "C")  # same letters but different order

How would you reorder the second vector to match first?

17.9.1 The match function

match() takes 2 arguments. The first argument is a vector of values in the order you want, while the second argument is the vector of values to be reordered such that it will match the first:

1st vector: values in the order you want 2nd vector: values to be reordered

The match function returns the position of the matches (indices) with respect to the second vector, which can be used to re-order it so that it matches the order in the first vector. Let's use match() on the first and second vectors we created.

# Saving indices for how to reorder `second` to match `first`
reorder_idx <- match(first, second)
reorder_idx
## [1] 4 1 5 2 3
# Reordering the second vector to match the order of the first vector
second[reorder_idx]
## [1] "A" "B" "C" "D" "E"
# Reordering and saving the output to a variable
second_reordered <- second[reorder_idx]

Matching with vectors of different lengths:

first <- c("A", "B", "C", "D", "E")
second <- c("D", "B", "A")  # remove values
match(first, second)
## [1]  3  2 NA  1 NA
second[match(first, second)]
## [1] "A" "B" NA  "D" NA

NOTE: For values that don't match by default return an NA value. You can specify what values you would have it assigned using nomatch argument.

Example:

mtch <- match(first, second, nomatch = 0)
mtch
## [1] 3 2 0 1 0
second[mtch]
## [1] "A" "B" "D"
# Check row names of the metadata
rownames(metadata)
##  [1] "sample1"  "sample2"  "sample3"  "sample4"  "sample5"  "sample6"  "sample7"  "sample8"  "sample9"  "sample10" "sample11"
## [12] "sample12"
# Check the column names of the counts data
colnames(rpkm_data)
##  [1] "sample2"  "sample5"  "sample7"  "sample8"  "sample9"  "sample4"  "sample6"  "sample12" "sample3"  "sample11" "sample10"
## [12] "sample1"

Reordering genomic data using the match() function:

genomic_idx <- match(rownames(metadata), colnames(rpkm_data))
genomic_idx
##  [1] 12  1  9  6  2  7  3  4  5 11 10  8
# Reorder the counts data frame to have the sample names in the same order as the metadata data
# frame
rpkm_ordered <- rpkm_data[, genomic_idx]
# View the reordered counts
View(rpkm_ordered)

# I prefer to use `head` -- and set the output to a limited number of columns so that it fits in
# the console

head(rpkm_ordered[, 1:5])
sample1 sample2 sample3 sample4 sample5
ENSMUSG00000000001 19.7848000 19.265000 20.889500 24.076700 23.7222000
ENSMUSG00000000003 0.0000000 0.000000 0.000000 0.000000 0.0000000
ENSMUSG00000000028 0.9377920 1.032290 0.892183 0.827891 0.8269540
ENSMUSG00000000031 0.0359631 0.000000 0.000000 0.000000 0.0000000
ENSMUSG00000000037 0.1514170 0.056033 0.146196 0.180883 0.0473238
ENSMUSG00000000049 0.2567330 0.258134 0.421286 2.191720 1.0730200

Now we can check if the rownames of metadata is in the same order as the colnames of rpkm_ordered:

all(rownames(metadata) == colnames(rpkm_ordered))
## [1] TRUE

Exercises

After talking with your collaborator, it becomes clear that sample2 and sample9 were actually from a different mouse background than the other samples and should not be part of our analysis. Create a new variable called subset_rpkm that has these columns removed from the rpkm_ordered data frame.

Use the match() function to subset the metadata data frame so that the row names of the metadata data frame match the column names of the subset_rpkm data frame.

17.10 11 Plotting and data visualization

.md file = 11-setting_up_to_plot.md10-reordering-to-match-datasets.md

17.10.1 Setup a data frame for visualization

mean(rpkm_ordered$sample1)
## [1] 10.2661

But what we want is a vector of 12 values that we can add to the metadata data frame. What is the best way to do this?

To get the mean of all the samples in a single line of code the map() family of function is a good option.

17.10.2 The map family of functions

We can use map functions to execute some task/function on every element in a vector, or every column in a dataframe, or every component of a list, and so on.

  • map() creates a list.
  • map_lgl() creates a logical vector.
  • map_int() creates an integer vector.
  • map_dbl() creates a “double” or numeric vector.
  • map_chr() creates a character vector.

The syntax for the map() family of functions is:

  • map(object, function_to_apply)

We can use the map_dbl() to get the mean of each column of rpkm_ordered in just one command:

# install purrr if you need to by uncommenting the following line: BiocManager::install(purrr)
library(purrr)  # Load the purrr

samplemeans <- map_dbl(rpkm_ordered, mean)

The output of map_dbl() is a named vector of length 12.

17.10.3 Adding to a metadata object

Before we add samplemeans as a new column to metadata, let's create a vector with the ages of each of the mice in our data set.

# Create a numeric vector with ages. Note that there are 12 elements here

age_in_days <- c(40, 32, 38, 35, 41, 32, 34, 26, 28, 28, 30, 32)

Now, we are ready to combine the metadata data frame with the 2 new vectors to create a new data frame with 5 columns:

# Add the new vectors as the last columns to the metadata

new_metadata <- data.frame(metadata, samplemeans, age_in_days)

# Take a look at the new_metadata object
head(new_metadata)
genotype celltype replicate samplemeans age_in_days
sample1 Wt typeA 1 10.266102 40
sample2 Wt typeA 2 10.849759 32
sample3 Wt typeA 3 9.452517 38
sample4 KO typeA 1 15.833872 35
sample5 KO typeA 2 15.590184 41
sample6 KO typeA 3 15.551529 32
names(new_metadata)
## [1] "genotype"    "celltype"    "replicate"   "samplemeans" "age_in_days"

17.11 12 Data visualization with ggplot2

.md file = 12-ggplot2.md

## load the new_metadata data frame into your environment from a .RData object, if you need to -
## set eval=TRUE
load("data/new_metadata.RData")
# this data frame should have 12 rows and 5 columns
dim(new_metadata)
## [1] 12  5
library(ggplot2)

The ggplot() function is used to initialize the basic graph structure, then we add to it. The basic idea is that you specify different parts of the plot using additional functions one after the other and combine them using the + operator; the functions in the resulting code chunk are called layers.

1st layer: plot window

ggplot(new_metadata)  # what happens? we get a plot window

The geom (geometric) object is the layer that specifies what kind of plot we want to draw. A plot must have at least one geom; there is no upper limit. Examples include:

  • points (geom_point, geom_jitter for scatter plots, dot plots, etc)
  • lines (geom_line, for time series, trend lines, etc)
  • boxplot (geom_boxplot, for, well, boxplots!)

2nd layer: geometries

ggplot(new_metadata) + geom_point()  # note what happens here

17.11.1 Geometries

Geoms (e.g.) in layers: - geom_point() - geom_boxplot() - geom_bar() - geom_density() - geom_dotplot()

17.11.2 Aesthetics

We get an error because each type of geom usually has a required set of aesthetics. “Aesthetics” are set with the aes() function and can be set within geom_point() or within ggplot().

The aes() function can be used to specify many plot elements including the following:

  • position (i.e., on the x and y axes)
  • color (“outside” color)
  • fill (“inside” color)
  • shape (of points)
  • etc.

3rd layer: aesthetics

ggplot(new_metadata, aes(x = age_in_days, y = samplemeans)) + geom_point()  # what happens here? we initialize the plot window with the axes

Add color to aesthetics:

ggplot(new_metadata) + geom_point(aes(x = age_in_days, y = samplemeans, color = genotype))

Add shape to aesthetics:

ggplot(new_metadata) + geom_point(aes(x = age_in_days, y = samplemeans, color = genotype, shape = celltype))

Add point size to geometry:

ggplot(new_metadata) + geom_point(aes(x = age_in_days, y = samplemeans, color = genotype, shape = celltype),
    size = 2.25)

The labels on the x- and y-axis are also quite small and hard to read. To change their size, we need to add an additional theme layer. The ggplot2 theme system handles non-data plot elements such as:

  • Axis label aesthetics
  • Plot background
  • Facet label background
  • Legend appearance

5th layer: theme

ggplot(new_metadata) + geom_point(aes(x = age_in_days, y = samplemeans, color = genotype, shape = celltype),
    size = 3) + theme_bw()

Do the axis labels or the tick labels get any larger by changing themes?

6th layer: axes title size

ggplot(new_metadata) + geom_point(aes(x = age_in_days, y = samplemeans, color = genotype, shape = celltype),
    size = 2.25) + theme_bw() + theme(axis.title = element_text(size = rel(1.5)))

Exercise

  1. The current axis label text defaults to what we gave as input to geom_point (i.e the column headers). We can change this by adding additional layers called xlab() and ylab() for the x- and y-axis, respectively. Add these layers to the current plot such that the x-axis is labeled “Age (days)” and the y-axis is labeled “Mean expression”.

  2. Use the ggtitle layer to add a plot title of your choice.

  3. Add the following new layer to the code chunk theme(plot.title=element_text(hjust=0.5))

What does it change?

  • Ans:

How many theme() layers can be added to a ggplot code chunk, in your estimation?

  • Ans:

17.12 13 Boxplot exercise

.md file = 13-boxplot_exercise.md

  1. Generate a boxplot using the data in the new_metadata dataframe. Create a ggplot2 code chunk with the following instructions:
  • Use the geom_boxplot() layer to plot the differences in sample means between the Wt and KO genotypes.

  • Use the fill aesthetic to look at differences in sample means between the celltypes within each genotype.

  • Add a title to your plot.

  • Add labels, ‘Genotype’ for the x-axis and ‘Mean expression’ for the y-axis.

  • Make the following theme() changes:

    1. Use the theme_bw() function to make the background white.
    2. Change the size of your axes labels to 1.25x larger than the default.
    3. Change the size of your plot title to 1.5x larger than default.
    4. Center the plot title.
  1. Change genotype order: Factor the new_metadata$genotype column without creating any extra variables/objects and change the levels to c("Wt", "KO")

  2. Re-run the boxplot code chunk you created in 1.

  3. Change default colors

Add a new layer scale_color_manual(values=c("purple","orange")). Do you observe a change?

  • Ans:

Replace scale_color_manual(values=c("purple","orange")) with scale_fill_manual(values=c("purple","orange")). Do you observe a change?

  • Ans:

In the scatterplot we drew in class, add a new layer scale_color_manual(values=c("purple","orange")), do you observe a difference?

  • Ans:

What do you think is the difference between scale_color_manual() and scale_fill_manual()?

  • Ans:

  • Boxplot using "color" instead of "fill". Use purple and orange for your colors.

  • Back in your boxplot code, change the colors in the scale_fill_manual() layer to be your 2 favorite colors.

Are there any colors that you tried that did not work?

  • Ans:
  1. OPTIONAL Exercise

Find the hexadecimal code for your 2 favourite colors (from exercise 3 above) and replace the color names with the hexadecimal codes within the ggplot2 code chunk.

# Hint: the gplots package. If gplots is not already installed, uncomment the following line of
# code

# BiocManager::install('gplots')

library(gplots)
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
## 
##     lowess

17.13 14 Saving data and plots to file

.md file = 14-exporting_data_and_plots.md

17.13.1 Writing data to file

# Save a data frame to file
write.csv(sub_meta, file = "data/subset_meta.csv")

17.13.2 Exporting figures to file:

## Open device for writing
pdf(file = "figures/scatterplot.pdf")

ggplot(new_metadata) + geom_point(aes(x = age_in_days, y = samplemeans, color = genotype, shape = celltype),
    size = rel(3))
## Closing the device is essential to save the temporary file created by pdf()/png()
dev.off()
## quartz_off_screen 
##                 2

17.14 15 Finding help

.md file = 15-finding_help.md

17.14.1 Saving variables

In order to find help on line you must provide a reproducible example. Google and R sites such as Stackoverflow are good sources. To provide an example you may want to share an object with someone else, you can save any or all R data structures that you have in your environment to a file:

save.image()  # saves all environmental variables in your current R session to a file called '.RData' in your working directory -- this is a command that you should use often when in an R session to save your work.

# OR give it a memorable name

save.image("intro-to-R.RData")

# OR for a single object

save.image(metadata, "metadata.RData")

# you can also use

saveRDS(metadata, file = "metadata.rds")

17.14.2 Loading data

load(file = ".RData")  # to load .RData files

readRDS(file = "metadata.rds")  # to load .rds files

17.14.3 sessionInfo()

Lists details about your current R session: R version and platform; locale and timezone; all packages and versions that are loaded as we saw before.

# you'll also often be required to provide your session info
sessionInfo()

17.15 16 Tidyverse data wrangling

.md file = 16-tidyverse.md

# If tidyverse is not already installed, uncomment the following line of code
# BiocManager::install('tidyverse')

library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ tibble  3.1.8     ✔ dplyr   1.1.0
## ✔ tidyr   1.3.0     ✔ stringr 1.5.0
## ✔ readr   2.1.3     ✔ forcats 1.0.0
## ── Conflicts ───────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Tidyverse basics

17.15.1 Pipes: The pipe (%>%) allows the output of a previous command to be used as input to another command instead of using nested functions.

NOTE: Shortcut to write the pipe is shift + command + m on Macos; shift + ctrl + m on Windows

## A single command
sqrt(83)
## [1] 9.110434
## Base R method of running more than one command
round(sqrt(83), digits = 2)
## [1] 9.11
## Running more than one command with piping
sqrt(83) %>%
    round(digits = 2)
## [1] 9.11

Exercise

  1. Create a vector of random numbers using the code below:
random_numbers <- c(81, 90, 65, 43, 71, 29)
  1. Use the pipe (%>%) to perform two steps in a single line:

    1. Take the mean of random_numbers using the mean() function.

    2. Round the output to three digits using the round() function.

random_numbers %>%
    mean %>%
    round(digits = 3)
## [1] 63.167

17.15.2 Tibbles

Experimental data

The dataset: gprofiler_results_Mov10oe.tsv

The gene list represents the functional analysis results of differences between control mice and mice over-expressing a gene involved in RNA splicing. We will focus on gene ontology (GO) terms, which describe the roles of genes and gene products organized into three controlled vocabularies/ontologies (domains):

  • biological processes (BP)

  • cellular components (CC)

  • molecular functions (MF)

that are over-represented in our list of genes.

17.15.3 Analysis goal and workflow

Goal: Visually compare the most significant biological processes (BP) based on the number of associated differentially expressed genes (gene ratios) and significance values and create a plot.

1. Read in the functional analysis results

# Read in the functional analysis results
functional_GO_results <- read.delim(file = "data/gprofiler_results_Mov10oe.tsv")

# Take a look at the results
functional_GO_results[1:3, 1:12]
query.number significant p.value term.size query.size overlap.size recall precision term.id domain subgraph.number term.name
1 TRUE 0.00434 111 5850 52 0.009 0.468 GO:0032606 BP 237 type I interferon production
1 TRUE 0.00330 110 5850 52 0.009 0.473 GO:0032479 BP 237 regulation of type I interferon production
1 TRUE 0.02970 39 5850 21 0.004 0.538 GO:0032480 BP 237 negative regulation of type I interferon production
names(functional_GO_results)
##  [1] "query.number"    "significant"     "p.value"         "term.size"       "query.size"      "overlap.size"    "recall"         
##  [8] "precision"       "term.id"         "domain"          "subgraph.number" "term.name"       "relative.depth"  "intersection"
names(functional_GO_results)
##  [1] "query.number"    "significant"     "p.value"         "term.size"       "query.size"      "overlap.size"    "recall"         
##  [8] "precision"       "term.id"         "domain"          "subgraph.number" "term.name"       "relative.depth"  "intersection"

2. Extract only the GO biological processes (BP) of interest

# Return only GO biological processes
bp_oe <- functional_GO_results %>%
    dplyr::filter(domain == "BP")

bp_oe[1:3, 1:12]
query.number significant p.value term.size query.size overlap.size recall precision term.id domain subgraph.number term.name
1 TRUE 0.00434 111 5850 52 0.009 0.468 GO:0032606 BP 237 type I interferon production
1 TRUE 0.00330 110 5850 52 0.009 0.473 GO:0032479 BP 237 regulation of type I interferon production
1 TRUE 0.02970 39 5850 21 0.004 0.538 GO:0032480 BP 237 negative regulation of type I interferon production

Note: the dplyr::filter is necessary because the stats library gets loaded with ggplot2, rmarkdown and bookdown and stats also has a filter function

Exercise

We would like to perform an additional round of filtering to only keep the most specific GO terms.

  1. For bp_oe, use the filter() function to only keep those rows where the relative.depth is greater than 4.
  2. Save output to overwrite our bp_oe variable.
bp_oe <- bp_oe %>%
    dplyr::filter(relative.depth > 4)

17.15.4 Select

  1. Select only the colnames needed for visualization
# Selecting columns to keep for visualization
names(bp_oe)
##  [1] "query.number"    "significant"     "p.value"         "term.size"       "query.size"      "overlap.size"    "recall"         
##  [8] "precision"       "term.id"         "domain"          "subgraph.number" "term.name"       "relative.depth"  "intersection"
bp_oe <- bp_oe %>%
    select(term.id, term.name, p.value, query.size, term.size, overlap.size, intersection)

head(bp_oe[, 1:6])
term.id term.name p.value query.size term.size overlap.size
GO:0090672 telomerase RNA localization 2.41e-02 5850 16 11
GO:0090670 RNA localization to Cajal body 2.41e-02 5850 16 11
GO:0090671 telomerase RNA localization to Cajal body 2.41e-02 5850 16 11
GO:0071702 organic substance transport 0.00e+00 5850 2629 973
GO:0015931 nucleobase-containing compound transport 4.43e-05 5850 200 93
GO:0050657 nucleic acid transport 3.70e-06 5850 166 83
dim(bp_oe)
## [1] 668   7

17.15.5 Arrange

Let’s sort the rows by adjusted p-value with the arrange() function.

# Order by adjusted p-value ascending
bp_oe <- bp_oe %>%
    arrange(p.value)

17.15.6 Rename

Let’s rename the term.id and term.name columns.

# Provide better names for columns
bp_oe <- bp_oe %>%
    dplyr::rename(GO_id = term.id, GO_term = term.name)

Exercise

Rename the intersection column to genes to reflect the fact that these are the DE genes associated with the GO process.

bp_oe <- bp_oe %>%
    dplyr::rename(genes = intersection)

17.15.7 Mutate

Create additional metrics for plotting (e.g. gene ratios)

# Create gene ratio column based on other columns in dataset
bp_oe <- bp_oe %>%
    mutate(gene_ratio = overlap.size/query.size)

Exercise

Create a column in bp_oe called term_percent to determine the percent of DE genes associated with the GO term relative to the total number of genes associated with the GO term (overlap.size / term.size)

bp_oe <- bp_oe %>%
    mutate(term_percent = overlap.size/term.size)

Homework

Use ggplot to plot the first 30 significant GO terms vs gene_ratio. Color the points by p.value as shown in the figure in below:

dotplot6

sub = bp_oe[1:30, ]

Change the p.value gradient colors to red-orange-yellow

# to help linearize the p.values for the colors; not essential for answer
sub$`-log10(p.value)` = -log10(sub$p.value)

See ggplot for more detail.