Chapter 17 Codebook
17.1 02 Introduction to R and RStudio
.md file = 02-introR-R-and-RStudio.md
17.1.1 Running commands and annotating code:
Add some additional code to the following chunk
:
# Intro to R Lesson
3 + 5
## [1] 8
The chunk will run inline, but I prefer to set the chunk so it will run in the console, because it gives us extra room, especially for plotting. You can set this using the cog wheel at the top of the script window next to the Knit button.
17.1.2 The assignment operator (<-)
The assignment operator assigns values on the right to variables on the left.
In RStudio the keyboard shortcut for the assignment operator <- is Alt + - (Windows) or Option + - (Mac).
<- 3
x <- 5 y
+ y x
## [1] 8
<- x + y number
Exercise
- Try changing the value of the variable x to 5. What happens to number?
<- 5
x number
## [1] 8
<- x + y
number
number
## [1] 10
- Ans:
- Now try changing the value of variable y to contain the value 10. What do you need to do, to update the variable number?
- Ans:
17.2 03 R syntax and data structures
.md file = 03-introR-syntax-and-data-structures.md
17.2.1 Vectors
# Create a numeric vector and store the vector as a variable called 'glengths'
<- c(4.6, 3000, 50000)
glengths glengths
## [1] 4.6 3000.0 50000.0
# Create a character vector and store the vector as a variable called 'species'
<- c("ecoli", "human", "corn")
species species
## [1] "ecoli" "human" "corn"
Exercise
Try to create a vector of numeric and character values by combining the two vectors that we just created (glengths
and species
). Assign this combined vector to a new variable called combined. Hint: you will need to use the combine c()
function to do this. Print the combined vector in the console, what looks different compared to the original vectors?
<- c(glengths, species)
combined combined
## [1] "4.6" "3000" "50000" "ecoli" "human" "corn"
- Ans:
17.2.2 Factors
# Create a character vector and store the vector as a variable called 'expression'
<- c("low", "high", "medium", "high", "low", "medium", "high")
expression expression
## [1] "low" "high" "medium" "high" "low" "medium" "high"
# Turn 'expression' vector into a factor
<- factor(expression)
expression expression
## [1] low high medium high low medium high
## Levels: high low medium
Exercises
Let’s say that in our experimental analyses, we are working with three different sets of cells: normal, cells knocked out for geneA (a very exciting gene), and cells overexpressing geneA. We have three replicates for each celltype.
Create a vector named samplegroup with nine elements: 3 control (“CTL”) values, 3 knock-out (“KO”) values, and 3 over-expressing (“OE”) values.
Turn samplegroup into a factor data structure.
17.2.3 Dataframes
A data.frame is the de facto data structure for most tabular data and what we use for statistics and plotting. A data.frame is similar to a matrix in that it’s a collection of vectors of the same length and each vector represents a column. However, in a dataframe each vector can be of a different data type (e.g., characters, integers, factors).
# Create a data frame and store it as a variable called 'df'
<- data.frame(species, glengths)
df df
species | glengths |
---|---|
ecoli | 4.6 |
human | 3000.0 |
corn | 50000.0 |
Exercise
Create a data frame called favorite_books with the following vectors as columns:
<- c("Catch-22", "Pride and Prejudice", "Nineteen Eighty Four")
titles <- c(453, 432, 328) pages
17.3 04 Functions in R
.md file = 04-introR-functions-and-arguments.md
17.3.1 Basic functions
We have already used a few examples of basic functions in the previous lessons i.e c()
, and factor()
.
Many of the base functions in R involve mathematical operations. One example would be the function sqrt()
. The input/argument must be a number, and the output is the square root of that number. Let's try finding the square root of 81:
sqrt(81)
## [1] 9
sqrt(glengths)
## [1] 2.144761 54.772256 223.606798
round(3.14159)
## [1] 3
17.3.2 Seeking help on functions
`?`(round)
args(round)
## function (x, digits = 0)
## NULL
example("round")
##
## round> round(.5 + -2:4) # IEEE / IEC rounding: -2 0 0 2 2 4 4
## [1] -2 0 0 2 2 4 4
##
## round> ## (this is *good* behaviour -- do *NOT* report it as bug !)
## round>
## round> ( x1 <- seq(-2, 4, by = .5) )
## [1] -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
##
## round> round(x1) #-- IEEE / IEC rounding !
## [1] -2 -2 -1 0 0 0 1 2 2 2 3 4 4
##
## round> x1[trunc(x1) != floor(x1)]
## [1] -1.5 -0.5
##
## round> x1[round(x1) != floor(x1 + .5)]
## [1] -1.5 0.5 2.5
##
## round> (non.int <- ceiling(x1) != floor(x1))
## [1] FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE
##
## round> x2 <- pi * 100^(-1:3)
##
## round> round(x2, 3)
## [1] 0.031 3.142 314.159 31415.927 3141592.654
##
## round> signif(x2, 3)
## [1] 3.14e-02 3.14e+00 3.14e+02 3.14e+04 3.14e+06
round(3.14159, digits = 2)
## [1] 3.14
Exercise
- Let’s use base R function to calculate mean value of the
glengths
vector. You might need to search online to find what function can perform this task.
mean(glengths)
## [1] 17668.2
- Create a new vector
test <- c(1, NA, 2, 3, NA, 4)
. Use the same base R function from exercise 1 (with addition of proper argument), and calculate the mean value of thetest
vector. The output should be 2.5. > NOTE: In R, missing values are represented by the symbol NA (not available). It’s a way to make sure that users know they have missing data, and make a conscious decision on how to deal with it. There are ways to ignore NA during statistical calculation, or to remove NA from the vector. If you want more information related to missing data or NA you can go to the NA help page (please note that there are many advanced concepts on that page that have not been covered in class).
<- c(1, NA, 2, 3, NA, 4)
test
mean(test, na.rm = TRUE)
## [1] 2.5
- Another commonly used base function is sort(). Use this function to sort the
glengths
vector in descending order.
sort(glengths)
## [1] 4.6 3000.0 50000.0
17.3.3 User-defined Functions
Let's try creating a simple example function. This function will take in a numeric value as input, and return the squared value.
<- function(x) {
square_it <- x * x
square return(square)
}
Once you run the code, you should see a function named square_it
in the Environment panel (located at the top right of Rstudio interface). Now, we can use this function as any other base R functions. We type out the name of the function, and inside the parentheses we provide a numeric value x
:
square_it(5)
Pretty simple, right? In this case, we only had one line of code that was run, but in theory you could have many lines of code to get obtain the final results that you want to "return" to the user. We have only scratched the surface here when it comes to creating functions! If you are interested you can also find more detailed information on writing functions R-bloggers site.
Exercise
- Write a function called
multiply_it
, which takes two inputs: a numeric valuex
, and a numeric valuey
. The function will return the product of these two numeric values, which isx * y
. For example,multiply_it(x=4, y=6)
will return output24
.
<- function(x, y) {
multiply_it <- x * y
product return(product)
}
multiply_it(x = 4, y = 6)
## [1] 24
17.4 05 Packages and libraries
.md file = 05-introR_packages.md
17.4.1 Packages and Libraries
You can check what libraries (packages) are loaded in your current R session by typing into the console:
sessionInfo() #Print version information about R, the OS and attached or loaded packages
## R version 4.2.1 (2022-06-23)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Monterey 12.6.5
##
## Matrix products: default
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] rmarkdown_2.20 knitr_1.42 bookdown_0.32
##
## loaded via a namespace (and not attached):
## [1] digest_0.6.31 R6_2.5.1 jsonlite_1.8.4 formatR_1.14 evaluate_0.20 cachem_1.0.6 rlang_1.0.6
## [8] cli_3.6.0 rstudioapi_0.14 jquerylib_0.1.4 bslib_0.4.2 tools_4.2.1 xfun_0.36 yaml_2.3.7
## [15] fastmap_1.1.0 compiler_4.2.1 htmltools_0.5.4 sass_0.4.5
# OR
search() #Gives a list of attached packages
## [1] ".GlobalEnv" "package:rmarkdown" "package:knitr" "package:bookdown" "tools:rstudio" "package:stats"
## [7] "package:graphics" "package:grDevices" "package:utils" "package:datasets" "package:methods" "Autoloads"
## [13] "package:base"
17.4.2 Installing packages
17.4.2.1 Packages from CRAN:
Previously we have introduced you to installing packages. An example is given below for the ggplot2
package that will be required for some plots we will create later on. Run this code to install ggplot2
.
Set eval = TRUE if you haven't installed ggplot2 yet
install.packages("ggplot2")
Alternatively, packages can also be installed from Bioconductor, a repository of packages which provides tools for the analysis and comprehension of high-throughput genomic data.
Set eval = TRUE if you haven't installed BiocManager yet
install.packages("BiocManager")
17.4.2.2 Packages from Bioconductor:
Now you can use BiocManager::install to install packages available on Bioconductor. Bioconductor is a free, open source and open development software project for the analysis and comprehension of genomic data generated by wet lab experiments in molecular biology.
# DO NOT RUN THIS!
::install("ggplot2") BiocManager
17.4.3 Loading libraries/packages
library(ggplot2)
To see the functions in a package you can also type:
ls("package:ggplot2")
## [1] "%+%" "%+replace%" "aes" "aes_"
## [5] "aes_all" "aes_auto" "aes_q" "aes_string"
## [9] "after_scale" "after_stat" "alpha" "annotate"
## [13] "annotation_custom" "annotation_logticks" "annotation_map" "annotation_raster"
## [17] "arrow" "as_label" "as_labeller" "autolayer"
## [21] "autoplot" "AxisSecondary" "benchplot" "binned_scale"
## [25] "borders" "calc_element" "combine_vars" "continuous_scale"
## [29] "Coord" "coord_cartesian" "coord_equal" "coord_fixed"
## [33] "coord_flip" "coord_map" "coord_munch" "coord_polar"
## [37] "coord_quickmap" "coord_sf" "coord_trans" "CoordCartesian"
## [41] "CoordFixed" "CoordFlip" "CoordMap" "CoordPolar"
## [45] "CoordQuickmap" "CoordSf" "CoordTrans" "cut_interval"
## [49] "cut_number" "cut_width" "derive" "diamonds"
## [53] "discrete_scale" "draw_key_abline" "draw_key_blank" "draw_key_boxplot"
## [57] "draw_key_crossbar" "draw_key_dotplot" "draw_key_label" "draw_key_linerange"
## [61] "draw_key_path" "draw_key_point" "draw_key_pointrange" "draw_key_polygon"
## [65] "draw_key_rect" "draw_key_smooth" "draw_key_text" "draw_key_timeseries"
## [69] "draw_key_vline" "draw_key_vpath" "dup_axis" "economics"
## [73] "economics_long" "el_def" "element_blank" "element_grob"
## [77] "element_line" "element_rect" "element_render" "element_text"
## [81] "enexpr" "enexprs" "enquo" "enquos"
## [85] "ensym" "ensyms" "expand_limits" "expand_scale"
## [89] "expansion" "expr" "Facet" "facet_grid"
## [93] "facet_null" "facet_wrap" "FacetGrid" "FacetNull"
## [97] "FacetWrap" "faithfuld" "find_panel" "flip_data"
## [101] "flipped_names" "fortify" "Geom" "geom_abline"
## [105] "geom_area" "geom_bar" "geom_bin_2d" "geom_bin2d"
## [109] "geom_blank" "geom_boxplot" "geom_col" "geom_contour"
## [113] "geom_contour_filled" "geom_count" "geom_crossbar" "geom_curve"
## [117] "geom_density" "geom_density_2d" "geom_density_2d_filled" "geom_density2d"
## [121] "geom_density2d_filled" "geom_dotplot" "geom_errorbar" "geom_errorbarh"
## [125] "geom_freqpoly" "geom_function" "geom_hex" "geom_histogram"
## [129] "geom_hline" "geom_jitter" "geom_label" "geom_line"
## [133] "geom_linerange" "geom_map" "geom_path" "geom_point"
## [137] "geom_pointrange" "geom_polygon" "geom_qq" "geom_qq_line"
## [141] "geom_quantile" "geom_raster" "geom_rect" "geom_ribbon"
## [145] "geom_rug" "geom_segment" "geom_sf" "geom_sf_label"
## [149] "geom_sf_text" "geom_smooth" "geom_spoke" "geom_step"
## [153] "geom_text" "geom_tile" "geom_violin" "geom_vline"
## [157] "GeomAbline" "GeomAnnotationMap" "GeomArea" "GeomBar"
## [161] "GeomBlank" "GeomBoxplot" "GeomCol" "GeomContour"
## [165] "GeomContourFilled" "GeomCrossbar" "GeomCurve" "GeomCustomAnn"
## [169] "GeomDensity" "GeomDensity2d" "GeomDensity2dFilled" "GeomDotplot"
## [173] "GeomErrorbar" "GeomErrorbarh" "GeomFunction" "GeomHex"
## [177] "GeomHline" "GeomLabel" "GeomLine" "GeomLinerange"
## [181] "GeomLogticks" "GeomMap" "GeomPath" "GeomPoint"
## [185] "GeomPointrange" "GeomPolygon" "GeomQuantile" "GeomRaster"
## [189] "GeomRasterAnn" "GeomRect" "GeomRibbon" "GeomRug"
## [193] "GeomSegment" "GeomSf" "GeomSmooth" "GeomSpoke"
## [197] "GeomStep" "GeomText" "GeomTile" "GeomViolin"
## [201] "GeomVline" "get_alt_text" "get_element_tree" "gg_dep"
## [205] "ggplot" "ggplot_add" "ggplot_build" "ggplot_gtable"
## [209] "ggplotGrob" "ggproto" "ggproto_parent" "ggsave"
## [213] "ggtitle" "guide_axis" "guide_bins" "guide_colorbar"
## [217] "guide_colorsteps" "guide_colourbar" "guide_coloursteps" "guide_gengrob"
## [221] "guide_geom" "guide_legend" "guide_merge" "guide_none"
## [225] "guide_train" "guide_transform" "guides" "has_flipped_aes"
## [229] "is.Coord" "is.facet" "is.ggplot" "is.ggproto"
## [233] "is.theme" "label_both" "label_bquote" "label_context"
## [237] "label_parsed" "label_value" "label_wrap_gen" "labeller"
## [241] "labs" "last_plot" "layer" "layer_data"
## [245] "layer_grob" "layer_scales" "layer_sf" "Layout"
## [249] "lims" "luv_colours" "map_data" "margin"
## [253] "max_height" "max_width" "mean_cl_boot" "mean_cl_normal"
## [257] "mean_sdl" "mean_se" "median_hilow" "merge_element"
## [261] "midwest" "mpg" "msleep" "panel_cols"
## [265] "panel_rows" "Position" "position_dodge" "position_dodge2"
## [269] "position_fill" "position_identity" "position_jitter" "position_jitterdodge"
## [273] "position_nudge" "position_stack" "PositionDodge" "PositionDodge2"
## [277] "PositionFill" "PositionIdentity" "PositionJitter" "PositionJitterdodge"
## [281] "PositionNudge" "PositionStack" "presidential" "qplot"
## [285] "quickplot" "quo" "quo_name" "quos"
## [289] "register_theme_elements" "rel" "remove_missing" "render_axes"
## [293] "render_strips" "reset_theme_settings" "resolution" "Scale"
## [297] "scale_alpha" "scale_alpha_binned" "scale_alpha_continuous" "scale_alpha_date"
## [301] "scale_alpha_datetime" "scale_alpha_discrete" "scale_alpha_identity" "scale_alpha_manual"
## [305] "scale_alpha_ordinal" "scale_color_binned" "scale_color_brewer" "scale_color_continuous"
## [309] "scale_color_date" "scale_color_datetime" "scale_color_discrete" "scale_color_distiller"
## [313] "scale_color_fermenter" "scale_color_gradient" "scale_color_gradient2" "scale_color_gradientn"
## [317] "scale_color_grey" "scale_color_hue" "scale_color_identity" "scale_color_manual"
## [321] "scale_color_ordinal" "scale_color_steps" "scale_color_steps2" "scale_color_stepsn"
## [325] "scale_color_viridis_b" "scale_color_viridis_c" "scale_color_viridis_d" "scale_colour_binned"
## [329] "scale_colour_brewer" "scale_colour_continuous" "scale_colour_date" "scale_colour_datetime"
## [333] "scale_colour_discrete" "scale_colour_distiller" "scale_colour_fermenter" "scale_colour_gradient"
## [337] "scale_colour_gradient2" "scale_colour_gradientn" "scale_colour_grey" "scale_colour_hue"
## [341] "scale_colour_identity" "scale_colour_manual" "scale_colour_ordinal" "scale_colour_steps"
## [345] "scale_colour_steps2" "scale_colour_stepsn" "scale_colour_viridis_b" "scale_colour_viridis_c"
## [349] "scale_colour_viridis_d" "scale_continuous_identity" "scale_discrete_identity" "scale_discrete_manual"
## [353] "scale_fill_binned" "scale_fill_brewer" "scale_fill_continuous" "scale_fill_date"
## [357] "scale_fill_datetime" "scale_fill_discrete" "scale_fill_distiller" "scale_fill_fermenter"
## [361] "scale_fill_gradient" "scale_fill_gradient2" "scale_fill_gradientn" "scale_fill_grey"
## [365] "scale_fill_hue" "scale_fill_identity" "scale_fill_manual" "scale_fill_ordinal"
## [369] "scale_fill_steps" "scale_fill_steps2" "scale_fill_stepsn" "scale_fill_viridis_b"
## [373] "scale_fill_viridis_c" "scale_fill_viridis_d" "scale_linetype" "scale_linetype_binned"
## [377] "scale_linetype_continuous" "scale_linetype_discrete" "scale_linetype_identity" "scale_linetype_manual"
## [381] "scale_linewidth" "scale_linewidth_binned" "scale_linewidth_continuous" "scale_linewidth_date"
## [385] "scale_linewidth_datetime" "scale_linewidth_discrete" "scale_linewidth_ordinal" "scale_radius"
## [389] "scale_shape" "scale_shape_binned" "scale_shape_continuous" "scale_shape_discrete"
## [393] "scale_shape_identity" "scale_shape_manual" "scale_shape_ordinal" "scale_size"
## [397] "scale_size_area" "scale_size_binned" "scale_size_binned_area" "scale_size_continuous"
## [401] "scale_size_date" "scale_size_datetime" "scale_size_discrete" "scale_size_identity"
## [405] "scale_size_manual" "scale_size_ordinal" "scale_type" "scale_x_binned"
## [409] "scale_x_continuous" "scale_x_date" "scale_x_datetime" "scale_x_discrete"
## [413] "scale_x_log10" "scale_x_reverse" "scale_x_sqrt" "scale_x_time"
## [417] "scale_y_binned" "scale_y_continuous" "scale_y_date" "scale_y_datetime"
## [421] "scale_y_discrete" "scale_y_log10" "scale_y_reverse" "scale_y_sqrt"
## [425] "scale_y_time" "ScaleBinned" "ScaleBinnedPosition" "ScaleContinuous"
## [429] "ScaleContinuousDate" "ScaleContinuousDatetime" "ScaleContinuousIdentity" "ScaleContinuousPosition"
## [433] "ScaleDiscrete" "ScaleDiscreteIdentity" "ScaleDiscretePosition" "seals"
## [437] "sec_axis" "set_last_plot" "sf_transform_xy" "should_stop"
## [441] "stage" "standardise_aes_names" "stat" "Stat"
## [445] "stat_align" "stat_bin" "stat_bin_2d" "stat_bin_hex"
## [449] "stat_bin2d" "stat_binhex" "stat_boxplot" "stat_contour"
## [453] "stat_contour_filled" "stat_count" "stat_density" "stat_density_2d"
## [457] "stat_density_2d_filled" "stat_density2d" "stat_density2d_filled" "stat_ecdf"
## [461] "stat_ellipse" "stat_function" "stat_identity" "stat_qq"
## [465] "stat_qq_line" "stat_quantile" "stat_sf" "stat_sf_coordinates"
## [469] "stat_smooth" "stat_spoke" "stat_sum" "stat_summary"
## [473] "stat_summary_2d" "stat_summary_bin" "stat_summary_hex" "stat_summary2d"
## [477] "stat_unique" "stat_ydensity" "StatAlign" "StatBin"
## [481] "StatBin2d" "StatBindot" "StatBinhex" "StatBoxplot"
## [485] "StatContour" "StatContourFilled" "StatCount" "StatDensity"
## [489] "StatDensity2d" "StatDensity2dFilled" "StatEcdf" "StatEllipse"
## [493] "StatFunction" "StatIdentity" "StatQq" "StatQqLine"
## [497] "StatQuantile" "StatSf" "StatSfCoordinates" "StatSmooth"
## [501] "StatSum" "StatSummary" "StatSummary2d" "StatSummaryBin"
## [505] "StatSummaryHex" "StatUnique" "StatYdensity" "summarise_coord"
## [509] "summarise_layers" "summarise_layout" "sym" "syms"
## [513] "theme" "theme_bw" "theme_classic" "theme_dark"
## [517] "theme_get" "theme_gray" "theme_grey" "theme_light"
## [521] "theme_linedraw" "theme_minimal" "theme_replace" "theme_set"
## [525] "theme_test" "theme_update" "theme_void" "transform_position"
## [529] "txhousing" "unit" "update_geom_defaults" "update_labels"
## [533] "update_stat_defaults" "vars" "waiver" "wrap_dims"
## [537] "xlab" "xlim" "ylab" "ylim"
## [541] "zeroGrob"
Exercise
The tidyverse suite of integrated packages was designed to work together to make common data science operations more user-friendly. We will be using the tidyverse suite in later lessons, and so let’s install it. Use the function from theBiocManager
package.
Set eval=TRUE if tidyverse is not yet installed. How can you run the BiocManager::install()
command for a package that is already installed without getting an error?
- Ans:
17.5 06 Subsetting: vectors and factors
.md file = 06-introR-data wrangling.md
17.5.1 Vectors: Selecting using indices
<- c(15, 22, 45, 52, 73, 81)
age age
## [1] 15 22 45 52 73 81
5] age[
## [1] 73
Reverse selection:
-5] age[
## [1] 15 22 45 52 81
c(3, 5, 6)] ## nested age[
## [1] 45 73 81
# OR
## create a vector first then select
<- c(3, 5, 6) # create vector of the elements of interest
idx age[idx]
## [1] 45 73 81
1:4] age[
## [1] 15 22 45 52
Exercises
Create a vector called alphabets with the following letters, C, D, X, L, F.
Use the associated indices along with [ ] to do the following:
only display C, D and F
display all except X
display the letters in the opposite order (F, L, X, D, C) *Bonus points for using a function instead of the indices (Hint: use Google)
17.5.2 Selecting using logical operators
> 50 age
## [1] FALSE FALSE FALSE TRUE TRUE TRUE
> 50 | age < 18 age
## [1] TRUE FALSE FALSE TRUE TRUE TRUE
age
## [1] 15 22 45 52 73 81
> 50 | age < 18] ## nested age[age
## [1] 15 52 73 81
# OR
## create a vector first then select
<- age > 50 | age < 18
idx idx
## [1] TRUE FALSE FALSE TRUE TRUE TRUE
age[idx]
## [1] 15 52 73 81
17.5.3 Logical operators with which()
which(age > 50 | age < 18)
## [1] 1 4 5 6
which(age > 50 | age < 18)] ## nested age[
## [1] 15 52 73 81
# OR
## create a vector first then select
<- which(age > 50 | age < 18)
idx_num age[idx_num]
## [1] 15 52 73 81
17.5.4 Factors
<- c("low", "high", "medium", "high", "low", "medium", "high")
expression
<- factor(expression)
expression
str(expression)
## Factor w/ 3 levels "high","low","medium": 2 1 3 1 2 3 1
== "high"] ## This will only return those elements in the factor equal to 'high' expression[expression
## [1] high high high
## Levels: high low medium
Exercise
Extract only those elements in samplegroup that are not KO (nesting the logical operation is optional).
17.5.4.1 Releveling factors
<- factor(expression, levels = c("low", "medium", "high")) # you can re-factor a factor
expression
str(expression)
## Factor w/ 3 levels "low","medium",..: 1 3 2 3 1 2 3
Exercise
Use the samplegroup factor we created in a previous lesson, and relevel it such that KO is the first level followed by CTL and OE.
It's always a good idea to periodically save your work. You can do this by using the command save.image()
:
save.image() # saves all environmental variables in your current R session to a file called '.RData' in your working directory -- this is a command that you should use often when in an R session to save your work.
# OR give it a memorable name
save.image("intro-to-R.RData")
17.6 07 Reading and data inspection
.md file = 07-reading_and_data_inspection.md
17.6.1 Reading data into R:
`?`(read.csv)
<- read.csv(file = "data/mouse_exp_design.csv") metadata
We can see if it has successfully been read in by running:
metadata
genotype | celltype | replicate | |
---|---|---|---|
sample1 | Wt | typeA | 1 |
sample2 | Wt | typeA | 2 |
sample3 | Wt | typeA | 3 |
sample4 | KO | typeA | 1 |
sample5 | KO | typeA | 2 |
sample6 | KO | typeA | 3 |
sample7 | Wt | typeB | 1 |
sample8 | Wt | typeB | 2 |
sample9 | Wt | typeB | 3 |
sample10 | KO | typeB | 1 |
sample11 | KO | typeB | 2 |
sample12 | KO | typeB | 3 |
Exercise 1
- Read "project-summary.txt" in to R using read.table() with the approriate arguments and store it as the variable proj_summary. To figure out the appropriate arguments to use with read.table(), keep the following in mind:
all the columns in the input text file have column name/headers
you want the first column of the text file to be used as row names
hint: look up the input for the row.names = argument in read.table()
- Display the contents of proj_summary in your console
17.6.2 Inspecting data structures:
Getting info on environmental variables:
Let's use the metadata
file that we created to test out data inspection functions.
head(metadata)
genotype | celltype | replicate | |
---|---|---|---|
sample1 | Wt | typeA | 1 |
sample2 | Wt | typeA | 2 |
sample3 | Wt | typeA | 3 |
sample4 | KO | typeA | 1 |
sample5 | KO | typeA | 2 |
sample6 | KO | typeA | 3 |
str(metadata)
## 'data.frame': 12 obs. of 3 variables:
## $ genotype : chr "Wt" "Wt" "Wt" "KO" ...
## $ celltype : chr "typeA" "typeA" "typeA" "typeA" ...
## $ replicate: int 1 2 3 1 2 3 1 2 3 1 ...
dim(metadata)
## [1] 12 3
nrow(metadata)
## [1] 12
ncol(metadata)
## [1] 3
class(metadata)
## [1] "data.frame"
colnames(metadata)
## [1] "genotype" "celltype" "replicate"
Exercise 2
- What is the class of each column in metadata (use one command)?
str(metadata)
## 'data.frame': 12 obs. of 3 variables:
## $ genotype : chr "Wt" "Wt" "Wt" "KO" ...
## $ celltype : chr "typeA" "typeA" "typeA" "typeA" ...
## $ replicate: int 1 2 3 1 2 3 1 2 3 1 ...
- What is the median of the replicates in metadata (use one command)?
Exercise 3
Use the class()
function on glengths
and metadata
, how does the output differ between the two?
class(glengths)
## [1] "numeric"
class(metadata)
## [1] "data.frame"
Use the summary()
function on the proj_summary dataframe
, what is the median “Intronic_Rate?
- Ans:
How long is the samplegroup
factor?
- Ans:
What are the dimensions of the proj_summary dataframe?
When you use the rownames() function on metadata, what is the data structure of the output?
- Ans:
How many elements in (how long is) the output of colnames(proj_summary)? Don’t count, but use another function to determine this.
17.7 08 Data frames and matrices
**.md file = 08-introR-_data_wrangling2.md**
17.7.1 Data wrangling: dataframes
Dataframes (and matrices) have 2 dimensions (rows and columns), so if we want to select some specific data from it we need to specify the “coordinates” we want from it. We use the same square bracket notation but rather than providing a single index, there are two indices required. Within the square bracket, row numbers come first followed by column numbers (and the two are separated by a comma).
Extracting values from data.frames:
# Extract value 'Wt'
1, 1] metadata[
## [1] "Wt"
# Extract third row
3, ] metadata[
genotype | celltype | replicate | |
---|---|---|---|
sample3 | Wt | typeA | 3 |
# Extract third column
3] metadata[,
## [1] 1 2 3 1 2 3 1 2 3 1 2 3
# Extract third column as a data frame
3, drop = FALSE] metadata[,
replicate | |
---|---|
sample1 | 1 |
sample2 | 2 |
sample3 | 3 |
sample4 | 1 |
sample5 | 2 |
sample6 | 3 |
sample7 | 1 |
sample8 | 2 |
sample9 | 3 |
sample10 | 1 |
sample11 | 2 |
sample12 | 3 |
# Dataframe containing first two columns
1:2] metadata[,
genotype | celltype | |
---|---|---|
sample1 | Wt | typeA |
sample2 | Wt | typeA |
sample3 | Wt | typeA |
sample4 | KO | typeA |
sample5 | KO | typeA |
sample6 | KO | typeA |
sample7 | Wt | typeB |
sample8 | Wt | typeB |
sample9 | Wt | typeB |
sample10 | KO | typeB |
sample11 | KO | typeB |
sample12 | KO | typeB |
# Data frame containing first, third and sixth rows
c(1, 3, 6), ] metadata[
genotype | celltype | replicate | |
---|---|---|---|
sample1 | Wt | typeA | 1 |
sample3 | Wt | typeA | 3 |
sample6 | KO | typeA | 3 |
# Extract the celltype column for the first three samples
c("sample1", "sample2", "sample3"), "celltype"] metadata[
## [1] "typeA" "typeA" "typeA"
# Check column names of metadata data frame
colnames(metadata)
## [1] "genotype" "celltype" "replicate"
# Check row names of metadata data frame
rownames(metadata)
## [1] "sample1" "sample2" "sample3" "sample4" "sample5" "sample6" "sample7" "sample8" "sample9" "sample10" "sample11"
## [12] "sample12"
# Extract the genotype column
$genotype metadata
## [1] "Wt" "Wt" "Wt" "KO" "KO" "KO" "Wt" "Wt" "Wt" "KO" "KO" "KO"
# Extract the first five values/elements of the genotype column
$genotype[1:5] metadata
## [1] "Wt" "Wt" "Wt" "KO" "KO"
Exercises
Return a data frame with only the genotype and replicate column values for sample2 and sample8.
Return the fourth and ninth values of the replicate column.
Extract the replicate column as a data frame.
17.7.1.1 Selecting using logical operators:
For example, if we want to return only those rows of the data frame with the celltype column having a value of typeA, we would perform two steps:
Identify which rows in the celltype column have a value of typeA.
Use those TRUE values to extract those rows from the data frame.
To do this we would extract the column of interest as a vector, with the first value corresponding to the first row, the second value corresponding to the second row, so on and so forth. We use that vector in the logical expression. Here we are looking for values to be equal to typeA,
So our logical expression would be:
$celltype == "typeA" metadata
## [1] TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
<- metadata$celltype == "typeA" logical_idx
If we want to keep all the rows that have celltype == "typeA": Any row in the logical_idx that is TRUE will be kept, those that are FALSE will be discarded.
metadata[logical_idx, ]
genotype | celltype | replicate | |
---|---|---|---|
sample1 | Wt | typeA | 1 |
sample2 | Wt | typeA | 2 |
sample3 | Wt | typeA | 3 |
sample4 | KO | typeA | 1 |
sample5 | KO | typeA | 2 |
sample6 | KO | typeA | 3 |
<- which(metadata$celltype == "typeA") idx
metadata[idx, ]
genotype | celltype | replicate | |
---|---|---|---|
sample1 | Wt | typeA | 1 |
sample2 | Wt | typeA | 2 |
sample3 | Wt | typeA | 3 |
sample4 | KO | typeA | 1 |
sample5 | KO | typeA | 2 |
sample6 | KO | typeA | 3 |
# OR you can use nesting
which(metadata$celltype == "typeA"), ] metadata[
genotype | celltype | replicate | |
---|---|---|---|
sample1 | Wt | typeA | 1 |
sample2 | Wt | typeA | 2 |
sample3 | Wt | typeA | 3 |
sample4 | KO | typeA | 1 |
sample5 | KO | typeA | 2 |
sample6 | KO | typeA | 3 |
Let’s try another subsetting. Extract the rows of the metadata data frame for only the replicates 2 and 3. First, let’s create the logical expression for the column of interest (replicate):
<- which(metadata$replicate > 1)
idx
metadata[idx, ]
genotype | celltype | replicate | |
---|---|---|---|
sample2 | Wt | typeA | 2 |
sample3 | Wt | typeA | 3 |
sample5 | KO | typeA | 2 |
sample6 | KO | typeA | 3 |
sample8 | Wt | typeB | 2 |
sample9 | Wt | typeB | 3 |
sample11 | KO | typeB | 2 |
sample12 | KO | typeB | 3 |
This should return the indices for the rows in the replicate column within metadata that have a value of 2 or 3. Now, we can save those indices to a variable and use that variable to extract those corresponding rows from the metadata table. Alternatively, you could do this in one line:
<- metadata[which(metadata$replicate > 1), ] sub_meta
Exercise
Subset the metadata dataframe to return only the rows of data with a genotype of KO.
17.8 09 Logical operators for matching
.md file = 09-identifying-matching-elements.md
17.8.1 Logical operators to match elements
<- read.csv("data/counts.rpkm.csv") rpkm_data
head(rpkm_data)
sample2 | sample5 | sample7 | sample8 | sample9 | sample4 | sample6 | sample12 | sample3 | sample11 | sample10 | sample1 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
ENSMUSG00000000001 | 19.265000 | 23.7222000 | 2.611610 | 5.8495400 | 6.5126300 | 24.076700 | 20.8198000 | 26.9158000 | 20.889500 | 24.0465000 | 24.198100 | 19.7848000 |
ENSMUSG00000000003 | 0.000000 | 0.0000000 | 0.000000 | 0.0000000 | 0.0000000 | 0.000000 | 0.0000000 | 0.0000000 | 0.000000 | 0.0000000 | 0.000000 | 0.0000000 |
ENSMUSG00000000028 | 1.032290 | 0.8269540 | 1.134410 | 0.6987540 | 0.9251170 | 0.827891 | 1.1686300 | 0.6735630 | 0.892183 | 0.9753270 | 1.045920 | 0.9377920 |
ENSMUSG00000000031 | 0.000000 | 0.0000000 | 0.000000 | 0.0298449 | 0.0597726 | 0.000000 | 0.0511932 | 0.0204382 | 0.000000 | 0.0000000 | 0.000000 | 0.0359631 |
ENSMUSG00000000037 | 0.056033 | 0.0473238 | 0.000000 | 0.0685938 | 0.0494147 | 0.180883 | 0.1438840 | 0.0662324 | 0.146196 | 0.0206405 | 0.017004 | 0.1514170 |
ENSMUSG00000000049 | 0.258134 | 1.0730200 | 0.252342 | 0.2970320 | 0.2082800 | 2.191720 | 1.6853800 | 0.1161970 | 0.421286 | 0.0634322 | 0.369550 | 0.2567330 |
colnames(rpkm_data)
## [1] "sample2" "sample5" "sample7" "sample8" "sample9" "sample4" "sample6" "sample12" "sample3" "sample11" "sample10"
## [12] "sample1"
rownames(metadata)
## [1] "sample1" "sample2" "sample3" "sample4" "sample5" "sample6" "sample7" "sample8" "sample9" "sample10" "sample11"
## [12] "sample12"
It looks as if the sample names (header) in our data matrix are similar to the row names of our metadata file, but it's hard to tell since they are not in the same order.
ncol(rpkm_data)
## [1] 12
nrow(metadata)
## [1] 12
17.8.2 The %in%
operator
The %in%
operator will take each element from vector1 as input, one at a time, and evaluate if the element is present in vector2. The two vectors do not have to be the same size. This operation will return a vector containing logical values to indicate whether or not there is a match. The new vector will be of the same length as vector1. Take a look at the example below:
<- c(1, 3, 5, 7, 9, 11) # odd numbers
A <- c(2, 4, 6, 8, 10, 12) # even numbers
B
# test to see if each of the elements of A is in B
%in% B A
## [1] FALSE FALSE FALSE FALSE FALSE FALSE
Let’s change a couple of numbers inside vector B to match vector A:
<- c(1, 3, 5, 7, 9, 11) # odd numbers
A <- c(2, 4, 6, 8, 1, 5) # add some odd numbers in
B
# test to see if each of the elements of A is in B
%in% B A
## [1] TRUE FALSE TRUE FALSE FALSE FALSE
<- A %in% B
intersection intersection
## [1] TRUE FALSE TRUE FALSE FALSE FALSE
A[intersection]
## [1] 1 5
any(A %in% B)
## [1] TRUE
all(A %in% B)
## [1] FALSE
Exercise
- Using the A and B vectors created above, evaluate each element in B to see if there is a match in A
%in% A B
## [1] FALSE FALSE FALSE FALSE TRUE TRUE
- Subset the B vector to only return those values that are also in A.
%in% A] B[B
## [1] 1 5
Suppose we had two vectors containing same values. How can we check if those values are in the same order in each vector? In this case, we can use == operator to compare each element of the same position from two vectors. Unlike %in% operator, == operator requires that two vectors are of equal length.
<- c(10, 20, 30, 40, 50)
A <- c(50, 40, 30, 20, 10) # same numbers but backwards
B
# test to see if each element of A is in B
%in% B A
## [1] TRUE TRUE TRUE TRUE TRUE
# test to see if each element of A is in the same position in B
== B A
## [1] FALSE FALSE TRUE FALSE FALSE
# use all() to check if they are a perfect match
all(A == B)
## [1] FALSE
Let’s try this on our genomic data, and see whether we have metadata information for all samples in our expression data.
<- rownames(metadata)
x <- colnames(rpkm_data) y
all(x %in% y)
## [1] TRUE
Now we can replace x and y by their real values to get the same answer:
all(rownames(metadata) %in% colnames(rpkm_data))
## [1] TRUE
== y x
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
all(x == y)
## [1] FALSE
Looks like all of the samples are there, but they need to be reordered. To reorder our genomic samples, we will learn different ways to reorder data in our next lesson.
Exercise
We have a list of 6 marker genes that we are very interested in. Our goal is to extract count data for these genes using the %in% operator from the rpkm_data data frame, instead of scrolling through rpkm_data and finding them manually.
First, let’s create a vector called important_genes with the Ensembl IDs of the 6 genes we are interested in:
<- c("ENSMUSG00000083700", "ENSMUSG00000080990", "ENSMUSG00000065619", "ENSMUSG00000047945",
important_genes "ENSMUSG00000081010", "ENSMUSG00000030970")
Use the %in% operator to determine if all of these genes are present in the row names of the rpkm_data data frame.
Extract the rows from rpkm_data that correspond to these 6 genes using [] and the %in% operator. Double check the row names to ensure that you are extracting the correct rows.
Bonus question: Extract the rows from rpkm_data that correspond to these 6 genes using [], but without using the %in% operator.
17.9 10 Reordering to match datasets
.md file = 10-reordering-to-match-datasets.md
Exercise
We know how to reorder using indices, let’s try to use it to reorder the contents of one vector to match the contents of another. Let’s create the vectors first and second as detailed below:
<- c("A", "B", "C", "D", "E")
first <- c("B", "D", "E", "A", "C") # same letters but different order second
How would you reorder the second vector to match first?
17.9.1 The match
function
match()
takes 2 arguments. The first argument is a vector of values in the order you want, while the second argument is the vector of values to be reordered such that it will match the first:
1st vector: values in the order you want 2nd vector: values to be reordered
The match
function returns the position of the matches (indices) with respect to the second vector, which can be used to re-order it so that it matches the order in the first vector. Let's use match() on the first and second vectors we created.
# Saving indices for how to reorder `second` to match `first`
<- match(first, second)
reorder_idx reorder_idx
## [1] 4 1 5 2 3
# Reordering the second vector to match the order of the first vector
second[reorder_idx]
## [1] "A" "B" "C" "D" "E"
# Reordering and saving the output to a variable
<- second[reorder_idx] second_reordered
Matching with vectors of different lengths:
<- c("A", "B", "C", "D", "E")
first <- c("D", "B", "A") # remove values second
match(first, second)
## [1] 3 2 NA 1 NA
match(first, second)] second[
## [1] "A" "B" NA "D" NA
NOTE: For values that don't match by default return an NA value. You can specify what values you would have it assigned using nomatch argument.
Example:
<- match(first, second, nomatch = 0)
mtch mtch
## [1] 3 2 0 1 0
second[mtch]
## [1] "A" "B" "D"
# Check row names of the metadata
rownames(metadata)
## [1] "sample1" "sample2" "sample3" "sample4" "sample5" "sample6" "sample7" "sample8" "sample9" "sample10" "sample11"
## [12] "sample12"
# Check the column names of the counts data
colnames(rpkm_data)
## [1] "sample2" "sample5" "sample7" "sample8" "sample9" "sample4" "sample6" "sample12" "sample3" "sample11" "sample10"
## [12] "sample1"
Reordering genomic data using the match()
function:
<- match(rownames(metadata), colnames(rpkm_data))
genomic_idx genomic_idx
## [1] 12 1 9 6 2 7 3 4 5 11 10 8
# Reorder the counts data frame to have the sample names in the same order as the metadata data
# frame
<- rpkm_data[, genomic_idx] rpkm_ordered
# View the reordered counts
View(rpkm_ordered)
# I prefer to use `head` -- and set the output to a limited number of columns so that it fits in
# the console
head(rpkm_ordered[, 1:5])
sample1 | sample2 | sample3 | sample4 | sample5 | |
---|---|---|---|---|---|
ENSMUSG00000000001 | 19.7848000 | 19.265000 | 20.889500 | 24.076700 | 23.7222000 |
ENSMUSG00000000003 | 0.0000000 | 0.000000 | 0.000000 | 0.000000 | 0.0000000 |
ENSMUSG00000000028 | 0.9377920 | 1.032290 | 0.892183 | 0.827891 | 0.8269540 |
ENSMUSG00000000031 | 0.0359631 | 0.000000 | 0.000000 | 0.000000 | 0.0000000 |
ENSMUSG00000000037 | 0.1514170 | 0.056033 | 0.146196 | 0.180883 | 0.0473238 |
ENSMUSG00000000049 | 0.2567330 | 0.258134 | 0.421286 | 2.191720 | 1.0730200 |
Now we can check if the rownames of metadata is in the same order as the colnames of rpkm_ordered:
all(rownames(metadata) == colnames(rpkm_ordered))
## [1] TRUE
Exercises
After talking with your collaborator, it becomes clear that sample2 and sample9 were actually from a different mouse background than the other samples and should not be part of our analysis. Create a new variable called subset_rpkm that has these columns removed from the rpkm_ordered data frame.
Use the match()
function to subset the metadata data frame so that the row names of the metadata data frame match the column names of the subset_rpkm data frame.
17.10 11 Plotting and data visualization
.md file = 11-setting_up_to_plot.md10-reordering-to-match-datasets.md
17.10.1 Setup a data frame for visualization
mean(rpkm_ordered$sample1)
## [1] 10.2661
But what we want is a vector of 12 values that we can add to the metadata data frame. What is the best way to do this?
To get the mean of all the samples in a single line of code the map()
family of function is a good option.
17.10.2 The map
family of functions
We can use map functions to execute some task/function on every element in a vector, or every column in a dataframe, or every component of a list, and so on.
map()
creates a list.map_lgl()
creates a logical vector.map_int()
creates an integer vector.map_dbl()
creates a “double” or numeric vector.map_chr()
creates a character vector.
The syntax for the map()
family of functions is:
map
(object, function_to_apply)
We can use the map_dbl()
to get the mean of each column of rpkm_ordered
in just one command:
# install purrr if you need to by uncommenting the following line: BiocManager::install(purrr)
library(purrr) # Load the purrr
<- map_dbl(rpkm_ordered, mean) samplemeans
The output of map_dbl()
is a named vector of length 12.
17.10.3 Adding to a metadata object
Before we add samplemeans
as a new column to metadata, let's create a vector with the ages of each of the mice in our data set.
# Create a numeric vector with ages. Note that there are 12 elements here
<- c(40, 32, 38, 35, 41, 32, 34, 26, 28, 28, 30, 32) age_in_days
Now, we are ready to combine the metadata
data frame with the 2 new vectors to create a new data frame with 5 columns:
# Add the new vectors as the last columns to the metadata
<- data.frame(metadata, samplemeans, age_in_days)
new_metadata
# Take a look at the new_metadata object
head(new_metadata)
genotype | celltype | replicate | samplemeans | age_in_days | |
---|---|---|---|---|---|
sample1 | Wt | typeA | 1 | 10.266102 | 40 |
sample2 | Wt | typeA | 2 | 10.849759 | 32 |
sample3 | Wt | typeA | 3 | 9.452517 | 38 |
sample4 | KO | typeA | 1 | 15.833872 | 35 |
sample5 | KO | typeA | 2 | 15.590184 | 41 |
sample6 | KO | typeA | 3 | 15.551529 | 32 |
names(new_metadata)
## [1] "genotype" "celltype" "replicate" "samplemeans" "age_in_days"
17.11 12 Data visualization with ggplot2
.md file = 12-ggplot2.md
## load the new_metadata data frame into your environment from a .RData object, if you need to -
## set eval=TRUE
load("data/new_metadata.RData")
# this data frame should have 12 rows and 5 columns
dim(new_metadata)
## [1] 12 5
library(ggplot2)
The ggplot()
function is used to initialize the basic graph structure, then we add to it. The basic idea is that you specify different parts of the plot using additional functions one after the other and combine them using the +
operator; the functions in the resulting code chunk are called layers.
1st layer: plot window
ggplot(new_metadata) # what happens? we get a plot window
The geom (geometric) object is the layer that specifies what kind of plot we want to draw. A plot must have at least one geom
; there is no upper limit. Examples include:
- points (
geom_point
,geom_jitter
for scatter plots, dot plots, etc) - lines (
geom_line
, for time series, trend lines, etc) - boxplot (
geom_boxplot
, for, well, boxplots!)
2nd layer: geometries
ggplot(new_metadata) + geom_point() # note what happens here
17.11.1 Geometries
Geoms (e.g.) in layers:
- geom_point()
- geom_boxplot()
- geom_bar()
- geom_density()
- geom_dotplot()
17.11.2 Aesthetics
We get an error because each type of geom usually has a required set of aesthetics. “Aesthetics” are set with the aes() function and can be set within geom_point()
or within ggplot()
.
The aes() function can be used to specify many plot elements including the following:
- position (i.e., on the x and y axes)
- color (“outside” color)
- fill (“inside” color)
- shape (of points)
- etc.
3rd layer: aesthetics
ggplot(new_metadata, aes(x = age_in_days, y = samplemeans)) + geom_point() # what happens here? we initialize the plot window with the axes
Add color to aesthetics:
ggplot(new_metadata) + geom_point(aes(x = age_in_days, y = samplemeans, color = genotype))
Add shape to aesthetics:
ggplot(new_metadata) + geom_point(aes(x = age_in_days, y = samplemeans, color = genotype, shape = celltype))
Add point size to geometry:
ggplot(new_metadata) + geom_point(aes(x = age_in_days, y = samplemeans, color = genotype, shape = celltype),
size = 2.25)
The labels on the x- and y-axis are also quite small and hard to read. To change their size, we need to add an additional theme layer. The ggplot2 theme system handles non-data plot elements such as:
- Axis label aesthetics
- Plot background
- Facet label background
- Legend appearance
5th layer: theme
ggplot(new_metadata) + geom_point(aes(x = age_in_days, y = samplemeans, color = genotype, shape = celltype),
size = 3) + theme_bw()
Do the axis labels or the tick labels get any larger by changing themes?
6th layer: axes title size
ggplot(new_metadata) + geom_point(aes(x = age_in_days, y = samplemeans, color = genotype, shape = celltype),
size = 2.25) + theme_bw() + theme(axis.title = element_text(size = rel(1.5)))
Exercise
The current axis label text defaults to what we gave as input to geom_point (i.e the column headers). We can change this by adding additional layers called xlab() and ylab() for the x- and y-axis, respectively. Add these layers to the current plot such that the x-axis is labeled “Age (days)” and the y-axis is labeled “Mean expression”.
Use the ggtitle layer to add a plot title of your choice.
Add the following new layer to the code chunk theme(plot.title=element_text(hjust=0.5))
What does it change?
- Ans:
How many theme() layers can be added to a ggplot code chunk, in your estimation?
- Ans:
17.12 13 Boxplot exercise
.md file = 13-boxplot_exercise.md
- Generate a boxplot using the data in the new_metadata dataframe. Create a ggplot2 code chunk with the following instructions:
Use the geom_boxplot() layer to plot the differences in sample means between the Wt and KO genotypes.
Use the fill aesthetic to look at differences in sample means between the celltypes within each genotype.
Add a title to your plot.
Add labels, ‘Genotype’ for the x-axis and ‘Mean expression’ for the y-axis.
Make the following theme() changes:
- Use the theme_bw() function to make the background white.
- Change the size of your axes labels to 1.25x larger than the default.
- Change the size of your plot title to 1.5x larger than default.
- Center the plot title.
Change genotype order: Factor the
new_metadata$genotype
column without creating any extra variables/objects and change the levels toc("Wt", "KO")
Re-run the boxplot code chunk you created in 1.
Change default colors
Add a new layer scale_color_manual(values=c("purple","orange")). Do you observe a change?
- Ans:
Replace scale_color_manual(values=c("purple","orange")) with scale_fill_manual(values=c("purple","orange")). Do you observe a change?
- Ans:
In the scatterplot we drew in class, add a new layer scale_color_manual(values=c("purple","orange")), do you observe a difference?
- Ans:
What do you think is the difference between scale_color_manual() and scale_fill_manual()?
Ans:
Boxplot using "color" instead of "fill". Use purple and orange for your colors.
Back in your boxplot code, change the colors in the scale_fill_manual() layer to be your 2 favorite colors.
Are there any colors that you tried that did not work?
- Ans:
- OPTIONAL Exercise
Find the hexadecimal code for your 2 favourite colors (from exercise 3 above) and replace the color names with the hexadecimal codes within the ggplot2 code chunk.
# Hint: the gplots package. If gplots is not already installed, uncomment the following line of
# code
# BiocManager::install('gplots')
library(gplots)
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
17.13 14 Saving data and plots to file
.md file = 14-exporting_data_and_plots.md
17.13.1 Writing data to file
# Save a data frame to file
write.csv(sub_meta, file = "data/subset_meta.csv")
17.13.2 Exporting figures to file:
## Open device for writing
pdf(file = "figures/scatterplot.pdf")
ggplot(new_metadata) + geom_point(aes(x = age_in_days, y = samplemeans, color = genotype, shape = celltype),
size = rel(3))
## Closing the device is essential to save the temporary file created by pdf()/png()
dev.off()
## quartz_off_screen
## 2
17.14 15 Finding help
.md file = 15-finding_help.md
17.14.1 Saving variables
In order to find help on line you must provide a reproducible example. Google and R sites such as Stackoverflow are good sources. To provide an example you may want to share an object with someone else, you can save any or all R data structures that you have in your environment to a file:
save.image() # saves all environmental variables in your current R session to a file called '.RData' in your working directory -- this is a command that you should use often when in an R session to save your work.
# OR give it a memorable name
save.image("intro-to-R.RData")
# OR for a single object
save.image(metadata, "metadata.RData")
# you can also use
saveRDS(metadata, file = "metadata.rds")
17.15 16 Tidyverse data wrangling
.md file = 16-tidyverse.md
# If tidyverse is not already installed, uncomment the following line of code
# BiocManager::install('tidyverse')
library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ tibble 3.1.8 ✔ dplyr 1.1.0
## ✔ tidyr 1.3.0 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 1.0.0
## ── Conflicts ───────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
Tidyverse basics
17.15.1 Pipes: The pipe (%>%
) allows the output of a previous command to be used as input to another command instead of using nested functions.
NOTE: Shortcut to write the pipe is shift + command + m on Macos; shift + ctrl + m on Windows
## A single command
sqrt(83)
## [1] 9.110434
## Base R method of running more than one command
round(sqrt(83), digits = 2)
## [1] 9.11
## Running more than one command with piping
sqrt(83) %>%
round(digits = 2)
## [1] 9.11
Exercise
- Create a vector of random numbers using the code below:
<- c(81, 90, 65, 43, 71, 29) random_numbers
Use the pipe (%>%) to perform two steps in a single line:
Take the mean of random_numbers using the mean() function.
Round the output to three digits using the round() function.
%>%
random_numbers %>%
mean round(digits = 3)
## [1] 63.167
17.15.2 Tibbles
Experimental data
The dataset: gprofiler_results_Mov10oe.tsv
The gene list represents the functional analysis results of differences between control mice and mice over-expressing a gene involved in RNA splicing. We will focus on gene ontology (GO) terms, which describe the roles of genes and gene products organized into three controlled vocabularies/ontologies (domains):
biological processes (BP)
cellular components (CC)
molecular functions (MF)
that are over-represented in our list of genes.
17.15.3 Analysis goal and workflow
Goal: Visually compare the most significant biological processes (BP) based on the number of associated differentially expressed genes (gene ratios) and significance values and create a plot.
1. Read in the functional analysis results
# Read in the functional analysis results
<- read.delim(file = "data/gprofiler_results_Mov10oe.tsv")
functional_GO_results
# Take a look at the results
1:3, 1:12] functional_GO_results[
query.number | significant | p.value | term.size | query.size | overlap.size | recall | precision | term.id | domain | subgraph.number | term.name |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | TRUE | 0.00434 | 111 | 5850 | 52 | 0.009 | 0.468 | GO:0032606 | BP | 237 | type I interferon production |
1 | TRUE | 0.00330 | 110 | 5850 | 52 | 0.009 | 0.473 | GO:0032479 | BP | 237 | regulation of type I interferon production |
1 | TRUE | 0.02970 | 39 | 5850 | 21 | 0.004 | 0.538 | GO:0032480 | BP | 237 | negative regulation of type I interferon production |
names(functional_GO_results)
## [1] "query.number" "significant" "p.value" "term.size" "query.size" "overlap.size" "recall"
## [8] "precision" "term.id" "domain" "subgraph.number" "term.name" "relative.depth" "intersection"
names(functional_GO_results)
## [1] "query.number" "significant" "p.value" "term.size" "query.size" "overlap.size" "recall"
## [8] "precision" "term.id" "domain" "subgraph.number" "term.name" "relative.depth" "intersection"
2. Extract only the GO biological processes (BP) of interest
# Return only GO biological processes
<- functional_GO_results %>%
bp_oe ::filter(domain == "BP")
dplyr
1:3, 1:12] bp_oe[
query.number | significant | p.value | term.size | query.size | overlap.size | recall | precision | term.id | domain | subgraph.number | term.name |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | TRUE | 0.00434 | 111 | 5850 | 52 | 0.009 | 0.468 | GO:0032606 | BP | 237 | type I interferon production |
1 | TRUE | 0.00330 | 110 | 5850 | 52 | 0.009 | 0.473 | GO:0032479 | BP | 237 | regulation of type I interferon production |
1 | TRUE | 0.02970 | 39 | 5850 | 21 | 0.004 | 0.538 | GO:0032480 | BP | 237 | negative regulation of type I interferon production |
Note: the dplyr::filter is necessary because the stats library gets loaded with ggplot2, rmarkdown and bookdown and stats also has a filter function
Exercise
We would like to perform an additional round of filtering to only keep the most specific GO terms.
- For bp_oe, use the filter() function to only keep those rows where the relative.depth is greater than 4.
- Save output to overwrite our bp_oe variable.
<- bp_oe %>%
bp_oe ::filter(relative.depth > 4) dplyr
17.15.4 Select
- Select only the colnames needed for visualization
# Selecting columns to keep for visualization
names(bp_oe)
## [1] "query.number" "significant" "p.value" "term.size" "query.size" "overlap.size" "recall"
## [8] "precision" "term.id" "domain" "subgraph.number" "term.name" "relative.depth" "intersection"
<- bp_oe %>%
bp_oe select(term.id, term.name, p.value, query.size, term.size, overlap.size, intersection)
head(bp_oe[, 1:6])
term.id | term.name | p.value | query.size | term.size | overlap.size |
---|---|---|---|---|---|
GO:0090672 | telomerase RNA localization | 2.41e-02 | 5850 | 16 | 11 |
GO:0090670 | RNA localization to Cajal body | 2.41e-02 | 5850 | 16 | 11 |
GO:0090671 | telomerase RNA localization to Cajal body | 2.41e-02 | 5850 | 16 | 11 |
GO:0071702 | organic substance transport | 0.00e+00 | 5850 | 2629 | 973 |
GO:0015931 | nucleobase-containing compound transport | 4.43e-05 | 5850 | 200 | 93 |
GO:0050657 | nucleic acid transport | 3.70e-06 | 5850 | 166 | 83 |
dim(bp_oe)
## [1] 668 7
17.15.5 Arrange
Let’s sort the rows by adjusted p-value with the arrange() function.
# Order by adjusted p-value ascending
<- bp_oe %>%
bp_oe arrange(p.value)
17.15.6 Rename
Let’s rename the term.id and term.name columns.
# Provide better names for columns
<- bp_oe %>%
bp_oe ::rename(GO_id = term.id, GO_term = term.name) dplyr
Exercise
Rename the intersection column to genes to reflect the fact that these are the DE genes associated with the GO process.
<- bp_oe %>%
bp_oe ::rename(genes = intersection) dplyr
17.15.7 Mutate
Create additional metrics for plotting (e.g. gene ratios)
# Create gene ratio column based on other columns in dataset
<- bp_oe %>%
bp_oe mutate(gene_ratio = overlap.size/query.size)
Exercise
Create a column in bp_oe called term_percent to determine the percent of DE genes associated with the GO term relative to the total number of genes associated with the GO term (overlap.size / term.size)
<- bp_oe %>%
bp_oe mutate(term_percent = overlap.size/term.size)
Homework
Use ggplot to plot the first 30 significant GO terms vs gene_ratio. Color the points by p.value as shown in the figure in below:
![](img/dotplot6.png)
dotplot6
= bp_oe[1:30, ] sub
Change the p.value gradient colors to red-orange-yellow
# to help linearize the p.values for the colors; not essential for answer
$`-log10(p.value)` = -log10(sub$p.value) sub
See ggplot for more detail.