Chapter 1 Introduction

Note: this book is a work in progress. All source code for this project are available on my GitHub, which is linked in 1.4.

This book serves as a collection of R Markdown files helps users in learning the practical syntax and usage of R for data science. Mainly, code snippets and workflow aimed at tackling everyday tasks in data science will be covered, which includes data cleaning, data wrangling, iterations, machine learning with caret, data visualization, and web app design using Shiny. Each broad topic will be split into chapters, though there will be some overlap.

1.1 Preface

This book assumes readers have a basic grasp of R programming language. There are tons of useful resources that provides the first few steps into programming with R and I hope this book can serve as a useful complement to using R - a pocket reference of sorts!

It depends on who you ask, but many users emphasize the importance of learning the syntax of base R before diving into commonly used packages like dplyr and data.table. It is definitely a good idea to get the hang of base R if you’re developing an app or a package for example - this would lessen the number of dependencies your program has!

However, similarly to pandas in Python, the popularity of dplyr and associated packages within the tidyverse suite has soared and you wouldn’t be surprised to see tidyverse solutions as the top answers in forums like Stack Overflow (this can be frustrating if you’re a base R purist). Using tidyverse for data science can definitely make your life easy - I find their syntax more pretty intuitive - but I’d like to sit on the fence on the base R vs. tidyverse debate; you should know both! For that reason, in this book I will try to use both interchangeably.

1.2 R syntax in this book

Code chunks will be presented in a typical Markdown format as such, with the code output below:

runif(n = 20, min = 0, max = 100)
##  [1] 86.5544839  4.1597651 71.9618412 72.2593990 49.5171120
##  [6] 43.7057444 22.0336953 68.1447566 81.9539577  0.9975925
## [11] 32.0767239 85.5006679 58.4824134  2.7996030 57.6497652
## [16] 77.2525484 38.6410517 47.0563594 80.6563789 21.4674578

When using commands outside of base R, the loading of the parent package will be explicitly shown to avoid confusion:

library(microbenchmark)
microbenchmark::microbenchmark(runif(n = 20, min = 0, max = 100))
## Warning in microbenchmark::microbenchmark(runif(n = 20, min =
## 0, max = 100)): less accurate nanosecond times to avoid
## potential integer overflows
## Unit: nanoseconds
##                               expr min  lq    mean median   uq
##  runif(n = 20, min = 0, max = 100) 861 943 1127.09   1066 1107
##   max neval
##  8159   100

Typically in longer chains of code, I will use %>% from magrittr as a pipe. This is usually standard practice in code using packages from the tidyverse so it’s a good habit to start using it. However, keep in mind - as of the recent R version (shown below), there is a native R pipe |> which behave almost - but not always - in a similar fashion.

Finally, here is the R version I am currently using:

version
##                _                           
## platform       aarch64-apple-darwin20      
## arch           aarch64                     
## os             darwin20                    
## system         aarch64, darwin20           
## status                                     
## major          4                           
## minor          3.1                         
## year           2023                        
## month          06                          
## day            16                          
## svn rev        84548                       
## language       R                           
## version.string R version 4.3.1 (2023-06-16)
## nickname       Beagle Scouts

1.3 R packages commonly used in this book

  • tidyverse: a collection of packages for data science, including dplyr, purrr, stringr, forcats, readr, and ggplot.

  • caret: package for implementation of machine learning models, with support for algorithms such as ranger, rpart, xgbTree, and svmLinear.

  • mlbench: package for benchmarks and datasets in machine learning.

  • broom: package for summarizing of model estimates.

  • ggpubr: package for publication-ready data visualizations.

  • Shiny: package for implementation and designing of interactive web apps.

1.4 Installing R packages

R packages found in this book are available on CRAN and thus can be installed simply by running install.packages(). For packages not on CRAN (or if you want to download developmental versions of a package), you can install packages straight from a GitHub repository by running devtools::install_github().

1.5 Code availability

All code used to compile this book as well as the individual markdown files are available on my repository here

1.6 Website hosting

This book is hosted on the shinyapps server, deployed with the R package rsconnect. The markdown files are compiled in this book format using the R package bookdown.