# 1 Getting Started with R

R is a popular programming language within the Data Scientist’s arsenal of tools. Over the course, we will start to get more familiar with R and show how it might be used to complete complicated tasks.

## 1.1 Launch RStudio

Here, we aim to introduce you to some base functionality within R and showcase how to perform some basic statistical tasks.

For starters, you can interface directly with the R terminal and enter in basic calculations:


(2*3-1)^2 - 5
#>  20

Often we will want to save a value, to do this we use one of the assignment operators: <- , ->, or =:

x <- 2*3
2 -> y
z = 5
print(x)
#>  6
print(y)
#>  2
print(z)
#>  5
w <- (x-1)^y-z

print(w)
#>  20

R can also assess the validity of logical statements:

TRUE == FALSE
#>  FALSE

R can deal with integers or decimals:

45L # add L if you want R to think of this as an integer
#>  45

3.14
#>  3.14

The class function can let you know the data type:

class(45L)
#>  "integer"
class(45)
#>  "numeric"
class(3.14)
#>  "numeric"

## 1.2 Summary of basic R Data Types

 Example Type “male,” “Diabetes” Character / String 3, 20.6, 100.222 Numeric 26L (add an ‘L’ to denote integer) Integer TRUE, FALSE Logical

$$~$$

$$~$$

## 1.3 Lists and Vectors

We can make lists or vectors within R by using the c() function:


list_of_ints <- c(2L,4L,6L,8L)

list_of_strings <- c("Data", "Science", "is a", 'blast!')

list_of_logicals <- c(TRUE, FALSE, TRUE, FALSE)

list_of_mixed_type <- c(3.14, 1L, "cat", TRUE)

list_of_numbers <- c(22/7, 18, 42, 65.2)

list_of_sexes <- c("Male","Female","Female","Male")

### 1.3.1length

The length function will tell you the length of the vector / list:


length(list_of_strings)
#>  4
class(list_of_strings)
#>  "character"

length(list_of_ints)
#>  4
class(list_of_ints)
#>  "integer"

length(list_of_numbers)
#>  4
class(list_of_numbers)
#>  "numeric"

length(list_of_mixed_type)
#>  4
class(list_of_mixed_type)
#>  "character"

### 1.3.2 Accessing elements in a list

We can access certain elements from within the lists by passing the location of the element of interest within the list:


list_of_mixed_type
#>  "3.14"
list_of_mixed_type
#>  "1"
list_of_mixed_type
#>  "cat"
list_of_mixed_type
#>  "TRUE"

$$~$$

$$~$$

## 1.4 The Data Frame

A Data Frame is a matrix of vectors:

my_dataframe <- data.frame(list_of_ints,
list_of_strings,
list_of_logicals,
list_of_mixed_type,
list_of_numbers,
list_of_sexes)

my_dataframe
#>   list_of_ints list_of_strings list_of_logicals list_of_mixed_type
#> 1            2            Data             TRUE               3.14
#> 2            4         Science            FALSE                  1
#> 3            6            is a             TRUE                cat
#> 4            8          blast!            FALSE               TRUE
#>   list_of_numbers list_of_sexes
#> 1        3.142857          Male
#> 2       18.000000        Female
#> 3       42.000000        Female
#> 4       65.200000          Male

We can see the first few rows of a dataframe with the head function:

head(my_dataframe)
#>   list_of_ints list_of_strings list_of_logicals list_of_mixed_type
#> 1            2            Data             TRUE               3.14
#> 2            4         Science            FALSE                  1
#> 3            6            is a             TRUE                cat
#> 4            8          blast!            FALSE               TRUE
#>   list_of_numbers list_of_sexes
#> 1        3.142857          Male
#> 2       18.000000        Female
#> 3       42.000000        Female
#> 4       65.200000          Male

#### 1.4.0.1 The dim function

We can determine the dimension by using the dim function:

dim(my_dataframe)
#>  4 6

so we can see that this data-frame has 4 rows and 6 columns. We can also get those values by using nrow and ncol:

nrow(my_dataframe)
#>  4

ncol(my_dataframe)
#>  6

### 1.4.1 Matrix Notation

We can use familiar matrix notation to select specific elements from the data frame:


my_dataframe[3,5]
#>  42

We can use similar notation to select a row:


my_dataframe[3,]
#>   list_of_ints list_of_strings list_of_logicals list_of_mixed_type
#> 3            6            is a             TRUE                cat
#>   list_of_numbers list_of_sexes
#> 3              42        Female

Or a column:


my_dataframe[,5]
#>   3.142857 18.000000 42.000000 65.200000

### 1.4.2 Data Frames have colnames

often our data frame will have meaningful column names:

colnames(my_dataframe)
#>  "list_of_ints"       "list_of_strings"    "list_of_logicals"
#>  "list_of_mixed_type" "list_of_numbers"    "list_of_sexes"

so it is also helpful to be able to pass in these column names to select a column rather than recall which column number the information is associated with:

my_dataframe[,'list_of_ints']
#>  2 4 6 8

We may also use the following to access a column in a dataframe:

my_dataframe$list_of_strings #>  "Data" "Science" "is a" "blast!" One interesting thing of note is that: class(my_dataframe[,"list_of_strings"]) #>  "character" but class(list_of_strings) #>  "character" ## 1.5 Factors As a default, R will attempt to turn strings into factors within a data frame. We can turn this off by passing in the additional parameter stringsAsFactors = FALSE into the data.frame function:  my_dataframe2 <- data.frame(list_of_ints, list_of_strings, list_of_logicals, list_of_mixed_type, list_of_numbers, list_of_sexes, stringsAsFactors = FALSE) my_dataframe2 #> list_of_ints list_of_strings list_of_logicals list_of_mixed_type #> 1 2 Data TRUE 3.14 #> 2 4 Science FALSE 1 #> 3 6 is a TRUE cat #> 4 8 blast! FALSE TRUE #> list_of_numbers list_of_sexes #> 1 3.142857 Male #> 2 18.000000 Female #> 3 42.000000 Female #> 4 65.200000 Male Now notice that: class(list_of_strings) == class(my_dataframe2[,"list_of_strings"]) #>  TRUE ### 1.5.1 levels and ordered levels We can access the levels of a factor with the levels function:  levels(my_dataframe[,'list_of_sexes']) #> NULL Having factors enables us to place ordering on different levels:  list_of_costs <- c('$0 - $100', '$100 - $200', '$200 - $300', '$300 - $400') ordered_list_of_costs <- ordered(list_of_costs, levels = list_of_costs) my_dataframe <- data.frame(list_of_ints, list_of_strings, list_of_logicals, list_of_mixed_type, list_of_numbers, list_of_sexes, ordered_list_of_costs) my_dataframe #> list_of_ints list_of_strings list_of_logicals list_of_mixed_type #> 1 2 Data TRUE 3.14 #> 2 4 Science FALSE 1 #> 3 6 is a TRUE cat #> 4 8 blast! FALSE TRUE #> list_of_numbers list_of_sexes ordered_list_of_costs #> 1 3.142857 Male$0 - $100 #> 2 18.000000 Female$100 - $200 #> 3 42.000000 Female$200 - $300 #> 4 65.200000 Male$300 - $400 my_dataframe[,'ordered_list_of_costs'] #> $0 - $100$100 - $200$200 - $300$300 - $400 #> Levels:$0 - $100 <$100 - $200 <$200 - $300 <$300 - $400 $$~$$ $$~$$ ## 1.6 Structure The R Structure function will compactly display the internal structure of an R object. To see help page for a function use ? before the function name, for example try: ?str str(my_dataframe) #> 'data.frame': 4 obs. of 7 variables: #>$ list_of_ints         : int  2 4 6 8
#>  $list_of_strings : chr "Data" "Science" "is a" "blast!" #>$ list_of_logicals     : logi  TRUE FALSE TRUE FALSE
#>  $list_of_mixed_type : chr "3.14" "1" "cat" "TRUE" #>$ list_of_numbers      : num  3.14 18 42 65.2
#>  $list_of_sexes : chr "Male" "Female" "Female" "Male" #>$ ordered_list_of_costs: Ord.factor w/ 4 levels "$0 -$100"<"$100 -$200"<..: 1 2 3 4
str(list_of_costs)
#>  chr [1:4] "$0 -$100" "$100 -$200" "$200 -$300" "$300 -$400"

$$~$$

$$~$$

## 1.7summary

The summary is a generic function used to produce result summaries of the results of various model fitting functions.

See ?summary for more information.

In the case of a dataframe the summary function will give summary level information on the dataframe, for continuous variables it will display the minimum, first quartile, median, mean, third quartile and max; for categorical data counts of each of the classes

summary(my_dataframe)
#>   list_of_ints list_of_strings    list_of_logicals list_of_mixed_type
#>  Min.   :2.0   Length:4           Mode :logical    Length:4
#>  1st Qu.:3.5   Class :character   FALSE:2          Class :character
#>  Median :5.0   Mode  :character   TRUE :2          Mode  :character
#>  Mean   :5.0
#>  3rd Qu.:6.5
#>  Max.   :8.0
#>  list_of_numbers  list_of_sexes      ordered_list_of_costs
#>  Min.   : 3.143   Length:4           $0 -$100  :1
#>  1st Qu.:14.286   Class :character   $100 -$200:1
#>  Median :30.000   Mode  :character   $200 -$300:1
#>  Mean   :32.086                      $300 -$400:1
#>  3rd Qu.:47.800
#>  Max.   :65.200

We can also get these statistics by class, here we use the information in the dataframe column list_of_sexes

by(my_dataframe, my_dataframe$list_of_sexes, summary) #> my_dataframe$list_of_sexes: Female
#>   list_of_ints list_of_strings    list_of_logicals list_of_mixed_type
#>  Min.   :4.0   Length:2           Mode :logical    Length:2
#>  1st Qu.:4.5   Class :character   FALSE:1          Class :character
#>  Median :5.0   Mode  :character   TRUE :1          Mode  :character
#>  Mean   :5.0
#>  3rd Qu.:5.5
#>  Max.   :6.0
#>  list_of_numbers list_of_sexes      ordered_list_of_costs
#>  Min.   :18      Length:2           $0 -$100  :0
#>  1st Qu.:24      Class :character   $100 -$200:1
#>  Median :30      Mode  :character   $200 -$300:1
#>  Mean   :30                         $300 -$400:0
#>  3rd Qu.:36
#>  Max.   :42
#> ------------------------------------------------------------
sample(library()$results[,1], 20, replace = FALSE) #>  "BH" "tcltk" "chron" "googlesheets4" #>  "cowplot" "gsubfn" "qap" "askpass" #>  "TSP" "processx" "translations" "ranger" #>  "lifecycle" "methods" "grid" "ggsignif" #>  "rsconnect" "httpuv" "yardstick" "backports" The following function will check if a list of packages are already installed, if not then it will install them: # install package if missing install_if_not <- function( list.of.packages ) { new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])] if(length(new.packages)) { install.packages(new.packages, repos = "http://cran.us.r-project.org") } else { print(paste0("the package '", list.of.packages , "' is already installed")) } } You can use the function like this: # test function install_if_not(c("tidyverse")) #>  "the package 'tidyverse' is already installed" some additional information on the installed packages including the version can be found now: tibble::as_tibble(installed.packages()) #> # A tibble: 305 x 16 #> Package LibPath Version Priority Depends Imports LinkingTo Suggests Enhances #> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> #> 1 abind C:/User~ 1.4-5 <NA> R (>= ~ method~ <NA> <NA> <NA> #> 2 Amelia C:/User~ 1.8.0 <NA> R (>= ~ foreig~ Rcpp (>=~ "tcltk,~ <NA> #> 3 AMR C:/User~ 1.7.1 <NA> R (>= ~ <NA> <NA> "cleane~ <NA> #> 4 arsenal C:/User~ 3.6.3 <NA> R (>= ~ knitr ~ <NA> "broom ~ <NA> #> 5 askpass C:/User~ 1.1 <NA> <NA> sys (>~ <NA> "testth~ <NA> #> 6 assertt~ C:/User~ 0.2.1 <NA> <NA> tools <NA> "testth~ <NA> #> # ... with 299 more rows, and 7 more variables: License <chr>, #> # License_is_FOSS <chr>, License_restricts_use <chr>, OS_type <chr>, #> # MD5sum <chr>, NeedsCompilation <chr>, Built <chr> ### 1.14.2 loaded packages To check which packages are loaded you can use: # loaded packages (.packages()) #>  "stats" "graphics" "grDevices" "utils" "datasets" "methods" #>  "base" Make sure that the tidyverse and dplyr packages are installed. You can run install.packages(c('tidyverse','dplyr')) to install both. ### 1.14.3 read sample data my_dataframe <- readRDS('my_dataframe.RDS' ) We can use the head command to see the first few rows: head(my_dataframe) #> list_of_ints list_of_strings list_of_logicals list_of_mixed_type #> 1 2 Data TRUE 3.14 #> 2 4 Science FALSE 1 #> 3 6 is a TRUE cat #> 4 8 blast! FALSE TRUE #> list_of_numbers list_of_sexes ordered_list_of_costs #> 1 3.142857 Male$0 - $100 #> 2 18.000000 Female$100 - $200 #> 3 42.000000 Female$200 - $300 #> 4 65.200000 Male$300 - $400 $$~$$ $$~$$ ## 1.15 Using Packages We make sections of code accessible to installed packages by using library command:  loaded_package_before <- (.packages()) # everthing below here can call functions in dplyr package library('dplyr') #> #> Attaching package: 'dplyr' #> The following objects are masked from 'package:stats': #> #> filter, lag #> The following objects are masked from 'package:base': #> #> intersect, setdiff, setequal, union glimpse(my_dataframe) #> Rows: 4 #> Columns: 7 #>$ list_of_ints          <int> 2, 4, 6, 8
#> $list_of_strings <chr> "Data", "Science", "is a", "blast!" #>$ list_of_logicals      <lgl> TRUE, FALSE, TRUE, FALSE
#> $list_of_mixed_type <chr> "3.14", "1", "cat", "TRUE" #>$ list_of_numbers       <dbl> 3.142857, 18.000000, 42.000000, 65.200000
#> $list_of_sexes <chr> "Male", "Female", "Female", "Male" #>$ ordered_list_of_costs <ord> $0 -$100, $100 -$200, $200 -$300, $300 -$400

(.packages())
#>  "dplyr"     "stats"     "graphics"  "grDevices" "utils"     "datasets"
#>  "methods"   "base"

#>  "dplyr"
detach(package:dplyr)
#the dplyr package has now been detached. calls to functions may have errors

packages_cur <- (.packages())

#>  "dplyr"

Additionally, we can access a function from a library without loading the entire library, this can be done by using a command such as package::function. This notation is needed in any instance where two or more loaded packages have at least one function with the same name. This notation is also useful in development of functions and packages.

For instance, the glimpse function from the dplyr package can also be accessed by using the following command

# glimpse the data
dplyr::glimpse(my_dataframe)
#> Rows: 4
#> Columns: 7
#> $list_of_ints <int> 2, 4, 6, 8 #>$ list_of_strings       <chr> "Data", "Science", "is a", "blast!"
#> $list_of_logicals <lgl> TRUE, FALSE, TRUE, FALSE #>$ list_of_mixed_type    <chr> "3.14", "1", "cat", "TRUE"
#> $list_of_numbers <dbl> 3.142857, 18.000000, 42.000000, 65.200000 #>$ list_of_sexes         <chr> "Male", "Female", "Female", "Male"
#> $ordered_list_of_costs <ord>$0 - $100,$100 - $200,$200 - $300,$300 - $400 And just notice that without the library loaded or the dplyr:: in front we can error: # error glimpse(my_dataframe) #> Rows: 4 #> Columns: 7 #>$ list_of_ints          <int> 2, 4, 6, 8
#> $list_of_strings <chr> "Data", "Science", "is a", "blast!" #>$ list_of_logicals      <lgl> TRUE, FALSE, TRUE, FALSE
#> $list_of_mixed_type <chr> "3.14", "1", "cat", "TRUE" #>$ list_of_numbers       <dbl> 3.142857, 18.000000, 42.000000, 65.200000
#> $list_of_sexes <chr> "Male", "Female", "Female", "Male" #>$ ordered_list_of_costs <ord> $0 -$100, $100 -$200, $200 -$300, $300 -$400

After first detaching a package with detach(package:package.name.here) we can check for an update from the console with install.packages(c("package.name.here"))

install.packages(c('dplyr'), repos = "http://cran.us.r-project.org")

$$~$$

$$~$$

## 1.16 Welcome to the tidyverse

The tidyverse is a collection of R packages that have been grouped together in order to make “data wrangling” more efficient:

library('tidyverse')
#> -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
#> v ggplot2 3.3.3     v purrr   0.3.4
#> v tibble  3.1.2     v stringr 1.4.0
#> v tidyr   1.1.3     v forcats 0.5.1
#> -- Conflicts ------------------------------------------ tidyverse_conflicts() --
#> x dplyr::lag()    masks stats::lag() ### 1.16.1 and friends

• broom make outputs tidy
• lubridate working with dates
• readxl the readxl package contains the read_excel to read in .xls or .xlsx files
• knitr produce outputs such as HTML, PDF, Docs, PowerPoint, with Rmarkdown
• shiny build interactive web apps straight from R
• flexdashboard build dashboards with R
• furrr parallel mapping
• yardstick for model evaluation metrics
tidyverse_friends <- c('broom','lubridate','readxl','knitr','shiny','furrr','flexdashboard','yardstick')
install.packages(tidyverse_friends)

### 1.16.2 pipe opperator

The pipe %>% operator originates from the magrittr package. The pipe takes the information on the left and passes it to the information on the right:

f(x) is the same as x %>% f()

Notice how x gets piped into a function f

glimpse(my_dataframe)
#> Rows: 4
#> Columns: 7
#> $list_of_ints <int> 2, 4, 6, 8 #>$ list_of_strings       <chr> "Data", "Science", "is a", "blast!"
#> $list_of_logicals <lgl> TRUE, FALSE, TRUE, FALSE #>$ list_of_mixed_type    <chr> "3.14", "1", "cat", "TRUE"
#> $list_of_numbers <dbl> 3.142857, 18.000000, 42.000000, 65.200000 #>$ list_of_sexes         <chr> "Male", "Female", "Female", "Male"
#> $ordered_list_of_costs <ord>$0 - $100,$100 - $200,$200 - $300,$300 - $400 is the same as my_dataframe %>% glimpse() #> Rows: 4 #> Columns: 7 #>$ list_of_ints          <int> 2, 4, 6, 8
#> $list_of_strings <chr> "Data", "Science", "is a", "blast!" #>$ list_of_logicals      <lgl> TRUE, FALSE, TRUE, FALSE
#> $list_of_mixed_type <chr> "3.14", "1", "cat", "TRUE" #>$ list_of_numbers       <dbl> 3.142857, 18.000000, 42.000000, 65.200000
#> $list_of_sexes <chr> "Male", "Female", "Female", "Male" #>$ ordered_list_of_costs <ord> $0 -$100, $100 -$200, $200 -$300, $300 -$400

Note that this pipe operation has become so popular in R version 4.1.0 now comes equipped with a pipe operator of it’s own:

1:10 |> mean()
#>  5.5

$$~$$

$$~$$

### 1.16.3 Other Packages

Other packages we might make use of:

• devtools develop R packages

• arsenal compare dataframes ; create summary tables

• skimr automate Exploratory Data Analysis

• DataExplorer automate Exploratory Data Analysis

• rsq computes Adjusted R2 for various model types

• RSQLite R package for interfacing with SQLite database

• dbplyr database back-end for dplyr

• plotly interactive HTML plots

• DT contains datatable function for interactive HTML datatable

• GGally contains ggcorr for correlation plots and ggpairs for other data-plots

• corrr correlation matrix as a data-frame

• AMR Principal Component Plots

• factoextra for k-means clustering

• randomForest fit a Random Forest model

• caret (Classification And Regression Training) is a set of functions that attempt to streamline the process for creating predictive models.

other_packages <- c('devtools','rsq','arsenal','skimr','DataExplorer',
'RSQLite','dbplyr','plotly','DT','GGally','corrr',
'AMR','factoextra','caret','randomForest')
install.packages(other_packages)

$$~$$

$$~$$

## 1.17 Package Versions

We already mentioned that install.packages will update the package from CRAN:

install.packages( c( tidyverse_friends , other_packages ), repos = "http://cran.us.r-project.org")

We also use devtools to install the most-up-to-date package from github, for example:

devtools::install_github("tidyverse/tidyverse")

Here are the versions installed on this system, compare with your own:

as_tibble(installed.packages()) %>%
select(Package, Version, Depends) %>%
filter(Package %in% c( c('tidyverse'),
c( tidyverse_friends , other_packages )  )) %>%
knitr::kable()
Package Version Depends
AMR 1.7.1 R (>= 3.0.0)
arsenal 3.6.3 R (>= 3.4.0), stats (>= 3.4.0)
broom 0.7.6 R (>= 3.1)
caret 6.0-88 R (>= 3.2.0), lattice (>= 0.20), ggplot2
corrr 0.4.3 R (>= 3.3.0)
DataExplorer 0.8.2 R (>= 3.6)
dbplyr 2.1.1 R (>= 3.1)
devtools 2.4.2 R (>= 3.0.2), usethis (>= 2.0.1)
dplyr 1.0.6 R (>= 3.3.0)
DT 0.18 NA
factoextra 1.0.7 R (>= 3.1.2), ggplot2 (>= 2.2.0)
flexdashboard 0.5.2 R (>= 3.0.2)
forcats 0.5.1 R (>= 3.2)
furrr 0.2.2 future (>= 1.19.1), R (>= 3.2.0)
GGally 2.1.1 R (>= 3.1), ggplot2 (>= 3.3.0)
ggplot2 3.3.3 R (>= 3.2)
knitr 1.33 R (>= 3.2.3)
lubridate 1.7.10 methods, R (>= 3.2)
plotly 4.9.4 R (>= 3.2.0), ggplot2 (>= 3.0.0)
purrr 0.3.4 R (>= 3.2)
randomForest 4.6-14 R (>= 3.2.2), stats
rsq 2.2 NA
RSQLite 2.2.7 R (>= 3.1.0)
shiny 1.6.0 R (>= 3.0.2), methods
skimr 2.1.3 R (>= 3.1.2)
stringr 1.4.0 R (>= 3.1)
tibble 3.1.2 R (>= 3.1.0)
tidyr 1.1.3 R (>= 3.1)
tidyverse 1.3.1 R (>= 3.3)
yardstick 0.0.8 R (>= 2.10)

You can install specific versions of packages with: devtools::install_version("my.package.name", version = "0.9.1")

## 1.18 Details on this machine’s version of R

tibble::enframe(Sys.info()) %>%
filter(name %in% c('sysname','release','version','machine')) %>%
knitr::kable()
name value
sysname Windows
release 10 x64
version build 19042
machine x86-64
tibble::as_tibble(R.Version()) %>%
pivot_longer(everything()) %>%
knitr::kable()
name value
platform x86_64-w64-mingw32
arch x86_64
os mingw32
system x86_64, mingw32
status
major 4
minor 1.0
year 2021
month 05
day 18
svn rev 80317
language R
version.string R version 4.1.0 (2021-05-18)
nickname Camp Pontanezen

### 1.18.1sessionInfo

All the details about the current running session of R:

sessionInfo()
#> R version 4.1.0 (2021-05-18)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19042)
#>
#> Matrix products: default
#>
#> locale:
#>  LC_COLLATE=English_United States.1252
#>  LC_CTYPE=English_United States.1252
#>  LC_MONETARY=English_United States.1252
#>  LC_NUMERIC=C
#>  LC_TIME=English_United States.1252
#>
#> attached base packages:
#>  stats     graphics  grDevices utils     datasets  methods   base
#>
#> loaded via a namespace (and not attached):
#>   bookdown_0.22     digest_0.6.27     R6_2.5.0          jsonlite_1.7.2
#>   magrittr_2.0.1    evaluate_0.14     stringi_1.6.1     rlang_0.4.11
#>   jquerylib_0.1.4   bslib_0.2.5.1     rmarkdown_2.8     tools_4.1.0
#>  stringr_1.4.0     xfun_0.23         yaml_2.2.1        rsconnect_0.8.18
#>  compiler_4.1.0    htmltools_0.5.1.1 knitr_1.33        sass_0.4.0