1 Getting Started with R
R
is a popular programming language within the Data Scientist’s arsenal of tools. Over the course, we will start to get more familiar with R
and show how it might be used to complete complicated tasks.
1.1 Launch RStudio
Here, we aim to introduce you to some base functionality within R and showcase how to perform some basic statistical tasks.
For starters, you can interface directly with the R terminal and enter in basic calculations:
2*3-1)^2 - 5
(#> [1] 20
Often we will want to save a value, to do this we use one of the assignment operators: <-
, ->
, or =
:
<- 2*3
x 2 -> y
= 5 z
print(x)
#> [1] 6
print(y)
#> [1] 2
print(z)
#> [1] 5
<- (x-1)^y-z
w
print(w)
#> [1] 20
R can also assess the validity of logical statements:
TRUE == FALSE
#> [1] FALSE
R can deal with integers or decimals:
# add L if you want R to think of this as an integer
45L #> [1] 45
3.14
#> [1] 3.14
The class function can let you know the data type:
class(45L)
#> [1] "integer"
class(45)
#> [1] "numeric"
class(3.14)
#> [1] "numeric"
1.2 Summary of basic R
Data Types
Example | Type |
“male,” “Diabetes” | Character / String |
3, 20.6, 100.222 | Numeric |
26L (add an ‘L’ to denote integer) | Integer |
TRUE, FALSE | Logical |
\(~\)
\(~\)
1.3 Lists and Vectors
We can make lists or vectors within R by using the c()
function:
<- c(2L,4L,6L,8L)
list_of_ints
<- c("Data", "Science", "is a", 'blast!')
list_of_strings
<- c(TRUE, FALSE, TRUE, FALSE)
list_of_logicals
<- c(3.14, 1L, "cat", TRUE)
list_of_mixed_type
<- c(22/7, 18, 42, 65.2)
list_of_numbers
<- c("Male","Female","Female","Male") list_of_sexes
1.3.1 length
The length
function will tell you the length of the vector / list:
length(list_of_strings)
#> [1] 4
class(list_of_strings)
#> [1] "character"
length(list_of_ints)
#> [1] 4
class(list_of_ints)
#> [1] "integer"
length(list_of_numbers)
#> [1] 4
class(list_of_numbers)
#> [1] "numeric"
length(list_of_mixed_type)
#> [1] 4
class(list_of_mixed_type)
#> [1] "character"
1.3.2 Accessing elements in a list
We can access certain elements from within the lists by passing the location of the element of interest within the list:
1]
list_of_mixed_type[#> [1] "3.14"
2]
list_of_mixed_type[#> [1] "1"
3]
list_of_mixed_type[#> [1] "cat"
4]
list_of_mixed_type[#> [1] "TRUE"
\(~\)
\(~\)
1.4 The Data Frame
A Data Frame is a matrix of vectors:
<- data.frame(list_of_ints,
my_dataframe
list_of_strings,
list_of_logicals,
list_of_mixed_type,
list_of_numbers,
list_of_sexes)
my_dataframe#> list_of_ints list_of_strings list_of_logicals list_of_mixed_type
#> 1 2 Data TRUE 3.14
#> 2 4 Science FALSE 1
#> 3 6 is a TRUE cat
#> 4 8 blast! FALSE TRUE
#> list_of_numbers list_of_sexes
#> 1 3.142857 Male
#> 2 18.000000 Female
#> 3 42.000000 Female
#> 4 65.200000 Male
We can see the first few rows of a dataframe with the head
function:
head(my_dataframe)
#> list_of_ints list_of_strings list_of_logicals list_of_mixed_type
#> 1 2 Data TRUE 3.14
#> 2 4 Science FALSE 1
#> 3 6 is a TRUE cat
#> 4 8 blast! FALSE TRUE
#> list_of_numbers list_of_sexes
#> 1 3.142857 Male
#> 2 18.000000 Female
#> 3 42.000000 Female
#> 4 65.200000 Male
1.4.0.1 The dim
function
We can determine the dimension by using the dim
function:
dim(my_dataframe)
#> [1] 4 6
so we can see that this data-frame has 4 rows and 6 columns. We can also get those values by using nrow
and ncol
:
nrow(my_dataframe)
#> [1] 4
ncol(my_dataframe)
#> [1] 6
1.4.1 Matrix Notation
We can use familiar matrix notation to select specific elements from the data frame:
3,5]
my_dataframe[#> [1] 42
We can use similar notation to select a row:
3,]
my_dataframe[#> list_of_ints list_of_strings list_of_logicals list_of_mixed_type
#> 3 6 is a TRUE cat
#> list_of_numbers list_of_sexes
#> 3 42 Female
Or a column:
5]
my_dataframe[,#> [1] 3.142857 18.000000 42.000000 65.200000
1.4.2 Data Frames have colnames
often our data frame will have meaningful column names:
colnames(my_dataframe)
#> [1] "list_of_ints" "list_of_strings" "list_of_logicals"
#> [4] "list_of_mixed_type" "list_of_numbers" "list_of_sexes"
so it is also helpful to be able to pass in these column names to select a column rather than recall which column number the information is associated with:
'list_of_ints']
my_dataframe[,#> [1] 2 4 6 8
We may also use the following to access a column in a dataframe:
$list_of_strings
my_dataframe#> [1] "Data" "Science" "is a" "blast!"
One interesting thing of note is that:
class(my_dataframe[,"list_of_strings"])
#> [1] "character"
but
class(list_of_strings)
#> [1] "character"
1.5 Factors
As a default, R will attempt to turn strings into factors within a data frame. We can turn this off by passing in the additional parameter stringsAsFactors = FALSE
into the data.frame
function:
<- data.frame(list_of_ints,
my_dataframe2
list_of_strings,
list_of_logicals,
list_of_mixed_type,
list_of_numbers,
list_of_sexes,stringsAsFactors = FALSE)
my_dataframe2#> list_of_ints list_of_strings list_of_logicals list_of_mixed_type
#> 1 2 Data TRUE 3.14
#> 2 4 Science FALSE 1
#> 3 6 is a TRUE cat
#> 4 8 blast! FALSE TRUE
#> list_of_numbers list_of_sexes
#> 1 3.142857 Male
#> 2 18.000000 Female
#> 3 42.000000 Female
#> 4 65.200000 Male
Now notice that:
class(list_of_strings) == class(my_dataframe2[,"list_of_strings"])
#> [1] TRUE
1.5.1 levels and ordered levels
We can access the levels of a factor with the levels
function:
levels(my_dataframe[,'list_of_sexes'])
#> NULL
Having factors enables us to place ordering on different levels:
<- c('$0 - $100',
list_of_costs '$100 - $200',
'$200 - $300',
'$300 - $400')
<- ordered(list_of_costs, levels = list_of_costs) ordered_list_of_costs
<- data.frame(list_of_ints,
my_dataframe
list_of_strings,
list_of_logicals,
list_of_mixed_type,
list_of_numbers,
list_of_sexes,
ordered_list_of_costs)
my_dataframe#> list_of_ints list_of_strings list_of_logicals list_of_mixed_type
#> 1 2 Data TRUE 3.14
#> 2 4 Science FALSE 1
#> 3 6 is a TRUE cat
#> 4 8 blast! FALSE TRUE
#> list_of_numbers list_of_sexes ordered_list_of_costs
#> 1 3.142857 Male $0 - $100
#> 2 18.000000 Female $100 - $200
#> 3 42.000000 Female $200 - $300
#> 4 65.200000 Male $300 - $400
'ordered_list_of_costs']
my_dataframe[,#> [1] $0 - $100 $100 - $200 $200 - $300 $300 - $400
#> Levels: $0 - $100 < $100 - $200 < $200 - $300 < $300 - $400
\(~\)
\(~\)
1.6 Structure
The R
Structure function will compactly display the internal structure of an R
object.
To see help page for a function use ?
before the function name, for example try: ?str
str(my_dataframe)
#> 'data.frame': 4 obs. of 7 variables:
#> $ list_of_ints : int 2 4 6 8
#> $ list_of_strings : chr "Data" "Science" "is a" "blast!"
#> $ list_of_logicals : logi TRUE FALSE TRUE FALSE
#> $ list_of_mixed_type : chr "3.14" "1" "cat" "TRUE"
#> $ list_of_numbers : num 3.14 18 42 65.2
#> $ list_of_sexes : chr "Male" "Female" "Female" "Male"
#> $ ordered_list_of_costs: Ord.factor w/ 4 levels "$0 - $100"<"$100 - $200"<..: 1 2 3 4
str(list_of_costs)
#> chr [1:4] "$0 - $100" "$100 - $200" "$200 - $300" "$300 - $400"
\(~\)
\(~\)
1.7 summary
The summary
is a generic function used to produce result summaries of the results of various model fitting functions.
See ?summary
for more information.
In the case of a dataframe the summary function will give summary level information on the dataframe, for continuous variables it will display the minimum, first quartile, median, mean, third quartile and max; for categorical data counts of each of the classes
summary(my_dataframe)
#> list_of_ints list_of_strings list_of_logicals list_of_mixed_type
#> Min. :2.0 Length:4 Mode :logical Length:4
#> 1st Qu.:3.5 Class :character FALSE:2 Class :character
#> Median :5.0 Mode :character TRUE :2 Mode :character
#> Mean :5.0
#> 3rd Qu.:6.5
#> Max. :8.0
#> list_of_numbers list_of_sexes ordered_list_of_costs
#> Min. : 3.143 Length:4 $0 - $100 :1
#> 1st Qu.:14.286 Class :character $100 - $200:1
#> Median :30.000 Mode :character $200 - $300:1
#> Mean :32.086 $300 - $400:1
#> 3rd Qu.:47.800
#> Max. :65.200
We can also get these statistics by class, here we use the information in the dataframe column list_of_sexes
by(my_dataframe, my_dataframe$list_of_sexes, summary)
#> my_dataframe$list_of_sexes: Female
#> list_of_ints list_of_strings list_of_logicals list_of_mixed_type
#> Min. :4.0 Length:2 Mode :logical Length:2
#> 1st Qu.:4.5 Class :character FALSE:1 Class :character
#> Median :5.0 Mode :character TRUE :1 Mode :character
#> Mean :5.0
#> 3rd Qu.:5.5
#> Max. :6.0
#> list_of_numbers list_of_sexes ordered_list_of_costs
#> Min. :18 Length:2 $0 - $100 :0
#> 1st Qu.:24 Class :character $100 - $200:1
#> Median :30 Mode :character $200 - $300:1
#> Mean :30 $300 - $400:0
#> 3rd Qu.:36
#> Max. :42
#> ------------------------------------------------------------
#> my_dataframe$list_of_sexes: Male
#> list_of_ints list_of_strings list_of_logicals list_of_mixed_type
#> Min. :2.0 Length:2 Mode :logical Length:2
#> 1st Qu.:3.5 Class :character FALSE:1 Class :character
#> Median :5.0 Mode :character TRUE :1 Mode :character
#> Mean :5.0
#> 3rd Qu.:6.5
#> Max. :8.0
#> list_of_numbers list_of_sexes ordered_list_of_costs
#> Min. : 3.143 Length:2 $0 - $100 :1
#> 1st Qu.:18.657 Class :character $100 - $200:0
#> Median :34.171 Mode :character $200 - $300:0
#> Mean :34.171 $300 - $400:1
#> 3rd Qu.:49.686
#> Max. :65.200
\(~\)
\(~\)
1.8 Save and Read RDS files
You can save important data, variables, or models as R
data files using saveRDS
:
saveRDS(my_dataframe,'my_dataframe.RDS')
by default this saves to your working directory.
1.9 getwd()
You can see your current working directory path by using getwd()
:
getwd()
#> [1] "C:/Users/jkyle/Documents/GitHub/Jeff_Data_Wrangling"
You can alter the paths in a number of different ways.
In this example, we will want to save results to a sub-folder called y_data
.
First, I will tell R to create a directory:
<- file.path(getwd(),'y_data')
new_path
dir.create(new_path)
#> Warning in dir.create(new_path): 'C:
#> \Users\jkyle\Documents\GitHub\Jeff_Data_Wrangling\y_data' already exists
Now I can save y
within this folder:
saveRDS(y, file.path(new_path,'y.RDS'))
1.10 readRDS
You can load R data files by using readRDS
:
<- readRDS(file.path(new_path, 'y.RDS'))
z
== z
y #> [1] TRUE
\(~\)
\(~\)
1.11 Remove Items from your enviroment
WARNING once something is removed from your environment it is GONE! SAVE WHAT YOU NEED with saveRDS
otherwise you will need to rerun portions of code.
You can remove items in your R environment by using the rm
function.
rm(my_dataframe2)
1.11.1 Remove all but a list
Let’s say that we want to remove everything except for my_dataframe
and y
, then we might do something like:
<- c('my_dataframe','new_path')
keep
rm(list = setdiff(ls(), keep))
Above we are using multiple functions in conjunction with one-another:
- We make a variable called
keep
to contain the items we wish to maintain ls
is returning a list of items within the environmentsetdiff
is taking the set-difference betweenls
which returns all the items within the environment and the items that we have specified withinkeep
.
1.11.2 Delete Files / Folders
unlink
deletes a file, we can use recursive = TRUE
to delete directories:
unlink(new_path, recursive = FALSE)
\(~\)
\(~\)
1.12 Functions
A function is an R object that takes in arguments or parameters and follows the collection of statements to provide the requested return.
1.12.1 Example R functions.
The R function set.seed
is designed to assist with pseudo-random number generation.
The R function rnorm
will produce a given number of randomly generated numbers with a provided mean and standard-deviation, Below we define 3 distributions each of 1000 numbers:
X
,Y
, andZ
will be drawn from a distribution with mean of 0 and standard deviation of 1.X
andZ
will be drawn with the same seed where asY
will be drawn with a different seed.Y
will be drawn from a distribution with mean of 5 and standard deviation of 2.
{set.seed(12345)
<- rnorm(1000, mean = 0, sd = 1)
X
}
{set.seed(5)
<- rnorm(1000, mean = 0, sd = 1)
Y <- rnorm(1000, mean = 5, sd = 2)
W
}
{set.seed(12345)
<- rnorm(1000, mean = 0, sd = 1)
Z }
Here are the first 10 from each of the distributions:
cat("X \n")
#> X
1:10]
X[#> [1] 0.5855288 0.7094660 -0.1093033 -0.4534972 0.6058875 -1.8179560
#> [7] 0.6300986 -0.2761841 -0.2841597 -0.9193220
cat("Y \n")
#> Y
1:10]
Y[#> [1] -0.84085548 1.38435934 -1.25549186 0.07014277 1.71144087 -0.60290798
#> [7] -0.47216639 -0.63537131 -0.28577363 0.13810822
cat("W \n")
#> W
1:10]
W[#> [1] 2.089891 7.489126 4.136106 5.013739 5.249102 4.180743 6.126827 8.213916
#> [9] 2.996097 6.018652
cat("Z \n")
#> Z
1:10]
Z[#> [1] 0.5855288 0.7094660 -0.1093033 -0.4534972 0.6058875 -1.8179560
#> [7] 0.6300986 -0.2761841 -0.2841597 -0.9193220
Note that X and Z will be exactly the same sets since we used the same seed to generate the two sets of numbers whereas Y will be different, even though we used the same parameters, for example
546] == Z[546]
X[#> [1] TRUE
546] == Y[546]
X[#> [1] FALSE
Even though X and Y are not the same, they are sampled from the same distribution therefore the p-value on the t-test should be well above .05:
t.test(X,Y)
#>
#> Welch Two Sample t-test
#>
#> data: X and Y
#> t = 0.6405, df = 1997.7, p-value = 0.5219
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#> -0.05938059 0.11697799
#> sample estimates:
#> mean of x mean of y
#> 0.04619816 0.01739946
Comparatively, X and W were not sampled from the same distribution, therefore the p-value should be closer to 0:
t.test(X,W)
#>
#> Welch Two Sample t-test
#>
#> data: X and W
#> t = -72.569, df = 1474.2, p-value < 2.2e-16
#> alternative hypothesis: true difference in means is not equal to 0
#> 95 percent confidence interval:
#> -5.237795 -4.962086
#> sample estimates:
#> mean of x mean of y
#> 0.04619816 5.14613865
We can see the difference between the distributions by using the hist
function:
hist(X, col=rgb(1,0,0,0.5), main="Overlapping Histogram") # red
hist(Y, col=rgb(0,1,0,0.5) , add=T) # green
hist(W, col=rgb(0,0,1,0.5) , add=T) # blue
1.13 User defined functions
Many of you may recall the quadratic formula from algebra; given a general quadratic equation of the form:
\[ax^2 + bx + c = 0 \]
with \(a\), \(b\), and \(c\) representing constants with \(a \neq 0\) then
\[x = -b \pm \frac{\sqrt{b^2 - 4ac}}{2a}\]
are the two solutions or roots of the quadratic equation.
We can program an R
function that takes in as inputs the coefficients from a quadratic equation and returns the two solutions:
<- function(a,b,c){
quadratic_forumla <- (-b + sqrt(b^2 - 4*a*c))/(2*a)
x_1 <- (-b - sqrt(b^2 - 4*a*c))/(2*a)
x_2 return(c(x_2,x_1))
}
quadratic_forumla(1,5,6)
#> [1] -3 -2
quadratic_forumla(5,3,-10)
#> [1] -1.745683 1.145683
quadratic_forumla(2,5,3)
#> [1] -1.5 -1.0
\(~\)
\(~\)
1.14 Libraries & Packages
Sometimes you may want to use newer or more specialized packages or libraries to read in or handle data. First you will need to make sure the library is installed with install.packages('package_name')
then you can load the library with library('package_name')
, and you will have access to all of the functionality withing the package.
Additionally, we can access a function from a library without loading the entire library, this can be done by using a command such as package::function
, for instance, readr::read_csv
tells R
to look in the readr
package for the function read_csv
. This can be useful in programming functions and packages as well as if multiple packages contain functions with the same name.
1.14.1 installed packages
To see which packages are installed you can use:
# which packages are installed
library()$results[,1]
sample(library()$results[,1], 20, replace = FALSE)
#> [1] "BH" "tcltk" "chron" "googlesheets4"
#> [5] "cowplot" "gsubfn" "qap" "askpass"
#> [9] "TSP" "processx" "translations" "ranger"
#> [13] "lifecycle" "methods" "grid" "ggsignif"
#> [17] "rsconnect" "httpuv" "yardstick" "backports"
The following function will check if a list of packages are already installed, if not then it will install them:
# install package if missing
<- function( list.of.packages ) {
install_if_not
<- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
new.packages
if(length(new.packages)) { install.packages(new.packages, repos = "http://cran.us.r-project.org") } else { print(paste0("the package '", list.of.packages , "' is already installed")) }
}
You can use the function like this:
# test function
install_if_not(c("tidyverse"))
#> [1] "the package 'tidyverse' is already installed"
some additional information on the installed packages including the version can be found now:
::as_tibble(installed.packages())
tibble#> # A tibble: 305 x 16
#> Package LibPath Version Priority Depends Imports LinkingTo Suggests Enhances
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 abind C:/User~ 1.4-5 <NA> R (>= ~ method~ <NA> <NA> <NA>
#> 2 Amelia C:/User~ 1.8.0 <NA> R (>= ~ foreig~ Rcpp (>=~ "tcltk,~ <NA>
#> 3 AMR C:/User~ 1.7.1 <NA> R (>= ~ <NA> <NA> "cleane~ <NA>
#> 4 arsenal C:/User~ 3.6.3 <NA> R (>= ~ knitr ~ <NA> "broom ~ <NA>
#> 5 askpass C:/User~ 1.1 <NA> <NA> sys (>~ <NA> "testth~ <NA>
#> 6 assertt~ C:/User~ 0.2.1 <NA> <NA> tools <NA> "testth~ <NA>
#> # ... with 299 more rows, and 7 more variables: License <chr>,
#> # License_is_FOSS <chr>, License_restricts_use <chr>, OS_type <chr>,
#> # MD5sum <chr>, NeedsCompilation <chr>, Built <chr>
1.14.2 loaded packages
To check which packages are loaded you can use:
# loaded packages
.packages())
(#> [1] "stats" "graphics" "grDevices" "utils" "datasets" "methods"
#> [7] "base"
Make sure that the tidyverse
and dplyr
packages are installed. You can run install.packages(c('tidyverse','dplyr'))
to install both.
1.14.3 read sample data
<- readRDS('my_dataframe.RDS' ) my_dataframe
We can use the head
command to see the first few rows:
head(my_dataframe)
#> list_of_ints list_of_strings list_of_logicals list_of_mixed_type
#> 1 2 Data TRUE 3.14
#> 2 4 Science FALSE 1
#> 3 6 is a TRUE cat
#> 4 8 blast! FALSE TRUE
#> list_of_numbers list_of_sexes ordered_list_of_costs
#> 1 3.142857 Male $0 - $100
#> 2 18.000000 Female $100 - $200
#> 3 42.000000 Female $200 - $300
#> 4 65.200000 Male $300 - $400
\(~\)
\(~\)
1.15 Using Packages
We make sections of code accessible to installed packages by using library
command:
<- (.packages())
loaded_package_before
# everthing below here can call functions in dplyr package
library('dplyr')
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
glimpse(my_dataframe)
#> Rows: 4
#> Columns: 7
#> $ list_of_ints <int> 2, 4, 6, 8
#> $ list_of_strings <chr> "Data", "Science", "is a", "blast!"
#> $ list_of_logicals <lgl> TRUE, FALSE, TRUE, FALSE
#> $ list_of_mixed_type <chr> "3.14", "1", "cat", "TRUE"
#> $ list_of_numbers <dbl> 3.142857, 18.000000, 42.000000, 65.200000
#> $ list_of_sexes <chr> "Male", "Female", "Female", "Male"
#> $ ordered_list_of_costs <ord> $0 - $100, $100 - $200, $200 - $300, $300 - $400
.packages())
(#> [1] "dplyr" "stats" "graphics" "grDevices" "utils" "datasets"
#> [7] "methods" "base"
<- (.packages())
loaded_package_after
setdiff(loaded_package_after, loaded_package_before)
#> [1] "dplyr"
detach(package:dplyr)
#the dplyr package has now been detached. calls to functions may have errors
<- (.packages())
packages_cur
setdiff(loaded_package_after,packages_cur)
#> [1] "dplyr"
Additionally, we can access a function from a library without loading the entire library, this can be done by using a command such as package::function
. This notation is needed in any instance where two or more loaded packages have at least one function with the same name. This notation is also useful in development of functions and packages.
For instance, the glimpse
function from the dplyr
package can also be accessed by using the following command
# glimpse the data
::glimpse(my_dataframe)
dplyr#> Rows: 4
#> Columns: 7
#> $ list_of_ints <int> 2, 4, 6, 8
#> $ list_of_strings <chr> "Data", "Science", "is a", "blast!"
#> $ list_of_logicals <lgl> TRUE, FALSE, TRUE, FALSE
#> $ list_of_mixed_type <chr> "3.14", "1", "cat", "TRUE"
#> $ list_of_numbers <dbl> 3.142857, 18.000000, 42.000000, 65.200000
#> $ list_of_sexes <chr> "Male", "Female", "Female", "Male"
#> $ ordered_list_of_costs <ord> $0 - $100, $100 - $200, $200 - $300, $300 - $400
And just notice that without the library loaded or the dplyr::
in front we can error:
# error
glimpse(my_dataframe)
#> Rows: 4
#> Columns: 7
#> $ list_of_ints <int> 2, 4, 6, 8
#> $ list_of_strings <chr> "Data", "Science", "is a", "blast!"
#> $ list_of_logicals <lgl> TRUE, FALSE, TRUE, FALSE
#> $ list_of_mixed_type <chr> "3.14", "1", "cat", "TRUE"
#> $ list_of_numbers <dbl> 3.142857, 18.000000, 42.000000, 65.200000
#> $ list_of_sexes <chr> "Male", "Female", "Female", "Male"
#> $ ordered_list_of_costs <ord> $0 - $100, $100 - $200, $200 - $300, $300 - $400
After first detaching a package with detach(package:package.name.here)
we can check for an update from the console with install.packages(c("package.name.here"))
install.packages(c('dplyr'), repos = "http://cran.us.r-project.org")
\(~\)
\(~\)
1.16 Welcome to the tidyverse
The tidyverse
is a collection of R packages that have been grouped together in order to make “data wrangling” more efficient:
library('tidyverse')
#> -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
#> v ggplot2 3.3.3 v purrr 0.3.4
#> v tibble 3.1.2 v stringr 1.4.0
#> v tidyr 1.1.3 v forcats 0.5.1
#> v readr 1.4.0
#> -- Conflicts ------------------------------------------ tidyverse_conflicts() --
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag() masks stats::lag()
tibble
is an more modernR
version of a data framereadr
read in data-files such as CSV as tibbles or data-frames, write data-frames or tibbles to CSV or supported data-file type.dplyr
is a popular package to manage many common data manipulation tasks and data summariestidyr
reshape your data but keep it ‘tidy’stringr
functions to support working with text strings.purrr
functional programming for Rggplot2
is an interface for R to create numerous types of graphics of dataforcats
tools for dealing with categorical data
1.16.1 and friends
broom
make outputs tidylubridate
working with datesreadxl
thereadxl
package contains theread_excel
to read in.xls
or.xlsx
filesknitr
produce outputs such as HTML, PDF, Docs, PowerPoint, with Rmarkdownshiny
build interactive web apps straight from Rflexdashboard
build dashboards with Rfurrr
parallel mappingyardstick
for model evaluation metrics
<- c('broom','lubridate','readxl','knitr','shiny','furrr','flexdashboard','yardstick') tidyverse_friends
install.packages(tidyverse_friends)
1.16.2 pipe opperator
The pipe %>%
operator originates from the magrittr
package. The pipe takes the information on the left and passes it to the information on the right:
f(x)
is the same asx %>% f()
Notice how x
gets piped into a function f
glimpse(my_dataframe)
#> Rows: 4
#> Columns: 7
#> $ list_of_ints <int> 2, 4, 6, 8
#> $ list_of_strings <chr> "Data", "Science", "is a", "blast!"
#> $ list_of_logicals <lgl> TRUE, FALSE, TRUE, FALSE
#> $ list_of_mixed_type <chr> "3.14", "1", "cat", "TRUE"
#> $ list_of_numbers <dbl> 3.142857, 18.000000, 42.000000, 65.200000
#> $ list_of_sexes <chr> "Male", "Female", "Female", "Male"
#> $ ordered_list_of_costs <ord> $0 - $100, $100 - $200, $200 - $300, $300 - $400
is the same as
%>%
my_dataframe glimpse()
#> Rows: 4
#> Columns: 7
#> $ list_of_ints <int> 2, 4, 6, 8
#> $ list_of_strings <chr> "Data", "Science", "is a", "blast!"
#> $ list_of_logicals <lgl> TRUE, FALSE, TRUE, FALSE
#> $ list_of_mixed_type <chr> "3.14", "1", "cat", "TRUE"
#> $ list_of_numbers <dbl> 3.142857, 18.000000, 42.000000, 65.200000
#> $ list_of_sexes <chr> "Male", "Female", "Female", "Male"
#> $ ordered_list_of_costs <ord> $0 - $100, $100 - $200, $200 - $300, $300 - $400
Note that this pipe operation has become so popular in R
version 4.1.0
now comes equipped with a pipe operator of it’s own:
1:10 |> mean()
#> [1] 5.5
\(~\)
\(~\)
1.16.3 Other Packages
Other packages we might make use of:
devtools
developR
packages- Note that
devtools
on Windows also requiresRtools
https://cran.r-project.org/bin/windows/Rtools/
- Note that
arsenal
compare dataframes ; create summary tablesskimr
automate Exploratory Data AnalysisDataExplorer
automate Exploratory Data Analysisrsq
computes Adjusted R2 for various model typesRSQLite
R package for interfacing with SQLite databasedbplyr
database back-end for dplyrplotly
interactive HTML plotsDT
containsdatatable
function for interactive HTMLdatatable
GGally
containsggcorr
for correlation plots andggpairs
for other data-plotscorrr
correlation matrix as a data-frameAMR
Principal Component Plotsfactoextra
for k-means clusteringrandomForest
fit a Random Forest modelcaret
(Classification And Regression Training) is a set of functions that attempt to streamline the process for creating predictive models.
<- c('devtools','rsq','arsenal','skimr','DataExplorer',
other_packages 'RSQLite','dbplyr','plotly','DT','GGally','corrr',
'AMR','factoextra','caret','randomForest')
install.packages(other_packages)
\(~\)
\(~\)
1.17 Package Versions
We already mentioned that install.packages
will update the package from CRAN:
install.packages( c( tidyverse_friends , other_packages ), repos = "http://cran.us.r-project.org")
We also use devtools
to install the most-up-to-date package from github, for example:
::install_github("tidyverse/tidyverse") devtools
Here are the versions installed on this system, compare with your own:
as_tibble(installed.packages()) %>%
select(Package, Version, Depends) %>%
filter(Package %in% c( c('tidyverse'),
c('tibble','readr','dplyr','tidyr','stringr','purrr','ggplot2','forcats'),
c( tidyverse_friends , other_packages ) )) %>%
::kable() knitr
Package | Version | Depends |
---|---|---|
AMR | 1.7.1 | R (>= 3.0.0) |
arsenal | 3.6.3 | R (>= 3.4.0), stats (>= 3.4.0) |
broom | 0.7.6 | R (>= 3.1) |
caret | 6.0-88 | R (>= 3.2.0), lattice (>= 0.20), ggplot2 |
corrr | 0.4.3 | R (>= 3.3.0) |
DataExplorer | 0.8.2 | R (>= 3.6) |
dbplyr | 2.1.1 | R (>= 3.1) |
devtools | 2.4.2 | R (>= 3.0.2), usethis (>= 2.0.1) |
dplyr | 1.0.6 | R (>= 3.3.0) |
DT | 0.18 | NA |
factoextra | 1.0.7 | R (>= 3.1.2), ggplot2 (>= 2.2.0) |
flexdashboard | 0.5.2 | R (>= 3.0.2) |
forcats | 0.5.1 | R (>= 3.2) |
furrr | 0.2.2 | future (>= 1.19.1), R (>= 3.2.0) |
GGally | 2.1.1 | R (>= 3.1), ggplot2 (>= 3.3.0) |
ggplot2 | 3.3.3 | R (>= 3.2) |
knitr | 1.33 | R (>= 3.2.3) |
lubridate | 1.7.10 | methods, R (>= 3.2) |
plotly | 4.9.4 | R (>= 3.2.0), ggplot2 (>= 3.0.0) |
purrr | 0.3.4 | R (>= 3.2) |
randomForest | 4.6-14 | R (>= 3.2.2), stats |
readr | 1.4.0 | R (>= 3.1) |
readxl | 1.3.1 | NA |
rsq | 2.2 | NA |
RSQLite | 2.2.7 | R (>= 3.1.0) |
shiny | 1.6.0 | R (>= 3.0.2), methods |
skimr | 2.1.3 | R (>= 3.1.2) |
stringr | 1.4.0 | R (>= 3.1) |
tibble | 3.1.2 | R (>= 3.1.0) |
tidyr | 1.1.3 | R (>= 3.1) |
tidyverse | 1.3.1 | R (>= 3.3) |
yardstick | 0.0.8 | R (>= 2.10) |
You can install specific versions of packages with: devtools::install_version("my.package.name", version = "0.9.1")
1.18 Details on this machine’s version of R
::enframe(Sys.info()) %>%
tibblefilter(name %in% c('sysname','release','version','machine')) %>%
::kable() knitr
name | value |
---|---|
sysname | Windows |
release | 10 x64 |
version | build 19042 |
machine | x86-64 |
::as_tibble(R.Version()) %>%
tibblepivot_longer(everything()) %>%
::kable() knitr
name | value |
---|---|
platform | x86_64-w64-mingw32 |
arch | x86_64 |
os | mingw32 |
system | x86_64, mingw32 |
status | |
major | 4 |
minor | 1.0 |
year | 2021 |
month | 05 |
day | 18 |
svn rev | 80317 |
language | R |
version.string | R version 4.1.0 (2021-05-18) |
nickname | Camp Pontanezen |
1.18.1 sessionInfo
All the details about the current running session of R
:
sessionInfo()
#> R version 4.1.0 (2021-05-18)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 19042)
#>
#> Matrix products: default
#>
#> locale:
#> [1] LC_COLLATE=English_United States.1252
#> [2] LC_CTYPE=English_United States.1252
#> [3] LC_MONETARY=English_United States.1252
#> [4] LC_NUMERIC=C
#> [5] LC_TIME=English_United States.1252
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> loaded via a namespace (and not attached):
#> [1] bookdown_0.22 digest_0.6.27 R6_2.5.0 jsonlite_1.7.2
#> [5] magrittr_2.0.1 evaluate_0.14 stringi_1.6.1 rlang_0.4.11
#> [9] jquerylib_0.1.4 bslib_0.2.5.1 rmarkdown_2.8 tools_4.1.0
#> [13] stringr_1.4.0 xfun_0.23 yaml_2.2.1 rsconnect_0.8.18
#> [17] compiler_4.1.0 htmltools_0.5.1.1 knitr_1.33 sass_0.4.0