Chapter 3 R Basics
We now start with the basics of R. If you have any experience at all with R, you can probably skip this section.
First, make sure you work with the RStudio IDE. Some useful pointers for this IDE include:
- Ctrl+Return(Enter) to run lines from editor.
- Alt+Shift+k for RStudio keyboard shortcuts.
- Ctrl+r to browse the command history.
- Alt+Shift+j to navigate between code sections
- tab for auto-completion
- Ctrl+1 to skip to editor.
- Ctrl+2 to skip to console.
- Ctrl+8 to skip to the environment list.
- Code Folding:
- Alt+l collapse chunk.
- Alt+Shift+l unfold chunk.
- Alt+o collapse all.
- Alt+Shift+o unfold all.
- Alt+“-” for the assignment operator
<-
.
3.0.1 Other IDEs
Currently, I recommend RStudio, but here are some other IDEs:
Jupyter Lab: a very promising IDE, originally designed for Python, that also supports R. At the time of writing, it seems that RStudio is more convenient for R, but it is definetly an IDE to follow closely. See Max Woolf’s review.
Eclipse: If you are a Java programmer, you are probably familiar with Eclipse, which does have an R plugin: StatEt.
Emacs: If you are an Emacs fan, you can find an R plugin: ESS.
Vim: Vim-R.
Visual Studio also supports R. If you need R for commercial purposes, it may be worthwhile trying Microsoft’s R, instead of the usual R. See here for installation instructions.
3.1 File types
The file types you need to know when using R are the following:
- .R: An ASCII text file containing R scripts only.
- .Rmd: An ASCII text file. If opened in RStudio can be run as an R-Notebook or compiled using knitr, bookdown, etc.
3.2 Simple calculator
R can be used as a simple calculator. Create a new R Notebook (.Rmd file) within RStudio using File-> New -> R Notebook, and run the following commands.
## [1] 15
## [1] 5670
## [1] 16
## [1] 16
## [1] 2.302585
## [1] 4
## [1] 3
3.3 Probability calculator
R can be used as a probability calculator. You probably wish you knew this when you did your Intro To Probability classes.
The Binomial density function:
## [1] 0.1171875
Notice that arguments do not need to be named explicitly
## [1] 0.1171875
The Binomial cumulative distribution function (CDF):
## [1] 0.171875
The Binomial quantile function:
## [1] 3
Generate random variables:
## [1] 4 4 5 7 4 7 7 6 6 3
R has many built-in distributions. Their names may change, but the prefixes do not:
- d prefix for the density function.
- p prefix for the cummulative distribution function (CDF).
- q prefix for the quantile function (i.e., the inverse CDF).
- r prefix to generate random samples.
Demonstrating this idea, using the CDF of several popular distributions:
pbinom()
for the Binomial CDF.ppois()
for the Poisson CDF.pnorm()
for the Gaussian CDF.pexp()
for the Exponential CDF.
For more information see ?distributions
.
3.4 Getting Help
One of the most important parts of working with a language, is to know where to find help. R has several in-line facilities, besides the various help resources in the R ecosystem.
Get help for a particular function.
If you don’t know the name of the function you are looking for, search local help files for a particular string:
Or load a menu where you can navigate local help in a web-based fashion:
3.5 Variable Asignment
Assignment of some output into an object named “x”:
If you are familiar with other programming languages you may prefer the =
assignment rather than the <-
assignment.
We recommend you make the effort to change your preferences.
This is because thinking with <-
helps to read your code, distinguishes between assignments and function arguments: think of function(argument=value)
versus function(argument<-value)
.
It also helps understand special assignment operators such as <<-
and ->
.
To print the contents of an object just type its name
## [1] 7 4 6 3 4 5 2 5 7 4
which is an implicit call to
## [1] 7 4 6 3 4 5 2 5 7 4
Alternatively, you can assign and print simultaneously using parenthesis.
## [1] 5 5 5 4 6 6 6 3 6 5
Operate on the object
## [1] 5.1
## [1] 0.9888889
R saves every object you create in RAM1. The collection of all such objects is the workspace which you can inspect with
## [1] "a" "A"
## [3] "allphat" "allphat_m"
## [5] "allq" "allqp"
## [7] "allqp_m" "auto"
## [9] "b" "B"
## [11] "big1" "borders"
## [13] "borders_data" "bst1"
## [15] "bst2" "bstSparse"
## [17] "c" "clean.example"
## [19] "compute_remoteness" "condition1"
## [21] "condition2" "coords"
## [23] "coords_trn" "coords_tst"
## [25] "createmap" "CV_LME_LGP"
## [27] "CV_split" "d"
## [29] "db" "db_test"
## [31] "db_train" "db_trn"
## [33] "db_tst" "db0us"
## [35] "df" "di"
## [37] "distributions1" "distributions2"
## [39] "distributions3" "dtest"
## [41] "dtest_nw" "dtrain"
## [43] "dtrain_nw" "e"
## [45] "estimate_dist" "EuStockMarkets"
## [47] "evalerror" "example1"
## [49] "example1_1" "example1_2"
## [51] "example1_3" "example2"
## [53] "features" "fib"
## [55] "fileURL" "frame1"
## [57] "glm.1" "grid_n"
## [59] "i" "in_train"
## [61] "intest" "intrn"
## [63] "j" "ker"
## [65] "long" "loss"
## [67] "m" "m_glm"
## [69] "m_lm" "m_lmm"
## [71] "m_xgb" "make_data"
## [73] "make_distributions" "missing.example"
## [75] "my.data" "my.sum"
## [77] "my.sum.2" "nontrn"
## [79] "num_round" "p"
## [81] "p_xgb" "p1"
## [83] "p1_xgb" "p2"
## [85] "p2_xgb" "p3_lmm"
## [87] "param1" "param2"
## [89] "phat" "Pima.te"
## [91] "pp" "preproc"
## [93] "q" "quantile.25"
## [95] "results_remotness_plot" "risk"
## [97] "rlogo" "run_cv"
## [99] "s" "sample_over_space"
## [101] "sample_P" "sample_Q0"
## [103] "scoping" "sigmoid"
## [105] "start" "stn"
## [107] "stn_trn" "stn_tst"
## [109] "t" "temp.file.name"
## [111] "test_data" "test_inla_LGP"
## [113] "test_labels" "the.list"
## [115] "time1" "time2"
## [117] "time3" "tirgul1"
## [119] "today" "train_data"
## [121] "train_inla_LGP" "train_labels"
## [123] "train_weights" "watchlist1"
## [125] "watchlist2" "weight.factor"
## [127] "weightedloss" "x"
## [129] "X" "X_tst"
## [131] "xindex" "y"
## [133] "y_hat_glm" "y_hat_glm_binar"
## [135] "y_hat_lm" "y_hat_lm_binar"
## [137] "y_tst" "yhat_glm"
## [139] "yhat_glm_binar" "yhat_glm_binar_0.3"
## [141] "yindex"
or with Ctrl+8 in RStudio.
If you lost your object, you can use ls
with a text pattern to search for it
## [1] "clean.example" "example1" "example1_1"
## [4] "example1_2" "example1_3" "example2"
## [7] "m_xgb" "missing.example" "p_xgb"
## [10] "p1_xgb" "p2_xgb" "x"
## [13] "xindex" "yindex"
To remove objects from the workspace:
## [1] "a" "A"
## [3] "allphat" "allphat_m"
## [5] "allq" "allqp"
## [7] "allqp_m" "auto"
## [9] "b" "B"
## [11] "big1" "borders"
## [13] "borders_data" "bst1"
## [15] "bst2" "bstSparse"
## [17] "c" "clean.example"
## [19] "compute_remoteness" "condition1"
## [21] "condition2" "coords"
## [23] "coords_trn" "coords_tst"
## [25] "createmap" "CV_LME_LGP"
## [27] "CV_split" "d"
## [29] "db" "db_test"
## [31] "db_train" "db_trn"
## [33] "db_tst" "db0us"
## [35] "df" "di"
## [37] "distributions1" "distributions2"
## [39] "distributions3" "dtest"
## [41] "dtest_nw" "dtrain"
## [43] "dtrain_nw" "e"
## [45] "estimate_dist" "EuStockMarkets"
## [47] "evalerror" "example1"
## [49] "example1_1" "example1_2"
## [51] "example1_3" "example2"
## [53] "features" "fib"
## [55] "fileURL" "frame1"
## [57] "glm.1" "grid_n"
## [59] "i" "in_train"
## [61] "intest" "intrn"
## [63] "j" "ker"
## [65] "long" "loss"
## [67] "m" "m_glm"
## [69] "m_lm" "m_lmm"
## [71] "m_xgb" "make_data"
## [73] "make_distributions" "missing.example"
## [75] "my.data" "my.sum"
## [77] "my.sum.2" "nontrn"
## [79] "num_round" "p"
## [81] "p_xgb" "p1"
## [83] "p1_xgb" "p2"
## [85] "p2_xgb" "p3_lmm"
## [87] "param1" "param2"
## [89] "phat" "Pima.te"
## [91] "pp" "preproc"
## [93] "q" "quantile.25"
## [95] "results_remotness_plot" "risk"
## [97] "rlogo" "run_cv"
## [99] "s" "sample_over_space"
## [101] "sample_P" "sample_Q0"
## [103] "scoping" "sigmoid"
## [105] "start" "stn"
## [107] "stn_trn" "stn_tst"
## [109] "t" "temp.file.name"
## [111] "test_data" "test_inla_LGP"
## [113] "test_labels" "the.list"
## [115] "time1" "time2"
## [117] "time3" "tirgul1"
## [119] "today" "train_data"
## [121] "train_inla_LGP" "train_labels"
## [123] "train_weights" "watchlist1"
## [125] "watchlist2" "weight.factor"
## [127] "weightedloss" "X"
## [129] "X_tst" "xindex"
## [131] "y" "y_hat_glm"
## [133] "y_hat_glm_binar" "y_hat_lm"
## [135] "y_hat_lm_binar" "y_tst"
## [137] "yhat_glm" "yhat_glm_binar"
## [139] "yhat_glm_binar_0.3" "yindex"
You may think that if an object is removed then its memory is freed. This is almost true, and depends on a negotiation mechanism between R and the operating system. R’s memory management is discussed in Chapter ??.
3.6 Missing
Unlike typically programming, when working with real life data, you may have missing values: measurements that were simply not recorded/stored/etc. R has rather sophisticated mechanisms to deal with missing values. It distinguishes between the following types:
NA
: Not Available entries.NaN
: Not a number.
R tries to defend the analyst, and return an error, or NA
when the presence of missing values invalidates the calculation:
## [1] NA
Most functions will typically have an inner mechanism to deal with these. In the mean
function, there is an na.rm
argument, telling R how to Remove NA
s.
## [1] 11
A more general mechanism is removing these manually:
## [1] 11
3.7 Piping
Because R originates in Unix and Linux environments, it inherits much of its flavor. Piping is an idea taken from the Linux shell which allows to use the output of one expression as the input to another. Piping thus makes code easier to read and write.
Prerequisites:
library(magrittr) # load the piping functions
x <- rbinom(n=1000, size=10, prob=0.5) # generate some toy data
Examples
x %>% var() # Instead of var(x)
x %>% hist() # Instead of hist(x)
x %>% mean() %>% round(2) %>% add(10)
The next example2 demonstrates the benefits of piping. The next two chunks of code do the same thing. Try parsing them in your mind:
# Functional (onion) style
car_data <-
transform(aggregate(. ~ cyl,
data = subset(mtcars, hp > 100),
FUN = function(x) round(mean(x, 2))),
kpl = mpg*0.4251)
# Piping (magrittr) style
car_data <-
mtcars %>%
subset(hp > 100) %>%
aggregate(. ~ cyl, data = ., FUN = . %>% mean %>% round(2)) %>%
transform(kpl = mpg %>% multiply_by(0.4251)) %>%
print
Tip: RStudio has a keyboard shortcut for the %>%
operator. Try Ctrl+Shift+m.
3.8 Vector Creation and Manipulation
The most basic building block in R is the vector. We will now see how to create them, and access their elements (i.e. subsetting). Here are three ways to create the same arbitrary vector:
c(10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21) # manually
10:21 # the `:` operator
seq(from=10, to=21, by=1) # the seq() function with "by"
seq(from=10, to=21, len=12) # the seq() function with "len"
Let’s assign it to the object named “x”:
Operations usually work element-wise:
## [1] 12 13 14 15 16 17 18 19 20 21 22 23
## [1] 20 22 24 26 28 30 32 34 36 38 40 42
## [1] 100 121 144 169 196 225 256 289 324 361 400 441
## [1] 3.162278 3.316625 3.464102 3.605551 3.741657 3.872983 4.000000
## [8] 4.123106 4.242641 4.358899 4.472136 4.582576
## [1] 2.302585 2.397895 2.484907 2.564949 2.639057 2.708050 2.772589
## [8] 2.833213 2.890372 2.944439 2.995732 3.044522
3.9 Search Paths and Packages
R can be easily extended with packages, which are merely a set of documented functions, which can be loaded or unloaded conveniently.
Let’s look at the function read.csv
.
We can see its contents by calling it without arguments:
## function (file, header = TRUE, sep = ",", quote = "\"", dec = ".",
## fill = TRUE, comment.char = "", ...)
## read.table(file = file, header = header, sep = sep, quote = quote,
## dec = dec, fill = fill, comment.char = comment.char, ...)
## <bytecode: 0x55fc9026d6a0>
## <environment: namespace:utils>
Never mind what the function does.
Note the environment: namespace:utils
line at the end.
It tells us that this function is part of the utils package.
We did not need to know this because it is loaded by default.
Here are some packages that I have currently loaded:
## [1] ".GlobalEnv" "package:gridExtra" "package:reshape2"
## [4] "package:RCurl" "package:bitops" "package:mlbench"
Other packages can be loaded via the library
function, or downloaded from the internet using the install.packages
function before loading with library
.
R’s package import mechanism is quite powerful, and is one of the reasons for R’s success.
3.10 Simple Plotting
R has many plotting facilities as we will further detail in the Plotting Chapter 6.
We start with the simplest facilities, namely, the plot
function from the graphics package, which is loaded by default.
Given an x
argument and a y
argument, plot
tries to present a scatter plot.
We call this the x,y
syntax.
R has another unique syntax to state functional relations.
We call y~x
the “tilde” syntax, which originates in works of Wilkinson and Rogers (1973) and was adopted in the early days of S.
The syntax y~x
is read as “y is a function of x”.
We will prefer the y~x
syntax over the x,y
syntax since it is easier to read, and will be very useful when we discuss more complicated models.
Here are some arguments that control the plot’s appearance.
We use type
to control the plot type, main
to control the main title.
We use xlab
for the x-axis label, ylab
for the y-axis.
We use pch
to control the point type.
We use col
to control the color, cex
for the point size, and abline
to add a straight line.
For more plotting options run these
When your plotting gets serious, go to Chapter 6.
3.11 Object Types
We already saw that the basic building block of R objects is the vector. Vectors can be of the following types:
- character Where each element is a string, i.e., a sequence of alphanumeric symbols.
- numeric Where each element is a real number in double precision floating point format.
- integer Where each element is an integer.
- logical Where each element is either TRUE, FALSE, or NA3
- complex Where each element is a complex number.
- list Where each element is an arbitrary R object.
- factor Factors are not actually vector objects, but they feel like such. They are used to encode any finite set of values. This will be very useful when fitting linear model because they include information on contrasts, i.e., on the encoding of the factors levels. You should always be alert and recall when you are dealing with a factor or with a character vector. They have different behaviors.
- Date for dates
- "POSIXct and POSIXt for times
Vectors can be combined into larger objects.
A matrix
can be thought of as the binding of several vectors of the same type.
In reality, a matrix is merely a vector with a dimension attribute, that tells R to read it as a matrix and not a vector.
3.12 Vectors and Matrices Linear Algebra
Vectors can be involved in algebric operations. Here are some examples for vectors multiplications:
x <- rnorm(4) # generating 4 standard normal random variables
y <- exp(-x) # taking the exponent of x
x * y
## [1] -1.5412664 0.2173503 -2.1436542 0.1689432
## [,1]
## [1,] -3.298627
## [,1]
## [1,] -3.298627
## [,1] [,2] [,3] [,4]
## [1,] -1.5412664 -0.5513476 -1.7862644 -0.5988587
## [2,] 0.6075926 0.2173503 0.7041748 0.2360800
## [3,] -1.8496379 -0.6616595 -2.1436542 -0.7186764
## [4,] 0.4348046 0.1555399 0.5039206 0.1689432
Note that *
implement an entrywise product, where %*%
is a dot product. While not specifing whether the vectors are horizontal or vertical, R treats the operation as \((1 \times n) * (n \times 1)\). Note that t()
is the vector/ matrix transpose.
The same hold for vector - matrices multiplication. Take a look at the following examples:
## [1] 1 1 1 1 1
(A <- matrix(data = rep(1:5,5), nrow = 5, ncol = 5, byrow = TRUE)) # generating a 5 by 5 matrix, where each row is the vector 1:5
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 2 3 4 5
## [2,] 1 2 3 4 5
## [3,] 1 2 3 4 5
## [4,] 1 2 3 4 5
## [5,] 1 2 3 4 5
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 2 3 4 5
## [2,] 1 2 3 4 5
## [3,] 1 2 3 4 5
## [4,] 1 2 3 4 5
## [5,] 1 2 3 4 5
## [,1] [,2] [,3] [,4] [,5]
## [1,] 5 10 15 20 25
## [,1]
## [1,] 15
## [2,] 15
## [3,] 15
## [4,] 15
## [5,] 15
Note that R transform x
to the correct (possible) dimension. if you want to fix the dimention of a vector, turn it to a column/row matrix (as.matrix(x)
/as.matrix(t(x))
). We now move to some matrix multiplications:
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.5 1 1.5 2 2.5
## [2,] 0.5 1 1.5 2 2.5
## [3,] 0.5 1 1.5 2 2.5
## [4,] 0.5 1 1.5 2 2.5
## [5,] 0.5 1 1.5 2 2.5
## [,1] [,2] [,3] [,4] [,5]
## [1,] 55 55 55 55 55
## [2,] 55 55 55 55 55
## [3,] 55 55 55 55 55
## [4,] 55 55 55 55 55
## [5,] 55 55 55 55 55
## [,1]
## [1,] 75
We can perform algebric operations even with vectors and matrices that are of logical
class:
## [1] FALSE TRUE TRUE
## [,1] [,2] [,3]
## [1,] TRUE FALSE FALSE
## [2,] FALSE TRUE TRUE
## [3,] TRUE FALSE FALSE
## [,1] [,2] [,3]
## [1,] 0 0 0
## [2,] 0 1 1
## [3,] 1 0 0
Note that R transform the output to 0-1 integer matrix.
3.13 Data Frames
If vectors of different types (but same length) are binded, we get a data.frame
which is the most fundamental object in R for data analysis.
Data frames are brilliant, but a lot has been learned since their invention.
They have thus been extended in recent years with the tbl
class, pronounced [Tibble] (https://cran.r-project.org/web/packages/tibble/vignettes/tibble.html), and the data.table
class.
The latter is discussed in Chapter 4, and is strongly recommended.
Creating a simple data frame:
Let’s inspect our data frame:
## x sin
## 1 1 3.841471
## 2 2 3.909297
## 3 3 3.141120
## 4 4 2.243198
## 5 5 2.041076
## 6 6 2.720585
Now using the RStudio Excel-like viewer:
We highly advise against editing the data this way since there will be no documentation of the changes you made. Always transform your data using scripts, so that everything is documented.
Verifying this is a data frame:
## [1] "data.frame"
Check the dimension of the data
## [1] 10 2
Note that checking the dimension of a vector is different than checking the dimension of a data frame.
## [1] 10
The length of a data.frame
is merely the number of columns.
## [1] 2
3.14 Exctraction
R provides many ways to subset and extract elements from vectors and other objects. The basics are fairly simple, but not paying attention to the “personality” of each extraction mechanism may cause you a lot of headache.
For starters, extraction is done with the [
operator.
The operator can take vectors of many types.
Extracting element with by integer index:
## [1] 3.841471
Extract column by index:
## [1] 1 2 3 4 5 6 7 8 9 10
Extract row by index:
## x sin
## 7 7 3.656987
Extract column by name:
## [1] 3.841471 3.909297 3.141120 2.243198 2.041076 2.720585 3.656987
## [8] 3.989358 3.412118 2.455979
As a general rule, extraction with [
will conserve the class of the parent object.
There are, however, exceptions.
Notice the extraction mechanism and the class of the output in the following examples.
## [1] "numeric"
## [1] "data.frame"
## [1] "data.frame"
## [1] "data.frame"
## [1] "data.frame"
## [1] "numeric"
The subset()
function does the same
We can also subset by conditioning on variables values. Here are two examples:
## x sin
## 4 4 2.243198
## 5 5 2.041076
## 6 6 2.720585
## 10 10 2.455979
## x sin
## 4 4 2.243198
## 5 5 2.041076
## 6 6 2.720585
## 10 10 2.455979
In the second exaple, we created a logical vector and used it for subsetting. Let’s see this:
## [1] FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE TRUE
We can use this approach to create conditions. Here is another examples:
condition1 <- frame1$sin < mean(frame1$sin)
condition2 <- frame1$x %in% c(2,3,4,9,10) # the %in% operator is a logical for whether the elements of the first vector are also elements in the second.
frame1[condition1,] # only condition1==TRUE
## x sin
## 4 4 2.243198
## 5 5 2.041076
## 6 6 2.720585
## 10 10 2.455979
## x sin
## 1 1 3.841471
## 2 2 3.909297
## 3 3 3.141120
## 7 7 3.656987
## 8 8 3.989358
## 9 9 3.412118
## x sin
## 2 2 3.909297
## 3 3 3.141120
## 4 4 2.243198
## 9 9 3.412118
## 10 10 2.455979
## x sin
## 4 4 2.243198
## 10 10 2.455979
## x sin
## 2 2 3.909297
## 3 3 3.141120
## 4 4 2.243198
## 5 5 2.041076
## 6 6 2.720585
## 9 9 3.412118
## 10 10 2.455979
If you want to force the stripping of the class attribute when extracting, try the [[
mechanism instead of [
.
a <- frame1[1] # [ extraction
b <- frame1[[1]] # [[ extraction
class(a)==class(b) # objects have differing classes
## [1] FALSE
## x
## [1,] TRUE
## [2,] TRUE
## [3,] TRUE
## [4,] TRUE
## [5,] TRUE
## [6,] TRUE
## [7,] TRUE
## [8,] TRUE
## [9,] TRUE
## [10,] TRUE
The different types of output classes cause different behaviors. Compare the behavior of [
on seemingly identical objects.
## x
## 1 1
## 2 2
## 3 3
## 4 4
## 5 5
## 6 6
## 7 7
## 8 8
## 9 9
## 10 10
## [1] 1
Adding variables to a data.frame can be done by assigning new vectors. The power of the data.frame object is in that it accepts almost any type of vectors, i.e. integers, numericals, logicals, factors and characters.
frame1$newvar1 <- frame1$x - frame1$sin
frame1$newvar2 <- frame1$newvar1 > 0
frame1$newvar3 <- ifelse(frame1$newvar2, "good", "bad")
frame1$newvar4 <- factor(frame1$newvar1>3 & frame1$newvar3=="good",
labels = c("level1","level2"))
head(frame1)
## x sin newvar1 newvar2 newvar3 newvar4
## 1 1 3.841471 -2.841471 FALSE bad level1
## 2 2 3.909297 -1.909297 FALSE bad level1
## 3 3 3.141120 -0.141120 FALSE bad level1
## 4 4 2.243198 1.756802 TRUE good level1
## 5 5 2.041076 2.958924 TRUE good level1
## 6 6 2.720585 3.279415 TRUE good level2
You can get a quick summary of the data.frame with summary
function. A more compact display can be achived with the str
function, provide you also the class of each column, the first values, and some more useful information.
## x sin newvar1 newvar2
## Min. : 1.00 Min. :2.041 Min. :-2.8415 Mode :logical
## 1st Qu.: 3.25 1st Qu.:2.522 1st Qu.: 0.3334 FALSE:3
## Median : 5.50 Median :3.277 Median : 3.1192 TRUE :7
## Mean : 5.50 Mean :3.141 Mean : 2.3589
## 3rd Qu.: 7.75 3rd Qu.:3.795 3rd Qu.: 3.8437
## Max. :10.00 Max. :3.989 Max. : 7.5440
## newvar3 newvar4
## Length:10 level1:5
## Class :character level2:5
## Mode :character
##
##
##
## 'data.frame': 10 obs. of 6 variables:
## $ x : int 1 2 3 4 5 6 7 8 9 10
## $ sin : num 3.84 3.91 3.14 2.24 2.04 ...
## $ newvar1: num -2.841 -1.909 -0.141 1.757 2.959 ...
## $ newvar2: logi FALSE FALSE FALSE TRUE TRUE TRUE ...
## $ newvar3: chr "bad" "bad" "bad" "good" ...
## $ newvar4: Factor w/ 2 levels "level1","level2": 1 1 1 1 1 2 2 2 2 2
If you want to learn more about subsetting see Hadley’s guide.
3.15 Augmentations of the data.frame class
As previously mentioned, the data.frame
class has been extended in recent years.
The best known extensions are the data.table
and the tbl
.
For beginners, it is important to know R’s basics, so we keep focusing on data frames.
For more advanced users, I recommend learning the (amazing) data.table
syntax.
3.16 Data Import and Export
For any practical purpose, you will not be generating your data manually. R comes with many importing and exporting mechanisms which we now present. If, however, you do a lot of data “munging”, make sure to see Hadley-verse Chapter ??. If you work with MASSIVE data sets, read the Memory Efficiency Chapter ??.
Sometimes our data is saved in our local directory. Before moving to data importing, we need to know what how to get and set R’s active directory:
getwd() #What is the working directory?
setwd("~/folder1/folder2") #Setting the working directory in Linux
setwd("./folder3")
The dot sign "."
indicate a relative path to the curent working directory. the "~"
the home directory. In Windows this is a bit different, e.g. setwd("C:\\Users\\User\\folder1")
.
3.16.1 Reading data files
The read.table
function is the main importing workhorse.
It can import directly from the web.
URL <- 'http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/bone.data'
tirgul1 <- read.table(URL)
Alternatively, load from a local directory:
Always look at the imported result!
## V1 V2 V3 V4
## 1 idnum age gender spnbmd
## 2 1 11.7 male 0.01808067
## 3 1 12.7 male 0.06010929
## 4 1 13.75 male 0.005857545
## 5 2 13.25 male 0.01026393
## 6 2 14.3 male 0.2105263
Ohh dear.
read.,table
tried to guess the structure of the input, but failed to recognize the header row. Set it manually with header=TRUE
:
Other important parameters:
na.strings
- set the character represent a missing valuenrows
- how many rows to readskip
- number of lines to skip before reading
A faster alternative for reading data frames is the fst::read.fst()
from fst package, but make sure before that the data file is of the .fst
format.
3.16.2 Export as CSV
Let’s write a simple file so that we have something to import
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 NA NA 14.3 56 5 5
## 6 28 NA 14.9 66 5 6
temp.file.name <- tempfile() # get some arbitrary file name
write.csv(x = airquality, file = temp.file.name) # export
Now let’s import the exported file. Being a .csv file, I can use read.csv
instead of read.table
.
## X Ozone Solar.R Wind Temp Month Day
## 1 1 41 190 7.4 67 5 1
## 2 2 36 118 8.0 72 5 2
## 3 3 12 149 12.6 74 5 3
## 4 4 18 313 11.5 62 5 4
## 5 5 NA NA 14.3 56 5 5
## 6 6 28 NA 14.9 66 5 6
3.16.3 Export non-CSV files
You can export your R objects in endlessly many ways:
If instead of the comma delimiter in .csv you want other column delimiters, look into ?write.table
.
If you are exporting only for R users, you can consider exporting as binary objects with saveRDS
, feather::write_feather
, or fst::write.fst
.
See (http://www.fstpackage.org/) for a comparison.
3.16.4 Reading source files
With RCurl
packge we can read webpages source. The getURL
function is used to retrieve the source of a webpage, which is especially useful for retrieving pages for data processing (also known as web scraping / web harvesting / web data extraction). Example:
What can we do with this data? Well, if we know how this text file is writen we can extract some relevant data. Web scraping is cool, but not in our scope. More on this and on extracting data from the web can be found here
3.16.5 Reading From Text Files
Sometimes the data has the shape of data.frame
but stored as a text file. We can also call the read.table
function to import such text files.
If you care about your sanity, see ?read.table
before starting imports.
Some notable properties of the function:
read.table
will try to guess column separators (tab, comma, etc.)read.table
will try to guess if a header row is present.read.table
will convert character vectors to factors unless told not to using thestringsAsFactors=FALSE
argument.- The output of
read.table
needs to be explicitly assigned to an object for it to be saved.
3.16.6 Download a file from the web
If you have the file URL, you can download a file and save it in a desired destination:
3.16.7 Writing Data to Text Files
The function write.table
is the exporting counterpart of read.table
.
3.16.8 .XLS(X) files
Strongly recommended to convert to .csv in Excel, and then import as csv. If you still insist see the xlsx package.
3.16.9 Massive files
The above importing and exporting mechanisms were not designed for massive files. See the section on the data.table package (4), or read the sections Sparse Representation and Memory Efficiency (Out-of-Ram Algorithms) in Johnathan’s R BGU course for more on working with massive data files.
3.16.10 Databases
R does not need to read from text files; it can read directly from a database. This is very useful since it allows the filtering, selecting and joining operations to rely on the database’s optimized algorithms. Then again, if you will only be analyzing your data with R, you are probably better of by working from a file, without the databases’ overhead. See the SQL Chapter for more on this matter.
3.16.11 Reading images data files
See the excellent magick package
3.17 Functions
One of the most basic building blocks of programming is the ability of writing your own functions. A function in R, like everything else, is an object accessible using its name. We first define a simple function that sums its two arguments
## [1] 12
From this example you may notice that:
The function
function
tells R to construct a function object.Unlike some programming languages, a period (
.
) is allowed as part of an object’s name.The arguments of the
function
, i.e.(x,y)
, need to be named but we are not required to specify their class. This makes writing functions very easy, but it is also the source of many bugs, and slowness of R compared to type declaring languages (C, Fortran,Java,…).A typical R function does not change objects4 but rather creates new ones. To save the output of
my.sum
we will need to assign it using the<-
operator.
Here is a (slightly) more advanced function:
my.sum.2 <- function(x, y , absolute=FALSE) {
if(absolute==TRUE) {
result <- abs(x+y)
}
else{
result <- x+y
}
result
}
my.sum.2(-10,2,TRUE)
## [1] 8
Things to note:
if(condition){expression1} else{expression2}
does just what the name suggests.The function will output its last evaluated expression. You don’t need to use the
return
function explicitly.Using
absolute=FALSE
sets the default value ofabsolute
toFALSE
. This is overridden ifabsolute
is stated explicitly in the function call.
An important behavior of R is the scoping rules. This refers to the way R seeks for variables used in functions. As a rule of thumb, R will first look for variables inside the function and if not found, will search for the variable values in outer environments5. Think of the next example.
## [1] 24
3.18 Looping
The real power of scripting is when repeated operations are done by iteration.
R supports the usual for
, while
, and repated
loops.
Here is an embarrassingly simple example
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
A slightly more advanced example, is vector multiplication
%*%
, sum()
, etc.
foreach
loops in the foreach package, rather then regular for
loops.
3.19 Apply
For applying the same function to a set of elements, there is no need to write an explicit loop.
This is such en elementary operation that every programming language will provide some facility to apply, or map the function to all elements of a set.
R provides several facilities to perform this.
The most basic of which is lapply
which applies a function over all elements of a list, and return a list of outputs:
the.list <- list(1,'a',mean) # a list of 3 elements from different calsses
lapply(X = the.list, FUN = class) # apply the function `class` to each elements
## [[1]]
## [1] "numeric"
##
## [[2]]
## [1] "character"
##
## [[3]]
## [1] "function"
## [1] "numeric" "character" "function"
What is the function you are using requires some arguments? One useful trick is to create your own function that takes only one argument:
## Murder.25% Assault.25% UrbanPop.25% Rape.25%
## 4.075 109.000 54.500 15.075
R provides many variations on lapply
to facilitate programming.
Here is a partial list:
sapply
: The same aslapply
but tries to arrange output in a vector or matrix, and not an unstructured list.vapply
: A safer version ofsapply
, where the output class is pre-specified.apply
: For applying over the rows or columns of matrices.mapply
: For applying functions with more than a single input.tapply
: For splitting vectors and applying functions on subsets.rapply
: A recursive version oflapply
.eapply
: Likelapply
, only operates onenvironments
instead of lists.Map
+Reduce
: For a Common Lisp look and feel oflapply
.parallel::parLapply
: A parallel version oflapply
from the package parallel.parallel::parLBapply
: A parallel version oflapply
, with load balancing from the package parallel.
3.20 Recursion
The R compiler is really not designed for recursion, and you will rarely need to do so.
See the RCpp Chapter in R-BGU-course for linking C code, which is better suited for recursion.
If you really insist to write recursions in R, make sure to use the Recall
function, which, as the name suggests, recalls the function in which it is place.
Here is a demonstration with the Fibonacci series.
## [1] 8
3.21 Dates and Times
R provides several options for dealing with date and date/time data. We will start by the default options from R’s base
library.
Unless you tell it to R, R doesn’t know something is a date. The as.Date
function convert values to dates. You can pass it a character
, a numeric
, or a POSIXct
(we’ll soon explain what it is).
Let’s see some examples:
## [1] "character"
## [1] "Date"
But what if our date is not in the yyyy-mm-dd format? We can tell R what is the charater date’s format.
## [1] "1948-05-14"
## [1] "1948-05-14"
Look for ?strptime
to see more options for date-time conversion functions to and from character.
Another option is to set the date given the number of days from the origin. e.g.:
## [1] "4686-04-10"
Once R knows its a date, some functions and operators adapt themselves:
## [1] "2020-12-02"
## [1] "2020-12-03"
## Time difference of 26500 days
## [1] "1948-05-14"
## [1] 12
We now move to Times, which have a more quantitative meaning. You can set a time object in R just like as in dates, but now you will have to specify more units (or R will set it for you to the default). The POSIXct
is R’s object class for time, and it expects strings to be in the format YYYY-MM-DD HH:MM:SS. with POSIXct
you can also specify the timezone, e.g., "Asia/Jerusalem"
.
## [1] "POSIXct" "POSIXt"
## Time difference of 3 days
## [1] "difftime"
If you convert POSIXct
to POSIXlt
object, time components can be extracted easily:
## $class
## [1] "POSIXct" "POSIXt"
## [1] 13
We can get more specific information about time difference:
## Time difference of 72 hours
## Time difference of 0.4285714 weeks
And we can also generate sequences with times:
## [1] "2020-12-02 13:04:39 IST" "2020-12-03 13:04:39 IST"
## [3] "2020-12-04 13:04:39 IST" "2020-12-05 13:04:39 IST"
## [1] "2020-12-02 13:04:39 IST" "2021-01-02 13:04:39 IST"
## [3] "2021-02-02 13:04:39 IST" "2021-03-02 13:04:39 IST"
## [5] "2021-04-02 13:04:39 IDT" "2021-05-02 13:04:39 IDT"
## [7] "2021-06-02 13:04:39 IDT" "2021-07-02 13:04:39 IDT"
## [9] "2021-08-02 13:04:39 IDT" "2021-09-02 13:04:39 IDT"
## [11] "2021-10-02 13:04:39 IDT" "2021-11-02 13:04:39 IST"
Except from R base, the lubridate package offer an efficient and convenient functionalities for dates and time.
Yoy can creat a date data by passing the order in which year, month, and day appear in your dates by the function name. For instance ymd()
means (left to right): year, month, day. You can also add the time in a day with ymd_hms()
. Once the order was specified, the exact format is not a matter:
## [1] "2017-01-31"
## [1] "2017-01-31"
## [1] "2017-01-31"
## [1] "2000-01-01 00:00:01 UTC"
## [1] "2000-01-01 00:00:01 UTC"
Another nice thing in lubridate, is that periods can be created with a number of friendly constructor functions that you can combine time objects. E.g.:
## [1] "1S"
## [1] "2M 0S" "3M 0S"
## [1] "4H 0M 0S"
## [1] "5d 0H 0M 0S"
## [1] "6m 0d 0H 0M 0S" "7m 0d 0H 0M 0S" "8m 0d 0H 0M 0S"
## [1] "63d 0H 0M 0S"
## [1] "10y 0m 0d 0H 0M 0S"
## [1] "2000-01-01 00:00:01 UTC"
## [1] "2000-01-01 00:00:02 UTC"
## [1] "2010-01-01 00:02:01 UTC" "2010-01-01 00:03:01 UTC"
And you can also extract and assign the time components:
## [1] "2000-01-01 00:00:01 UTC"
## [1] 1
## [1] "2000-01-01 00:00:26 UTC"
There are many more things to know about working with dates and times and how to analyze temporal data. If you are intrested in time-series analysis, look up for tseries
, forecast
and zoo
packages.
3.22 Working with strings
strings are common in R, especially as: files names, paths (directories), names of elements in the data, text elements (e.g. title in a plot), and more.
Here are some examples of manipulating strings.
The paste
function does what its name suggest, with sep
argument for the character string that separate the terms.
## [1] TRUE
## [1] "good morning"
## [1] "good.morning"
The substr
function extract or replace substrings in a character vector. E.g.:
## [1] "ood"
The grep
function searches for matches to argument pattern within each element of a character vector.
It differs in the format of detail in the results. Here are some examples:
## [1] "good" "morning" "good.evening"
## [1] 1 3
grep("good",d, value=TRUE, ignore.case=TRUE) # whith value=TRUE we get the value of these elements. Here we also ask the matching to be case sensitive
## [1] "good" "good.evening"
We can use gsub
to replace characters in a string object:
## [1] "gqqd" "mqrning" "gqqd.evening"
The strsplit
allows to split string vectors to list:
## a b c
## "thiszis" "justzan" "example"
## $a
## [1] "this" "is"
##
## $b
## [1] "just" "an"
##
## $c
## [1] "example"
Lastly. here are some more simple and useful functions in R to work with strings
## a b c
## 7 7 7
## a b c
## "THISZIS" "JUSTZAN" "EXAMPLE"
## a b c
## "thiszis" "justzan" "example"
## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"
## [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J"
## the sum of 1 and 2 is 3
## [1] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710"
## [4] "Hornet 4 Drive" "Hornet Sportabout"
abbreviate(long, minlength = 3, method = "both.sides") # try abbreviate function to short them while maintain the meaning
## Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive
## "MRX" "MRW" "D71" "H4D"
## Hornet Sportabout
## "HrS"
If you need more than this, look for the stringr package that provides a set of internally consistent tools.
3.23 Bibliographic Notes
There are endlessly many introductory texts on R. For a list of free resources see CrossValidated. I personally recommend the official introduction Venables et al. (2004), available online, or anything else Bill Venables writes.
For Importing and Exporting see (https://cran.r-project.org/doc/manuals/r-release/R-data.html). For working with databases see (https://rforanalytics.wordpress.com/useful-links-for-r/odbc-databases-for-r/).
For advanced R programming see Wickham (2014), available online, or anything else Hadley Wickham writes.
For a curated list of recommended packages see here.
3.24 Practice Yourself
Load the package MASS. That was easy. Now load ggplot2, after looking into
install.pacakges()
.Save the numbers 1 to 1,000,000 (
1e6
) into an object namedobject
.Write a function that computes the mean of its input. Write a version that uses
sum()
, and another that uses afor
loop and the summation+
. Try checking which is faster usingsystem.time
. Is the difference considerable? Ask me about it in class.Write a function that returns
TRUE
if a number is divisible by 13,FALSE
if not, and a nice warning to the user if the input is not an integer number.Apply the previous function to all the numbers in
object
. Try using afor
loop, but also a mapping/apply function.Make a matrix of random numbers using
A <- matrix(rnorm(40), ncol=10, nrow=4)
. Compute the mean of each columns. Do it using your own loop and then do the same withlapply
orapply
.Make a data frame (
dataA
) with three columns, and 100 rows. The first column with 100 numbers generated from the \(\mathcal{N}(10,1)\) distribution, second column with samples from \(Poiss(\lambda=4)\). The third column contains only1
.
Make another data frame (dataB
) with three columns and 100 rows. Now with \(\mathcal{N}(10,0.5^2)\), \(Poiss(\lambda=4)\) and2
. Combine the two data frames into an object nameddataAB
withrbind
. Make a scatter plot ofdataAB
where the x-axes is the first column, the y-axes is the second and define the shape of the points to be the third column.In a sample generated of 1,000 observations from the \(\mathcal{N}(10,1)\) distribution:
- What is the proportion of samples smaller than \(12.4\) ?
- What is the \(0.23\) percentile of the sample?
Nothing like cleaning a dataset, to paractice your R basics. Have a look at RACHAEL TATMAN collected several datasets which BADLY need some cleansing.
References
Venables, William N, David M Smith, R Development Core Team, and others. 2004. “An Introduction to R.” Network Theory Limited.
Wickham, Hadley. 2014. Advanced R. CRC Press.
Wilkinson, GN, and CE Rogers. 1973. “Symbolic Description of Factorial Models for Analysis of Variance.” Applied Statistics, 392–99.