Chapter 2 Getting Started
2.1 A Brief History of R
- R is a scripting language for statistical data manipulation and analysis.
- It was inspired by statistical language S (for Statistics) developed by AT&T.
- Then developed into S-Plus with GUI
- R has become more popular than S or S-Plus because it’s free (open source) and more people are contributing to it (called GNU S)
R is an extremely versatile open source programming language for statistics and data science
2.2 Install R and R Studio
- You can download and install a copy of the latest version of R for free on your own computer at CRAN.
- We use RStudio (an IDE of R) for the R statistical programming language. Install RStudio at the website. R needs to installed before installing RStudio.
- Open RStudio. Your interface is made up of four panes:
- script/code pane
- Where we keep records of our work
- Write scripts/code
- Scripts can be saved
- console pane
- Contains command line
- Execute quick commands
- Displays executed code
- Displays results of executed code
- workspace pane
- Shows what is loaded in memory, e.g., data
- Stores any object, function or data you create during your R session (we will cover what those are)
- The history tab keeps a record of all previously submmitted commands
- files/plots/packages/help pane
- The files tab lists the files in the set working directory. You can also naviate to other directories
- The plots tab displays any graphs/figures created during the R session
- The package tab shows a list of all add-ons currently available in RStudio (but more can be installed)
- The help tab provides information about R and commands.
- script/code pane
The layout of the panes can be re-organized via menu options Veiw>Panes>…
2.3 How to Run R
R operates in two modes: interactive and batch.
2.3.1 Interactive Mode
- Type R on terminal (Mac) or start R by double clicking the R icon (Windows).
- You can see the console pane in RStudio. (Use RStudio)
- Ex. Consider a random variable X∼N(0,1). Generate a new random variable Y=|X| which will not have the standard normal distribution anymore.
set.seed(100) # setting seed number for a random process
x = rnorm(100) # generate 100 samples of X
y = abs(x) # generate 100 samples of Y
mean(y) # calcuate the sample average of y
Note that in the code, anything after ‘#’ are comments that are not recognized as R codes. You can minimize the three steps into one as follows:
set.seed(100)
mean(abs(rnorm(100)))
Save these command lines as R script file. For example z.r or z.R.
source(z.R)
In the workspace pane, you can check that x and y are generated by sourcing the script.
2.3.2 Batch Mode
Let’s continue editting z.R script file by including the following comman line.
pdf('figure.pdf')
hist(x)
dev.off()
The updated z.R script file can be run without entering R interactive mode, by invoking R with an operating system shell command. You can confirm that this worked by checking figure.pdf file in your working directory.
2.4 A First R Session
Let’s make a data named x in a vector type.
x <- c(1,2,3)
‘<-’ is the standard assignment operator in R. ‘=’ also works except for some situations. And c stands for concatenate. So we are concatenating three one-element vectors. Try:
y <- c(1,2,x,c(3,4,5))
y
Individual element in a vector can be assessed via [ ], which is called indexing.
y[2]
y[1:3]
y[3:5]
We can replace elements in a vector, which is called subsetting.
y[3:5] <- c(33,44,55)
y
We can apply predefined R functions.
m_y <- mean(y)
m_y
sd(y)
Let’s work on one of R’s internal data sets. You can get a list by typing the following.
data()
Let’s Try Nile data.
?Nile # to check info on the data
Nile # print data (don't do this for big data)
mean(Nile)
sd(Nile)
hist(Nile)
This is the end of our five-minute intro to R. Quit R by calling the q() function.
Save workspace image to save your variables so that you can resume work later, especially when you loaded large datasets.
2.5 Introduction to Functions
A function is a group of instructions that takes inputs, uses them to compute other values and returns a result. Let’s make a function that counts the odd numbers in a input vector
oddcount<- function(x) {
k <- 0 # initialize output
for (i in x){
if (i%%2==1) k<-k+1 # %% is the remainder arithmetic
}
k
}
Then try:
oddcount(c(1,3,5,2,4,5,7))
oddcount(c(2,2,2,2))
oddcount(c('k',2,2)) # k is the character, error occurs
When we included non-numeric element, it produced an error. Let’s change the function then.
oddcount_mod<- function(x) {
#### x: a nemeric input vector
if (all(is.numeric(x))) {
k <- 0 # initialize output
for (i in x){
if (i%%2==1) k<-k+1 # %% is the remainder arithmetic
}
return(k)
}else { print('The elements of input vector should be numeric type')}
}
oddcount_mod(c(1,3,5,2,4,5,7))
oddcount_mod(c(2,2,2,2))
oddcount_mod(c('k',2,2)) # k is the character, error occurs
2.5.1 Variable Scope
i # this is a local variable defined within function, not visible outside the function
z = c(2,3,4)
oddcount_mod(z)
Let’s see the following example.
f <- function(k) k+kk # kk is not defined as an input
f
k # local variable
f(2) # error
kk <- 3 # define kk, the global variable
f(2) # works
A global variable can be defined within a function by superassignment operator, <<-. Try by making examples.
2.5.2 Default Arguments
You can set a default value for the input of a function:
evenoddcount <- function(x,odd=TRUE) {
#### x: a nemeric input vector
#### odd: TRUE if odd numbers counted, FALSE if even numbers counted
if (all(is.numeric(x))) {
k <- 0 # initialize output
if (odd) {
for (i in x){
if (i%%2==1) k<-k+1 # %% is the remainder arithmetic
}
}else {
for (i in x){
if (i%%2==0) k<-k+1
}
}
return(k)
}else { print('The elements of input vector should be numeric type')}
}
evenoddcount(c(1,2,3)) # odd numbers counted by default
evenoddcount(c(1,2,3),odd=FALSE) # even numbers counted by changing the odd input
Note that R allows logical values TRUE and FALSE to be abbreviated to T and F. Think about where you put “if (odd)” to make the function efficient. Do you think it is optimally written?
2.6 More functions in Packages
R has base system and package system. Functions in base system are available when you install R. The vast array of user-written functions available at the cran and bioconductor repositories. In the last few years, the number of packages has grown exponentially!
One package that will make extensive use of is the ‘tidyverse’ package. It is based on little different language from the base system but useful for data processing and visualization. This is actually wrapper for a number of other packages (see the Tidyverse website). We type the following into the R console:
?install.packages
install.packages('tidyverse') # install from cran repositories
Or you can use the package pane on RStudio. R will download the packages from CRAN and install them on your computer. You will not able to use the functions, objects, or help files in a package until you load it. Once you have installed a package, you can load it using the library() function
library(tidyverse)
The printed output tells you that tidyverse loads multiple packages, dplyr, purrr, ggplot2 and etc. These are considered the core of the tidyverse because you’ll use them in almost every analysis within tidyverse. Packages in the tidyverse change fairly frequently. You can see if updates are available by running:
tidyverse_update()
2.7 Misc
You can check and change your working directory by:
getwd()
setwd("..")
Getting helped:
help(lm) # function
?lm # function
?"<" # operator
?"for" # loops
Getting examples:
example(lm) # this will actually run examples
If you don’t know quite what you’re looking for:
help.search('multivariate normal')
??'multivariate normal'
Getting R’s internal help files
?mvrnorm