Chapter 2 Basics
2.1 Operators
Operator | Description |
---|---|
+
|
addition |
-
|
subtraction |
*
|
multiplication |
/
|
division |
^ or ** | exponentiation |
x%%y | modulus (x mod y) 5%%2 is 1 |
x%/%y | integer division 5%/%2 is 2 |
Operator | Description |
---|---|
< | less than |
<= | less than or equal to |
> | greater than |
>= | greater than or euqal to |
== | exactly equal to |
!= | not equal to |
!x | Not x |
x|y | x OR y |
x&y | x AND y |
isTRUE(x) | test if x is TRUE |
2.2 Functions
Built-in Functions
- R has many built in functions that compute different statistical procedures.
- Functions in R are followed by ( ).
- Inside the parenthesis we write the object (vector, array, matrix, dataframe) to which we want to apply the function.
Function | Description |
---|---|
abs(x) | absolute value |
sqrt(x) | square root |
ceiling(x) | ceiling(3.475) is 4 |
floor(x) | floor(3.475) is 3 |
trunc(x) | trunc(5.99) is 5 |
round(x, digits=n) | round(3.475, digits=2) is 3.48 |
signif(x, digits=n) | signif(3.475, digits=2) is 3.5 |
cos(x), sin(x), tan(x) | also acos(x), cosh(x), acosh(x), etc. |
log(x) | natural logarithm |
log10(x) | common logarithm |
exp(x) | e^x |
Function | Description |
---|---|
substr(x, start=n1, stop=n2) | Extract or replace substrings in a character vector. x <- “abcdef”, substr(x, 2, 4) is “bcd” |
grep(pattern, x, ignore.case=FALSE, fixed=FALSE) | Search for pattern in x. If fixed =FALSE then pattern is a regular expression. If fixed=TRUE then pattern is a text string. Returns matching indices. grep(“A”, c(“b”,“A”,“c”), fixed=TRUE) returns 2 |
sub(pattern, replacement, x, ignore.case=FALSE, fixed=FALSE) | Find pattern in x and replace with replacement text. If fixed=FALSE then pattern is a regular expression. If fixed = T then pattern is a text string. sub(“”,“.”,“Hello There”) returns “Hello.There” |
strsplit(x, split) | Split the elements of character vector x at split. strsplit(“abc”, “”) returns 3 element vector “a”,“b”,“c” |
paste(…, sep=“”) | Concatenate strings after using sep string to seperate them. paste(“x”,1:3,sep=“”) returns c(“x1”,“x2” “x3”) paste(“x”,1:3,sep=“M”) returns c(“xM1”,“xM2” “xM3”) paste(“Today is”, date()) |
toupper(x) | Uppercase |
tolower(x) | Lowercase |
The following tables describe functions related to probability distributions. For random number generators below, you can use set.seed(1234) or some other integer to create reproducible pseudo-random numbers.
Function | Description |
---|---|
dnorm(x) | normal density function (by default m=0 sd=1) # plot standard normal curve x <- pretty(c(-3,3), 30) y <- dnorm(x) plot(x, y, type=“l”, xlab=“Normal Deviate”, ylab=“Density”, yaxs=“i”) |
pnorm(q) | cumulative normal probability for q (area under the normal curve to the right of q) pnorm(1.96) is 0.975 |
qnorm(p) | normal quantile. value at the p percentile of normal distribution qnorm(.9) is 1.28 # 90th percentile |
rnorm(n, m=0, sd=1) | n random normal deviates with mean m and standard deviation sd. #50 random normal variates with mean=50, sd=10x <- rnorm(50, m=50, sd=10) |
dbinom(x, size, prob), pbinom(p, size,prob), qbinom(q,size,prob), rbinom(n,size,prob) | binomial distribution where size is the sample size and prob is the probability of a heads (pi) # prob of 0 to 5 heads of fair coin out of 10 flips dbinom(0:5, 10, .5) # prob of 5 or less heads of fair coin out of 10 flips pbinom(5, 10, .5) |
dpois(x, lamda), ppois(q,lamda), qpois(p,lamda), rpois(n,lamda) | poisson distribution with m=std=lamda #probability of 0,1, or 2 events with lamda=4 dpois(0:2, 4) # probability of at least 3 events with lamda=4 1- ppois(2,4) |
dunif(x,min,max=1) | uniform distribution, follows the same pattern |
punif(q,min=0,max=1) | as the normal distribution above. |
qunif(p,min=0,max=1) | #10 uniform random variates |
runif(n,min=0,max=1) | x <- runif(10) |
mean(x,trim=0, na.rm=FALSE) | mean of object x, # trimmed mean, removing any missing values and # 5 percent of highest and lowest scores mx <- mean(x,trim=.05,na.rm=TRUE) |
sd(x) | standard deviation of object(x). also look at var(x) for variance and mad(x) for median absolute deviation. |
median(x) | median |
quantile(x) | quantiles where x is the numeric vector whose quantiles are desired and probs is a numeric vector with probabilities in [0,1]. # 30th and 84th percentiles of x, y <- quantile(x, c(.3,.84)) |
range(x) | range |
sum(x) | sum |
diff(x,lag=1) | lagged differences, with lag indicating which lag to use |
min(x) | minimum |
max(x) | maximum |
scale(x, center=TRUE, scale=TRUE) | column center or standardize a matrix |
Function | Description |
---|---|
seq(from, to, by) | generate a sequence indices <- seq(1,10,2) #indices is c(1, 3, 5, 7, 9) |
rep(x,ntimes) | repeat x n times y <- rep(1:3, 2) # y is c(1, 2, 3, 1, 2, 3) |
cut(x,n) | divide continuous variable in factor with n levels y <- cut(x, 5) |
length(object) | number of elements or components |
str(object) | structure of an object |
class(object) | class or type of an object |
names(object) | names |
c(object, object,…) | combine objects into a vector |
cbind(object, object,…) | combine objects as columns |
rbind(object, object,…) | combine objects as rows |
ls() | list current objects |
rm(object) | delete an object |
newobject <- edit(object) | create a new object |
fix(object) | edit an object in place |
Functions Applied
R as a Calculator
1250 + 1000
[1] 2250
1250 - 1000
[1] 250
99/3
[1] 33
3^3
[1] 27
4%%2
[1] 0
1+1; 4*5; 6-2
[1] 2 [1] 20 [1] 4
Dealing with NAN and NA’s.
- NAN (not a number)
- NA (missing value)
x <- c(1:8, NA)
####NA is the result
mean(x)
[1] NA
####na.rm removes the NA, so the calculation may be performed
mean(x, na.rm=TRUE)
[1] 4.5
2.3 Data Types
String Characters
- In R, string variables are defined by double quotation marks.
letters <- c("A", "B", "C")
x |
---|
A |
B |
C |
Objects in R
- Objects in R obtain values by assignment.
- This is achieved by the gets arrow, <-, and not the equal sign, =.
- Objects can be of different kinds.
- Vectors, Arrays, Matrices, Subscripts, Dataframes
Vector
- A vector is a sequence of data elements of the same basic type. Members in a vector are officially called components.
- Here is a vector containing three numeric values 2, 3 and 5.
vector <- c(2,3,5)
x |
---|
2 |
3 |
5 |
Array
Arrays are numeric objects with dimension attributes. The difference between a matrix and an array is that arrays have more than two dimensions. The following example creates an array of two 3x3 matrices each with 3 rows and 3 columns.
Create two vectors of different lengths.
vector1 <- c(5,9,3,7,2)
vector2 <- c(10:17)
Take these vectors as into the array.
result <- array(c(vector1, vector2), dim=c(3,3,2))
V1 | V2 | V3 | V4 | V5 | V6 |
---|---|---|---|---|---|
5 | 7 | 11 | 14 | 17 | 3 |
9 | 2 | 12 | 15 | 5 | 7 |
3 | 10 | 13 | 16 | 9 | 2 |
Matrix
A matrix is a collection of data elements arranged in a two-dimensional rectangular layout. The following is an example of a matrix with 2 rows and 3 columns.
matrix <- matrix(c(2,4,3,1,5,7), # the data elements
nrow=2, #number of rows
ncol = 3, #number of columns
byrow = TRUE) #fill matrix by rows
2 | 4 | 3 |
1 | 5 | 7 |
Subscript
- Select only one or some of the elements in a vector, a matrix or an array.
- We can do this by using subscripts in square brackets [ ].
- In matrices or dataframes the first subscript refers to the row and the second to the column.
- R has several ways to subscript (that is, extract specific elements from a vector). The most common way is directly using the square bracket operator:
vector1[4]
[1] 7 In this example, the user has said “give me the fourth element of vector1”.
Here is a similar question: “what are the second and fifth elements of vector1?”
vector1[c(2,5)]
[1] 9 2
Here the c(), of course, constructs the vector (2,5) to be used as the index; then we extract the second and fifth entries of vector1.
Dataframe
A data frame is used for storing data tables. It is a list of vectors of equal length. For example, the following variable df is a data frame containing three vectors n, s, b
n <- c(2,3,5)
s <- c("aa","bb","cc")
b <- c(TRUE, FALSE, TRUE)
df <- data.frame(n,s,b) #df is a dataframe
n | s | b |
---|---|---|
2 | aa | TRUE |
3 | bb | FALSE |
5 | cc | TRUE |
Let’s create a sample data set, summarize the data and perform some basic manipulations.
####Create a vector
x <- (1:5)
####Summarize the vector
summary <- summary(x)
####Calculate the mean and median and check if they are equal
mean <- mean(x)
median <- median(x)
equal <- mean==median
####Transform to a data frame
df <- as.data.frame(x)
####Add a calculated column
df$New <- df$x/2
####Rename the columns
names(df)[names(df)=="x"] <- "Column1"
names(df)[names(df)=="New"] <- "Column2"
Column1 | Column2 |
---|---|
1 | 0.5 |
2 | 1.0 |
3 | 1.5 |
4 | 2.0 |
5 | 2.5 |
Tips
- R is case-sensitive.
- Comment your code so you remember what it does; comments are preceded with #.
- R scripts are simply text files with a .R extension.
- Use Ctrl + R to submit code.
- Use the Tab key to let R/R Studio finish typing commands for you.
- Use Shift + down arrow to highlight lines or blocks of code.
- In R Studio: Ctrl + 1 and Ctrl + 2 switches between script and console.
- Use up and down arrows to cycle through previous commands in console.
- Don’t be afraid of errors; you won’t break R.
- If you get stuck, Google is your friend.
2.4 Loops
For loops
In R a while takes this form, where variable is the name of your iteration variable, and sequence is a vector or list of values:
for (variable in sequence) expression
The expression can be a single R command - or several lines of commands wrapped in curly brackets:
for (variable in sequence) { expression expression expression } Here is a quick trivial example, printing the square root of the integers one to ten:
for (x in c(1:10)) print(sqrt(x))
[1] 1 [1] 1.414214 [1] 1.732051 [1] 2 [1] 2.236068 [1] 2.44949 [1] 2.645751 [1] 2.828427 [1] 3 [1] 3.162278
While loops
In R While takes this form, where condition evaluates to a boolean (True/False) and must be wrapped in ordinary brackets:
while (condition) expression
As with a for loop, expression can be a single R command - or several lines of commands wrapped in curly brackets:
while (condition) { expression expression expression }
We’ll start by using a “while loop” to print out the first few Fibonacci numbers: 0, 1, 1, 2, 3, 5, 8, 13, … where each number is the sum of the previous two numbers. Create a new R script file, and copy this code into it:
a <- 0
b <- 1
print(a)
[1] 0
while (b < 50) {
print(b)
temp <- a + b
a <- b
b <- temp
}
[1] 1 [1] 1 [1] 2 [1] 3 [1] 5 [1] 8 [1] 13 [1] 21 [1] 34
This next version builds up the answer gradually using a vector, which it prints at the end:
x <- c(0,1)
while (length(x) < 10) {
position <- length(x)
new <- x[position] + x[position-1]
x <- c(x,new)
}
print(x)
To understand how this manages to append the new value to the end of the vector x, try this at the command prompt:
x <- c(1,2,3,4)
c(x,5)
[1] 1 2 3 4 5
Writing Functions
This following script uses the function() command to create a function (based on the code above) which is then stored as an object with the name Fibonacci:
Fibonacci <- function(n) {
x <- c(0,1)
while (length(x) < n) {
position <- length(x)
new <- x[position] + x[position-1]
x <- c(x,new)
}
return(x)
}
Once you run this code, there will be a new function available which we can now test:
Fibonacci(10)
[1] 0 1 1 2 3 5 8 13 21 34
Fibonacci(3)
[1] 0 1 1
Fibonacci(2)
[1] 0 1
Fibonacci(1)
[1] 0 1
That seems to work nicely - except in the case n == 1 where the function is returning the first two Fibonacci numbers! This gives us an excuse to introduce the if statement.
The If statement In order to fix our function we can do this:
Fibonacci <- function(n) {
if (n==1) return(0)
x <- c(0,1)
while (length(x) < n) {
position <- length(x)
new <- x[position] + x[position-1]
x <- c(x,new)
}
return(x)
}
In the above example we are using the simplest possible if statement:
if (condition) expression The if statement can also be used like this:
if (condition) expression else expression And, much like the while and for loops the expression can be multiline with curly brackets:
Fibonacci <- function(n) {
if (n==1) {
x <- 0
} else {
x <- c(0,1)
while (length(x) < n) {
position <- length(x)
new <- x[position] + x[position-1]
x <- c(x,new)
}
}
return(x)
}
Fibonacci(1)
[1] 0