Business Analytics with R

2.1 Operators

Table 2.1: Arithmetic Operators
Operator	Description
`+`	addition
`-`	subtraction
`*`	multiplication
`/`	division
^ or **	exponentiation
x%%y	modulus (x mod y) 5%%2 is 1
x%/%y	integer division 5%/%2 is 2

Table 2.2: Logical Operators
Operator	Description
<	less than
<=	less than or equal to
>	greater than
>=	greater than or euqal to
==	exactly equal to
!=	not equal to
!x	Not x
x\|y	x OR y
x&y	x AND y
isTRUE(x)	test if x is TRUE

2.2 Functions

Built-in Functions

R has many built in functions that compute different statistical procedures.
Functions in R are followed by ( ).
Inside the parenthesis we write the object (vector, array, matrix, dataframe) to which we want to apply the function.

Table 2.3: Numeric Functions
Function	Description
abs(x)	absolute value
sqrt(x)	square root
ceiling(x)	ceiling(3.475) is 4
floor(x)	floor(3.475) is 3
trunc(x)	trunc(5.99) is 5
round(x, digits=n)	round(3.475, digits=2) is 3.48
signif(x, digits=n)	signif(3.475, digits=2) is 3.5
cos(x), sin(x), tan(x)	also acos(x), cosh(x), acosh(x), etc.
log(x)	natural logarithm
log10(x)	common logarithm
exp(x)	e^x

Table 2.4: Character Functions
Function	Description
substr(x, start=n1, stop=n2)	Extract or replace substrings in a character vector. x <- “abcdef”, substr(x, 2, 4) is “bcd”
grep(pattern, x, ignore.case=FALSE, fixed=FALSE)	Search for pattern in x. If fixed =FALSE then pattern is a regular expression. If fixed=TRUE then pattern is a text string. Returns matching indices. grep(“A”, c(“b”,“A”,“c”), fixed=TRUE) returns 2
sub(pattern, replacement, x, ignore.case=FALSE, fixed=FALSE)	Find pattern in x and replace with replacement text. If fixed=FALSE then pattern is a regular expression. If fixed = T then pattern is a text string. sub(“”,“.”,“Hello There”) returns “Hello.There”
strsplit(x, split)	Split the elements of character vector x at split. strsplit(“abc”, “”) returns 3 element vector “a”,“b”,“c”
paste(…, sep=“”)	Concatenate strings after using sep string to seperate them. paste(“x”,1:3,sep=“”) returns c(“x1”,“x2” “x3”) paste(“x”,1:3,sep=“M”) returns c(“xM1”,“xM2” “xM3”) paste(“Today is”, date())
toupper(x)	Uppercase
tolower(x)	Lowercase

The following tables describe functions related to probability distributions. For random number generators below, you can use set.seed(1234) or some other integer to create reproducible pseudo-random numbers.

Table 2.5: Statistical / Probability Functions
Function	Description
dnorm(x)	normal density function (by default m=0 sd=1) # plot standard normal curve x <- pretty(c(-3,3), 30) y <- dnorm(x) plot(x, y, type=“l”, xlab=“Normal Deviate”, ylab=“Density”, yaxs=“i”)
pnorm(q)	cumulative normal probability for q (area under the normal curve to the right of q) pnorm(1.96) is 0.975
qnorm(p)	normal quantile. value at the p percentile of normal distribution qnorm(.9) is 1.28 # 90th percentile
rnorm(n, m=0, sd=1)	n random normal deviates with mean m and standard deviation sd. #50 random normal variates with mean=50, sd=10x <- rnorm(50, m=50, sd=10)
dbinom(x, size, prob), pbinom(p, size,prob), qbinom(q,size,prob), rbinom(n,size,prob)	binomial distribution where size is the sample size and prob is the probability of a heads (pi) # prob of 0 to 5 heads of fair coin out of 10 flips dbinom(0:5, 10, .5) # prob of 5 or less heads of fair coin out of 10 flips pbinom(5, 10, .5)
dpois(x, lamda), ppois(q,lamda), qpois(p,lamda), rpois(n,lamda)	poisson distribution with m=std=lamda #probability of 0,1, or 2 events with lamda=4 dpois(0:2, 4) # probability of at least 3 events with lamda=4 1- ppois(2,4)
dunif(x,min,max=1)	uniform distribution, follows the same pattern
punif(q,min=0,max=1)	as the normal distribution above.
qunif(p,min=0,max=1)	#10 uniform random variates
runif(n,min=0,max=1)	x <- runif(10)
mean(x,trim=0, na.rm=FALSE)	mean of object x, # trimmed mean, removing any missing values and # 5 percent of highest and lowest scores mx <- mean(x,trim=.05,na.rm=TRUE)
sd(x)	standard deviation of object(x). also look at var(x) for variance and mad(x) for median absolute deviation.
median(x)	median
quantile(x)	quantiles where x is the numeric vector whose quantiles are desired and probs is a numeric vector with probabilities in [0,1]. # 30th and 84th percentiles of x, y <- quantile(x, c(.3,.84))
range(x)	range
sum(x)	sum
diff(x,lag=1)	lagged differences, with lag indicating which lag to use
min(x)	minimum
max(x)	maximum
scale(x, center=TRUE, scale=TRUE)	column center or standardize a matrix

Table 2.6: Other Functions
Function	Description
seq(from, to, by)	generate a sequence indices <- seq(1,10,2) #indices is c(1, 3, 5, 7, 9)
rep(x,ntimes)	repeat x n times y <- rep(1:3, 2) # y is c(1, 2, 3, 1, 2, 3)
cut(x,n)	divide continuous variable in factor with n levels y <- cut(x, 5)
length(object)	number of elements or components
str(object)	structure of an object
class(object)	class or type of an object
names(object)	names
c(object, object,…)	combine objects into a vector
cbind(object, object,…)	combine objects as columns
rbind(object, object,…)	combine objects as rows
ls()	list current objects
rm(object)	delete an object
newobject <- edit(object)	create a new object
fix(object)	edit an object in place

Functions Applied

R as a Calculator

1250 + 1000

[1] 2250

1250 - 1000

[1] 250

99/3

[1] 33

3^3

[1] 27

4%%2

[1] 0

1+1; 4*5; 6-2

[1] 2 [1] 20 [1] 4

Dealing with NAN and NA’s.

NAN (not a number)
NA (missing value)

x <- c(1:8, NA)

####NA is the result
mean(x)

[1] NA

####na.rm removes the NA, so the calculation may be performed
mean(x, na.rm=TRUE)

[1] 4.5

2.3 Data Types

String Characters

In R, string variables are defined by double quotation marks.

letters <- c("A", "B", "C")

Table 2.7: Character String
x
A
B
C

Objects in R

Objects in R obtain values by assignment.
This is achieved by the gets arrow, <-, and not the equal sign, =.
Objects can be of different kinds.
Vectors, Arrays, Matrices, Subscripts, Dataframes

Vector

A vector is a sequence of data elements of the same basic type. Members in a vector are officially called components.
Here is a vector containing three numeric values 2, 3 and 5.

vector <- c(2,3,5)

Table 2.8: Vector
x
2
3
5

Array

Arrays are numeric objects with dimension attributes. The difference between a matrix and an array is that arrays have more than two dimensions. The following example creates an array of two 3x3 matrices each with 3 rows and 3 columns.

Create two vectors of different lengths.

vector1 <- c(5,9,3,7,2)
vector2 <- c(10:17)

Take these vectors as into the array.

result <- array(c(vector1, vector2), dim=c(3,3,2))

Table 2.9: Matrix
V1	V2	V3	V4	V5	V6
5	7	11	14	17	3
9	2	12	15	5	7
3	10	13	16	9	2

Matrix

A matrix is a collection of data elements arranged in a two-dimensional rectangular layout. The following is an example of a matrix with 2 rows and 3 columns.

matrix <- matrix(c(2,4,3,1,5,7), # the data elements
            nrow=2, #number of rows
            ncol = 3, #number of columns
            byrow = TRUE) #fill matrix by rows

Table 2.10: Matrix
2	4	3
1	5	7

Subscript

Select only one or some of the elements in a vector, a matrix or an array.
We can do this by using subscripts in square brackets [ ].
In matrices or dataframes the first subscript refers to the row and the second to the column.
R has several ways to subscript (that is, extract specific elements from a vector). The most common way is directly using the square bracket operator:

vector1[4]

[1] 7 In this example, the user has said “give me the fourth element of vector1”.

Here is a similar question: “what are the second and fifth elements of vector1?”

vector1[c(2,5)]

[1] 9 2

Here the c(), of course, constructs the vector (2,5) to be used as the index; then we extract the second and fifth entries of vector1.

Dataframe

A data frame is used for storing data tables. It is a list of vectors of equal length. For example, the following variable df is a data frame containing three vectors n, s, b

n <- c(2,3,5)
s <- c("aa","bb","cc")
b <- c(TRUE, FALSE, TRUE)
df <- data.frame(n,s,b) #df is a dataframe

n	s	b
2	aa	TRUE
3	bb	FALSE
5	cc	TRUE

Let’s create a sample data set, summarize the data and perform some basic manipulations.

####Create a vector
x <- (1:5)
####Summarize the vector
summary <- summary(x)
####Calculate the mean and median and check if they are equal
mean <- mean(x)
median <- median(x)
equal <- mean==median
####Transform to a data frame
df <- as.data.frame(x)
####Add a calculated column
df$New <- df$x/2
####Rename the columns
names(df)[names(df)=="x"] <- "Column1"
names(df)[names(df)=="New"] <- "Column2"

Table 2.11: Dataframe
Column1	Column2
1	0.5
2	1.0
3	1.5
4	2.0
5	2.5

Tips

R is case-sensitive.
Comment your code so you remember what it does; comments are preceded with #.
R scripts are simply text files with a .R extension.
Use Ctrl + R to submit code.
Use the Tab key to let R/R Studio finish typing commands for you.
Use Shift + down arrow to highlight lines or blocks of code.
In R Studio: Ctrl + 1 and Ctrl + 2 switches between script and console.
Use up and down arrows to cycle through previous commands in console.
Don’t be afraid of errors; you won’t break R.
If you get stuck, Google is your friend.

2.4 Loops

For loops

In R a while takes this form, where variable is the name of your iteration variable, and sequence is a vector or list of values:

for (variable in sequence) expression

The expression can be a single R command - or several lines of commands wrapped in curly brackets:

for (variable in sequence) { expression expression expression } Here is a quick trivial example, printing the square root of the integers one to ten:

for (x in c(1:10)) print(sqrt(x))

[1] 1 [1] 1.414214 [1] 1.732051 [1] 2 [1] 2.236068 [1] 2.44949 [1] 2.645751 [1] 2.828427 [1] 3 [1] 3.162278

While loops

In R While takes this form, where condition evaluates to a boolean (True/False) and must be wrapped in ordinary brackets:

while (condition) expression

As with a for loop, expression can be a single R command - or several lines of commands wrapped in curly brackets:

while (condition) { expression expression expression }

We’ll start by using a “while loop” to print out the first few Fibonacci numbers: 0, 1, 1, 2, 3, 5, 8, 13, … where each number is the sum of the previous two numbers. Create a new R script file, and copy this code into it:

a <- 0
b <- 1
print(a)

[1] 0

while (b < 50) {
    print(b)
    temp <- a + b
    a <- b
    b <- temp
}

[1] 1 [1] 1 [1] 2 [1] 3 [1] 5 [1] 8 [1] 13 [1] 21 [1] 34

This next version builds up the answer gradually using a vector, which it prints at the end:

x <- c(0,1)
while (length(x) < 10) {
    position <- length(x)
    new <- x[position] + x[position-1]
    x <- c(x,new)
}
print(x)

To understand how this manages to append the new value to the end of the vector x, try this at the command prompt:

x <- c(1,2,3,4)
c(x,5)

[1] 1 2 3 4 5

Writing Functions

This following script uses the function() command to create a function (based on the code above) which is then stored as an object with the name Fibonacci:

Fibonacci <- function(n) {
    x <- c(0,1)
    while (length(x) < n) {
        position <- length(x)
        new <- x[position] + x[position-1]
        x <- c(x,new)
    }
    return(x)
}

Once you run this code, there will be a new function available which we can now test:

Fibonacci(10)

[1] 0 1 1 2 3 5 8 13 21 34

Fibonacci(3)

[1] 0 1 1

Fibonacci(2)

[1] 0 1

Fibonacci(1)

[1] 0 1

That seems to work nicely - except in the case n == 1 where the function is returning the first two Fibonacci numbers! This gives us an excuse to introduce the if statement.

The If statement In order to fix our function we can do this:

Fibonacci <- function(n) {
    if (n==1) return(0)
    x <- c(0,1)
    while (length(x) < n) {
        position <- length(x)
        new <- x[position] + x[position-1]
        x <- c(x,new)
    }
    return(x)
}

In the above example we are using the simplest possible if statement:

if (condition) expression The if statement can also be used like this:

if (condition) expression else expression And, much like the while and for loops the expression can be multiline with curly brackets:

Fibonacci <- function(n) {
    if (n==1) {
        x <- 0
    } else {
        x <- c(0,1)
        while (length(x) < n) {
            position <- length(x)
            new <- x[position] + x[position-1]
            x <- c(x,new)
        }
    }
    return(x)
}

Fibonacci(1)

[1] 0

Business Analytics with R - DRAFT

Chapter 2 Basics

2.1 Operators

2.2 Functions

2.3 Data Types

2.4 Loops

2.5 Review