# 5 Basics

R can create advanced outputs. This section aims to familiarize readers with the basic principles of R before creating or reproducing advanced outputs.

## 5.1 Functions

A programmable calculator enables its users to write and save functions. It is useful for an R user to understand how R functions work.

### 5.1.1 R as a Basic Calculator.

R can calculate. See the examples below followed by R syntax.

\[\begin{equation} 1+1=2 \tag{5.1} \end{equation}\]```
1+1
## [1] 2
```

```
1-1
## [1] 0
```

```
1 + (2 / 3) - (2 * 6.5)
## [1] -11.33333
```

```
sin(30) + 4^3 + log(4) + exp(3) + sqrt(7)
## [1] 87.12955
```

When typed and run with R, Equations (5.1) through (5.4) are calculated but not kept. If an outcome of an R operation will be used later, it should be named. When a name is assigned, the outcome can be recalled easily. The assigned outputs are saved in the R environment throughout the session. Assignment can be done with “=”, “<-” or “<<-”. This book uses “=”. Let’s assign a name for the Equations (5.1) through (5.4)’s outputs.

```
a=1 - 1
b=1 + 1
c=1 + (2 / 3) - (2 * 6.5)
d=sin(30) + 4^3 + log(4) + exp(3) + sqrt(7)
```

It is possible to operate with these assigned variables.

```
a+b+c+d
## [1] 77.79622
```

It is possible to overwrite.

```
e=3+2
e
## [1] 5
e=e+10
e
## [1] 15
```

It is possible to rename. (Note: R is case sensitive).

```
Equation1_output=a
Equation1_output + b + c + d #is equal to a+b+c+d
## [1] 77.79622
```

### 5.1.2 R as a Programmable Calculator

A function basically has 3 parts, an input, a process and an output. Let’s use an analogy, assume that below functions are created by a teacher to examine test scores.

#### 5.1.2.1 Single input - Single output

A simple function is given below and named as *constant5*. Let’s assume it adds 5 points to each score. The *constant5* function takes a value, adds 5 and produces an output.

```
constant5=function(input){
output=input+5
return(output)
}
constant5(input=50)
## [1] 55
constant5(100)
## [1] 105
constant5(120)
## [1] 125
```

With above code, we use R as a programmable calculator. We define *constant5* as a *function* that takes an input, processes it by adding 5 *(input+5)*, creates an output *(output=input+5)* and reports it *(return(output))*. All these steps should be given in *{ }*.

Another simple function will be that *systematic1*, adds 1% for each score. It will take a value, add 1% and produce an output.

```
systematic1=function(input){
output=input+(input/100)
return(output)
}
systematic1(input=50)
## [1] 50.5
systematic1(100)
## [1] 101
systematic1(120)
## [1] 121.2
```

#### 5.1.2.2 Multiple input - Single output

Above two examples use one single value as an input. Let us use two values for *nomistake* function. In this example, let`s say the teacher cuts 0.2 points for each spelling mistake. For example, if a grade is 90, it will go down to 88.8 if there are 6 spelling errors. The *nomistake* asks for a grade and the number of spelling errors to calculate the reduced grade.

```
nomistake=function(grade, nserror){
output=grade - (0.2 * nserror)
return(output)
}
nomistake(grade=90,nserror=6)
## [1] 88.8
nomistake(90,17)
## [1] 86.6
```

Inputs for an R function are generally called *arguments*. *nomistake* is programmed to receive 2 arguments to calculate one single output. It is possible to create functions with multiple arguments and multiple outcomes.

#### 5.1.2.3 Multiple input - Multiple output

The *feedback* function asks for number of correct responses and points for each to calculate a total score. It also provides the number of correct responses needed for a full score of 100.

```
feedback=function(correct, point){
total=correct*point
remained=(100-total)/point
output=c(paste("score:", total," missed items:",remained))
return(output)
}
feedback(correct=20,point=2)
## [1] "score: 40 missed items: 30"
feedback(27,2)
## [1] "score: 54 missed items: 23"
```

#### 5.1.2.4 Basic error

R functions need arguments to work. Please see the following error if you forget to feed *point* parameter into the *feedback* function

```
feedback=function(correct, point){
total=correct*point
remained=(100-total)/point
output=c(paste("score:", total," missed items:",remained))
return(output)
}
feedback(correct=20)
## Error in feedback(correct = 20): argument "point" is missing, with no default
```

#### 5.1.2.5 Basic warning

R functions can produce warnings. Let us create *nomistake2* that calculates the remaining score after cuts

```
nomistake2=function(grade, nserror){
output=grade - (0.2 * nserror)
return(output)
}
nomistake2(grade=50,nserror=10)
## [1] 48
```

We can produce a warning if the final score is lower than 0.

```
nomistake2=function(grade, nserror){
output=grade - (0.2 * nserror)
if (output<0)
warning("Final score is lower than 0")
return(output)
}
nomistake2(grade=10,nserror=60)
## Warning in nomistake2(grade = 10, nserror = 60): Final score is lower than
## 0
## [1] -2
```

#### 5.1.2.6 Basic failure

A function can stop. Let us create *nomistake3* that calculates the final score. However,this time, it stops if the score is lower than 20 to avoid further cuts.

```
nomistake3=function(grade, nserror){
if ((grade)<(20))
stop("Score is already low")
output=grade - (0.2 * nserror)
return(output)
}
nomistake3(10,9)
## Error in nomistake3(10, 9): Score is already low
```

### 5.1.3 Help!

Although applied R users do not need to write new functions, they should know the principles of how R functions work. Whenever an R function throws a warning or an error, it generally is caused by the users (or their data) rather than the function itself.

R basically runs on functions. Researchers write functions, place them in R packages and make them available. There are currently 10000+ R packages available via Comprehensive R Archive Network. R version 3.3.1 downloads to your computer with 30 packages that includes thousands of functions.

One of the main packages that have been downloaded to your computer is called *base*, and it has 1200+ functions. For example this package has the *mean* function to calculate the arithmetic mean. Packages and functions are generally well documented. Users should effectively use the documentations via *help* function, *?* or *??*. *example* functions may also be helpful.

```
help("base") # see description, you can click on index at the bottom to see 1200+ functions
help(mean) # see the mean function and its arguments
?mean # see the mean function and its arguments
??mean # see the mean function and its arguments
example(mean) # see an example
```

## 5.2 R Data Types

Vectors, matrices, variable types, factors, missing values and data frames are briefly introduced.

### 5.2.1 Vectors

R can create vectors using *c()* function. Let’s create grades for 10 students

```
grades=c(40,50,53,65,72,77,79,81,86,90)
grades
## [1] 40 50 53 65 72 77 79 81 86 90
```

R can operate with vectors.

```
grades=c(40,50,53,65,72,77,79,81,86,90)
grades+10
## [1] 50 60 63 75 82 87 89 91 96 100
grades+(grades*0.10)
## [1] 44.0 55.0 58.3 71.5 79.2 84.7 86.9 89.1 94.6 99.0
grades*grades
## [1] 1600 2500 2809 4225 5184 5929 6241 6561 7396 8100
grades2=c(30,40,46,58,64,66,69,72,74,81)
(grades+grades2)/2
## [1] 35.0 45.0 49.5 61.5 68.0 71.5 74.0 76.5 80.0 85.5
grades*0.4 + grades2*0.6
## [1] 34.0 44.0 48.8 60.8 67.2 70.4 73.0 75.6 78.8 84.6
```

There are useful functions to create vectors. For example the *rep* function (see example(rep)) is helpful to repeat values.

The *rnorm* function can create random variables. If you run *?rnorm* you will see it has three arguments, *rnorm(n, mean = 0, sd = 1)* . This function requires the number of observations ( *n* ) argument to be provided. By default the mean is set to be 0 and standard deviation to be 1. However, you can change the default for example by running *rnorm(12,mean=10,sd=2)* to create 12 observations from a normal distribution with mean 10 and standard deviation 2. A similar function is *runif(n, min = 0, max = 1)* to generate *n* observations from a uniform distribution on the interval from minimum=0 to maximum=2. You can change the interval, for example by running *runif(12, min = 10, max = 37)*.

```
a=1:12 # a is a regular sequence from 1 to 12 created with ':'
rep(0,12) # repeat zero 12 times
## [1] 0 0 0 0 0 0 0 0 0 0 0 0
rep(1:5,each=3) # repeat 1 to 5 each 3 times
## [1] 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5
rep(1:5,times=3) # repeat 1 to 5 , 3 times
## [1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
seq(from=1,to=12) # create 1 to 12 sequence
## [1] 1 2 3 4 5 6 7 8 9 10 11 12
seq(1,25,by=2) # create 1 to 25 by 2
## [1] 1 3 5 7 9 11 13 15 17 19 21 23 25
seq(1,6,by=0.5) # create 1 to 6 by 0.5
## [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0
rnorm(12) # create 12 random observations from ~N(0,1)
## [1] 0.327915830 0.145675615 0.181046686 0.001756333 -0.977429381
## [6] 0.841226040 -0.205038775 -0.234106083 0.026919073 -1.637883752
## [11] 1.785370740 0.798137173
rnorm(12,mean=10,sd=2) #create 12 random observations from ~ N(10,2)
## [1] 12.602158 11.233674 12.676267 9.936139 11.202955 12.030426 12.059646
## [8] 10.079494 12.360848 9.077320 8.599441 10.032944
runif(12, min = 10, max = 37) # create 12 random observations from a uniform distribution.
## [1] 24.31173 25.63679 23.49106 12.28763 15.90228 14.10212 35.71000
## [8] 20.34887 33.24387 13.69444 26.13435 29.05195
```

### 5.2.2 Matricies

R can create matrices and operate.

```
A=matrix(1:16,ncol=4,nrow=4) #create a 4 x 4 matrix
A
## [,1] [,2] [,3] [,4]
## [1,] 1 5 9 13
## [2,] 2 6 10 14
## [3,] 3 7 11 15
## [4,] 4 8 12 16
B=matrix(runif(16,min=20,max=40),ncol=4) #create a 4 x 4 matrix
# example operations
A+B # add
## [,1] [,2] [,3] [,4]
## [1,] 26.44971 35.97621 44.47561 37.47592
## [2,] 32.32376 41.15661 31.59914 48.07615
## [3,] 38.53369 31.55911 39.86506 49.87883
## [4,] 25.03194 46.45229 51.89366 38.31500
A*B # multiply
## [,1] [,2] [,3] [,4]
## [1,] 25.44971 154.8810 319.2805 318.1870
## [2,] 60.64752 210.9397 215.9914 477.0661
## [3,] 106.60108 171.9137 317.5156 523.1824
## [4,] 84.12774 307.6183 478.7239 357.0401
A%*%B # matrix multipication
## [,1] [,2] [,3] [,4]
## [1,] 770.2869 927.671 921.8744 798.8612
## [2,] 882.6260 1056.815 1047.7079 914.6071
## [3,] 994.9651 1185.959 1173.5413 1030.3530
## [4,] 1107.3042 1315.104 1299.3748 1146.0989
t(B) # transpose
## [,1] [,2] [,3] [,4]
## [1,] 25.44971 30.32376 35.53369 21.03194
## [2,] 30.97621 35.15661 24.55911 38.45229
## [3,] 35.47561 21.59914 28.86506 39.89366
## [4,] 24.47592 34.07615 34.87883 22.31500
```

### 5.2.3 Variables

It is important to know the data before running basic or sophisticated analyses. In an R environment, a variable subject to an analysis is generally defined as nominal, ordered, continuous, missing or date variable.

#### 5.2.3.1 Nominal

In R, a nominal variable can be represented alphanumerically. However the interpretation of a nominal variable is not numeric. It is helpful for naming a characteristic rather than quantifing it. Below commands can create nominal vectors.

```
address=c("AAX","BBZ","CBT","DBA","DDC","XZT")
gender=c("M","F","F","M","F","M")
id=sample(letters,6)
treatment=rep(c("cntrl","trt"),each=3)
city=as.character(1:6)
```

#### 5.2.3.2 Ordered

An ordered variable includes more information compared to a nominal variable. It represents order but the difference between values is not informative. Below commands can create ordered variables. The *level* argument for an ordered factor provides the information of order. If the *level* argument is not provided, R , by default, sorts the unique set of given values into increasing order.

```
item1=ordered(c("poor","average","good","good","poor","poor"),
levels=c("poor","average","good"))
ses=ordered(c(1,3,2,2,1,3),levels=c("1","2","3"))
```

#### 5.2.3.3 Continuous

An interval or ratio (true-zero variable) provides more information compared to ordinal and nominal variable. The difference between values is informative. Below commands can create continuous variables.

```
grade=c(52,75,39,62,24,86)
score=rnorm(n=6,mean=160,sd=5)
```

#### 5.2.3.4 Date Variable

One of the several date variable creation methods is using the as.Date() function. It will try to convert what is provided into a date. This is a flexible function and you can use the *format* argument to provide the information on how you enter a date. By default it looks for a format of *YYYY-MM-DD* .Another convenient way to input a date variable might be in *MM/DD/YYYY* format. This is possible by using *format=“%m/%d/%y”* . *Sys.Date()* function will give you today’s date in a *YYYY-MM-DD* format. You can operate with dates, for example the *Sys.Date( )-birthday* command below calculates the number of days between the provided birthdays and today.

```
birthday=as.Date(c("1984-06-01","1988-10-20","1990-12-01",
"1978-03-23","1974-08-22","1994-11-04"))
birthday
## [1] "1984-06-01" "1988-10-20" "1990-12-01" "1978-03-23" "1974-08-22"
## [6] "1994-11-04"
holidays=as.Date(c("01/01/2016","04/23/2016","05/19/2016","08/30/2016","09/29/2016"),
format="%m/%d/%y")
holidays
## [1] "2020-01-01" "2020-04-23" "2020-05-19" "2020-08-30" "2020-09-29"
Sys.Date( )
## [1] "2017-04-06"
Sys.Date( )-birthday
## Time differences in days
## [1] 11997 10395 9623 14259 15568 8189
```

#### 5.2.3.5 Logical variable

A logical variable takes a value of either TRUE or FALSE. When forced to be numeric, a logical variable takes the form of 1 and 0. Below command tests the grade variable whether its elements are larger than its mean or not.

```
grade=c(52,75,39,62,24,86) # create grades
grade>mean(grade) # create TRUE-FALSE by testing if the grade is larger than the mean
## [1] FALSE TRUE FALSE TRUE FALSE TRUE
as.numeric(grade>mean(grade)) # force the logical variable to be 1 and 0.
## [1] 0 1 0 1 0 1
```

### 5.2.4 Factors

R has a data type of *factor*. It can be considered as a general frame for nominal and ordered variables.

```
course=factor(c("Cook","Plumber","Designer","Plumber","Cook","Plumber"))
ga1=factor(c(1,1,3,4,2,3),levels = 1:4,
labels=c("StronglyDisagree","Disagree","Agree","StronglyAgree"))
ga2=factor(c(1,3,4,4,2,3),ordered = T)
ga3=gl(n=3,k=2,labels=c("A","B","C"),ordered=F)
```

Factors are important data types. Levels of a factor should be examined. It might be necessary to drop levels if they are not used in the variable. For example, if the main data has a factor , lets say *Color* and the levels are “blue”, “green” and “yellow”. Assume a subset is chosen from the data and it has only “blue” and “yellow”, R will still treat it as factor with 3 levels.This will cause problems.

The droplevel function drops unused levels. Examine the code below;

```
#the ga4 factor is defined with 4 levels A,B,C and D.
#BUT the data has only 1s,2s and 3s, the level D is not used.
ga4=factor(c(1,1,3,2,2,3),levels = 1:4,labels=c("A","B","C","D"))
ga4
## [1] A A C B B C
## Levels: A B C D
droplevels(ga4)
## [1] A A C B B C
## Levels: A B C
```

### 5.2.5 Missing Values

The data might be incomplete. R uses **NA** (not available) to represent missing values.

```
incomeSource=c("wage","wage","pension",NA,NA,"wage")
houseMember=c(3,2,3,NA,NA,4)
```

NOTE: Missing data indicators might be confusing. Notice the difference between NA,

```
temp = factor(c('wage','pension', NA, 'NA'," ",-99,"-99"))
#for a factor or a character variable NA is shown as <NA> to specify a true missing cell
# NA without < > represents a factor level
# " " also represents a factor level
#-99 and "-99" represents the same factor level
is.na(temp) # identifies only the third element as a missing value.
## [1] FALSE FALSE TRUE FALSE FALSE FALSE FALSE
#A possible solution
temp[temp=='NA' | temp==" "| temp== -99 | temp== "-99"]=NA
#check
is.na(temp)
## [1] FALSE FALSE TRUE TRUE TRUE TRUE TRUE
#DO NOT forget to drop levels
temp=droplevels(temp)
```

### 5.2.6 Data Frames

A data frame includes variables. Assuming a social scientist is generally interested in the relationships between the variables, a data frame is their main R structure. Below command creates a data frame using some of the earlier created variables.

```
# reminder
# id=sample(letters,6)
# treatment=rep(c("cntrl","trt"),each=3)
# gender=c("M","F","F","M","F","M")
# item1=ordered(c("poor","average","good","good","poor","poor"),
# levels=c("poor","average","good"))
# ses=ordered(c(1,3,2,2,1,3),levels=c("1","2","3"))
# grade=c(52,75,39,62,24,86)
# incomeSource=c("wage","wage","pension",NA,NA,"wage")
# birthday=as.Date(c("1984-06-01","1988-10-20","1990-12-01",
# "1978-03-23","1974-08-22","1994-11-04"))
# course=factor(c("Cook","Plumber","Designer","Plumber","Cook","Plumber"))
basic_data=data.frame(id,treatment,gender,item1,ses,
grade,incomeSource,birthday,course)
basic_data
## id treatment gender item1 ses grade incomeSource birthday course
## 1 b cntrl M poor 1 52 wage 1984-06-01 Cook
## 2 l cntrl F average 3 75 wage 1988-10-20 Plumber
## 3 m cntrl F good 2 39 pension 1990-12-01 Designer
## 4 k trt M good 2 62 <NA> 1978-03-23 Plumber
## 5 c trt F poor 1 24 <NA> 1974-08-22 Cook
## 6 n trt M poor 3 86 wage 1994-11-04 Plumber
```

Data can be entered manually into R. However this is generally not the case. When data are transferred into the R environment, a useful function to check its internal structure is named as *str*.

```
str(basic_data)
## 'data.frame': 6 obs. of 9 variables:
## $ id : Factor w/ 6 levels "b","c","k","l",..: 1 4 5 3 2 6
## $ treatment : Factor w/ 2 levels "cntrl","trt": 1 1 1 2 2 2
## $ gender : Factor w/ 2 levels "F","M": 2 1 1 2 1 2
## $ item1 : Ord.factor w/ 3 levels "poor"<"average"<..: 1 2 3 3 1 1
## $ ses : Ord.factor w/ 3 levels "1"<"2"<"3": 1 3 2 2 1 3
## $ grade : num 52 75 39 62 24 86
## $ incomeSource: Factor w/ 2 levels "pension","wage": 2 2 1 NA NA 2
## $ birthday : Date, format: "1984-06-01" "1988-10-20" ...
## $ course : Factor w/ 3 levels "Cook","Designer",..: 1 3 2 3 1 3
```

## 5.3 R Packages

R version 3.3.1 downloads to a computer with 30 packages that includes thousands of functions.These packages are stored under *system library*. Other useful functions are created by R users and made available to R community. For example linear mixed effect models can be analyzed R using *lme4*(Bates et al. 2015) package. This package is cited more than 1500 times and has been downloaded more than 60000 times. You can use Figure 3.2 to check its current usage. R packages are generally available via CRAN and they are generally not archived further if they are maintained properly. You can download R packages into your computer and store them locally, under *user library* . You need to load (activate) the packages in each session before you can use them.

You have probably noticed that R, RStudio and R packages are interconnected. When you download Rstuido after you have downloaded R, the Rstudio scans your computer, locates R and connects to it. Both R and RStudio can locate your libraries unless you manually manipulated the file locations. If you wonder the location of your R packages you can run *.libPaths()* function.

R packages located in CRAN can easily be downloaded into your machine using RStuido’s **Packages** tab, or you can directly type *install.packages(“packagename”)*. When you open a new session, some of the main packages are loaded automatically. When a package is loaded, you can see the tick in the *Packages* tab. If the package you plan to use in a session is not loaded, you can click the box or you can directly type *library(“packagename”)*. You can see these steps in (Video4 5.1).

## 5.4 The Workspace

When a session is started by opening an R script, every operation takes place in the working space. Every step is recorded and can be seen in *History* tab of R Studio. Working space can be saved when closing the session. The objects created in a session are kept in the space. You can use *ls()* function to see your objects in the working space, you can also check *Environment* tab of R Studio.

The objects in a workspace can easily be saved into the working directory as separate outputs. Also the objects in the working directory can easily be loaded into the working space. Here *easily* refers to the unnecessity of providing a path. If the path is provided, you can save or load objects from different directories.

You can run *getwd()* command to see your working directory. You can change the working directory within a session using *setwd()* function. Alternatively you can change the directory using *Session* tab of R Studio. The data input and output is covered more broadly in the next chapter.

### References

Bates, Douglas, Martin Mächler, Ben Bolker, and Steve Walker. 2015. “Fitting Linear Mixed-Effects Models Using lme4.” *Journal of Statistical Software* 67 (1): 1–48. doi:10.18637/jss.v067.i01.