5 Basics
R can create advanced outputs. This section aims to familiarize readers with the basic principles of R before creating or reproducing advanced outputs.
5.1 Functions
A programmable calculator enables its users to write and save functions. It is useful for an R user to understand how R functions work.
5.1.1 R as a Basic Calculator.
R can calculate. See the examples below followed by R syntax.
\[\begin{equation} 1+1=2 \tag{5.1} \end{equation}\]1+1
## [1] 2
1-1
## [1] 0
1 + (2 / 3) - (2 * 6.5)
## [1] -11.33333
sin(30) + 4^3 + log(4) + exp(3) + sqrt(7)
## [1] 87.12955
When typed and run with R, Equations (5.1) through (5.4) are calculated but not kept. If an outcome of an R operation will be used later, it should be named. When a name is assigned, the outcome can be recalled easily. The assigned outputs are saved in the R environment throughout the session. Assignment can be done with “=”, “<-” or “<<-”. This book uses “=”. Let’s assign a name for the Equations (5.1) through (5.4)’s outputs.
a=1 - 1
b=1 + 1
c=1 + (2 / 3) - (2 * 6.5)
d=sin(30) + 4^3 + log(4) + exp(3) + sqrt(7)
It is possible to operate with these assigned variables.
a+b+c+d
## [1] 77.79622
It is possible to overwrite.
e=3+2
e
## [1] 5
e=e+10
e
## [1] 15
It is possible to rename. (Note: R is case sensitive).
Equation1_output=a
Equation1_output + b + c + d #is equal to a+b+c+d
## [1] 77.79622
5.1.2 R as a Programmable Calculator
A function basically has 3 parts, an input, a process and an output. Let’s use an analogy, assume that below functions are created by a teacher to examine test scores.
5.1.2.1 Single input - Single output
A simple function is given below and named as constant5. Let’s assume it adds 5 points to each score. The constant5 function takes a value, adds 5 and produces an output.
constant5=function(input){
output=input+5
return(output)
}
constant5(input=50)
## [1] 55
constant5(100)
## [1] 105
constant5(120)
## [1] 125
With above code, we use R as a programmable calculator. We define constant5 as a function that takes an input, processes it by adding 5 (input+5), creates an output (output=input+5) and reports it (return(output)). All these steps should be given in { }.
Another simple function will be that systematic1, adds 1% for each score. It will take a value, add 1% and produce an output.
systematic1=function(input){
output=input+(input/100)
return(output)
}
systematic1(input=50)
## [1] 50.5
systematic1(100)
## [1] 101
systematic1(120)
## [1] 121.2
5.1.2.2 Multiple input - Single output
Above two examples use one single value as an input. Let us use two values for nomistake function. In this example, let`s say the teacher cuts 0.2 points for each spelling mistake. For example, if a grade is 90, it will go down to 88.8 if there are 6 spelling errors. The nomistake asks for a grade and the number of spelling errors to calculate the reduced grade.
nomistake=function(grade, nserror){
output=grade - (0.2 * nserror)
return(output)
}
nomistake(grade=90,nserror=6)
## [1] 88.8
nomistake(90,17)
## [1] 86.6
Inputs for an R function are generally called arguments. nomistake is programmed to receive 2 arguments to calculate one single output. It is possible to create functions with multiple arguments and multiple outcomes.
5.1.2.3 Multiple input - Multiple output
The feedback function asks for number of correct responses and points for each to calculate a total score. It also provides the number of correct responses needed for a full score of 100.
feedback=function(correct, point){
total=correct*point
remained=(100-total)/point
output=c(paste("score:", total," missed items:",remained))
return(output)
}
feedback(correct=20,point=2)
## [1] "score: 40 missed items: 30"
feedback(27,2)
## [1] "score: 54 missed items: 23"
5.1.2.4 Basic error
R functions need arguments to work. Please see the following error if you forget to feed point parameter into the feedback function
feedback=function(correct, point){
total=correct*point
remained=(100-total)/point
output=c(paste("score:", total," missed items:",remained))
return(output)
}
feedback(correct=20)
## Error in feedback(correct = 20): argument "point" is missing, with no default
5.1.2.5 Basic warning
R functions can produce warnings. Let us create nomistake2 that calculates the remaining score after cuts
nomistake2=function(grade, nserror){
output=grade - (0.2 * nserror)
return(output)
}
nomistake2(grade=50,nserror=10)
## [1] 48
We can produce a warning if the final score is lower than 0.
nomistake2=function(grade, nserror){
output=grade - (0.2 * nserror)
if (output<0)
warning("Final score is lower than 0")
return(output)
}
nomistake2(grade=10,nserror=60)
## Warning in nomistake2(grade = 10, nserror = 60): Final score is lower than
## 0
## [1] -2
5.1.2.6 Basic failure
A function can stop. Let us create nomistake3 that calculates the final score. However,this time, it stops if the score is lower than 20 to avoid further cuts.
nomistake3=function(grade, nserror){
if ((grade)<(20))
stop("Score is already low")
output=grade - (0.2 * nserror)
return(output)
}
nomistake3(10,9)
## Error in nomistake3(10, 9): Score is already low
5.1.3 Help!
Although applied R users do not need to write new functions, they should know the principles of how R functions work. Whenever an R function throws a warning or an error, it generally is caused by the users (or their data) rather than the function itself.
R basically runs on functions. Researchers write functions, place them in R packages and make them available. There are currently 10000+ R packages available via Comprehensive R Archive Network. R version 3.3.1 downloads to your computer with 30 packages that includes thousands of functions.
One of the main packages that have been downloaded to your computer is called base, and it has 1200+ functions. For example this package has the mean function to calculate the arithmetic mean. Packages and functions are generally well documented. Users should effectively use the documentations via help function, ? or ??. example functions may also be helpful.
help("base") # see description, you can click on index at the bottom to see 1200+ functions
help(mean) # see the mean function and its arguments
?mean # see the mean function and its arguments
??mean # see the mean function and its arguments
example(mean) # see an example
5.2 R Data Types
Vectors, matrices, variable types, factors, missing values and data frames are briefly introduced.
5.2.1 Vectors
R can create vectors using c() function. Let’s create grades for 10 students
grades=c(40,50,53,65,72,77,79,81,86,90)
grades
## [1] 40 50 53 65 72 77 79 81 86 90
R can operate with vectors.
grades=c(40,50,53,65,72,77,79,81,86,90)
grades+10
## [1] 50 60 63 75 82 87 89 91 96 100
grades+(grades*0.10)
## [1] 44.0 55.0 58.3 71.5 79.2 84.7 86.9 89.1 94.6 99.0
grades*grades
## [1] 1600 2500 2809 4225 5184 5929 6241 6561 7396 8100
grades2=c(30,40,46,58,64,66,69,72,74,81)
(grades+grades2)/2
## [1] 35.0 45.0 49.5 61.5 68.0 71.5 74.0 76.5 80.0 85.5
grades*0.4 + grades2*0.6
## [1] 34.0 44.0 48.8 60.8 67.2 70.4 73.0 75.6 78.8 84.6
There are useful functions to create vectors. For example the rep function (see example(rep)) is helpful to repeat values.
The rnorm function can create random variables. If you run ?rnorm you will see it has three arguments, rnorm(n, mean = 0, sd = 1) . This function requires the number of observations ( n ) argument to be provided. By default the mean is set to be 0 and standard deviation to be 1. However, you can change the default for example by running rnorm(12,mean=10,sd=2) to create 12 observations from a normal distribution with mean 10 and standard deviation 2. A similar function is runif(n, min = 0, max = 1) to generate n observations from a uniform distribution on the interval from minimum=0 to maximum=2. You can change the interval, for example by running runif(12, min = 10, max = 37).
a=1:12 # a is a regular sequence from 1 to 12 created with ':'
rep(0,12) # repeat zero 12 times
## [1] 0 0 0 0 0 0 0 0 0 0 0 0
rep(1:5,each=3) # repeat 1 to 5 each 3 times
## [1] 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5
rep(1:5,times=3) # repeat 1 to 5 , 3 times
## [1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
seq(from=1,to=12) # create 1 to 12 sequence
## [1] 1 2 3 4 5 6 7 8 9 10 11 12
seq(1,25,by=2) # create 1 to 25 by 2
## [1] 1 3 5 7 9 11 13 15 17 19 21 23 25
seq(1,6,by=0.5) # create 1 to 6 by 0.5
## [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0
rnorm(12) # create 12 random observations from ~N(0,1)
## [1] 0.327915830 0.145675615 0.181046686 0.001756333 -0.977429381
## [6] 0.841226040 -0.205038775 -0.234106083 0.026919073 -1.637883752
## [11] 1.785370740 0.798137173
rnorm(12,mean=10,sd=2) #create 12 random observations from ~ N(10,2)
## [1] 12.602158 11.233674 12.676267 9.936139 11.202955 12.030426 12.059646
## [8] 10.079494 12.360848 9.077320 8.599441 10.032944
runif(12, min = 10, max = 37) # create 12 random observations from a uniform distribution.
## [1] 24.31173 25.63679 23.49106 12.28763 15.90228 14.10212 35.71000
## [8] 20.34887 33.24387 13.69444 26.13435 29.05195
5.2.2 Matricies
R can create matrices and operate.
A=matrix(1:16,ncol=4,nrow=4) #create a 4 x 4 matrix
A
## [,1] [,2] [,3] [,4]
## [1,] 1 5 9 13
## [2,] 2 6 10 14
## [3,] 3 7 11 15
## [4,] 4 8 12 16
B=matrix(runif(16,min=20,max=40),ncol=4) #create a 4 x 4 matrix
# example operations
A+B # add
## [,1] [,2] [,3] [,4]
## [1,] 26.44971 35.97621 44.47561 37.47592
## [2,] 32.32376 41.15661 31.59914 48.07615
## [3,] 38.53369 31.55911 39.86506 49.87883
## [4,] 25.03194 46.45229 51.89366 38.31500
A*B # multiply
## [,1] [,2] [,3] [,4]
## [1,] 25.44971 154.8810 319.2805 318.1870
## [2,] 60.64752 210.9397 215.9914 477.0661
## [3,] 106.60108 171.9137 317.5156 523.1824
## [4,] 84.12774 307.6183 478.7239 357.0401
A%*%B # matrix multipication
## [,1] [,2] [,3] [,4]
## [1,] 770.2869 927.671 921.8744 798.8612
## [2,] 882.6260 1056.815 1047.7079 914.6071
## [3,] 994.9651 1185.959 1173.5413 1030.3530
## [4,] 1107.3042 1315.104 1299.3748 1146.0989
t(B) # transpose
## [,1] [,2] [,3] [,4]
## [1,] 25.44971 30.32376 35.53369 21.03194
## [2,] 30.97621 35.15661 24.55911 38.45229
## [3,] 35.47561 21.59914 28.86506 39.89366
## [4,] 24.47592 34.07615 34.87883 22.31500
5.2.3 Variables
It is important to know the data before running basic or sophisticated analyses. In an R environment, a variable subject to an analysis is generally defined as nominal, ordered, continuous, missing or date variable.
5.2.3.1 Nominal
In R, a nominal variable can be represented alphanumerically. However the interpretation of a nominal variable is not numeric. It is helpful for naming a characteristic rather than quantifing it. Below commands can create nominal vectors.
address=c("AAX","BBZ","CBT","DBA","DDC","XZT")
gender=c("M","F","F","M","F","M")
id=sample(letters,6)
treatment=rep(c("cntrl","trt"),each=3)
city=as.character(1:6)
5.2.3.2 Ordered
An ordered variable includes more information compared to a nominal variable. It represents order but the difference between values is not informative. Below commands can create ordered variables. The level argument for an ordered factor provides the information of order. If the level argument is not provided, R , by default, sorts the unique set of given values into increasing order.
item1=ordered(c("poor","average","good","good","poor","poor"),
levels=c("poor","average","good"))
ses=ordered(c(1,3,2,2,1,3),levels=c("1","2","3"))
5.2.3.3 Continuous
An interval or ratio (true-zero variable) provides more information compared to ordinal and nominal variable. The difference between values is informative. Below commands can create continuous variables.
grade=c(52,75,39,62,24,86)
score=rnorm(n=6,mean=160,sd=5)
5.2.3.4 Date Variable
One of the several date variable creation methods is using the as.Date() function. It will try to convert what is provided into a date. This is a flexible function and you can use the format argument to provide the information on how you enter a date. By default it looks for a format of YYYY-MM-DD .Another convenient way to input a date variable might be in MM/DD/YYYY format. This is possible by using format=“%m/%d/%y” . Sys.Date() function will give you today’s date in a YYYY-MM-DD format. You can operate with dates, for example the Sys.Date( )-birthday command below calculates the number of days between the provided birthdays and today.
birthday=as.Date(c("1984-06-01","1988-10-20","1990-12-01",
"1978-03-23","1974-08-22","1994-11-04"))
birthday
## [1] "1984-06-01" "1988-10-20" "1990-12-01" "1978-03-23" "1974-08-22"
## [6] "1994-11-04"
holidays=as.Date(c("01/01/2016","04/23/2016","05/19/2016","08/30/2016","09/29/2016"),
format="%m/%d/%y")
holidays
## [1] "2020-01-01" "2020-04-23" "2020-05-19" "2020-08-30" "2020-09-29"
Sys.Date( )
## [1] "2017-04-06"
Sys.Date( )-birthday
## Time differences in days
## [1] 11997 10395 9623 14259 15568 8189
5.2.3.5 Logical variable
A logical variable takes a value of either TRUE or FALSE. When forced to be numeric, a logical variable takes the form of 1 and 0. Below command tests the grade variable whether its elements are larger than its mean or not.
grade=c(52,75,39,62,24,86) # create grades
grade>mean(grade) # create TRUE-FALSE by testing if the grade is larger than the mean
## [1] FALSE TRUE FALSE TRUE FALSE TRUE
as.numeric(grade>mean(grade)) # force the logical variable to be 1 and 0.
## [1] 0 1 0 1 0 1
5.2.4 Factors
R has a data type of factor. It can be considered as a general frame for nominal and ordered variables.
course=factor(c("Cook","Plumber","Designer","Plumber","Cook","Plumber"))
ga1=factor(c(1,1,3,4,2,3),levels = 1:4,
labels=c("StronglyDisagree","Disagree","Agree","StronglyAgree"))
ga2=factor(c(1,3,4,4,2,3),ordered = T)
ga3=gl(n=3,k=2,labels=c("A","B","C"),ordered=F)
Factors are important data types. Levels of a factor should be examined. It might be necessary to drop levels if they are not used in the variable. For example, if the main data has a factor , lets say Color and the levels are “blue”, “green” and “yellow”. Assume a subset is chosen from the data and it has only “blue” and “yellow”, R will still treat it as factor with 3 levels.This will cause problems.
The droplevel function drops unused levels. Examine the code below;
#the ga4 factor is defined with 4 levels A,B,C and D.
#BUT the data has only 1s,2s and 3s, the level D is not used.
ga4=factor(c(1,1,3,2,2,3),levels = 1:4,labels=c("A","B","C","D"))
ga4
## [1] A A C B B C
## Levels: A B C D
droplevels(ga4)
## [1] A A C B B C
## Levels: A B C
5.2.5 Missing Values
The data might be incomplete. R uses NA (not available) to represent missing values.
incomeSource=c("wage","wage","pension",NA,NA,"wage")
houseMember=c(3,2,3,NA,NA,4)
NOTE: Missing data indicators might be confusing. Notice the difference between NA,
temp = factor(c('wage','pension', NA, 'NA'," ",-99,"-99"))
#for a factor or a character variable NA is shown as <NA> to specify a true missing cell
# NA without < > represents a factor level
# " " also represents a factor level
#-99 and "-99" represents the same factor level
is.na(temp) # identifies only the third element as a missing value.
## [1] FALSE FALSE TRUE FALSE FALSE FALSE FALSE
#A possible solution
temp[temp=='NA' | temp==" "| temp== -99 | temp== "-99"]=NA
#check
is.na(temp)
## [1] FALSE FALSE TRUE TRUE TRUE TRUE TRUE
#DO NOT forget to drop levels
temp=droplevels(temp)
5.2.6 Data Frames
A data frame includes variables. Assuming a social scientist is generally interested in the relationships between the variables, a data frame is their main R structure. Below command creates a data frame using some of the earlier created variables.
# reminder
# id=sample(letters,6)
# treatment=rep(c("cntrl","trt"),each=3)
# gender=c("M","F","F","M","F","M")
# item1=ordered(c("poor","average","good","good","poor","poor"),
# levels=c("poor","average","good"))
# ses=ordered(c(1,3,2,2,1,3),levels=c("1","2","3"))
# grade=c(52,75,39,62,24,86)
# incomeSource=c("wage","wage","pension",NA,NA,"wage")
# birthday=as.Date(c("1984-06-01","1988-10-20","1990-12-01",
# "1978-03-23","1974-08-22","1994-11-04"))
# course=factor(c("Cook","Plumber","Designer","Plumber","Cook","Plumber"))
basic_data=data.frame(id,treatment,gender,item1,ses,
grade,incomeSource,birthday,course)
basic_data
## id treatment gender item1 ses grade incomeSource birthday course
## 1 b cntrl M poor 1 52 wage 1984-06-01 Cook
## 2 l cntrl F average 3 75 wage 1988-10-20 Plumber
## 3 m cntrl F good 2 39 pension 1990-12-01 Designer
## 4 k trt M good 2 62 <NA> 1978-03-23 Plumber
## 5 c trt F poor 1 24 <NA> 1974-08-22 Cook
## 6 n trt M poor 3 86 wage 1994-11-04 Plumber
Data can be entered manually into R. However this is generally not the case. When data are transferred into the R environment, a useful function to check its internal structure is named as str.
str(basic_data)
## 'data.frame': 6 obs. of 9 variables:
## $ id : Factor w/ 6 levels "b","c","k","l",..: 1 4 5 3 2 6
## $ treatment : Factor w/ 2 levels "cntrl","trt": 1 1 1 2 2 2
## $ gender : Factor w/ 2 levels "F","M": 2 1 1 2 1 2
## $ item1 : Ord.factor w/ 3 levels "poor"<"average"<..: 1 2 3 3 1 1
## $ ses : Ord.factor w/ 3 levels "1"<"2"<"3": 1 3 2 2 1 3
## $ grade : num 52 75 39 62 24 86
## $ incomeSource: Factor w/ 2 levels "pension","wage": 2 2 1 NA NA 2
## $ birthday : Date, format: "1984-06-01" "1988-10-20" ...
## $ course : Factor w/ 3 levels "Cook","Designer",..: 1 3 2 3 1 3
5.3 R Packages
R version 3.3.1 downloads to a computer with 30 packages that includes thousands of functions.These packages are stored under system library. Other useful functions are created by R users and made available to R community. For example linear mixed effect models can be analyzed R using lme4(Bates et al. 2015) package. This package is cited more than 1500 times and has been downloaded more than 60000 times. You can use Figure 3.2 to check its current usage. R packages are generally available via CRAN and they are generally not archived further if they are maintained properly. You can download R packages into your computer and store them locally, under user library . You need to load (activate) the packages in each session before you can use them.
You have probably noticed that R, RStudio and R packages are interconnected. When you download Rstuido after you have downloaded R, the Rstudio scans your computer, locates R and connects to it. Both R and RStudio can locate your libraries unless you manually manipulated the file locations. If you wonder the location of your R packages you can run .libPaths() function.
R packages located in CRAN can easily be downloaded into your machine using RStuido’s Packages tab, or you can directly type install.packages(“packagename”). When you open a new session, some of the main packages are loaded automatically. When a package is loaded, you can see the tick in the Packages tab. If the package you plan to use in a session is not loaded, you can click the box or you can directly type library(“packagename”). You can see these steps in (Video4 5.1).
5.4 The Workspace
When a session is started by opening an R script, every operation takes place in the working space. Every step is recorded and can be seen in History tab of R Studio. Working space can be saved when closing the session. The objects created in a session are kept in the space. You can use ls() function to see your objects in the working space, you can also check Environment tab of R Studio.
The objects in a workspace can easily be saved into the working directory as separate outputs. Also the objects in the working directory can easily be loaded into the working space. Here easily refers to the unnecessity of providing a path. If the path is provided, you can save or load objects from different directories.
You can run getwd() command to see your working directory. You can change the working directory within a session using setwd() function. Alternatively you can change the directory using Session tab of R Studio. The data input and output is covered more broadly in the next chapter.
References
Bates, Douglas, Martin Mächler, Ben Bolker, and Steve Walker. 2015. “Fitting Linear Mixed-Effects Models Using lme4.” Journal of Statistical Software 67 (1): 1–48. doi:10.18637/jss.v067.i01.