2 Tutorial 2: Workflow in R

Tutorial 2 will not yet deal with how to write code and do your own analyses in R (don’t worry, we’ll get there soon!).

Before you write your own code, you should understand the basic workflow when working with R and R Studio - independent of whether you want to calculate a regression model, do an automated content analysis, or visualize results of an analysis.

After working through Tutorial 1, you’ll…

  • understand the basic work flow in R.

2.1 Defining your working directory

The first step of any type of analysis is to define your working directory. You may wonder: What’s that?

Your working directory is the folder from which data can be imported into R or to which you can export and save data created with R.

Create a folder that you want to use as your working directory for this tutorial (or use an existing one, that also works). Go to that folder and copy the path to it:

Image: Working Directory

Now you know where this working directory is located - but R should know, too! Telling R from which folder to import data or where to export data to is also called setting your working directory. We call a function called setwd() (you guessed right: short for “setting you working directory”) which allows us to do exactly that.

Important: The way this working directory is set differs between Windows- and Mac-Operating Systems.

Windows: The dashes need to be pointing towards the right direction (if you simply copy the path to the folder, you may need to replace these signs “\” with “/”)

setwd("C:/Users/vhase/Documents/Text as Data Seminar")

Mac: You may need to add a “/” at the beginning like so:

setwd("/Documents/Text as Data Seminar")

If you have forgotten where you set your working directory, you can also ask R about the path of your current working directory with getwd():

getwd()
## [1] "C:/Users/vhase/Documents/Text as Data Seminar"

2.2 Packages

I’ve been talking about packages before: While base R, i.e., the standard version of R, already includes many helpful functions, you may at times need other, additional functions. For instance, in the case of automated content analysis - the method, we’ll focus on in this seminar - we’ll need to use specific packages including additional functions.

Packages are collections of topic-specific functions that extend the functions implemented in base R.

In the spirit of “open science”, anyone can write and publish these additional functions and related packages and anyone can also access the code used to do so.

You’ll find a list of all of R packages here. In this seminar, we’ll for instance use packages like Quanteda or STM for automated content analysis.

2.2.1 Installing packages

To use a package, you have to install it first. Let’s say you’re interested in using the package Quanteda. Using the command install.packages(), you can install the package on your computer. You’ll have to give the function the name of the package you are interested in installing.

install.packages("quanteda")

Now the package has been installed on your computer and is accessible locally. We only have to use install.packages() for any package once. Afterwards, the only thing you’ll have to do after open R is to activate the already installed package - which we’ll learn next.

2.2.2 Activating packages

Before we are able to use a package, we need to activate it in each session. Thus, you should not only define a working directory at the beginning of each session but also activate the packages you want to use via the library()_ command. Again, you’ll have to give R the name of the package you want to activate:

library("quanteda")

Else, you can also use the name of the package followed by two colons :: to activate a package directly before calling one of its function. For instance, I do not need use to activate the quanteda package using the library() command to use the function tokens() if I use the following command:

quanteda::tokens()

2.2.3 Getting information about packages

The package is installed and activated - but how can we use it? To get an overview of functions included in a given package, you can consult its corresponding “reference manual” (overview document containing all of a package’s functions) or, if available, its “vignette” (tutorials on how to use selected functions for the corresponding package) provided by a package’s author on a website called “CRAN”.

The easiest way to finding these manuals/vignettes is Google: Simply google CRAN Quanteda, for instance, and you’ll be guided to the following website:

Image: Cran Overview Quanteda package

The first paragraph (circled in red) gives you an overview of aspects for which this package may be useful. The second red-circled area links to the reference manual and the vignette. You can, for instance, check out the reference manual to get an idea of the many functions the quanteda package contains.

Another way of getting there is to simply use the help()-function provided by R, which we’ll get to now.

2.3 Help?!

The one thing you can count on in this seminar is that many things will not work right away: You’ll forget commands or what to use them for, the name of packages you need, or be confronted with errors messages that you need to understand to fix a given problem. This happens to anyone: from beginners to those having worked with R for many years. In this case, you need: help().

2.3.1 Finding information about packages

If you’re interested in a specific package, you can also use R and the help() function(or simply use ?, which leads to the same result):

help(quanteda) #Version 1 of asking for help
?quanteda #Version 2 of asking for help

In turn, you’ll get more information via the window “Help”:

Image: Cran Overview for the Quanteda package

2.3.2 Finding information about functions

Oftentimes, you need help with a specific function.

I’ll give you an example: Let’s say I teach a seminar with 10 students. I have asked all of them about their age. I have now saved their answers (i.e., 10 different numbers) in an object called age. This object is a vector, i.e. an object that consists of several values of the same data type - we’ll get to this in Tutorial 3.

age <- c(23, 26, 19, 28, 24, 22, 21, 27, 24, 24)

Now we want R to compute the mean age of students in the seminar using the mean() function. We thus ask R to compute the mean of the vector age like so: We call the function mean(). We specify all necessary conditions to run it - here that x = age, i.e. that R should compute the mean of all values in the vector age:

mean(x = age)
## [1] 23.8

That looks good - R tells us that the mean age of our students is 23.8 years. Let’s say I did the same thing for a different seminar: I also asked students about their age. while most chose to answer, some refused to answer. Thus, I recorded missing answers as NA (NA is used to record missing values, short for “not available”).

age <- c(23, 26, NA, 28, 24, 22, 21, NA, 24, NA)
mean(x = age)
## [1] NA

However, when trying to get the students’ mean age, R tells us that the mean is NA (i.e., missing). But do we really only have missing values? Let’s inspect our data again:

age
##  [1] 23 26 NA 28 24 22 21 NA 24 NA

That’s not true: 7 out of 10 students told us their age; only 3 refused to answer (here recorded as NA). So why does R tell us that the overall mean is missing - shouldn’t the function simply ignore NAs and tells us the mean age of all of those 7 students who answered our question?

To do some troubleshooting, we use the help() function. We specify for which function we need help:

?mean

This is where our fourth window comes into place as results for our search for help are depicted here (the paragraph depicted here is the reference manual including information on the mean() function).

Image: Help for error with mean()-function

It includes important information on the function (of which we’ll discuss only some, namely those circled in red):

  • Description: explains for which types of tasks the function mean() should be used
  • Usage: explains how the function mean() should be used
  • Arguments: explains which elements need to be or can be defined for using mean() and how these elements need to be specified
  • Examples: exemplifies how the function mean() can be used

When inspecting the section “Arguments”, we’ll soon discover something very important: mean() is a function that needs an object x for which the mean should be calculated. In this case, we specified x to consist of the vector age by typing x = age.

mean(x = age)

Upon further inspection, however, we see something else: The mean() function needs more information. In particular, we have to specify how R should deal with missing values, here NAs (see the section circled in red). This wasn’t a problem in the first example (since we had no NAs), but seems to be a problem for the second example. The manual reads as follows:

  • “na.rm: a logical value indicating whether NA values should be stripped before the computation proceeds”.

This indicates that if our x contains any NAs, we need to tell R and the mean() function how to deal with these. We haven’t specified this yet, which is why R includes all missing values for calculation and thus tells us that - given that some values are missing - the mean is missing. If we want R to ignore all NAs, we need to actively set na.rm (short for removal of NAs) to TRUE. This tells R that the mean should be computed for all of those values for x that are not missing.

The following command therefore gives us the mean age of all those students who chose to answer the question:

mean(age, na.rm = TRUE)
## [1] 24

2.3.3 Searching for help online

For some questions, using the help()-function won’t cut it. In this case, Google is your new best friend.

I have almost never encountered I problem I had with R where someone else had not already had the same problem and asked for answers online (and most often, had already gotten a helpful response.).

When googling, look out for the following websites that often offer help for statistical/programming issues:

2.3.3.1 Make sure to use relevant search terms

When googling, make sure to use all relevant search terms. This includes at least:

  • parts of the error message you are receiving or descriptions of the error
  • the search term “R” (there are a lot of other programming languages and you should make sure that your answers are tailored to R)
  • the function throwing the error

Let’s say you are trying to find out how to set your working directory since your R throws the following error: “cannot find directory”. Googling for help via search terms such as “directory programming define” will likely lead to insufficient results because: (a) the specific command you are having trouble with is missing, (b) the specific error message you are getting is missing, (c) the search request does not specify that you need answers for the programming language R.

A better way to go around this would be something like: “setwd() R error message cannot find directory”: (a) you are specifying the command that gives you trouble, (b) you are specifying the error message, and (c) you are specifying that you want answers for R.

2.3.3.2 Don’t trust every result you get

While most Google searches will get you a multitude of different answers for your questions, not all of them are necessarily right for your specific problem. Moreover, there may be different solutions for the same problem - so don’t be confused when people are proposing different approaches. Contrary to common conception, the internet is not always right - you may also get answers that are wrong or inefficient. Its often best to scroll through some search results and then try the solution that seems most understandable and/or suitable for you.

2.3.3.3 Make your problem reproducible

It is often vital that others can reproduce your problem: Others need to see which lines of codes exactly created an error message, what the error message looked liked, which data you used, and on which type of machine/system you ran the analysis to help.

For instance: Nobody is likely going to be able to help you with a request like this

"If I try to set my working directory, my computer tells me that I can't (the error says: Error: unpexted input in setwd(C:\. What is the problem?"

This isn’t great because no one knows the code that created the problem or the machine/system you used. Thus, you need to make your error replicable by giving the exact command and potentially information about your machine via sessionInfo():

"I am trying to set my working directory on a Windows System using the following code:
setwd(C:\Users\vhase\Documents\Text as Data Seminar)

While the path to the folder that I want to be my working directory is definitely correct, R gives me the following error message:
Error: unpexted input in setwd(C:\."
sessionInfo()
## R version 4.1.0 (2021-05-18)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18363)
## 
## Matrix products: default
## 
## Random number generation:
##  RNG:     Mersenne-Twister 
##  Normal:  Inversion 
##  Sample:  Rounding 
##  
## locale:
## [1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] caret_6.0-88              lattice_0.20-44           stm_1.3.6                 reshape2_1.4.4           
##  [5] quanteda.textstats_0.94.1 quanteda.textplots_0.94   quanteda_3.1.0            readtext_0.81            
##  [9] extrafont_0.17            bookdown_0.22             rsconnect_0.8.24          pagedown_0.15            
## [13] spacyr_1.2.1              sjPlot_2.8.10             performance_0.8.0         lmerTest_3.1-3           
## [17] lme4_1.1-27.1             Matrix_1.3-3              httr_1.4.2                purrr_0.3.4              
## [21] readr_1.4.0               lubridate_1.7.10          stringr_1.4.0             ggplot2_3.3.5            
## [25] readxl_1.3.1              dplyr_1.0.7              
## 
## loaded via a namespace (and not attached):
##   [1] backports_1.2.1      fastmatch_1.1-3      servr_0.22           plyr_1.8.6           proxyC_0.2.2        
##   [6] splines_4.1.0        websocket_1.4.0      SnowballC_0.7.0      TH.data_1.0-10       digest_0.6.27       
##  [11] foreach_1.5.1        htmltools_0.5.1.1    fansi_0.5.0          magrittr_2.0.1       sna_2.6             
##  [16] recipes_0.1.16       modelr_0.1.8         gower_0.2.2          RcppParallel_5.1.4   matrixStats_0.59.0  
##  [21] sandwich_3.0-1       extrafontdb_1.0      askpass_1.1          colorspace_2.0-2     rappdirs_0.3.3      
##  [26] ggrepel_0.9.1        xfun_0.24            crayon_1.4.1         jsonlite_1.7.2       survival_3.2-11     
##  [31] zoo_1.8-9            iterators_1.0.13     glue_1.4.2           stopwords_2.2        gtable_0.3.0        
##  [36] ipred_0.9-11         emmeans_1.6.2-1      sjstats_0.18.1       sjmisc_2.8.9         Rttf2pt1_1.3.8      
##  [41] scales_1.1.1         mvtnorm_1.1-2        DBI_1.1.1            ggeffects_1.1.1      Rcpp_1.0.7          
##  [46] xtable_1.8-4         reticulate_1.22      proxy_0.4-26         stats4_4.1.0         lava_1.6.9          
##  [51] prodlim_2019.11.13   datawizard_0.2.1     ellipsis_0.3.2       pkgconfig_2.0.3      farver_2.1.0        
##  [56] nnet_7.3-16          sass_0.4.0           utf8_1.2.1           tidyselect_1.1.1     labeling_0.4.2      
##  [61] rlang_0.4.11         later_1.2.0          effectsize_0.5       munsell_0.5.0        cellranger_1.1.0    
##  [66] tools_4.1.0          cli_3.0.0            generics_0.1.0       statnet.common_4.5.0 sjlabelled_1.1.8    
##  [71] broom_0.7.8          evaluate_0.14        yaml_2.2.1           ModelMetrics_1.2.2.2 processx_3.5.2      
##  [76] knitr_1.33           nlme_3.1-152         slam_0.1-48          compiler_4.1.0       rstudioapi_0.13     
##  [81] curl_4.3.1           png_0.1-7            e1071_1.7-7          tibble_3.1.2         bslib_0.2.5.1       
##  [86] stringi_1.6.2        highr_0.9            ps_1.6.0             parameters_0.15.0    nloptr_1.2.2.2      
##  [91] vctrs_0.3.8          pillar_1.6.1         lifecycle_1.0.0      jquerylib_0.1.4      estimability_1.3    
##  [96] data.table_1.14.0    insight_0.14.5       httpuv_1.6.1         R6_2.5.1             promises_1.2.0.1    
## [101] network_1.17.1       codetools_0.2-18     boot_1.3-28          MASS_7.3-54          assertthat_0.2.1    
## [106] openssl_1.4.4        withr_2.4.2          multcomp_1.4-17      bayestestR_0.11.5    hms_1.1.0           
## [111] ISOcodes_2021.02.24  grid_4.1.0           rpart_4.1-15         timeDate_3043.102    tidyr_1.1.3         
## [116] coda_0.19-4          class_7.3-19         minqa_1.2.4          rmarkdown_2.9        nsyllable_1.0       
## [121] pROC_1.17.0.1        numDeriv_2016.8-1.1

2.3.4 Interrupting R

Some commands run for a longer time - and you may realize while they are running that the code still contains an error. In this case, you may want to stop R in executing the command. If you want to do this manually, you can use the stop button in the window “Console” (only visible while R is executing code).

Image: Interrupting R

Else, you can use the menu via Session / Interrupt R.

2.4 Saving, loading & cleaning code/results

2.4.1 Saving your code

A great feature of R is that it makes analyses easily reproducible - given that you save your code. When reopening R Studio and you script, you can simply “rerun” the code with one click and your analysis will be reproduced.

To save code, you have two options:

  • Choose the menu option File/Save as. Important: Code needs to be saved with the ending “.R”.
  • Chose the Save-button in the source window and save your code in the correct format, for instance as “MyCode.R”.

Image: Saving code

2.4.2 Saving your results

You have successfully executed all commands and now want R to save your results/working environment? Saving your results is especially useful if it takes some time to have R run through the code and reproduce results - in this case, you only need to save results once and can then load them for the next session.

Again, there are several options for saving your results:

  • Use the save.image()-command:
save.image("MyData.RDATA")
  • Use the save-button in the environment window and save your results in the correct format, for instance as MyData.RDATA".

Image: Saving results

2.4.3 Loading working spaces

Having saved results in a previous session, you can now easily import them in a new session. Using the load()-command, you can import working spaces into a new R session. Here, we first define the working directory in which R may find our results to then import results:

setwd("C:/Users/vhase/Documents/Text as Data Seminar")
load("MyData.RDATA")

2.4.4 Clean your working space

After some time, your environment may be a bit messy: You may have defined objects you no longer need, which may lead to loosing sight of the things that are important. In this case, you can easily sort through relevant data and clean your working space. For instance, the rm()-command deletes specific objects in your environment. Say we want to save the object age since we no longer need it:

rm(age)

If you want a fresh start, you can also delete all objects in your environment. By specifying an empty lists of objects, ls(), as the element to be deleted via rm(), all objects get deleted:

rm(list = ls())

2.4.5 Take Aways

  • Working Directory: The folder from which data can be imported into R or to which you can export and save data created with R. Should be defined at the beginning of each session. Commands: setwd(), getwd()
  • Packages: Collections of topic-specific functions that extend the functions implemented in base R. You only need to install them once on your computer - but you have to activate packages at the beginning of each session. Otherwise, R will not be able to find related functions. Commands: install.packages(), library()
  • Help: The thing everyone working with R needs. It’s normal to run into errors when working with R - don’t get frustrated too easily. Commands: ?, help()
  • Saving, loading, and cleaning code/results: You should save your code/results from time to time to be able to replicate analyses. Commands: save.image, load(), rm()