The purpose of computing is insight, not numbers.
Richard W. Hamming (1962)
When first starting to learn a programming language, it may seem that computing only creates more numbers out of those that we have collected or obtained as data. However, the real purpose of computing and data analysis lies not in generating new numbers, but in grasping and understanding the phenomena that are represented by these numbers. Thus, even when our initial attention may be captured by abstract objects and functions, we must not forget that computing is not an end in itself, but a means towards achieving a better understanding of some domain. If we succeed, the end result of computing is not numbers, but obtaining answers to questions and gaining insights into topics that we care about. Our vehicle and catalyst for promoting these insights is R.
R is a programming language, but also a software environment (R Core Team, 2021). To start learning R, we first need to install some software programs on our computer. The names and purposes of these programs may initially seem a bit confusing, but we must not allow this to discourage us. Once this infrastructure is in place, we will simply open a program and start using R to write analysis scripts, text documents, and visualizations.
After working through this chapter, you should be able to:
- explain why R is and is not like a Swiss knife;
- categorize R objects into data vs. functions;
- distinguish between different shapes (e.g., scalars, vectors, rectangles) and types (e.g., numeric, character, logical) of data;
- create and change R objects (by assignment);
- apply arithmetic functions to numeric data objects;
- create and modify vectors and rectangular tables of data;
- select elements from vectors and rectangular tables of data (by indexing);
- recognize some more advanced issues (e.g., factors, lists, random sampling, conditionals, and loops).
This chapter assumes the following:
Software: You have installed the software prerequisites specified in the Introduction. Specifically,
Packages are the way in which R developers share code with a wider community of users. They typically assume some common environment (e.g., a recent version of R, or other packages) and provide more specialized functions that are only of interest to a subset of all R users.
Once R and RStudio are installed and running, additional R packages can be installed by evaluating the function
install.packages() in the Console of your R interface, with the name of the desired package enclosed in quotation marks:10
# Installing packages: install.packages("tidyverse") install.packages("ds4psy")
Packages only need to be installed once (unless you want to install an updated version), but need to be loaded every time they are being used. Loading the two packages just installed can be achieve by the following commands:
# Loading packages: library("tidyverse") library("ds4psy")
The purpose and terminology of R packages is explained in more detail in Section 1.1.3 below.
Note that other introductions to R may require slightly different packages. For instance, see the software requirements of Chapter 1.4 of r4ds and tidyverse.org for current information on the tidyverse packages.
Readings: This book assumes a running version of R and RStudio and familiar with the basic features and layout of RStudio, as described in the introductory chapters of the r4ds book (Wickham & Grolemund, 2017):
This implies that you understand and can do the following:
Enter and run R commands at the prompt in the Console window of RStudio, and check their results;
Use R as a calculator for arithmetic expressions;
Assign numeric values and characters to named objects;
Call simple R functions on objects;
Enter and run R scripts in the Editor window of RStudio;
Collect and store all course-related files in a dedicated directory and corresponding RStudio project.
A difficulty when first learning a programming language is that it forces us to simultaneously tackle two problems:
On the one hand, we need to grasp and understand the abstract concepts and tools that the language provides for solving tasks that we care about. For instance, if we have data from an online survey, we want to read in this data, screen it, combine and recode some variables, run some tests, and communicate our results via tables and visualizations. Irrespective of the tool or language used, achieving this task requires making a range of different choices regarding appropriate data structures and functions.
On the other hand, the technical implementation of each language requires an infrastructure that create additional challenges. In R, the technical overhead by its organization into system and environments is considerable. Although the software products and their overall organization may be amazing and wonderful, the concepts and dependencies required to navigate the technology is a constant source of confusion for novices.
Beyond a confusion about software products, both the programming language R and its infrastructure provide many conceptual challenges. We can distinguish between two types of concepts:
Computer science concepts
Key concepts in any programming language are the distinction between data and functions, and between different data types, shapes, and data structures. As these terms are extensively discussed in Section 1.2, we only mention here that functions are operators that read and transform data.
Data objects can be characterized further in terms of their type and shape. Data structures are constructs for storing specific combinations of data shapes and types. For instance, most data files that we work with in this book will assume the shape of rectangular tables and contain both numeric and text data. Data structures that allows storing such data in R are so-called data frames and lists.
Using particular data structures when programming or analyzing data is like adhering to the rules of grammar or spelling in a spoken or written conversation. As long as we can fluently speak, write, or read, we typically focus on the content and do not care about these rules. In fact, we may not even be aware of their existence, let alone their names and properties. But whenever an error occurs, detecting and fixing the problem in a competent manner requires a surprising amount of knowledge and experience regarding the underlying rules and practices. Hence, it is useful to acquire some abstract and conceptual background knowledge, even when we hope to ignore it most of the time.
Working through this book requires several software products (listed in the introductory chapter and above) that provide an R infrastructure. To understand the need for installing and loading these components, the following distinctions are important:
R core vs. contributed packages: All R code installed on your computer is stored in modules known as R packages. A package is a collection of code that serves some purpose and is developed, shared, and maintained by one or more programmers. A working installation of R can be thought of your R engine and consists of many different packages (aka. libraries). About 30 of them belong to a set of core packages that are maintained by established R developers and provide the essential functionality of R that automatically comes with every R installation. We often refer to these packages as “base R,” even though the package base is only one of these core packages. By contrast, the Comprehensive R Archive Network (aka. CRAN) is an online catalogue and global distribution platform for over 19,000 additional packages, which can be thought of as providing more specialized tools that are collected in a distributed archive. And as the official guidelines for writing R extensions can be scary and intimidating, many R authors choose not to submit their packages to CRAN and instead provide their packages as archives in other places. Thus, the actual number of R packages is much larger than those on listed on CRAN.
The hierarchy of R packages implies different levels of generality and quality: Whereas the set of core packages are written and checked by experts on the R development core team, the vast majority of existing packages have been contributed by committed R developers and users. Consequently, R is like a Swiss knife insofar as it consists of a set of basic tools, plus thousands of more specialized tools that can be added when you happen to work on a corresponding task. But beware: Just as you do not trust every article on someone’s website, you should not blindly trust any R package. In this respect, R is quite similar to Wikipedia — both are results of the collaborative effort of many volunteers that is administered by a team of highly dedicated experts. And although both products come for free, without any a priori guarantees and could potentially be abused and undermined by evil interests, there are mechanisms to recognize and reward quality over time.
Installing vs. loading R packages: When starting R on your computer, a small set of core packages — typically involving base, datasets, graphics, but also methods and stats — are loaded by default.11 However, the majority of R packages that we will use need to be installed additionally on your computer (once, typically via the
install.packages()command) and loaded (every time) before they can be used. The need to load packages before using them is the reason why many R programs begin with a (or many)
library(pkg)command(s) to load a package named pkg.12 More specifically, when some package named pkg defines a command
fancy_fun(), we can only use
fancy_fun()in our code after installing and loading pkg. Alternatively, we can install a package (e.g., named pkg) on our system and then use the command
pkg::fancy_fun(). The prefix
pgk::instructs R to look for the
fancy_fun()command of the package pkg.13 Again, the Swiss knife analogy is helpful: In order to use some specific tool, we first need to have it available (or installed) on our knife. But to actually use a tool, we still need to open and activate (or load) it first.14
R vs. graphical user interfaces (GUIs): By default, R is an interpreted language that assumes that commands are entered at a prompt (typically shown as
>) and are then evaluated by the underlying program. Over time, this basic way of interaction has been supported by graphical user interfaces (GUIs) that provide tools and separate windows for editing programs, displaying outputs, showing system information and libraries, etc. On most platforms, R comes with some GUI pre-installed, but the most versatile platforms to interact with R are so-called integrated development environments (IDEs) that need to installed separately. Here, we will use the currenly most popular and powerful IDE provided by RStudio (in its free, open source Desktop edition).15
Figure 1.2 shows an example of a mechanical toolbox that provides a range of tools to tackle common tasks. By analogy, R can be characterized as a software toolbox: A collection of tools for solving a wide array of tasks. However, as R, its functions, packages, and the RStudio IDE all are tools on some level, we are facing a rather a wild medley of boxes and tools.
Explain in which respects R is or is not like a Swiss knife.
What would be a fitting analogy for a GUI or IDE?
If packages are viewed as tools for specific tasks, the obvious candidate would be a toolbox. However, as individual R functions can also be viewed as tools (see Section 1.2.5), every package is a toolbox in itself. Thus, this line of thinking leads to an elaborate system of Matryoshka dolls: Swiss-knife-like tools in toolboxes, that are contained in more elaborate toolboxes. However, the same task (e.g., creating a visualization) can also be addressed and solved on different levels (e.g., by using base R functions or the functions from another R package). Before we get too dizzy, we should remind ourselves that computers essentially are universal machines with many layers of inter-related systems, each of which can be described in terms of the tool vs. toolbox analogy. Thus, all such descriptions are somewhat arbitrary and crucially depend on our current perspective and interests.
- Find out your current R version, the packages currently installed in your R library, and the packages that are being pre-loaded when starting R.
Hint: There are R functions for this, but your GUI/IDE also provides a lot of information about R and its packages.
The current R version is typically shown in the Console when starting a new R session. The list of all currently installed packages is available in the Packages tab of the RStudio IDE. Currently loaded packages are checked and clicking on a package name loads the corresponding documentation. The same window also offers options for installing or updating packages (assuming a working network connection).
As always in R, all these tasks (and many more) can also be performed by evaluating dedicated R functions:
# Library and packages: library() # prints the location of your library and the names and description of packages getOption("defaultPackages") # prints the names of packages loaded by default # Info on current R version and session: R.versionsessionInfo() # prints currently loaded packages # Information on a specific package: # shows package documentation (in the Help tab)?ggplot2
- There are multiple R packages that define a
?filter()to find at least two corresponding packages. How could you call the corresponding
# provided that dplyr has been loaded by library(tidyverse) # as part of the tidyverse packages vs. library(dplyr) # load only the dplyr package # shows that dplyr and stats define this command ?filter # Calling commands from installed packages: ::filter() # would call the filter() command of dplyr dplyr::filter() # would call the filter() command of statsstats
Note: When loading a package, any conflicts with pre-loaded objects are displayed in the Console (as “masked” objects). For instance, when starting R and then only loading the dplyr package, we see that it re-defines several objects from R’s base and stats packages.
1.1.4 Getting ready
We start our first session by creating an R script (with the file extension
.R) and loading the R packages of the tidyverse and ds4psy.
To facilitate finding information in a script, always structure it by inserting explicit headings, plenty of space between different parts, and meaningful comments (i.e., lines preceded by the
A neat feature of the editor in RStudio are the foldable sections that automatically appear when a commented line contains 4 or more consecutive dashes (i.e.,
# ----) and allow closing and opening the corresponding section (by clicking on the small triangle on the left or using the
Cmd + Alt + o and
Cmd + Alt + Shift-O keyboard shortcuts).
Here’s an example how your initial R script could look like:
## R basics | ds4psy ## Your Name | 2022 July 15 ## ---------------------------- ## Preparations: ---------- # Load additional packages for this session: library(tidyverse) library(ds4psy) ## Topic: ---------- # ... ## End of file (eof). ----------
Save your R script (e.g., as
01_basics.R in the R folder of your project) and remember saving it regularly as you keep adding content to it.
Installing a package assumes an existing internet connection. More specifically, your system is downloading packages from a client of The Comprehensive R Archive Network (aka. CRAN), which is a nifty way of making over 18,000 packages available world-wide.↩︎
getOption("defaultPackages")in your R console lists the packages that belong to this exclusive set.↩︎
library()without the name of a specific package prints the location of your package library and a list of all packages currently installed in it.↩︎
The technical term for instructing R to “look for a command within a particular package” is “namespace.” Any R object is defined within some particular range (or “scope”) and must first be activated to become available.↩︎
The analogy breaks down when it comes to using multiple tools at once: In R, we typically load many packages in parallel. On a Swiss knife, this would be difficult or dangerous.↩︎