1.6 Advanced issues
This section introduces some more advanced issues:
Importing data allows reading data from files or online locations;
Factors are variables (vectors) that contain categorical data;
Lists are hierarchical (recursive) vectors;
Random sampling creates vectors by drawing elements from populations or distributions;
Flow control provides a glimpse on conditional statements and loops in R.
As most of these topics are covered in greater detail in later chapters, this section only serves as a brief introduction or “sneak preview”. So do not panic if some of these topics remain a bit puzzling at this point. Essentially, our current goal is only to become familiar with these concepts and distinctions. It is good to be aware of their existence and recognize them later, even if some details remain a bit fuzzy at this stage.
1.6.1 Importing data
In most cases, we do not generate the data that we analyze, but obtain a table of dataset from somewhere else. Typical locations of data include:
- data included in R packages,
- data stored on a local or remote hard drive,
- data stored on online servers.
R and RStudio provide many ways of reading in data from various sources. Which way is suited to a particular dataset depends mostly on the location of the file and the format in which the data is stored. We will examine different ways of importing datasets in Chapter 6. Here, we show the 2 most common ways of importing a dataset.
The dataset we import stems from an article in the Journal of Clinical Psychology (Woodworth, O’Brien-Malone, Diamond, & Schüz, 2017). An influential paper by Seligman et al. (Seligman, Steen, Park, & Peterson, 2005) found that several positive psychology interventions reliably increased happiness and decreased depressive symptoms in a placebo-controlled internet study. Woodworth et al. (2017) re-examined this claim by measuring the long-term effectiveness of different web-based positive psychology interventions and published their data in another article (Woodworth, O’Brien-Malone, Diamond, & Schüz, 2018) (see Appendix B.1 for details).
Data from a file
When loading data that is stored as a file, there are 2 questions to answer:
- Location: Where is the file stored?
- Format: In which format is the file stored?
To illustrate how data can be imported from an online source, we store a copy of the participant data from Woodworth et al. (2018) as a text file in CSV (comma-separated-value) format on a web server at http://rpository.com/ds4psy/data/posPsy_participants.csv.
Given this setup, we can load the dataset into an R object
p_info by evaluating the following command (from the package readr, which is part of the tidyverse):
Note the feedback message provided by the
read_csv() function of the readr package: It tells us the names of the variables (columns) that were read and the data type of these variables (here: numeric variables of type “double”). To obtain basic information about the newly created tibble
p_info, we can simply evaluate its name:
p_info #> # A tibble: 295 x 6 #> id intervention sex age educ income #> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 1 4 2 35 5 3 #> 2 2 1 1 59 1 1 #> 3 3 4 1 51 4 3 #> 4 4 3 1 50 5 2 #> 5 5 2 2 58 5 2 #> 6 6 1 1 31 5 1 #> 7 7 3 1 44 5 2 #> 8 8 2 1 57 4 2 #> 9 9 1 1 36 4 3 #> 10 10 2 1 45 4 3 #> # … with 285 more rows
Data from an R package
An even simpler way of obtaining a data file is available when datasets are stored in and provided by R packages. In this case, some R programmer has typically saved the data in a compressed format (as an
.rda file) and it can be accessed by installing and loading the corresponding R package. Provided that we have installed and loaded this package, we can easily access the corresponding dataset.
In our case, we need to load the ds4psy package, which contains the participant data as an R object
We can treat and manipulate such data objects just like any other R object. For instance, we can copy the dataset
posPsy_p_info into another R object by assigning it to
Having loaded the same data in 2 different ways, we should verify that we obtained the same result both times.
We can verify that
p_info_2 are equal by using the
Throughout this book, we will primarily rely on the datasets provided by the ds4psy package or the example datesets included in various tidyverse packages. Loading files from different locations and commands for writing files in various formats will be discussed in detail in Chapter 6 on Importing data.
Checking a dataset
To get an initial idea about the contents of a dataset (often called a data frame, table, or tibble), we typically inspect its dimensions, print it, ask for its structure (by using the base R command
str), or take a
glimpse on its variables and values:
dim(p_info) # 295 rows, 6 columns #>  295 6 p_info # prints a summary of the table/tibble #> # A tibble: 295 x 6 #> id intervention sex age educ income #> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 1 4 2 35 5 3 #> 2 2 1 1 59 1 1 #> 3 3 4 1 51 4 3 #> 4 4 3 1 50 5 2 #> 5 5 2 2 58 5 2 #> 6 6 1 1 31 5 1 #> 7 7 3 1 44 5 2 #> 8 8 2 1 57 4 2 #> 9 9 1 1 36 4 3 #> 10 10 2 1 45 4 3 #> # … with 285 more rows str(p_info) # shows the structure of an R object #> tibble [295 × 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame) #> $ id : num [1:295] 1 2 3 4 5 6 7 8 9 10 ... #> $ intervention: num [1:295] 4 1 4 3 2 1 3 2 1 2 ... #> $ sex : num [1:295] 2 1 1 1 2 1 1 1 1 1 ... #> $ age : num [1:295] 35 59 51 50 58 31 44 57 36 45 ... #> $ educ : num [1:295] 5 1 4 5 5 5 5 4 4 4 ... #> $ income : num [1:295] 3 1 3 2 2 1 2 2 3 3 ... #> - attr(*, "spec")= #> .. cols( #> .. id = col_double(), #> .. intervention = col_double(), #> .. sex = col_double(), #> .. age = col_double(), #> .. educ = col_double(), #> .. income = col_double() #> .. ) tibble::glimpse(p_info) # shows the types and initial values of all variables (columns) #> Rows: 295 #> Columns: 6 #> $ id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1… #> $ intervention <dbl> 4, 1, 4, 3, 2, 1, 3, 2, 1, 2, 2, 2, 4, 4, 4, 4, 3, 2, 1,… #> $ sex <dbl> 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1,… #> $ age <dbl> 35, 59, 51, 50, 58, 31, 44, 57, 36, 45, 56, 46, 34, 41, … #> $ educ <dbl> 5, 1, 4, 5, 5, 5, 5, 4, 4, 4, 5, 4, 5, 1, 2, 1, 4, 5, 3,… #> $ income <dbl> 3, 1, 3, 2, 2, 1, 2, 2, 3, 3, 1, 3, 3, 2, 2, 1, 2, 2, 1,…
Understanding a dataset
When loading a new data file, it is crucial to always obtain a description of the variables and values contained in the file (often called a Codebook).
For the dataset loaded into
p_info, this description looks as follows:
posPsy_participants.csv contains demographic information of 295 participants:
id: participant ID
intervention: 3 positive psychology interventions, plus 1 control condition:
- 1 = “Using signature strengths”,
- 2 = “Three good things”,
- 3 = “Gratitude visit”,
- 4 = “Recording early memories” (control condition).
- 1 = female,
- 2 = male.
age: participant’s age (in years).
educ: level of education:
- 1 = Less than Year 12,
- 2 = Year 12,
- 3 = Vocational training,
- 4 = Bachelor’s degree,
- 5 = Postgraduate degree.
- 1 = below average,
- 2 = average,
- 3 = above average.
Beyond conveniently loading datasets, another advantage of using data provided by R packages is that the details of a dataset are easily accessible by using the standard R help system. For instance, provided that the ds4psy package is installed and loaded, we can obtain the codebook and background information of the
posPsy_p_info data by evaluating
Of course, being aware of the source, structure, and variables of a dataset are only the first steps of data analysis. Ultimately, understanding a dataset is the motivation and purpose of its analysis. The next steps on this path involve phases of data screening, cleaning, and transformation (e.g., checking for missing or extreme cases, viewing variable distributions, and obtaining summaries or visualizations of key variables). This process and mindset are described in the chapter on Exploring data (Chapter 4).
Using the data in
p_info, create a new variable
TRUEif and only if a person has a Bachelor’s or Postgraduate degree.
Use R commands to obtain (the row data of) the youngest person with a university degree.
We will examine this data file further in Exercise 8 (see Section 1.8.8).
Whenever creating a new vector or a table with several variables (i.e., a data frame or tibble), we need to ask ourselves whether the variable(s) containing character strings or numeric values are to be considered as factors. A factor is a categorical variable (i.e., a vector or column in a data frame) that is measured on a nominal scale (i.e., distinguishes between different levels, but attaches no meaning to their order or distance). Factors may sort their levels in some order, but this order merely serves illustrative purposes (e.g., arranging the entries in a table or graph). Typical examples for factor variables are gender (e.g., “male”, “female”, “other”) or eye color (“blue”, “brown”, “green”, etc.). A good test for a categorical (or nominal) variable is that its levels can be re-arranged (without changing the meaning of the measurement) and averaging makes no sense (as the distances between levels are undefined).
stringsAsFactors = FALSE
Let’s revisit our example of creating a data frame
df from the four vectors
height (see Section 1.5.2).
The columns of this table were originally provided as vectors of type “character”. In many cases, character variables simply contain text data that we do not want to treat as factors. When we want to prevent R from converting character variables into factors when creating a new data frame, we can explicitly set an option
stringsAsFactors = FALSE in the
Let’s inspect the resulting
gender variable of
This shows that the variable
df$gender is a vector of type “character”, rather than a factor.
df$gender is equal to
gender, as the original vector also was of type “character”.
stringsAsFactors = TRUE
Up to R version 4.0.0 (released on 2020-04-24), the default setting of the
data.frame() function was
stringsAsFactors = TRUE. Thus, when creating the data frame
df above, the variables consisting of character strings (here:
gender) were — to the horror of generations of students — automatically converted into factors.19
We can still re-create the previous default behavior by setting
stringsAsFactors = TRUE:
df <- data.frame(name, gender, age, height, stringsAsFactors = TRUE) df$gender # Levels are not quoted and ordered alphabetically #>  male female female female female male male #> Levels: female male is.factor(df$gender) #>  TRUE typeof(df$gender) #>  "integer" unclass(df$gender) #>  2 1 1 1 1 2 2 #> attr(,"levels") #>  "female" "male"
In this data frame
df, all character variables were converted into factors. When inspecting a factor variable, its levels are printed by their text labels. But when looking closely at the output of
df$gender, we see that the labels
female are not quoted (as the text elements in a character variable would be) and the factor levels are printed below the vector.
Factors are similar to character variables insofar as they identify cases by a text label.
However, the different factor values are internally saved as integer values that are mapped to the different values of the character variable in a particular order (here: alphabetically). The
unclass() command shows that the variable
df are internally saved as integers from 1 to 7 (with the original names as labels of the seven factor levels). Similarly, the
gender variable of
df contains only integer values of 1 and 2, with 1 corresponding to “female” and 2 corresponding to “male”.
Importantly, a factor value denotes solely the identity of a particular factor level, rather than its magnitude. For instance, elements with an internal value of 2 (here: “male”) are different from those with a value of 1 (here: “female”), but not larger or twice as large.
To examine the differences between factors and character variables, we can use the
as.factor() functions to switch factors into characters and vice versa.
as.character() turns factors into character variables
as.character() function on a factor and assigning the result to the same variable turns a factor into a character variable:
# (1) gender as a character variable: df$gender <- as.character(df$gender) df$gender #>  "male" "female" "female" "female" "female" "male" "male" is.factor(df$gender) # not a factor #>  FALSE levels(df$gender) # no levels #> NULL typeof(df$gender) # a character variable #>  "character" unclass(df$gender) #>  "male" "female" "female" "female" "female" "male" "male" # as.integer(df$gender) # would yield an error, as undefined for character variables.
We see that using the
as.character() function on the factor
df$gender created a character variable.
Each element of the character variable is printed in quotation marks (i.e.,
as.factor() turns character variables into factors
When now using the function
as.factor() on a character variable, we turn the character string variable into a factor.
Internally, this changes the variable in several ways:
Each distinct character value is turned into a particular factor level, the levels are sorted (here: alphabetically), and the levels are mapped to an underlying numeric representation (here: consecutive integer values, starting at 1):
# (2) gender as a factor variable: df$gender <- as.factor(df$gender) # convert from "character" into a "factor" df$gender #>  male female female female female male male #> Levels: female male is.factor(df$gender) # a factor #>  TRUE levels(df$gender) # 2 levels #>  "female" "male" typeof(df$gender) # an integer variable #>  "integer" as.integer(df$gender) # convert factor into numeric variable #>  2 1 1 1 1 2 2
as.factor() on a character variable created a factor that is internally represented as integer values. When printing the variable, its levels are identified by the factor labels, but not printed in quotation marks (i.e.,
factor() defines factor variables with
Suppose we knew not only the current
age, but also the exact date of birth (DOB) of the people in
The new variable
df$DOB (i.e., a column of
df) is of type “Date”, which is another type of data that we will encounter in detail in Chapter 10 on Dates and times. A nice feature of date and time variables is that they can be used to extract date- and time-related elements, like the names of months and weekdays.
DOB values of
df, the corresponding variables
wday are as follows:
The two new variables
df$wday presently are character vectors:
If we ever wanted to categorize people by the month or weekday on which they were born or create a visualization that used the names of months or weekdays as an axis, it would make more sense to define these variables as factors.
However, simply using
as.factor() on the columns of
df would fall short:
The resulting vectors would be factors, but their levels are incomplete (as not all months and weekdays were present in the data) and would be sorted in alphabetical order (e.g., the month of “Apr” would precede “Feb”).
factor() function allows defining a factor variable from scratch.
As must factors can assume a fixed number of levels, the function contains an argument
levels to provide the possible factor levels (as a character vector):
# Define all possible factor levels (as character vectors): all_months <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec") all_wdays <- c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun") # Define factors (with levels): factor(df$month, levels = all_months) #>  Mar Jul Apr Oct Dec Jan Dec #> Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec factor(df$wday, levels = all_wdays) #>  Sun Sun Fri Sat Wed Tue Fri #> Levels: Mon Tue Wed Thu Fri Sat Sun
As we explicitly defined
all_wdays (as character vectors) and specified them as the
levels of the
factor() function, the resulting vectors are factors that know about values that did not occur in the data.
At this point, the factor levels are listed in a specific order, but are only considered to be different from each other. If we explicitly wanted to express that the levels are ordered, we can set the argument
factor(df$month, levels = all_months, ordered = TRUE) #>  Mar Jul Apr Oct Dec Jan Dec #> 12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec factor(df$wday, levels = all_wdays, ordered = TRUE) #>  Sun Sun Fri Sat Wed Tue Fri #> Levels: Mon < Tue < Wed < Thu < Fri < Sat < Sun
Thus, if we wanted to convert the variables
df into ordered factors, we would need to re-assign them as follows:
Appearance vs. representation
We just encoded the character variables
wday of a data frame
df as ordered factors.
However, the resulting data frame
df looks and prints as it did before:
Thus, defining the character variables
df as ordered factors affects how these variables are represented and treated by R functions, but not necessarily how they appear. The difference between the properties of an underlying representation and its (often indiscriminate and possibly deceptive) appearance on the surface is an important aspect of representations that will pop up repeatedly throughout this book and course.
Note that turning variables into factors affects what we can do with them.
For instance, if
wday were numeric variables, we could use them for arithmetic comparisons and indexing, like:
As these variables are factors, however, these statements yield
as.numeric() turns factor levels into numbers, but…
We could fix this by using the
as.numeric() function for converting factor levels into numbers:
This seems to work, but requires that we are aware how the factor levels were defined (e.g., that the 7th level of
wday corresponded to Sunday). As it is easy to make mistakes when interpreting factors as numbers, we should always check and double check the result of such conversions. Actually, converting factor levels into numbers is often a sign that a variable should not have been encoded as a factor.
The subtle, but profound lesson here is that variables that may appear the same (e.g., when printing a variable or inspecting a data table) may still differ in important ways. Crucially, the type and status of a variable affects what can be done with it and how it is treated by R commands. For instance, asking for
summary() on a factor yields a case count for each factor level (including levels without any cases):
By contrast, calling
summary() on the corresponding numeric variables yields descriptive statistics of its numbers:
summary() of the corresponding character variables merely describes the vector of text labels:
Thus, even when variables may look the same when inspecting the data, it really matters how they are internally represented.
In factors, the difference between a variable’s appearance and its underlying representation is further complicated by the option to supply a
label (as a character object) for each factor
level. Unless there are good reasons for an additional layer of abstraction (e.g., for labeling groups in a graph), we recommend only defining factor levels.
At this point, we do not need to understand the details of factors. But as factors occasionally appear accidentally — as
stringsAsFactors = TRUE was the default for many decades, until it was changed in R 4.0.0 (on 2020-04-24)20 — and are useful when analyzing and visualizing empirical data (e.g., for distinguishing between different experimental conditions) it is good to know that factors exist and can be dealt with.
Alternative factors: In many cultures and regions, people consider Sunday to be the first day of a new week, rather than the last day of the previous week.
Define a corresponding ordered factor
wday_2(for the data frame
In which ways are the factor variables
wday_2the same or different?
Use both factor variables to identify people born on a Sunday (by vector indexing).
- Defining an ordered factor
wday_2(for the data frame
- Comparing factors
All the values of
df$wday_2 remain the same.
wday_2, Sunday (
Sun) is now the first factor level, rather than the last:
If the two factor variables were re-interpreted as numbers, their results would no longer be identical (even though the differences are fairly subtle in this example):
- Using vector indexing on both factor variables to identify people born on a Sunday:
Beyond scalars, vectors, and tables, R provides “lists” as yet another shape in which data can be represented.
Lists are sequential data structures in which each element can have an internal structure. Thus, they are similar to vectors (e.g., in having a certain
length()), but every element of a list can be a complex (rather than an elementary) object.
We can create a list by applying the
list() function to a sequence of elements:
l_2 are both lists and contain the same three numeric elements, but differ in the representation’s shape:
l_1is a list with three elements, each of which is a scalar (i.e., vector of length 1).
l_2is a list with two elements. The first is a scalar, but the second is a (numeric) vector (of length 2).
Lists are more flexible, but also more complex than the other data shapes we encountered so far. Unlike atomic vectors, lists can contain a mix of data shapes and types:
The ability to store a mix of data shapes and types in a list allows creating complex representations:
As lists can contain other lists, they can be used to construct arbitrarily complex data structures (like tables or tree-like hierarchies):
Finally, the elements of lists can be named. As with vectors, the
names() function is used to both retrieve and assign names:
is.list() function allows checking whether some R object is a list:
but note that lists are also considered to be vectors:
As lists are structured objects, a useful function for inspecting lists is
Lists are powerful structures for representing data, but can also easily get complicated. In practice, we will rarely need lists, as vectors and tables are typically sufficient for our purposes. However, as we occasionally will encouter lists (e.g., as the output of statistical functions), it is good to be aware of them and know how to access their elements.
Accessing list elements
Accessing list elements is similar to indexing vectors, but needs to account for an additional layer of complexity. This is achieved by distinguishing between single square brackets (i.e.,
) and double square brackets (
x[i]returns the i-th sub-list of a list
x(as a list);
x[[i]]removes a level of the hierarchy and returns the i-th element of a list
x(as an object).
The distinction between single and double square brackets is important when working with lists:
always returns a smaller list, whereas
[]removes a hierarchy level to return list elements.
Thus, what is achieved by
 with vectors is achieved by
[] with lists. An example illustrates the difference:
For named lists, there is another way of accessing list elements that is similar to accessing the named variables (columns) of a data frame:
x$nselects a list element (like
[]) with the name
$n both return list elements that can be of various data types and shapes. In the case of
l_5, the 2nd element named “two” happens to be a list:
Vectors and data frames as lists
Due to their immense flexibility, the data structures used earlier in this chapter can be re-represented as lists. In Sections 1.4.1, we used the analogy of a train with a series of waggons to introduce vectors.
train vector could easily be transformed into a list with 15 elements:
However, this re-representation would only add complexity without a clear benefit. As long as we just want to record the load of each waggon, storing
train in the shape of a vector is not only sufficient, but simpler (and typically better) than using a list.
What could justify using a list? Suppose we wanted to store not just the load of each waggon but also its weight. A corresponding list
train_ls could look as follows:
train_ls contains two vectors: One records the
load of each waggon (as a vector with 15 elements, stored as a factor with 3 levels) and the other one records its
weight (as a vector with 15 numeric elements).
Does the addition of a second vector justify using a list? Not really. If both vectors have the same length, a simpler (and typically better) way to store this data would be a data frame:
This data frame provides convenient access to each column (i.e., the two variables that were the vectors of the list), each row (i.e., each waggon), and cell (i.e., any combination of a variable and waggon).
Thus, the lesson to be learned here is that lists are flexible, but rarely needed. As long as data is systematic and rectangular, vectors and data frames are sufficient and to be preferred. Not only are simpler data structures typically better, but many R functions are written and optimized for vectors and data frames, rather than lists. As a consequence, lists are only used when data requires mixing data types or the types or shapes of data are so complex, irregular, or unpredictable that they do not fit into a rectangular table.
- List features:
What is similar for and what distinguishes lists from vectors?
What is similar for and what distinguishes lists from tables (data frames/tibbles)?
- List access:
Someone makes the following claims about R data structures:
does with vectors is done with
t$ndoes with tables is done with
Explain what is “what” in both cases and construct an example that illustrates the comparison.
- The “what” in “What
does with vectors is done with
[]with lists.” refers to numeric indexing of vector/list elements.
- The “what” in “What is done with
df$non tables is done with
ls$nwith lists.” refers to named indexing (of table columns/list elements).
- Lists vs. vectors:
x <- list(1:3)and
y <- list(1, 2, 3)define identical or different objects? Describe both objects in terms of lists and vectors.
How can we access and retrieve the element “3” in the objects
How can we change the element “3” in the objects
x <- list(1:3)and
y <- list(1, 2, 3)define identical or different objects? Describe both objects in terms of lists and vectors.
Both commands define a list, but two different ones:
- `x`: `list(1:3)` is identical to `list(c(1, 2, 3))` and defines a list with _one_ element, which is the numeric vector `1:3`. - `y`: `list(1, 2, 3)` defines a list with _three_ elements, each of which is a numeric scalar (i.e., a vector of length 1).
- How can we access and retrieve the element “3” in the objects
- How can we change the element “3” in the objects
- A table vs. list of people:
Take one of the data frames
df (describing people) from above (e.g., from Section 1.5.2 and convert it into a list
df_ls. Then solve the following tasks for both the table
df and the list
- Get the vector of all names.
- Get the gender of the 3rd and the 6th person.
- Get the maximum age of all persons.
- Compute the mean height of all persons.
- Get all data about Nero.
df into a list
- Get the vector of all names:
- Get the gender of the 3rd and the 6th person:
- Get the maximum age of all persons:
- Compute the mean height of all persons:
- Get all data for Nero:
We see that accessing variables (in columns) is simple and straightforward for both tables and lists. However, accessing individual records in rows is much easier for tables than for lists — which is why we typically record our data at tables.
1.6.4 Random sampling
Random sampling creates vectors by randomly drawing objects from some population of elements or a mathematical distribution. Two basic ways of sampling consist in (A) drawing objects from a given population, and (B) drawing values from a distribution that is described by its mathematical properties.
A. Sampling from a population
A common task in psychology and statistics is drawing a sample from a given set of objects.
In R, the
sample() function allows drawing a sample of size
size from a population
A logical argument
replace specifies whether the sample is to be drawn with or without replacement.
Not surprisingly, the population
x is provided as a vector of elements and the result of
sample() is another vector of length
# Sampling vector elements (with sample): sample(x = 1:3, size = 10, replace = TRUE) #>  2 1 3 3 1 3 1 2 2 3 # Note: # sample(1:3, 10) # would yield an error (as replace = FALSE by default). # Binary sample (coin flip): coin <- c("H", "T") # 2 events: Heads or Tails sample(coin, 5, TRUE) # is short for: #>  "H" "T" "H" "H" "T" sample(x = coin, size = 5, replace = TRUE) # flip coin 5 times #>  "H" "T" "T" "T" "H" sample(x = coin, size = 1000, replace = TRUE) # flip coin 1000 times #>  "H" "H" "H" "T" "H" "H" "T" "T" "T" "H" "T" "H" "H" "H" "T" "T" "T" "T" "H" #>  "T" "H" "H" "H" "H" "T" "T" "H" "H" "H" "T" "T" "H" "H" "H" "T" "H" "T" "H" #>  "T" "T" "T" "T" "T" "H" "H" "H" "H" "H" "T" "H" "H" "H" "T" "T" "T" "T" "T" #>  "T" "H" "H" "H" "H" "T" "H" "H" "T" "T" "H" "T" "T" "T" "T" "H" "H" "H" #> [ reached getOption("max.print") -- omitted 925 entries ]
B. Sampling from a distribution
sample() function required specifying a population
x, it assumes that we can provide the set of elements from which our samples are to be drawn.
When creating artificial data (e.g., for practice purposes or simulations), we often cannot or do not want to specify all elements, but want to draw samples from a specific distribution. A distribution is typically described by its type and its corresponding mathematical properties — parameters that determine the location of values and thus the density and shape of their distribution. The most common distributions for psychological variables are:
Binomial distribution (discrete values): The number of times an event with a binary (yes/no) outcome and a probability of
sizetrials (see Wikipedia: Binomial distribution for details).
Normal distribution (aka. Gaussian/Bell curve: continuous values): Values that are symmetrical around a given
meanwith a given standard deviation
sd(see Wikipedia: Normal distribution for details).
Poisson distribution (discrete values): The number of times an event occurs, given a constant mean rate
lambda(see Wikipedia: Poisson distribution for details).
Uniform distribution (continuous values): An arbitrary outcome value within given bounds
max(see Wikipedia: Uniform distribution for details).
R makes it very easy to sample random values from these distributions. The corresponding functions, their key parameters, and examples of typical measurements are listed in Table 1.1.
|Name||R function||Key parameters||Example variables|
||binomial gender (female vs. non-female)|
||height values, test scores|
||number of times a certain age value occurs|
||arbitrary value in range|
As an example, assume that we want to generate test data for the age, gender (e.g., female vs. non-female), IQ scores, and some random test value for a sample of 1000 participants of a study. The actual distribution of each of these measures is subject to many empirical and theoretical considerations. For instance, if our sample consists of students of psychology, the number of females is likely to be higher than the number of non-females (at least as far as the University of Konstanz is concerned). However, the following definitions will provide us with plausible approximations (provided that the parameter values are plausible as well):
Each of these simple functions generated a vector whose values we can now probe with functions.
For instance, here are some ways to inspect the vector
# Describing the vector: length(IQs) #>  1000 str(IQs) #> num [1:1000] 99.7 95 104 94 115.3 ... # Describing vector values: mean(IQs) # arithmetic mean #>  99.93255 sd(IQs) # standard deviation #>  10.33515 range(IQs) # range (min to max) #>  63.06327 131.17747 summary(IQs) #> Min. 1st Qu. Median Mean 3rd Qu. Max. #> 63.06 92.77 99.84 99.93 106.91 131.18
Note that the values for
sd are not exactly what we specified in the instruction that generated
IQs, but pretty close — a typical property of random sampling.
The following histograms illustrate the distributions of values in the generated vectors:
Here are some tasks that practice random sampling:
- Selecting and sampling
We have used the R object
LETTERS in our practice of indexing/subsetting vectors above.
Combine subsetting and sampling to create a random vector of 10 elements, which are sampled (with replacement) from the letters “U” to “Z”?
- Creating sequences and samples:
Evaluate the following expressions and explain their results in terms of functions and their arguments (e.g., by looking up the documentation of
- Flipping coins:
The ds4psy package contains a
coin() function that lets you flip a coin once or repeatedly:
- Explore the
coin()function and then try to achieve the same functionality by using the
The three expressions using the
coin() function can be re-created by using the
sample() function with different arguments:
Note: As the
sample() functions involve random sampling, reproducing the same functionality with different functions or calling the same function repeatedly does not always yield the same results (unless we fiddle with R’s random number generator).
1.6.5 Flow control
As we have seen when defining our first scalars or vectors (in Sections 1.3 and 1.4), it matters in which order we define objects. For instance, the following example first defines
x as a (numeric) scalar
1, but then re-assigns
x to a vector (of type integer)
1:10, before re-assigning
x to a scalar (of type character) “oops”:
Although this example may seem trivial, it illustrates 2 important points:
R executes code sequentially: The order in which code statements and assignments are evaluated matters.
Whenever changing an object by re-assigning it, any old content of it (and its type) is lost.
At first, this dependence on evaluation order and the “forgetfulness” of R may seem like a nuisance. However, both features actually have their benefits when programming algorithms that require distinctions between cases or repeated executions of code. We will discuss such cases in our introduction to Programming (see Chapters 11 and 12). But as it is likely that you will encounter examples of if-then statements or loops before getting to these chapters, we briefly mention 2 major ways in which we can control the flow of information here.
A conditional statement in R begins with the keyword
if (in combination with some TEST that evaluates to either
FALSE) and a THEN part that is executed if the test evaluates to
An optional keyword
else and a corresponding ELSE part is executed if the test evaluates to
The basic structures of a conditional test in R are the following:
Notice that both
if statements involve two different types of parentheses:
Whereas the TEST is enclosed in round parentheses
(), the THEN and ELSE parts are enclosed in curly brackets
A caveat: Users that come from other statistical software packages (like SAS or SPSS) often recode data by using scores of conditional statements. Although this is possible in R, its vector-based nature and the powers of (logical and numeric) indexing usually provide better solutions.
- Predict the final outcome of evaluating
yin the following code and then evaluate it to check your prediction:
Conditional statements are covered in greater detail in Chapter 11 on Functions (see Section 11.3). However, we will also see that the combination of vectorized data structures and indexing/subsetting (as introduced in Section 1.4.6) often allows us to avoid conditional statements in R (see Section 11.3.7).
Loops repeat particular lines of code as long as some criteria are met.
This can be achieved in several ways, but the most common form of loop is the
for loop that increments an index or counter variable (often called
i) within a pre-specified range.
The basic structure of a
for loop in R is the following:
The code in
<LOOP-BODY> is executed repeatedly — as often as indicated in
i serves as a counter that indicates the current iteration of the loop.
- Predict the effects of the following loops and then evaluate them to check your prediction:
We will learn more about loops in our chapter on Iteration (Chapter 12).
This concludes our sneak preview on some more advanced aspects of R. Again, do not worry if some of them remain a bit fuzzy at this point — they will re-appear and be explained in greater detail later. Let’s wrap up this chapter and check what we have learned by doing some exercises.
Seligman, M. E., Steen, T. A., Park, N., & Peterson, C. (2005). Positive psychology progress: Empirical validation of interventions. American Psychologist, 60(5), 410. https://doi.org/10.1037/0003-066X.60.5.410
Wickham, H., & Grolemund, G. (2017). R for data science: Import, tidy, transform, visualize, and model data. Retrieved from http://r4ds.had.co.nz
Woodworth, R. J., O’Brien-Malone, A., Diamond, M. R., & Schüz, B. (2017). Web-based positive psychology interventions: A reexamination of effectiveness. Journal of Clinical Psychology, 73(3), 218–232. https://doi.org/10.1002/jclp.22328
Woodworth, R. J., O’Brien-Malone, A., Diamond, M. R., & Schüz, B. (2018). Data from “Web-based positive psychology interventions: A reexamination of effectiveness”. Journal of Open Psychology Data, 6(1). https://doi.org/10.5334/jopd.35
The default value of the argument
stringsAsFactorsused to be
TRUEfor decades. As this caused much confusion, the default has now been changed. From R version 4.0.0 (released on 2020-04-24), the default is
stringsAsFactors = FALSE. This shows that the R gods at https://cran.r-project.org/ are listening to user feedback, but you should not count on changes happening quickly.↩
The recent switch from the default of
stringsAsFactors = TRUEto
stringsAsFactors = FALSEactually teaches us an important lesson about R: Always be aware of defaults and try not to rely on them too much in your own code. Explicating the arguments used in our own functions will protect us from changes in implicit defaults. For background on R’s
stringsAsFactorsdefault, see the post stringsAsFactors (by Kurt Hornik, on 2020-02-16) on the R developer blog.↩