1.5 Tables

As we have seen in Section 1.4, vectors are 1-dimensional sequences: They have a length, but no width — or only a trivial width of 1. By combining several vectors, we get a rectangular or tabular data structure — or simply tables. Tables are the most frequent way in which we will encounter data throughout this book.

Somewhat confusingly, tables (or rectangular data structures) exist in many variants and under different labels. The most common types of tables in R are a data table, data frame, tibble, or matrix. As all these concepts denote tabular data structures, the differences between them concern details of their contents and their implementation. At this point, it is sufficient to know that they are all tables (i.e., rectangular data structures) and distinguish between the 2 most common types: matrices and data frames.

1.5.1 Matrices

When a rectangle of data contains data of the same type in all cells (i.e., all rows and columns), we get a matrix of data:

# Creating 3 vectors: 
x <- 1:3
y <- 4:6
z <- 7:9

# Combining vectors (of the same length): ---- 
m1 <- rbind(x, y, z)  # combine as rows
m1
#>   [,1] [,2] [,3]
#> x    1    2    3
#> y    4    5    6
#> z    7    8    9

m2 <- cbind(x, y, z)  # combine as columns
m2
#>      x y z
#> [1,] 1 4 7
#> [2,] 2 5 8
#> [3,] 3 6 9

# Putting a vector into a rectangular matrix:
m3 <- matrix(data = 1:20, nrow = 5, ncol = 4, byrow = TRUE)
m3
#>      [,1] [,2] [,3] [,4]
#> [1,]    1    2    3    4
#> [2,]    5    6    7    8
#> [3,]    9   10   11   12
#> [4,]   13   14   15   16
#> [5,]   17   18   19   20

m4 <- matrix(data = 1:20, nrow = 5, ncol = 4, byrow = FALSE)
m4
#>      [,1] [,2] [,3] [,4]
#> [1,]    1    6   11   16
#> [2,]    2    7   12   17
#> [3,]    3    8   13   18
#> [4,]    4    9   14   19
#> [5,]    5   10   15   20

Retrieving values from a matrix m works similarly to indexing vectors. However, due to the 2-dimensional nature of a matrix, we now need to specify 2 indices in square brackets: the number of the row, and the number of the column, separated by a comma. Thus, to obtain the value of row r and column c of a matrix m we need to evaluate m[r, c]. Just as with vectors, providing multiple numeric indices selects the corresponding rows or columns. When r or c is left unspecified, all rows or columns are selected.

# Selecting cells, rows, or columns of matrices: ---- 
m1[2, 3]  # in m1: select row 2, column 3
#> y 
#> 6
m2[3, 1]  # in m2: select row 3, column 1
#> x 
#> 3

m1[2,  ]  # in m1: select row 2, all columns
#> [1] 4 5 6
m2[ , 1]  # in m1: select column 1, all rows
#> [1] 1 2 3

m3
#>      [,1] [,2] [,3] [,4]
#> [1,]    1    2    3    4
#> [2,]    5    6    7    8
#> [3,]    9   10   11   12
#> [4,]   13   14   15   16
#> [5,]   17   18   19   20
m3[2, 3:4] # in m3: select row 2, columns 3 to 4
#> [1] 7 8
m3[3:5, 2] # in m3: select rows 3 to 5, column 2
#> [1] 10 14 18

m4[]  # in r4: select all rows and all columns (i.e., all of m4)
#>      [,1] [,2] [,3] [,4]
#> [1,]    1    6   11   16
#> [2,]    2    7   12   17
#> [3,]    3    8   13   18
#> [4,]    4    9   14   19
#> [5,]    5   10   15   20

Just as with vectors, we can apply functions to matrices. Typical examples include:

# Applying functions to matrices: ---- 
is.matrix(m1)
#> [1] TRUE
typeof(m2)
#> [1] "integer"

dim(m1)   # dimensions of r2: 3 rows and 3 columns
#> [1] 3 3
nrow(m2)  # number of rows of r2
#> [1] 3
ncol(m3)  # number of columns of r3
#> [1] 4

sum(m1)
#> [1] 45
max(m2)
#> [1] 9
mean(m3)
#> [1] 10.5
colSums(m3)  # column sums of r3
#> [1] 45 50 55 60
rowSums(m4)  # row sums of r4
#> [1] 34 38 42 46 50

Similarly, we can extend the notion of indexing to matrices:

m4 > 10  # returns a matrix of logical values
#>       [,1]  [,2] [,3] [,4]
#> [1,] FALSE FALSE TRUE TRUE
#> [2,] FALSE FALSE TRUE TRUE
#> [3,] FALSE FALSE TRUE TRUE
#> [4,] FALSE FALSE TRUE TRUE
#> [5,] FALSE FALSE TRUE TRUE
typeof(m4 > 10)
#> [1] "logical"
m4[m4 > 10]  # indexing of matrices
#>  [1] 11 12 13 14 15 16 17 18 19 20

1.5.2 Tables: Data frames and tibbles

As matrices (and vectors) contain data of only 1 type (e.g., all cells are all numeric, character, or logical data), we need another data structure for more diverse and interesting datasets. Imagine that you want to store a dataset that contains basic information on some people:

  • their names,
  • their gender,
  • their age (in years),
  • their height (in cm).

Each of these 4 variables can be stored as a vector (the first 2 of type character, the others of type numeric). To store all 4 variables in 1 data structure, we can think of the 4 vectors forming the columns of a rectangular table. The most common rectangular data structure in R is a data frame (or tibble, which is a simpler version of a data frame used in the tidyverse). Let’s create 4 vectors and combine them into a data frame:

# Create some vectors (of different types, but same length): -----  
name <- c("Adam", "Bertha", "Cecily", "Dora", "Eve", "Nero", "Zeno")
gender <- c("male", "female", "female", "female", "female", "male", "male")
age <- c(21, 23, 22, 19, 21, 18, 24)
height <- c(165, 170, 168, 172, 158, 185, 182)

# Combine 4 vectors (of equal length) into a data frame: 
df <- data.frame(name, gender, age, height)
df    # Note: Vectors are the columns of the data frame!
#>     name gender age height
#> 1   Adam   male  21    165
#> 2 Bertha female  23    170
#> 3 Cecily female  22    168
#> 4   Dora female  19    172
#> 5    Eve female  21    158
#> 6   Nero   male  18    185
#> 7   Zeno   male  24    182

The resulting data frame df is a 2-dimensional object consisting out of 7 rows (cases) and 4 columns (variables). We can apply some functions to obtain information about the data frame:

is.matrix(df)
#> [1] FALSE
is.data.frame(df)
#> [1] TRUE

# What are the dimensions of df?
dim(df)  # 7 rows (cases) x 4 columns (variables)
#> [1] 7 4

# Note that 
# sum(df)  # would yield an error

We can easily turn any data frame into a tibble (by using the as_tibble command of the tidyverse package tibble):

# Turn df into a tibble tb: 
tb <- tibble::as_tibble(df)
dim(tb)  # 7 cases (rows) x 4 variables (columns), as df
#> [1] 7 4

We will learn more about tibbles later (see Chapter 5). At this point, a tibble is just a simpler and more convenient type of data frame. One advantage of tibbles is that they can be printed to the screen more easily. For instance, printing a tibble always shows its dimensions (as in dim(tb)) and the type of each of its variables (i.e., each column):

tb  # print tb
#> # A tibble: 7 x 4
#>   name   gender   age height
#>   <fct>  <fct>  <dbl>  <dbl>
#> 1 Adam   male      21    165
#> 2 Bertha female    23    170
#> 3 Cecily female    22    168
#> 4 Dora   female    19    172
#> 5 Eve    female    21    158
#> 6 Nero   male      18    185
#> 7 Zeno   male      24    182

Please remember: Both data frames and tibbles have columns — rather than rows — that consist of vectors. This is the reason for referring to the columns as the variables of the table.

Practice

  • Create a tibble that contains the names, ages, and family relations of some family (including 3 generations). What are the types of variables involved and the dimensions of the resulting tibble?

1.5.3 Working with data tables

As data frames and tibbles are tables (i.e., 2-dimensional data structures), we need 2 numeric indices to select specific cells: A 1st index for specifying rows (or cases), and a 2nd index for specifying columns (or variables). With these indices, selecting cells, rows (cases), or columns (variables) of a data frame or tibble works just like selecting the corresponding cells, rows, or columns in matrices:

# Selecting cells, rows or columns: ----- 
df[5, 3]  # cell in row 5, column 3: 21 (age of Eve)
#> [1] 21
df[6, ]   # row 6
#>   name gender age height
#> 6 Nero   male  18    185
df[ , 4]  # column 4
#> [1] 165 170 168 172 158 185 182

In addition to numeric indexing (i.e., selecting cells, rows, or columns by their numeric indices), we can also select columns by their name. In case we do not know the variable names of a table, we can use the names() function to obtain them (as a vector):

names(df)    # yields the names of all variables (columns)
#> [1] "name"   "gender" "age"    "height"
names(df)[4] # the name of the 4th variable
#> [1] "height"

For selecting a column of a table by its name, we can combine the name of the data structure (e.g., df) with the $ operator, followed by the name of the variable to be selected:

# Selecting variables (columns) by their name (with the $ operator):
df$gender  # returns gender vector
#> [1] male   female female female female male   male  
#> Levels: female male
df$age     # returns age vector
#> [1] 21 23 22 19 21 18 24

As the table was created by combining vectors (into the columns of df or tb), selecting a variable by its name yields a vector again. Thus, we can apply all kinds of functions to the variables (columns) of a table:

# Applying functions to columns of df:
df$gender == "male"
#> [1]  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE
sum(df$gender == "male")  # Note: TRUE is 1, FALSE is 0
#> [1] 3

df$age < 21
#> [1] FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE
df$age[df$age < 21]   # Which age values are below 21?
#> [1] 19 18
df$name[df$age < 21]  # What are the names of people below 21?
#> [1] Dora Nero
#> Levels: Adam Bertha Cecily Dora Eve Nero Zeno

mean(df$height)
#> [1] 171.4286
df$height < 170
#> [1]  TRUE FALSE  TRUE FALSE  TRUE FALSE FALSE
df$gender[df$height < 170]
#> [1] male   female female
#> Levels: female male

Adding new variables to a table is easy: To create a new variable x of a given table t, we simply assign something (a vector of length nrow(t)) to a new variable name (using the t$x notation). The variable type of x depends on our assignment:

df
#>     name gender age height
#> 1   Adam   male  21    165
#> 2 Bertha female  23    170
#> 3 Cecily female  22    168
#> 4   Dora female  19    172
#> 5    Eve female  21    158
#> 6   Nero   male  18    185
#> 7   Zeno   male  24    182
dim(df)   # 7 cases (rows) x 4 variables (columns)
#> [1] 7 4
nrow(df)  # 7 rows
#> [1] 7

# Create a new variable:
df$may_drink <- rep(NA, nrow(df))  # initialize a new variable (column) with unknown (NA) values
df # => may_drink was added as a new column to df, all instances are NA
#>     name gender age height may_drink
#> 1   Adam   male  21    165        NA
#> 2 Bertha female  23    170        NA
#> 3 Cecily female  22    168        NA
#> 4   Dora female  19    172        NA
#> 5    Eve female  21    158        NA
#> 6   Nero   male  18    185        NA
#> 7   Zeno   male  24    182        NA

# Assign values: A person may drink (alcohol, in the US),  
df$may_drink <- (df$age >= 21)  # if s/he is 21 (or older)
df
#>     name gender age height may_drink
#> 1   Adam   male  21    165      TRUE
#> 2 Bertha female  23    170      TRUE
#> 3 Cecily female  22    168      TRUE
#> 4   Dora female  19    172     FALSE
#> 5    Eve female  21    158      TRUE
#> 6   Nero   male  18    185     FALSE
#> 7   Zeno   male  24    182      TRUE

# Note:
# - we did not use an if-then statement
# - we did not specify separate TRUE vs. FALSE cases
# - we can assign and set new variables in 1 step:

df$is_female <- (df$gender == "female")
df
#>     name gender age height may_drink is_female
#> 1   Adam   male  21    165      TRUE     FALSE
#> 2 Bertha female  23    170      TRUE      TRUE
#> 3 Cecily female  22    168      TRUE      TRUE
#> 4   Dora female  19    172     FALSE      TRUE
#> 5    Eve female  21    158      TRUE      TRUE
#> 6   Nero   male  18    185     FALSE     FALSE
#> 7   Zeno   male  24    182      TRUE     FALSE

Practice

  • Add 2 more logical variables to df: A variable is_tall that is TRUE if and only if someone is taller than 170 cm and a variable short_name that is TRUE if and only if someone’s name is not more than 4 characters long.

Hint: The nchar() function yields the length of a character string.

Subsetting tables

An alternative way for selecting a subset of a table is provided by the subset() function, which typically takes the form subset(x, condition), where x is some data table and condition is a logical test that imposes criteria on one or more variables in x. For instance, if we wanted to select the cases (rows) of df that satisfy certain requirements regarding age or gender, we could specify those as follows:

# Subsetting by a condition:
subset(df, age > 20)
#>     name gender age height may_drink is_female
#> 1   Adam   male  21    165      TRUE     FALSE
#> 2 Bertha female  23    170      TRUE      TRUE
#> 3 Cecily female  22    168      TRUE      TRUE
#> 5    Eve female  21    158      TRUE      TRUE
#> 7   Zeno   male  24    182      TRUE     FALSE
subset(df, gender == "male")
#>   name gender age height may_drink is_female
#> 1 Adam   male  21    165      TRUE     FALSE
#> 6 Nero   male  18    185     FALSE     FALSE
#> 7 Zeno   male  24    182      TRUE     FALSE

# Multiple conditions:
subset(df, age > 20 | gender == "male")  # logical OR
#>     name gender age height may_drink is_female
#> 1   Adam   male  21    165      TRUE     FALSE
#> 2 Bertha female  23    170      TRUE      TRUE
#> 3 Cecily female  22    168      TRUE      TRUE
#> 5    Eve female  21    158      TRUE      TRUE
#> 6   Nero   male  18    185     FALSE     FALSE
#> 7   Zeno   male  24    182      TRUE     FALSE
subset(df, age > 20 & gender == "male")  # logical AND
#>   name gender age height may_drink is_female
#> 1 Adam   male  21    165      TRUE     FALSE
#> 7 Zeno   male  24    182      TRUE     FALSE

However, note that the subset() function is only a convenient way of indexing by [...]. We can re-write the 4 commands from above as:

# Subsetting by a condition:
df[age > 20, ]
#>     name gender age height may_drink is_female
#> 1   Adam   male  21    165      TRUE     FALSE
#> 2 Bertha female  23    170      TRUE      TRUE
#> 3 Cecily female  22    168      TRUE      TRUE
#> 5    Eve female  21    158      TRUE      TRUE
#> 7   Zeno   male  24    182      TRUE     FALSE
df[gender == "male", ]
#>   name gender age height may_drink is_female
#> 1 Adam   male  21    165      TRUE     FALSE
#> 6 Nero   male  18    185     FALSE     FALSE
#> 7 Zeno   male  24    182      TRUE     FALSE

# Multiple conditions:
df[age > 20 | gender == "male", ]  # logical OR
#>     name gender age height may_drink is_female
#> 1   Adam   male  21    165      TRUE     FALSE
#> 2 Bertha female  23    170      TRUE      TRUE
#> 3 Cecily female  22    168      TRUE      TRUE
#> 5    Eve female  21    158      TRUE      TRUE
#> 6   Nero   male  18    185     FALSE     FALSE
#> 7   Zeno   male  24    182      TRUE     FALSE
df[age > 20 & gender == "male", ]  # logical AND
#>   name gender age height may_drink is_female
#> 1 Adam   male  21    165      TRUE     FALSE
#> 7 Zeno   male  24    182      TRUE     FALSE

As using subset() can have unanticipated consequences, its help page even contains the warning “This is a convenience function intended for use interactively. For programming it is better to use the standard subsetting functions like [ ….”12 Thus, indexing by [] is both safer and more general.

1.5.4 Changing variable types

When working with vectors or rectangles of data, we often need or want to convert the type of a variable into another one. To convert a variable, we simply assign it to itself (so that all its values will be preserved) and wrap a type conversion function (as.character, as.integer, as.logical, as.numeric, or as.factor) around it:

levels(df$gender)  # currently a so-called "factor" variable
#> [1] "female" "male"
typeof(df$gender)  # of type "integer"
#> [1] "integer"
df$gender <- as.character(df$gender)  # convert into a character variable
typeof(df$gender)  # now of type "character"
#> [1] "character"

df$gender <- as.factor(df$gender)  # convert from "character" into a "factor"
df$gender
#> [1] male   female female female female male   male  
#> Levels: female male
typeof(df$gender)  # again of type "integer"
#> [1] "integer"

typeof(df$age)  # numeric "double"
#> [1] "double"
df$age <- as.integer(df$age)  # convert from "double" to "integer"
typeof(df$age)  # "integer"
#> [1] "integer"
df$age <- as.numeric(df$age)  # convert from "integer" to numeric "double"
typeof(df$age)  # numeric "double"
#> [1] "double"

Practice

  1. What happens when you convert a vector v <- 0:3 into a logical data by using the as.logical conversion function? (Predict the outcome, then check it, and verify your understanding by reading the documentation of ?as.logical.)
v <- 0:3
v
as.logical(v)
  1. What happens when you convert the outcome of as.logical(v) into numeric data by using the as.numeric conversion function)? (Predict the outcome, then check it, and verify your understanding by reading the documentation of ?as.numeric.)
vl <- as.logical(v)
vl
as.numeric(vl)

1.5.5 Factor variables

A factor is a categorical variable (i.e., a vector or column in a data frame) that distinguishes between different levels of a variable and sorts these levels in some order. When creating the data frame df above (by putting 4 vectors of the same length into a data.frame command), the variables consisting of character strings (here: name and gender) were automatically converted into factors:

df <- data.frame(name, gender, age, height)

df$name  # Note that levels are ordered alphabetically
#> [1] Adam   Bertha Cecily Dora   Eve    Nero   Zeno  
#> Levels: Adam Bertha Cecily Dora Eve Nero Zeno
is.factor(df$name)
#> [1] TRUE
typeof(df$name)
#> [1] "integer"
as.integer(df$name)
#> [1] 1 2 3 4 5 6 7

df$gender # Note that levels are ordered alphabetically
#> [1] male   female female female female male   male  
#> Levels: female male
is.factor(df$gender)
#> [1] TRUE
typeof(df$name)
#> [1] "integer"
as.integer(df$gender)
#> [1] 2 1 1 1 1 2 2

Factors are similar to character variables insofar as they identify cases by a text label. However, when looking closely at the output of df$gender we see that the labels male and female are not quoted (as the text elements in a character variable would be). When using as.factor on a character variable, each distinct string value is turned into a different factor level, these levels are sorted (here: alphabetically), and the levels are mapped to an underlying numeric representation (here: consecutive integer values, starting at 1). We can examine the differences between a character variable and a factor variable:

# (1) gender as a character variable:
df$gender <- as.character(df$gender)
df$gender
#> [1] "male"   "female" "female" "female" "female" "male"   "male"

is.factor(df$gender)    # not a factor
#> [1] FALSE
levels(df$gender)       # no levels
#> NULL
typeof(df$gender)       # a character variable
#> [1] "character"
# as.integer(df$gender) # would yield an error, as undefined for character variables.

# (2) gender as a factor variable:
df$gender <- as.factor(df$gender)  # convert from "character" into a "factor"
df$gender
#> [1] male   female female female female male   male  
#> Levels: female male

is.factor(df$gender)   # a factor
#> [1] TRUE
levels(df$gender)      # 2 levels 
#> [1] "female" "male"
typeof(df$gender)      # an integer variable
#> [1] "integer"
as.integer(df$gender)  # convert factor into numeric variable
#> [1] 2 1 1 1 1 2 2

When we want to prevent R from automatically converting character variables into factors when creating a new data frame, we can explicitly set an option stringsAsFactors = FALSE in a data.frame command:

df <- data.frame(name, gender, age, height, stringsAsFactors = FALSE)
df$gender
#> [1] "male"   "female" "female" "female" "female" "male"   "male"
typeof(df$gender)
#> [1] "character"

We currently do not need to understand the details of factors. But as factors occasionally appear accidentally (as stringsAsFactors = TRUE by default) and are useful when analyzing and visualizing empirical data (e.g., for distinguishing between different experimental conditions) it is good to know that factors exist and can easily be dealt with.

1.5.6 Importing data

In most cases, we do not generate the data that we analyze, but obtain a table of dataset from somewhere else. Typical locations of data include:

  • data included in R packages
  • data stored on your local disk
  • data stored on online servers

R and R Studio provide many ways of reading in data from various sources. Which way is suited to a particular dataset depends mostly on the location of the file and the format in which the data is stored. We will examine different ways of importing datasets in Chapter 6. Here, we show the 2 most common ways of importing a dataset.

The dataset we import stems from an interesting article in the Journal of Clinical Psychology (Woodworth, O’Brien-Malone, Diamond, & Schüz, 2017, Woodworth, O’Brien-Malone, Diamond, & Schüz (2018)). An influential paper by Seligman et al. (Seligman, Steen, Park, & Peterson, 2005) found that several positive psychology interventions lastingly increased happiness and decreased depressive symptoms in a placebo-controlled internet study. Woodworth et al. (2017) re-examined this claim by measuring the long-term effectiveness of different web-based positive psychology interventions and published their data in a separate article (Woodworth et al., 2018) (see Appendix B.1 for details).

Data from a file

When loading data that is stored as a file, there are 2 questions to answer:

  • Location: Where is the file stored?
  • Format: In which format is the file stored?

To illustrate how data can be imported from an online source, we store a copy of the participant data from Woodworth et al. (2018) as a text file in CSV (comma-separated-value) format on a web server at http://rpository.com/ds4psy/data/posPsy_participants.csv. Given this setup, we can load the dataset into an R object p_info by evaluating the following command (from the package readr, which is part of the tidyverse):

p_info <- readr::read_csv(file = "http://rpository.com/ds4psy/data/posPsy_participants.csv")

Data from an R package

An even simpler way of obtaining a data file exists when datasets are stored and provided in R packages. In this case, some R programmer has typically saved the data in a compressed format (as an .rda file) and it can be accessed by installing and loading the corresponding R package. Provided that we have installed and loaded this package, we can easily access the corresponding dataset. In our case, we need to load the ds4psy package, which contains the participant data as an R object posPsy_p_info. We can treat and manipulate such data objects just like any other R object. For instance, we can copy the dataset posPsy_p_info into another R object by assigning it to p_info_2:

# install.packages("ds4psy")  # installs the 'ds4psy' package
library(ds4psy)               # loads the 'ds4psy' package

p_info_2 <- posPsy_p_info

Having loaded the same data in 2 different ways, we should verify that we obtained the same result both times. We can verify that p_info and p_info_2 are equal by using the all.equal command:

all.equal(p_info, p_info_2)
#> [1] TRUE

Throughout this book, we will primarily rely on the datasets provided by the ds4psy package, but show additional options of importing files stored in different locations and formats in the chapter on Importing data (Chapter 6).

Checking a dataset

To get an initial idea about the contents of a dataset (often called a data frame, table, or tibble), we typically inspect its dimensions, print it, ask for its structure (by using the base R command str), or take a glimpse on its variables and values:

dim(p_info)              # 295 rows, 6 columns
#> [1] 295   6
p_info                   # prints a summary of the table/tibble
#> # A tibble: 295 x 6
#>       id intervention   sex   age  educ income
#>    <dbl>        <dbl> <dbl> <dbl> <dbl>  <dbl>
#>  1     1            4     2    35     5      3
#>  2     2            1     1    59     1      1
#>  3     3            4     1    51     4      3
#>  4     4            3     1    50     5      2
#>  5     5            2     2    58     5      2
#>  6     6            1     1    31     5      1
#>  7     7            3     1    44     5      2
#>  8     8            2     1    57     4      2
#>  9     9            1     1    36     4      3
#> 10    10            2     1    45     4      3
#> # … with 285 more rows
str(p_info)              # shows the structure of an R object
#> Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 295 obs. of  6 variables:
#>  $ id          : num  1 2 3 4 5 6 7 8 9 10 ...
#>  $ intervention: num  4 1 4 3 2 1 3 2 1 2 ...
#>  $ sex         : num  2 1 1 1 2 1 1 1 1 1 ...
#>  $ age         : num  35 59 51 50 58 31 44 57 36 45 ...
#>  $ educ        : num  5 1 4 5 5 5 5 4 4 4 ...
#>  $ income      : num  3 1 3 2 2 1 2 2 3 3 ...
#>  - attr(*, "spec")=
#>   .. cols(
#>   ..   id = col_double(),
#>   ..   intervention = col_double(),
#>   ..   sex = col_double(),
#>   ..   age = col_double(),
#>   ..   educ = col_double(),
#>   ..   income = col_double()
#>   .. )
tibble::glimpse(p_info)  # shows the types and initial values of all variables (columns)
#> Observations: 295
#> Variables: 6
#> $ id           <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
#> $ intervention <dbl> 4, 1, 4, 3, 2, 1, 3, 2, 1, 2, 2, 2, 4, 4, 4, 4, 3, 2, 1,…
#> $ sex          <dbl> 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1,…
#> $ age          <dbl> 35, 59, 51, 50, 58, 31, 44, 57, 36, 45, 56, 46, 34, 41, …
#> $ educ         <dbl> 5, 1, 4, 5, 5, 5, 5, 4, 4, 4, 5, 4, 5, 1, 2, 1, 4, 5, 3,…
#> $ income       <dbl> 3, 1, 3, 2, 2, 1, 2, 2, 3, 3, 1, 3, 3, 2, 2, 1, 2, 2, 1,…

Understanding a dataset

When analyzing a data file from a remote source, it is crucial to also obtain a description of the variables and values contained in the file (often called a Codebook). For the dataset loaded into p_info this description looks as follows:

posPsy_participants.csv contains demographic information of 295 participants:

  • id: participant ID

  • intervention: 3 positive psychology interventions, plus 1 control condition:

    • 1 = “Using signature strengths”,
    • 2 = “Three good things”,
    • 3 = “Gratitude visit”,
    • 4 = “Recording early memories” (control condition).
  • sex:

    • 1 = female,
    • 2 = male.
  • age: participant’s age (in years).

  • educ: level of education:

    • 1 = Less than Year 12,
    • 2 = Year 12,
    • 3 = Vocational training,
    • 4 = Bachelor’s degree,
    • 5 = Postgraduate degree.
  • income:

    • 1 = below average,
    • 2 = average,
    • 3 = above average.

Beyond conveniently loading datasets, another advantage of using data provided by R packages is that the details of a dataset are easily accessible by using the standard R help system. For instance, provided that the ds4psy package is installed and loaded, we can obtain the codebook and background information of the posPsy_p_info data by evaluating ?posPsy_p_info.

Practice

  • Using the data in p_info, create a new variable uni_degree that is TRUE if and only if a person has a Bachelor’s or Postgraduate degree.

  • Use R commands to obtain (the row data of) the youngest person with a university degree.

We will examine this data file further in Exercise 6 (see Section 1.8.6).

References

Seligman, M. E., Steen, T. A., Park, N., & Peterson, C. (2005). Positive psychology progress: Empirical validation of interventions. American Psychologist, 60(5), 410. https://doi.org/10.1037/0003-066X.60.5.410

Woodworth, R. J., O’Brien-Malone, A., Diamond, M. R., & Schüz, B. (2017). Web-based positive psychology interventions: A reexamination of effectiveness. Journal of Clinical Psychology, 73(3), 218–232. https://doi.org/10.1002/jclp.22328

Woodworth, R. J., O’Brien-Malone, A., Diamond, M. R., & Schüz, B. (2018). Data from “Web-based positive psychology interventions: A reexamination of effectiveness”. Journal of Open Psychology Data, 6(1). https://doi.org/10.5334/jopd.35


  1. For details, see this discussion of non-standard evaluation (in Wickham, 2014a).