4.4 Exercises
Working through the essentials of exploratory data analysis (EDA) (in Section 4.2) has provided us with a pretty clear picture of the data and results of the AHI_CESD
dataset.
However, we have not yet touched the data contained in posPsy_wide
. Given that this file was described as a transformed and corrected version of AHI_CESD
, we should not expect to find completely different results in it.
So let’s use posPsy_wide
to exercise our skills in EDA, but also to answer some interesting questions about the study by Woodworth, O’Brien-Malone, Diamond, & Schüz (2017):
# Load packages:
library(tidyverse)
library(rmarkdown)
library(knitr)
library(here)
library(ds4psy)
# Load data:
# 1. Participant data:
<- ds4psy::posPsy_p_info # from the ds4psy package
posPsy_p_info # posPsy_p_info <- readr::read_csv(file = "http://rpository.com/ds4psy/data/posPsy_participants.csv") # online
# 2. Original DVs in long format:
<- ds4psy::posPsy_AHI_CESD # from the ds4psy package
AHI_CESD # AHI_CESD <- readr::read_csv(file = "http://rpository.com/ds4psy/data/posPsy_AHI_CESD.csv") # online
# 4. Transformed and corrected version of all data (in wide format):
<- ds4psy::posPsy_wide # from the ds4psy package
posPsy_wide # posPsy_wide <- readr::read_csv(file = "http://rpository.com/ds4psy/data/posPsy_data_wide.csv") # online
See Section B.1 of Appendix B for details on the data.
4.4.1 Exercise 1
Exploring wide data
Load the data stored as posPsy_wide
(from the ds4psy package or this CSV-file) into posPsy_wide
(if you haven’t done so above) and explore it.
What are the data’s dimensions, variables, and types of variables? How would you describe the structure of this dataset?
Verify that
posPsy_wide
contains the same participants asposPsy_p_info
.
Hint: In R, the equality of two objects x
and y
can be checked by all.equal(x, y)
or all.equal(x, y, check.attributes = FALSE)
.
- Inspect
posPsy_wide
for missing (NA
) values. Do you see some systematic pattern to them?
4.4.2 Exercise 2
When screening the data contained in AHI_CESD
(in Section 4.2.4 or available as ds4psy::posPsy_AHI_CESD
), we have computed and plotted both the number of measurement occasions per particiant and the number of participants per measurement occasion. The fact that the number of people decreased from initial to later occasions motivated an important question:
- Are the dropouts (i.e., people present at first, but absent at one or more later measurements) random or systematic?
Selective dropouts?
Theoretically, the presence of dropouts — which we can define here as people being present initially, but missing one or more measurements at later occasions — raises an important issue: Are the dropouts random (due to chance) or systematic (due to some other factor)?
Let’s explore this issue (for the data of posPsy_wide
) in a sequence of four steps:
Create a new filter variable
dropout
(e.g., so thatdropout
isFALSE
when a person was measured on all occasions andTRUE
when a person missed one or more occasions), and add this variable to the data.Use the
dropout
variable to ask whether you see any group differences between dropouts and non-dropouts in their independent variables (sex
,age
,educ
, andincome
). This would imply that some of these factors may have co-determined whether a participant completed the study or dropped out.
Hints: The best way to do this depends on the types of variables of interest:
For categorical variables (here:
sex
,educ
,income
), the question essentially asks you to compare the relative proportions (e.g., of females vs. males) in the dropout vs. non-dropout groups. You can do this with computing appropriate summary tables in dplyr. However, it’s much easier to visually judge the similarity of graphs (e.g., by using bar charts in which stacked bars are scaled to 100% for each bar, i.e.,pos = "fill"
).For continuous variables (here:
age
), it’s easiest to visualize their distributions (as histograms or density plots) with group-specific aesthetics or facets for the variables to be compared.
Do you see any systematic differences between dropouts and non-dropouts by
intervention
? This would imply that the type ofintervention
may have co-determined whether a participant completed the study or dropped out.Do dropouts vs. non-dropouts show any systematic differences in the dependent variables (i.e., happiness or depression scores)? This would imply that any changes in these scores — which were the main focus of this study — could have been co-determined by the fact that a participant completed the study vs. dropped out.
Hint: As this question addresses changes in scores between different measurement occasions, it could be operationalized in many ways. In our analysis of the number of participants per occasion above, we saw that only 148 of 295 pre-test participants (i.e., 50.2%) were present at Occasion 1. Hence, compare the (mean and raw) scores at the pre-test (Occasion 0) of the dropouts vs. non-dropouts of Occasion 1 (i.e., compare participants’ initial ahiTotal.0
and cesdTotal.0
scores based on their absence/presence at Occasion 1).
4.4.3 Exercise 3
In previous sessions, we already examined the age
and sex
distributions of participants.
But are the participants in the four intervention
groups also balanced in terms of their levels of income
?
And could the initial scores of the participants at occasion 0 (i.e., before any intervention took place) vary as a function of their income
?
Let’s find out…
Effects of income
Transform the data in
posPsy_wide
to summarise the overall distribution ofincome
levels (i.e., the frequency or percentage of each level) and compare your results to those of Exercise 5 of Chapter 3 and Exercise 6 of Chapter 1, which both used thep_info
data (available fromds4psy::posPsy_p_info
or CSV-file).Use two different ways to visualise the distributions of
income
levels (a) overall, and (b) separately for eachintervention
.
Hints: You can either directly plot the data contained in posPsy_wide
(e.g., by selecting some of its variables) or first use dplyr to create a summary table that can be piped into (or used as the data
of) a ggplot
command. To create separate plots by intervention
, use facetting on the overall plot.
- Do the initial scores of happiness (
ahiTotal.0
) or depression (cesdTotal.0
) vary by the levels ofincome
?
Examine this (a) by using dplyr to compute corresponding summary tables, and (b) by using ggplot2 to visualise all raw values, their means, and distributions per level of income
. Check whether the summary scores in your table (a) correspond to those shown in your plot (b), and interpret your plots (e.g., do you see any potential outliers?).
Hint: Use factor(income)
to turn income
from an integer into a categorical variable (factor).
4.4.4 Exercise 4
We have seen many plots that showed trends of happiness and depression scores for different interventions over measurement occasions. Now it’s time to take a stance: What are the main findings of the study?
Showing results
Summarize the main results as clearly as possible in 1 or 2 graphs.
Justify your choice of graph (prior to plotting it).
Interpret your graph. What does it show? What does it not show?
Hint: You may use the data in posPsy_wide
(in wide format) or that in AHI_CESD
(in long format).
4.4.5 Exercise 5
This is a bonus exercise, which you are welcome to skip. It goes beyond the information given in the data, but illustrates a typical problem that would occur if you actually wanted to re-analyze someone else’s data: The data files provided contained some aggregated scores. Can we understand how these scores were computed?
Can we verify the computation of total scores?
Use data transformation on the data in AHI_CESD
(in long format) to answer the following questions:
Is the total happiness score (
ahiTotal
) the sum of all 24 corresponding scale values (fromahi01
toahi24
) at each measurement occasion?Is the total depression score (
cesdTotal
) the sum of all 20 corresponding scale values (fromcesd01
tocesd20
) at each measurement occasion?Assuming that your answer to the previous question (2.) is negative, try to verify how the
cesdTotal
scores are computed from the individual values (fromcesd01
tocesd20
).
To find out how the overall depression scores are computed, you need to consult background information on the CES-D scale. The original reference for this scale (with over 58,000 citations) is Radloff (1977):
- Radloff, L. S. (1977). The CES-D scale: A self report depression scale for research in the general population. Applied Psychological Measurement, 1, 385–401. doi: https://doi.org/10.1177/014662167700100306
but even without reading this article, we can find plenty of scale-related information online.
Hint: Some scale items are reverse-coded, so they should be subtracted from the sum of the other items, rather than added to them.
This concludes our exercises on exploratory data analysis (EDA) — but we will further improve our corresponding skills in subsequent chapters.