Working through the essentials of exploratory data analysis (EDA) (in Section 4.2) has provided us with a pretty clear picture of the data and results of the
However, we have not yet touched the data contained in
posPsy_wide. Given that this file was described as a transformed and corrected version of
AHI_CESD, we should not expect to find completely different results in it.
So let’s use
posPsy_wide to exercise our skills in EDA, but also to answer some interesting questions about the study by Woodworth et al. (2017):
# Load packages: library(tidyverse) library(rmarkdown) library(knitr) library(here) library(ds4psy) # Load data: # 1. Participant data: posPsy_p_info <- ds4psy::posPsy_p_info # from the ds4psy package # posPsy_p_info <- readr::read_csv(file = "http://rpository.com/ds4psy/data/posPsy_participants.csv") # online # 2. Original DVs in long format: AHI_CESD <- ds4psy::posPsy_AHI_CESD # from the ds4psy package # AHI_CESD <- readr::read_csv(file = "http://rpository.com/ds4psy/data/posPsy_AHI_CESD.csv") # online # 4. Transformed and corrected version of all data (in wide format): posPsy_wide <- ds4psy::posPsy_wide # from the ds4psy package # posPsy_wide <- readr::read_csv(file = "http://rpository.com/ds4psy/data/posPsy_data_wide.csv") # online
Load the data stored as
posPsy_wide (from the ds4psy package or this CSV-file) into
posPsy_wide (if you haven’t done so above) and explore it.
What are the data’s dimensions, variables, and types of variables? How would you describe the structure of this dataset?
posPsy_widecontains the same participants as
Hint: In R, the equality of two objects
y can be checked by
all.equal(x, y, check.attributes = FALSE).
posPsy_widefor missing (
NA) values. Do you see some systematic pattern to them?
When screening the data contained in
AHI_CESD (in Section 4.2.4 or available as
ds4psy::posPsy_AHI_CESD), we have computed and plotted both the number of measurement occasions per particiant and the number of participants per measurement occasion. The fact that the number of people decreased from initial to later occasions motivated an important question:
- Are the dropouts (i.e., people present at first, but absent at one or more later measurements) random or systematic?
Theoretically, the presence of dropouts — which we can define here as people being present initially, but missing one or more measurements at later occasions — raises an important issue: Are the dropouts random (due to chance) or systematic (due to some other factor)?
Let’s explore this issue (for the data of
posPsy_wide) in a sequence of four steps:
Create a new filter variable
dropout(e.g., so that
FALSEwhen a person was measured on all occasions and
TRUEwhen a person missed one or more occasions), and add this variable to the data.
dropoutvariable to ask whether you see any group differences between dropouts and non-dropouts in their independent variables (
income). This would imply that some of these factors may have co-determined whether a participant completed the study or dropped out.
Hints: The best way to do this depends on the types of variables of interest:
For categorical variables (here:
income), the question essentially asks you to compare the relative proportions (e.g., of females vs. males) in the dropout vs. non-dropout groups. You can do this with computing appropriate summary tables in dplyr. However, it’s much easier to visually judge the similarity of graphs (e.g., by using bar charts in which stacked bars are scaled to 100% for each bar, i.e.,
pos = "fill").
For continuous variables (here:
age), it’s easiest to visualize their distributions (as histograms or density plots) with group-specific aesthetics or facets for the variables to be compared.
Do you see any systematic differences between dropouts and non-dropouts by
intervention? This would imply that the type of
interventionmay have co-determined whether a participant completed the study or dropped out.
Do dropouts vs. non-dropouts show any systematic differences in the dependent variables (i.e., happiness or depression scores)? This would imply that any changes in these scores — which were the main focus of this study — could have been co-determined by the fact that a participant completed the study vs. dropped out.
Hint: As this question addresses changes in scores between different measurement occasions, it could be operationalized in many ways. In our analysis of the number of participants per occasion above, we saw that only 148 of 295 pre-test participants (i.e., 50.2%) were present at Occasion 1. Hence, compare the (mean and raw) scores at the pre-test (Occasion 0) of the dropouts vs. non-dropouts of Occasion 1 (i.e., compare participants’ initial
cesdTotal.0 scores based on their absence/presence at Occasion 1).
In previous sessions, we already examined the
sex distributions of participants.
But are the participants in the four
intervention groups also balanced in terms of their levels of
And could the initial scores of the participants at occasion 0 (i.e., before any intervention took place) vary as a function of their
Let’s find out…
Transform the data in
posPsy_wideto summarise the overall distribution of
incomelevels (i.e., the frequency or percentage of each level) and compare your results to those of Exercise 5 of Chapter 3 and Exercise 6 of Chapter 1, which both used the
p_infodata (available from
Use two different ways to visualise the distributions of
incomelevels (a) overall, and (b) separately for each
Hints: You can either directly plot the data contained in
posPsy_wide (e.g., by selecting some of its variables) or first use dplyr to create a summary table that can be piped into (or used as the
data of) a
ggplot command. To create separate plots by
intervention, use facetting on the overall plot.
- Do the initial scores of happiness (
ahiTotal.0) or depression (
cesdTotal.0) vary by the levels of
Examine this (a) by using dplyr to compute corresponding summary tables, and (b) by using ggplot2 to visualise all raw values, their means, and distributions per level of
income. Check whether the summary scores in your table (a) correspond to those shown in your plot (b), and interpret your plots (e.g., do you see any potential outliers?).
factor(income) to turn
income from an integer into a categorical variable (factor).
We have seen many plots that showed trends of happiness and depression scores for different interventions over measurement occasions. Now it’s time to take a stance: What are the main findings of the study?
This is a bonus exercise, which you are welcome to skip. It goes beyond the information given in the data, but illustrates a typical problem that would occur if you actually wanted to re-analyze someone else’s data: The data files provided contained some aggregated scores. Can we understand how these scores were computed?
Use data transformation on the data in
AHI_CESD (in long format) to answer the following questions:
Is the total happiness score (
ahiTotal) the sum of all 24 corresponding scale values (from
ahi24) at each measurement occasion?
Is the total depression score (
cesdTotal) the sum of all 20 corresponding scale values (from
cesd20) at each measurement occasion?
Assuming that your answer to the previous question (2.) is negative, try to verify how the
cesdTotalscores are computed from the individual values (from
To find out how the overall depression scores are computed, you need to consult background information on the CES-D scale. The original reference for this scale (with over 58,000 citations) is Radloff (1977):
- Radloff, L. S. (1977). The CES-D scale: A self report depression scale for research in the general population. Applied Psychological Measurement, 1, 385–401. doi: https://doi.org/10.1177/014662167700100306
but even without reading this article, we can find plenty of scale-related information online.
Hint: Some scale items are reverse-coded, so they should be subtracted from the sum of the other items, rather than added to them.
This concludes our exercises on exploratory data analysis (EDA) — but we will further improve our corresponding skills in subsequent chapters.