## 4.4 Exercises Working through the essentials of exploratory data analysis (EDA) (in Section 4.2) has provided us with a pretty clear picture of the data and results of the AHI_CESD dataset.
However, we have not yet touched the data contained in posPsy_wide. Given that this file was described as a transformed and corrected version of AHI_CESD, we should not expect to find completely different results in it. So let’s use posPsy_wide to exercise our skills in EDA, but also to answer some interesting questions about the study by :

# Load packages:
library(tidyverse)
library(rmarkdown)
library(knitr)
library(here)
library(ds4psy)

# 1. Participant data:
posPsy_p_info <- ds4psy::posPsy_p_info  # from the ds4psy package

# 2. Original DVs in long format:
AHI_CESD <- ds4psy::posPsy_AHI_CESD  # from the ds4psy package

# 4. Transformed and corrected version of all data (in wide format):
posPsy_wide <- ds4psy::posPsy_wide  # from the ds4psy package
# posPsy_wide <- readr::read_csv(file = "http://rpository.com/ds4psy/data/posPsy_data_wide.csv")  # online 

See Section B.1 of Appendix B for details on the data.

### 4.4.1 Exercise 1

#### Exploring wide data

Load the data stored as posPsy_wide (from the ds4psy package or this CSV-file) into posPsy_wide (if you haven’t done so above) and explore it.

1. What are the data’s dimensions, variables, and types of variables? How would you describe the structure of this dataset?

2. Verify that posPsy_wide contains the same participants as posPsy_p_info.

Hint: In R, the equality of two objects x and y can be checked by all.equal(x, y) or all.equal(x, y, check.attributes = FALSE).

1. Inspect posPsy_wide for missing (NA) values. Do you see some systematic pattern to them?

### 4.4.2 Exercise 2

When screening the data contained in AHI_CESD (in Section 4.2.4 or available as ds4psy::posPsy_AHI_CESD), we have computed and plotted both the number of measurement occasions per particiant and the number of participants per measurement occasion. The fact that the number of people decreased from initial to later occasions motivated an important question:

• Are the dropouts (i.e., people present at first, but absent at one or more later measurements) random or systematic?

#### Selective dropouts?

Theoretically, the presence of dropouts — which we can define here as people being present initially, but missing one or more measurements at later occasions — raises an important issue: Are the dropouts random (due to chance) or systematic (due to some other factor)? Let’s explore this issue (for the data of posPsy_wide) in a sequence of four steps:

1. Create a new filter variable dropout (e.g., so that dropout is FALSE when a person was measured on all occasions and TRUE when a person missed one or more occasions), and add this variable to the data.

2. Use the dropout variable to ask whether you see any group differences between dropouts and non-dropouts in their independent variables (sex, age, educ, and income). This would imply that some of these factors may have co-determined whether a participant completed the study or dropped out.

Hints: The best way to do this depends on the types of variables of interest:

• For categorical variables (here: sex, educ, income), the question essentially asks you to compare the relative proportions (e.g., of females vs. males) in the dropout vs. non-dropout groups. You can do this with computing appropriate summary tables in dplyr. However, it’s much easier to visually judge the similarity of graphs (e.g., by using bar charts in which stacked bars are scaled to 100% for each bar, i.e., pos = "fill").

• For continuous variables (here: age), it’s easiest to visualize their distributions (as histograms or density plots) with group-specific aesthetics or facets for the variables to be compared.

1. Do you see any systematic differences between dropouts and non-dropouts by intervention? This would imply that the type of intervention may have co-determined whether a participant completed the study or dropped out.

2. Do dropouts vs. non-dropouts show any systematic differences in the dependent variables (i.e., happiness or depression scores)? This would imply that any changes in these scores — which were the main focus of this study — could have been co-determined by the fact that a participant completed the study vs. dropped out.

Hint: As this question addresses changes in scores between different measurement occasions, it could be operationalized in many ways. In our analysis of the number of participants per occasion above, we saw that only 148 of 295 pre-test participants (i.e., 50.2%) were present at Occasion 1. Hence, compare the (mean and raw) scores at the pre-test (Occasion 0) of the dropouts vs. non-dropouts of Occasion 1 (i.e., compare participants’ initial ahiTotal.0 and cesdTotal.0 scores based on their absence/presence at Occasion 1).

### 4.4.3 Exercise 3

In previous sessions, we already examined the age and sex distributions of participants. But are the participants in the four intervention groups also balanced in terms of their levels of income? And could the initial scores of the participants at occasion 0 (i.e., before any intervention took place) vary as a function of their income? Let’s find out…

#### Effects of income

1. Transform the data in posPsy_wide to summarise the overall distribution of income levels (i.e., the frequency or percentage of each level) and compare your results to those of Exercise 5 of Chapter 3 and Exercise 6 of Chapter 1, which both used the p_info data (available from ds4psy::posPsy_p_info or CSV-file).

2. Use two different ways to visualise the distributions of income levels (a) overall, and (b) separately for each intervention.

Hints: You can either directly plot the data contained in posPsy_wide (e.g., by selecting some of its variables) or first use dplyr to create a summary table that can be piped into (or used as the data of) a ggplot command. To create separate plots by intervention, use facetting on the overall plot.

1. Do the initial scores of happiness (ahiTotal.0) or depression (cesdTotal.0) vary by the levels of income?

Examine this (a) by using dplyr to compute corresponding summary tables, and (b) by using ggplot2 to visualise all raw values, their means, and distributions per level of income. Check whether the summary scores in your table (a) correspond to those shown in your plot (b), and interpret your plots (e.g., do you see any potential outliers?).

Hint: Use factor(income) to turn income from an integer into a categorical variable (factor).

### 4.4.4 Exercise 4

We have seen many plots that showed trends of happiness and depression scores for different interventions over measurement occasions. Now it’s time to take a stance: What are the main findings of the study?

#### Showing results

• Summarize the main results as clearly as possible in 1 or 2 graphs.

• Justify your choice of graph (prior to plotting it).

• Interpret your graph. What does it show? What does it not show?

Hint: You may use the data in posPsy_wide (in wide format) or that in AHI_CESD (in long format).

### 4.4.5 Exercise 5

This is a bonus exercise, which you are welcome to skip. It goes beyond the information given in the data, but illustrates a typical problem that would occur if you actually wanted to re-analyze someone else’s data: The data files provided contained some aggregated scores. Can we understand how these scores were computed?

#### Can we verify the computation of total scores?

Use data transformation on the data in AHI_CESD (in long format) to answer the following questions:

1. Is the total happiness score (ahiTotal) the sum of all 24 corresponding scale values (from ahi01 to ahi24) at each measurement occasion?

2. Is the total depression score (cesdTotal) the sum of all 20 corresponding scale values (from cesd01 to cesd20) at each measurement occasion?

3. Assuming that your answer to the previous question (2.) is negative, try to verify how the cesdTotal scores are computed from the individual values (from cesd01 to cesd20).

To find out how the overall depression scores are computed, you need to consult background information on the CES-D scale. The original reference for this scale (with over 58,000 citations) is Radloff (1977):