## 4.4 Exercises

Working through the essentials of EDA (in Section 4.2) should have provided you with a pretty clear picture of the data and results of the `AHI_CESD`

dataset.
However, we have not yet touched the data contained in `posPsy_wide`

. Given that this file was described as a transformed and corrected version of `AHI_CESD`

, we should not expect to find completely different results in it.
So let’s use `posPsy_wide`

to exercise our skills in EDA, but also to answer some interesting questions about this study.

```
# Load packages:
library(tidyverse)
library(rmarkdown)
library(knitr)
library(here)
library(ds4psy)
# Load data:
# 1. Participant data:
posPsy_p_info <- ds4psy::posPsy_p_info # from the ds4psy package
# posPsy_p_info <- readr::read_csv(file = "http://rpository.com/ds4psy/data/posPsy_participants.csv") # online
# 2. Original DVs in long format:
AHI_CESD <- ds4psy::posPsy_AHI_CESD # from the ds4psy package
# AHI_CESD <- readr::read_csv(file = "http://rpository.com/ds4psy/data/posPsy_AHI_CESD.csv") # online
# 4. Transformed and corrected version of all data (in wide format):
posPsy_wide <- ds4psy::posPsy_wide # from the ds4psy package
# posPsy_wide <- readr::read_csv(file = "http://rpository.com/ds4psy/data/posPsy_data_wide.csv") # online
```

See Section B.1 of Appendix B for details on the data.

### 4.4.1 Exercise 1

#### Explore `posPsy_wide`

Load the data stored as `posPsy_wide`

(from the **ds4psy** package or this CSV-file) into `posPsy_wide`

(if you haven’t done so above) and explore it.

What are the data’s dimensions, variables, and types of variables? How would you describe the structure of this dataset?

Verify that

`posPsy_wide`

contains the same participants as`posPsy_p_info`

.

**Hint:** In R, the equality of 2 objects `x`

and `y`

can be checked by `all.equal(x, y)`

.

- Inspect
`posPsy_wide`

for missing (`NA`

) values. Do you see some systematic pattern to them?

### 4.4.2 Exercise 2

When screening the data contained in `AHI_CESD`

(in Section 4.2.4 or available as `ds4psy::posPsy_AHI_CESD`

), we have computed and plotted both the number of measurement occasions per particiant and the number of participants per measurement occasion. The fact that the number of people decreased from initial to later occasions motivated an important question:

- Are the
*dropouts*(i.e., people present at first, but absent at one or more later measurements) random or systematic?

#### Selective dropouts?

Theoretically, the presence of *dropouts* — which we can define here as people being present initially, but missing one or more measurements at later occasions — raises an important issue: Are the dropouts *random* (due to chance) or *systematic* (due to some other factor)?
Let’s explore this issue (for the data of `posPsy_wide`

) in a sequence of 4 steps:

Create a new filter variable

`dropout`

(e.g., so that`dropout`

is`FALSE`

when a person was measured on*all*occasions and`TRUE`

when a person missed one or more occasions), and add this variable to the data.Use the

`dropout`

variable to ask whether you see any group differences between dropouts and non-dropouts in their independent variables (`sex`

,`age`

,`educ`

, and`income`

). This would imply that some of these factors may have co-determined whether a participant completed the study or dropped out.

**Hints:** The best way to do this depends on the types of variables of interest:

For

*categorical*variables (here:`sex`

,`educ`

,`income`

), the question essentially asks you to compare the relative proportions (e.g., of females vs. males) in the dropout vs. non-dropout groups. You*can*do this with computing appropriate summary tables in**dplyr**. However, it’s much easier to visually judge the similarity of graphs (e.g., by using bar charts in which stacked bars are scaled to 100% for each bar, i.e.,`pos = "fill"`

).For

*continuous*variables (here:`age`

), it’s easiest to visualize their distributions (as histograms or density plots) with group-specific aesthetics or facets for the variables to be compared.

Do you see any systematic differences between dropouts and non-dropouts by

`intervention`

? This would imply that the type of`intervention`

may have co-determined whether a participant completed the study or dropped out.Do dropouts vs. non-dropouts show any systematic differences in the dependent variables (i.e., happiness or depression scores)? This would imply that any changes in these scores – which were the main focus of this study – could have been co-determined by the fact that a participant completed the study vs. dropped out.

**Hint:** As this question addresses changes in scores between different measurement occasions, it could be operationalized in many ways. In our analysis of the number of participants per occasion above, we saw that only 148 of 295 pre-test participants (i.e., 50.2%) were present at Occasion 1. Hence, compare the (mean and raw) scores at the pre-test (Occasion 0) of the dropouts vs. non-dropouts of Occasion 1 (i.e., compare participants’ initial `ahiTotal.0`

and `cesdTotal.0`

scores based on their absence/presence at Occasion 1).

### 4.4.3 Exercise 3

In previous sessions, we already examined the `age`

and `sex`

distributions of participants.
But are the participants in the 4 `intervention`

groups also balanced in terms of their `income`

levels?
And could the initial scores of the participants at occasion 0 (i.e., before any intervention took place) vary as a function of their `income`

?
Let’s find out…

#### Effects of `income`

Transform the data in

`posPsy_wide`

to summarise the overall distribution of`income`

levels (i.e., the frequency or percentage of each level) and compare your results to those of Exercise 4 of Chapter 3 and Exercise 6 of Chapter 1, which both used the`p_info`

data (available from`ds4psy::posPsy_p_info`

or CSV-file).Use 2 different ways to visualise the distributions of

`income`

levels (a) overall, and (b) separately for each`intervention`

.

**Hints:** You can either directly plot the data contained in `posPsy_wide`

(e.g., by selecting some of its variables) or first use **dplyr** to create a summary table that can be piped into (or used as the `data`

of) a `ggplot`

command. To create separate plots by `intervention`

, use facetting on the overall plot.

- Do the initial scores of happiness (
`ahiTotal.0`

) or depression (`cesdTotal.0`

) vary by the levels of`income`

?

Examine this (a) by using **dplyr** to compute corresponding summary tables, and (b) by using **ggplot2** to visualise all raw values, their means, and distributions per level of `income`

. Check whether the summary scores in your table (a) correspond to those shown in your plot (b), and interpret your plots (e.g., do you see any potential outliers?).

**Hint:** Use `factor(income)`

to turn `income`

from an integer into a categorical variable (factor).

### 4.4.4 Exercise 4

We have seen many plots that showed trends of happiness and depression scores for different interventions over measurement occasions. Now it’s time to take a stance: What are the main findings of the study?

#### Showing results

Summarize the main results as clearly as possible in 1 or 2 graphs.

Justify your choice of graph (prior to plotting it).

Interpret your graph. What does it show? What does it not show?

**Hint:** You may use the data in `posPsy_wide`

(in wide format) or that in `AHI_CESD`

(in long format).

### 4.4.5 Exercise 5

This is a bonus exercise, which you are welcome to skip.
It goes beyond the information given in the data, but illustrates a typical problem that would occur if you actually wanted to re-analyze someone else’s data: The data files provided contained some aggregated scores. Can we understand *how* these scores were computed?

#### Can we verify the computation of total scores

Use data transformation on the data in `AHI_CESD`

(in long format) to answer the following questions:

Is the total happiness score (

`ahiTotal`

) the sum of all 24 corresponding scale values (`ahi01`

to`ahi24`

) at each measurement occasion?Is the total depression score (

`cesdTotal`

) the sum of all 20 corresponding scale values (`cesd01`

to`cesd20`

) at each measurement occasion?Assuming that your answer to the previous question (2.) is negative, try to verify

*how*the`cesdTotal`

scores are computed from the individual values (`cesd01`

to`cesd20`

).

To find out how the overall depression scores are computed, you need to consult background information on the CES-D scale. The original reference for this scale (with over 45,000 citations) is Radloff (1977):

- Radloff, L. S. (1977).
The CES-D scale: A self report depression scale for research in the general population.
*Applied Psychological Measurement*,*1*, 385–401. doi: https://doi.org/10.1177/014662167700100306

but even without reading this article, we can find plenty of scale-related information online.

**Hint:** Some scale items are reverse-coded, so they should be *subtracted* from the sum of the other items, rather than added to them.

This concludes our exercises on exploratory data analysis (EDA) — but we will further improve our corresponding skills in subsequent chapters.

### References

Radloff, L. S. (1977). The CES-D scale: A self-report depression scale for research in the general population. *Applied Psychological Measurement*, *1*(3), 385–401. https://doi.org/10.1177/014662167700100306