5.4 Exercises

ds4psy: Exercises 5

The following exercises require the essential tibble commands and repeat many commands from earlier chapters (involving dplyr and ggplot2).

5.4.1 Exercise 1

Flower power

Turn the iris data — contained in R datasets — into a tibble and conduct an EDA on it.

Hint: iris provides the measurements (in cm) of plant parts (length and width of sepal and petal parts) for 50 flowers from each of 3 iris species (called setosa, versicolor, and virginica). (Evaluate ?iris to obtain a description of the dataset.)

  1. Save datasets::iris as a tibble ir that contains this data and inspect it. Are there any missing values?

  2. Compute a summary table that shows the means of the 4 measurement columns (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) for each of the 3 Species (in rows). Save the resulting table of means as a tibble im1.

  3. Create a histogram that shows the distribution of Sepal.Width values across all species.

  4. Create a plot that shows the shape of the distribution of Sepal.Width values for each species.

  5. Create a plot that shows Petal.Width as a function of Sepal.Width separately (i.e., in 3 facets) for each species.

5.4.2 Exercise 2

Rental accounting

Anna, Brian, and Caro are sharing a flat and keep a record of the items that each of them purchased for the household. At the end of each week, they use this data to balance their account. As an aspiring data scientist, you offer your help. Here’s last week’s data:

Name Mon Tue Wed Thu Fri Sat Sun
Anna Bread: $2.50 Pasta: $4.50 Pencils: $3.25 Milk: $4.80 Cookies: $4.40 Cake: $12.50
Butter: $2.00 Cream: $3.90
Brian Chips: $3.80 Beer: $11.80 Steak: $16.20 Toilet paper: $4.50 Wine: $8.80
Caro Fruit: $6.30 Batteries: $6.10 Newspaper: $2.90 Honey: $3.20 Detergent: $9.95
  1. Which variables and which observations would you define here? Go ahead and enter the data into a tibble acc_1.

  2. Use acc_1 to answer the following questions (by using dplyr for creating tables that contain the answer):

    • How much money was spent this week?
    • Which percentage of the overall amount was spent by each person?
    • How many items did each person purchase?
    • How much did each person pay overall?
    • Who buys the cheapest/most expensive items (on average)?
    • How much is being spent on each day of the week (overall and on average)?
    • What is the order of days sorted by the overall amount spent (from most expensive to least expensive)?
  3. Interpret and re-create the following graphs (using ggplot2 and possibly dplyr):

  1. Bonus task: What do the following plots show? Try re-creating the plots from the data in acc_1.

Hint: These plots are created with geom_step and geom_area. However, rather than directly calling ggplot, consider first using dplyr to transform the data of acc_1 into summary tables that contain the values needed for the plots. You may have to combine multiple group_by and mutate commands to compute all required variables.

5.4.3 Exercise 3

Positive psychology tibbles

In this exercise, we will enter some results from the Exploring data chapter (Chapter 4) as a tibble. Next, we re-compute the results from the raw data and visualize the results.

To do this exercise, re-load the following data files (into R objects posPsy_wide and posPsy_long):

# Load data: 
posPsy_wide <- ds4psy::posPsy_wide  # from ds4psy package
# posPsy_wide <- readr::read_csv(file = "http://rpository.com/ds4psy/data/posPsy_data_wide.csv")  # online
dim(posPsy_wide)  # 295 294
#> [1] 295 294

# 3. Corrected DVs in long format:
posPsy_long <- ds4psy::posPsy_long  # from ds4psy package
# posPsy_long <- readr::read_csv(file = "http://rpository.com/ds4psy/data/posPsy_AHI_CESD_corrected.csv")  # online
dim(posPsy_long)  # 990 x 50
#> [1] 990  50

See Section B.1 of Appendix B for details on the data.

The following table shows the mean depression scores per intervention for each of the 5 occasions (with means rounded to 1 decimal):

Table 5.2: Mean depression scores by intervention and occasion.
intervention mn_cesd_0 mn_cesd_1 mn_cesd_2 mn_cesd_3 mn_cesd_4 mn_cesd_5
1 15.1 15.3 13.6 12.0 11.2 13.5
2 16.2 14.6 11.4 12.5 13.4 14.6
3 16.1 12.3 14.8 13.9 14.9 13.0
4 12.8 9.9 9.5 9.1 7.7 10.2
  1. Enter this data directly into a tibble my_tbl (by using either the tibble or the tribble command).

  2. Re-compute an identical tibble my_tbl_2 by transforming one of the posPsy_... datasets (using dplyr) and verify that my_tbl and my_tbl_2 are equal.

  3. Visualize the information expressed by my_tbl in a transparent way (e.g., by creating a bar or line plot).

Hint: If this is difficult by using data = my_tbl in ggplot, use a data file that is better suited for this purpose. Why can’t you just directly plot my_tbl?

5.4.4 Exercise 4

False-positive psychology

Having considered the benefits of positive psychology, we can now consider the pitfalls of false-positive psychology: Presenting incidential and irrelevant results as statistically significant findings. An intriguing article on this phenomenon reports noteworthy results based on 2 datasets (J. P. Simmons, Nelson, & Simonsohn, 2011):

  • Simmons, J.P., Nelson, L.D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359–1366. doi: https://doi.org/10.1177/0956797611417632

The data was published separately (J. Simmons, Nelson, & Simonsohn, 2014) and pre-processed to facilitate working with it. (See Section B.2 of Appendix B for details on the data and corresponding articles.)

The following table was created by summarizing the data of both studies. It reports the mean, minimum, and maximum age of participants per condition cond, as well as the number of people in those conditions who reported to feel a certain age (from very young to very old):

Table 5.3: Age-related data from Simmons et al. (2011).
cond n mn_ag mi_ag mx_ag fl_vyng fl_yng fl_mid fl_old fl_vold
64 25 21.09 18.30 38.24 0 13 10 2 0
control 22 20.80 18.53 27.23 3 15 3 1 0
potato 31 20.60 18.18 27.37 1 17 11 2 0
  1. Enter this data as a tibble tbl_1 (by using the tibble or the tribble command).

  2. Import the original dataset (either from the ds4psy package or from http://rpository.com/ds4psy/data/falsePosPsy_all.csv) and re-create the data of tbl_1 from the original data as a new tibble tbl_org (by using dplyr).

# Import the dataset:
falsePosPsy_all <- ds4psy::falsePosPsy_all  # from ds4psy package
# falsePosPsy_all <- readr::read_csv("http://rpository.com/ds4psy/data/falsePosPsy_all.csv")  # online

Hints: Check the codebook (in Section B.2 of Appendix B) for details on the data variables and their possible values. For instance, participants’ age in years is stored in a variable called aged365.

  1. Visualize the following aspects of the data (by using ggplot2):

    • Use the tibble tbl_org to plot the number of participants per condition (e.g., as a bar plot).

    • Plot the mean age per condition with the minimum and the maximum age (e.g., by using geom_pointrange).

This concludes our set of exercises on tibbles.

References

Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359–1366. https://doi.org/10.1177/0956797611417632

Simmons, J., Nelson, L., & Simonsohn, U. (2014). Data from paper “False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant”. Journal of Open Psychology Data, 2(1). https://doi.org/10.5334/jopd.aa