Section 5 March 16th, 2023

5.1 Welcome!

As a reminder:

  • CPR = Copy, paste, and run

  • The group typically independently comes together for dedicated time to use R.

  • Everybody can work at their own pace or in groups, etc.

  • Senior members help support junior members

  • When it doubt: Google

    • Being able to identify a solution online is a unique skill.
  • Have fun

Today’s challenges involve data that was used for #TidyTuesday. From them:

“The data this week comes from Hollywood Age Gap via Data Is Plural.

An informational site showing the age gap between movie love interests.

The data follows certain rules:

The two (or more) actors play actual love interests (not just friends, coworkers, or some other non-romantic type of relationship)

The youngest of the two actors is at least 17 years old

Not animated characters”

5.2 Your Packages

You will probably benefit from downloading (if necessary) and loading the following packages:

  • tidyverse

5.3 The Data

CPR the following code to import your data into R:

age_gaps <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-02-14/age_gaps.csv')

5.4 The Challenge

Although I have tried to make these progressively harder, feel free to bounce around until you find your sweet spot.

  1. Explore the data yourself. Look around and see what’s in there.

Think about how many observations, what each variable is, etc. Use a functions to help you out. E.g.,:

glimpse(age_gaps)
## Rows: 1,155
## Columns: 13
## $ movie_name         <chr> "Harold and Maude", "Venus", "The Quiet American", …
## $ release_year       <dbl> 1971, 2006, 2002, 1998, 2010, 1992, 2009, 1999, 199…
## $ director           <chr> "Hal Ashby", "Roger Michell", "Phillip Noyce", "Joe…
## $ age_difference     <dbl> 52, 50, 49, 45, 43, 42, 40, 39, 38, 38, 36, 36, 35,…
## $ couple_number      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ actor_1_name       <chr> "Ruth Gordon", "Peter O'Toole", "Michael Caine", "D…
## $ actor_2_name       <chr> "Bud Cort", "Jodie Whittaker", "Do Thi Hai Yen", "T…
## $ character_1_gender <chr> "woman", "man", "man", "man", "man", "man", "man", …
## $ character_2_gender <chr> "man", "woman", "woman", "woman", "man", "woman", "…
## $ actor_1_birthdate  <date> 1896-10-30, 1932-08-02, 1933-03-14, 1930-09-17, 19…
## $ actor_2_birthdate  <date> 1948-03-29, 1982-06-03, 1982-10-01, 1975-11-08, 19…
## $ actor_1_age        <dbl> 75, 74, 69, 68, 81, 59, 62, 69, 57, 77, 59, 56, 65,…
## $ actor_2_age        <dbl> 23, 24, 20, 23, 38, 17, 22, 30, 19, 39, 23, 20, 30,…
  1. Generate some summary statistics for each actor’s age. Specifically, calculate the mean, mode, mix, max, and standard deviation.

More.

  • Who was the oldest woman in the data? Man? The youngest woman? Youngest man?

  • What actor has appeared the most in the data set as ‘actor 1’? What about ‘actor 2’?

  1. Generate the same statistics in Q2 for the age difference between actors.

  2. How many films have multiple couples? Hint: table()

  3. How many times was a woman actor older than the man counterpart (note: actor 1 indicates the older actor)?

  • Does Leo DiCaprio do movies with older women? What the mean age of women co-starring with Leo (remember to check both columns, in case he’s been the older or younger of the couple)?
  1. What proportion of couples had the same gender? Hint: table() may also be useful here.

  2. Summarize the min, max, mean, and SD for each decade (e.g., 1960s, 1970s) of movies. You can CPR the following code to create a decade variable (try on your own if you are brave).

round_any <- function(x, accuracy, f=floor){f(x/ accuracy) * accuracy}
age_gaps$decade <- round_any(age_gaps$release_year, accuracy = 10)
  1. Plot the ages of each couple in a scatterplot (e.g., x is actor 1, y is actor 2).Get creative and make it look nice.

  2. Visualize the mean age difference by decade of movie release. Get creative and make it look nice.

  3. Run an ANOVA to determine if there was the age differences in movies varies by decade. Use only the decades 1970s, 1980s, and 1990s. You can use the aov function.

  4. Run a independent samples t-test to see if couples where the woman is older than the man have a difference age gap compared to couples where the man is older than the women.

  5. Visualize Qs 10 and 11.