3 Data

  • Sources
  • Description
  • Wrangling/cleaning
  • Spotting mistakes and missing data (could be part of EDA too)
  • Listing anomalies and outliers (could be part of EDA too)

3.1 Sources

Four datasets were used as a reference. These datasets are retrieved or scraped from MovieLens and IMDB databases. The used datasets are named as:

  • movies.csv

  • character_genders.tsv

  • links.csv

  • budget_chars.csv

3.2 Description

Movies (movies.csv)

This dataset includes the list of title of movies with released year, assigned genre, and unique MovieLens identifier. It is gathered from MovieLens web site which is a movie recommendation service. the movieID represents a unique identifier of the movie, genres are list of assigned genres seperated by “|”, title shows the original title, genres indicates the category of the movie. The dataset has movies in total and the table below shows the first 5 records.

movieId
title
genres
1
Toy Story (1995)
Adventure|Animation|Children|Comedy|Fantasy
2
Jumanji (1995)
Adventure|Children|Fantasy
3
Grumpier Old Men (1995)
Comedy|Romance
4
Waiting to Exhale (1995)
Comedy|Drama|Romance
5
Father of the Bride Part II (1995)
Comedy
6
Heat (1995)
Action|Crime|Thriller

Links (links.csv)

This file includes the identifiers of the related movie on the MovieLens, IMDB and The Movie Database datasets with movieId, imdbId and tmdbId variables, respectively. The dataset has the identifiers of 9742 movies. The table below shows the first 5 records of the related dataset.

movieId
imdbId
tmdbId
1
0114709
862
2
0113497
8844
3
0113228
15602
4
0114885
31357
5
0113041
11862
6
0113277
949

Budget and Characters (budget_chars.csv)

The research questions are related to the profit of the movies and their possible predictors such as budget, genders of stars, and the studio. As those data are not found on the given datasets, the variables were scraped from the IMDB database using imdbId variables in the links.csv file. imdbId variable indicates unique identifier of the movies on IBDM, the budget shows the estimated budget of the movie, ww_gross indicates the worldwide gross income, studio shows the company that releases the movie and characters includes the unique IMDB identifier of the 15 characters that played in that movie. The table below shows the first 5 records of the related dataset.

imdbId
budget
ww_gross
studio
characters
0114709
30000000
404992988
Walt Disney Pictures, Pixar Animation Studios
nm0000158,nm0000741,nm0725543,nm0001815,nm0001728,nm0001652,nm0001633,nm0606658,nm0902184,nm0582418,nm0000388,nm0293589,nm0422710,nm0029460,nm0039970
0113497
65000000
262821940
TriStar Pictures, Interscope Communications, Teitler Film
nm0000245,nm0404993,nm0000379,nm0682300,nm0001372,nm0001564,nm0004979,nm0165101,nm0002123,nm0120410,nm0359695,nm0053390,nm0643488,nm0857326,nm0861809
0113228
25000000
71518503
Lancaster Gate, Warner Bros.
nm0000527,nm0000493,nm0000047,nm0000268,nm0580565,nm0000435,nm0001629,nm0756218,nm0346893,nm0025908,nm0459246,nm0942642,nm0370010,nm0263449,nm0505760
0114885
16000000
81452156
Twentieth Century Fox
nm0001365,nm0000291,nm0222643,nm0005375,nm0002138,nm0371660,nm0932112,nm0004729,nm0502442,nm0682495,nm0265668,nm0760254,nm0226934,nm0358650,nm0601469
0113041
30000000
76594107
Sandollar Productions, Taylor-Made Productions, The Meyers/Shyer Company
nm0000188,nm0000473,nm0001737,nm0931090,nm0627624,nm0001085,nm0000703,nm0324533,nm0570033,nm0011038,nm0506405,nm0150427,nm0651074,nm0736851,nm0030926

Character Genders (character_genders.tsv)

The movie pages do not include the genders of the characters so this information was scraped from characters’ pages separately according to their profession tags. If a character is tagged as “actor” it is assigned as Male; and if it is “actress” it is assigned as “Female” if both tags are not present it is assigned as “Unknown”. The table below shows the first 5 records of the related dataset.

nconst
gender
nm0000001
Male
nm0000002
Female
nm0000003
Female
nm0000004
Male
nm0000005
Male

3.3 Data Wrangling and Cleaning

As the movies dataset include both year and title in the same column, these variables are seperated into two different variables. Then the links dataset was joined to movies dataset by using common movieId column. The first 5 rows of merged dataset printed below.

movieId
title
genres
year
imdbId
tmdbId
1
Toy Story
Adventure|Animation|Children|Comedy|Fantasy
1995
0114709
862
2
Jumanji
Adventure|Children|Fantasy
1995
0113497
8844
3
Grumpier Old Men
Comedy|Romance
1995
0113228
15602
4
Waiting to Exhale
Comedy|Drama|Romance
1995
0114885
31357
5
Father of the Bride Part II
Comedy
1995
0113041
11862

The research questions are based on the budget of the movies and the effect of this variable on some profit criteria. These relations are also evaluated according to the genre, the main character, and the studio. As a budget, profit, list of main characters, and studio names are not located on given datasets, those are scraped from related databases. As all of those variables are found in IMDB, this database selected as a source and joined together. The profit was calculated by extracting the budget from ww_gross. In the table below, 5 rows of a merged table are seen.

movieId
title
year
genres
budget
profit
studio
star_female
star_male
star_unknown
1
Toy Story
1995
Adventure|Animation|Children|Comedy|Fantasy
30000000
374992988
Walt Disney Pictures, Pixar Animation Studios
0
2
0
2
Jumanji
1995
Adventure|Children|Fantasy
65000000
197821940
TriStar Pictures, Interscope Communications, Teitler Film
0
2
0
3
Grumpier Old Men
1995
Comedy|Romance
25000000
46518503
Lancaster Gate, Warner Bros.
0
2
0
4
Waiting to Exhale
1995
Comedy|Drama|Romance
16000000
65452156
Twentieth Century Fox
2
0
0
5
Father of the Bride Part II
1995
Comedy
30000000
46594107
Sandollar Productions, Taylor-Made Productions, The Meyers/Shyer Company
1
1
0

3.4 Spotting Mistakes and Missing Data

The numerical summary of movies shows that all records have complete movieId, title, studio variables; while 3399 records has missing budget and 4135 missing profit data. Also, 370 of movies have missing data for star genders. As only first two characters were selected as stars, the maximum values of gender counts were 2.

#>     movieId          title                year     
#>  Min.   :     1   Length:9742        Min.   :1902  
#>  1st Qu.:  3248   Class :character   1st Qu.:1988  
#>  Median :  7300   Mode  :character   Median :1999  
#>  Mean   : 42200                      Mean   :1995  
#>  3rd Qu.: 76232                      3rd Qu.:2008  
#>  Max.   :193609                      Max.   :2018  
#>                                      NA's   :12    
#>     genres              budget             profit         
#>  Length:9742        Min.   :0.00e+00   Min.   :-3.00e+10  
#>  Class :character   1st Qu.:4.85e+06   1st Qu.:-2.57e+06  
#>  Mode  :character   Median :1.50e+07   Median : 1.14e+07  
#>                     Mean   :4.50e+07   Mean   : 4.40e+07  
#>                     3rd Qu.:3.80e+07   3rd Qu.: 6.23e+07  
#>                     Max.   :3.00e+10   Max.   : 2.55e+09  
#>                     NA's   :3399       NA's   :4135       
#>     studio           star_female    star_male    star_unknown
#>  Length:9742        Min.   :0     Min.   :0     Min.   :0    
#>  Class :character   1st Qu.:0     1st Qu.:1     1st Qu.:0    
#>  Mode  :character   Median :1     Median :1     Median :0    
#>                     Mean   :1     Mean   :1     Mean   :0    
#>                     3rd Qu.:1     3rd Qu.:2     3rd Qu.:0    
#>                     Max.   :2     Max.   :2     Max.   :2    
#>                     NA's   :370   NA's   :370   NA's   :370

Some movies have very low budgets (such as < USD 100) while have high income (> USD 10^9). Some of them is due to scraping error and currency differences (such as ESP20 was extracted as USD 20) As they are clearly outliers, they are removed from further analysis. Also movies with net loss were removed from analysis.