3 Data | Title of The Web Page

3.1 Sources

Four datasets were used as a reference. These datasets are retrieved or scraped from MovieLens and IMDB databases. The used datasets are named as:

movies.csv
character_genders.tsv
links.csv
budget_chars.csv

3.2 Description

Movies (movies.csv)

This dataset includes the list of title of movies with released year, assigned genre, and unique MovieLens identifier. It is gathered from MovieLens web site which is a movie recommendation service. the movieID represents a unique identifier of the movie, genres are list of assigned genres seperated by “|”, title shows the original title, genres indicates the category of the movie. The dataset has movies in total and the table below shows the first 5 records.

Links (links.csv)

This file includes the identifiers of the related movie on the MovieLens, IMDB and The Movie Database datasets with movieId, imdbId and tmdbId variables, respectively. The dataset has the identifiers of 9742 movies. The table below shows the first 5 records of the related dataset.

Budget and Characters (budget_chars.csv)

The research questions are related to the profit of the movies and their possible predictors such as budget, genders of stars, and the studio. As those data are not found on the given datasets, the variables were scraped from the IMDB database using imdbId variables in the links.csv file. imdbId variable indicates unique identifier of the movies on IBDM, the budget shows the estimated budget of the movie, ww_gross indicates the worldwide gross income, studio shows the company that releases the movie and characters includes the unique IMDB identifier of the 15 characters that played in that movie. The table below shows the first 5 records of the related dataset.

0114709

30000000

404992988

Walt Disney Pictures, Pixar Animation Studios

nm0000158,nm0000741,nm0725543,nm0001815,nm0001728,nm0001652,nm0001633,nm0606658,nm0902184,nm0582418,nm0000388,nm0293589,nm0422710,nm0029460,nm0039970

0113497

65000000

262821940

TriStar Pictures, Interscope Communications, Teitler Film

nm0000245,nm0404993,nm0000379,nm0682300,nm0001372,nm0001564,nm0004979,nm0165101,nm0002123,nm0120410,nm0359695,nm0053390,nm0643488,nm0857326,nm0861809

0113228

25000000

71518503

Lancaster Gate, Warner Bros.

nm0000527,nm0000493,nm0000047,nm0000268,nm0580565,nm0000435,nm0001629,nm0756218,nm0346893,nm0025908,nm0459246,nm0942642,nm0370010,nm0263449,nm0505760

0114885

16000000

81452156

Twentieth Century Fox

nm0001365,nm0000291,nm0222643,nm0005375,nm0002138,nm0371660,nm0932112,nm0004729,nm0502442,nm0682495,nm0265668,nm0760254,nm0226934,nm0358650,nm0601469

0113041

30000000

76594107

Sandollar Productions, Taylor-Made Productions, The Meyers/Shyer Company

nm0000188,nm0000473,nm0001737,nm0931090,nm0627624,nm0001085,nm0000703,nm0324533,nm0570033,nm0011038,nm0506405,nm0150427,nm0651074,nm0736851,nm0030926

Character Genders (character_genders.tsv)

The movie pages do not include the genders of the characters so this information was scraped from characters’ pages separately according to their profession tags. If a character is tagged as “actor” it is assigned as Male; and if it is “actress” it is assigned as “Female” if both tags are not present it is assigned as “Unknown”. The table below shows the first 5 records of the related dataset.

3.3 Data Wrangling and Cleaning

As the movies dataset include both year and title in the same column, these variables are seperated into two different variables. Then the links dataset was joined to movies dataset by using common movieId column. The first 5 rows of merged dataset printed below.

The research questions are based on the budget of the movies and the effect of this variable on some profit criteria. These relations are also evaluated according to the genre, the main character, and the studio. As a budget, profit, list of main characters, and studio names are not located on given datasets, those are scraped from related databases. As all of those variables are found in IMDB, this database selected as a source and joined together. The profit was calculated by extracting the budget from ww_gross. In the table below, 5 rows of a merged table are seen.

1

Toy Story

1995

Adventure|Animation|Children|Comedy|Fantasy

30000000

374992988

Walt Disney Pictures, Pixar Animation Studios

0

2

0

2

Jumanji

1995

Adventure|Children|Fantasy

65000000

197821940

TriStar Pictures, Interscope Communications, Teitler Film

0

2

0

3

Grumpier Old Men

1995

Comedy|Romance

25000000

46518503

Lancaster Gate, Warner Bros.

0

2

0

4

Waiting to Exhale

1995

Comedy|Drama|Romance

16000000

65452156

Twentieth Century Fox

2

0

5

Father of the Bride Part II

1995

Comedy

30000000

46594107

Sandollar Productions, Taylor-Made Productions, The Meyers/Shyer Company

1

0

3.4 Spotting Mistakes and Missing Data

The numerical summary of movies shows that all records have complete movieId, title, studio variables; while 3399 records has missing budget and 4135 missing profit data. Also, 370 of movies have missing data for star genders. As only first two characters were selected as stars, the maximum values of gender counts were 2.

#>     movieId          title                year     
#>  Min.   :     1   Length:9742        Min.   :1902  
#>  1st Qu.:  3248   Class :character   1st Qu.:1988  
#>  Median :  7300   Mode  :character   Median :1999  
#>  Mean   : 42200                      Mean   :1995  
#>  3rd Qu.: 76232                      3rd Qu.:2008  
#>  Max.   :193609                      Max.   :2018  
#>                                      NA's   :12    
#>     genres              budget             profit         
#>  Length:9742        Min.   :0.00e+00   Min.   :-3.00e+10  
#>  Class :character   1st Qu.:4.85e+06   1st Qu.:-2.57e+06  
#>  Mode  :character   Median :1.50e+07   Median : 1.14e+07  
#>                     Mean   :4.50e+07   Mean   : 4.40e+07  
#>                     3rd Qu.:3.80e+07   3rd Qu.: 6.23e+07  
#>                     Max.   :3.00e+10   Max.   : 2.55e+09  
#>                     NA's   :3399       NA's   :4135       
#>     studio           star_female    star_male    star_unknown
#>  Length:9742        Min.   :0     Min.   :0     Min.   :0    
#>  Class :character   1st Qu.:0     1st Qu.:1     1st Qu.:0    
#>  Mode  :character   Median :1     Median :1     Median :0    
#>                     Mean   :1     Mean   :1     Mean   :0    
#>                     3rd Qu.:1     3rd Qu.:2     3rd Qu.:0    
#>                     Max.   :2     Max.   :2     Max.   :2    
#>                     NA's   :370   NA's   :370   NA's   :370

Some movies have very low budgets (such as < USD 100) while have high income (> USD 10^9). Some of them is due to scraping error and currency differences (such as ESP20 was extracted as USD 20) As they are clearly outliers, they are removed from further analysis. Also movies with net loss were removed from analysis.