- Spotting mistakes and missing data (could be part of EDA too)
- Listing anomalies and outliers (could be part of EDA too)
Four datasets were used as a reference. These datasets are retrieved or scraped from MovieLens and IMDB databases. The used datasets are named as:
This dataset includes the list of title of movies with released year, assigned genre, and unique MovieLens identifier. It is gathered from MovieLens web site which is a movie recommendation service. the movieID represents a unique identifier of the movie, genres are list of assigned genres seperated by “|”, title shows the original title, genres indicates the category of the movie. The dataset has movies in total and the table below shows the first 5 records.
This file includes the identifiers of the related movie on the MovieLens, IMDB and The Movie Database datasets with movieId, imdbId and tmdbId variables, respectively. The dataset has the identifiers of 9742 movies. The table below shows the first 5 records of the related dataset.
Budget and Characters (budget_chars.csv)
The research questions are related to the profit of the movies and their possible predictors such as budget, genders of stars, and the studio. As those data are not found on the given datasets, the variables were scraped from the IMDB database using imdbId variables in the links.csv file. imdbId variable indicates unique identifier of the movies on IBDM, the budget shows the estimated budget of the movie, ww_gross indicates the worldwide gross income, studio shows the company that releases the movie and characters includes the unique IMDB identifier of the 15 characters that played in that movie. The table below shows the first 5 records of the related dataset.
Character Genders (character_genders.tsv)
The movie pages do not include the genders of the characters so this information was scraped from characters’ pages separately according to their profession tags. If a character is tagged as “actor” it is assigned as Male; and if it is “actress” it is assigned as “Female” if both tags are not present it is assigned as “Unknown”. The table below shows the first 5 records of the related dataset.
3.1 Data Wrangling and Cleaning
As the movies dataset include both year and title in the same column, these variables are seperated into two different variables. Then the links dataset was joined to movies dataset by using common movieId column. The first 5 rows of merged dataset printed below.
The research questions are based on the budget of the movies and the effect of this variable on some profit criteria. These relations are also evaluated according to the genre, the main character, and the studio. As a budget, profit, list of main characters, and studio names are not located on given datasets, those are scraped from related databases. As all of those variables are found in IMDB, this database selected as a source and joined together. The profit was calculated by extracting the budget from ww_gross. In the table below, 5 rows of a merged table are seen.
3.1.1 Spotting Mistakes and Missing Data
The numerical summary of movies shows that all records have complete movieId, title, studio variables; while 3399 records has missing budget and 4135 missing profit data. Also, 370 of movies have missing data for star genders. As only first two characters were selected as stars, the maximum values of gender counts were 2.
#> movieId title genres #> Min. : 1 Length:9742 Length:9742 #> 1st Qu.: 3248 Class :character Class :character #> Median : 7300 Mode :character Mode :character #> Mean : 42200 #> 3rd Qu.: 76232 #> Max. :193609 #> #> year budget profit #> Min. :1902 Min. :0.00e+00 Min. :-3.00e+10 #> 1st Qu.:1988 1st Qu.:4.85e+06 1st Qu.:-2.57e+06 #> Median :1999 Median :1.50e+07 Median : 1.14e+07 #> Mean :1995 Mean :4.50e+07 Mean : 4.40e+07 #> 3rd Qu.:2008 3rd Qu.:3.80e+07 3rd Qu.: 6.23e+07 #> Max. :2018 Max. :3.00e+10 Max. : 2.55e+09 #> NA's :12 NA's :3399 NA's :4135 #> studio star_female star_male star_unknown #> Length:9742 Min. :0 Min. :0 Min. :0 #> Class :character 1st Qu.:0 1st Qu.:1 1st Qu.:0 #> Mode :character Median :1 Median :1 Median :0 #> Mean :1 Mean :1 Mean :0 #> 3rd Qu.:1 3rd Qu.:2 3rd Qu.:0 #> Max. :2 Max. :2 Max. :2 #> NA's :370 NA's :370 NA's :370
Some movies have very low budgets (such as < USD 100) while have high income (> USD 10^9). Some of them is due to scraping error and currency differences (such as ESP20 was extracted as USD 20) As they are clearly outliers, they are removed from further analysis. Also movies with net loss were removed from analysis.