3 Data

  • Sources
  • Description
  • Wrangling/cleaning
  • Spotting mistakes and missing data (could be part of EDA too)
  • Listing anomalies and outliers (could be part of EDA too)

3.1 Sources

Four datasets were used as a reference. These datasets are retrieved or scraped from MovieLens and IMDB databases. The used datasets are named as:

  • movies.csv

  • character_genders.tsv

  • links.csv

  • budget_chars.csv

3.2 Description

Movies (movies.csv)

This dataset includes the list of title of movies with released year, assigned genre, and unique MovieLens identifier. It is gathered from MovieLens web site which is a movie recommendation service. the movieID represents a unique identifier of the movie, genres are list of assigned genres seperated by “|”, title shows the original title, genres indicates the category of the movie. The dataset has movies in total and the table below shows the first 5 records.

Links (links.csv)

This file includes the identifiers of the related movie on the MovieLens, IMDB and The Movie Database datasets with movieId, imdbId and tmdbId variables, respectively. The dataset has the identifiers of 9742 movies. The table below shows the first 5 records of the related dataset.

Budget and Characters (budget_chars.csv)

The research questions are related to the profit of the movies and their possible predictors such as budget, genders of stars, and the studio. As those data are not found on the given datasets, the variables were scraped from the IMDB database using imdbId variables in the links.csv file. imdbId variable indicates unique identifier of the movies on IBDM, the budget shows the estimated budget of the movie, ww_gross indicates the worldwide gross income, studio shows the company that releases the movie and characters includes the unique IMDB identifier of the 15 characters that played in that movie. The table below shows the first 5 records of the related dataset.

Character Genders (character_genders.tsv)

The movie pages do not include the genders of the characters so this information was scraped from characters’ pages separately according to their profession tags. If a character is tagged as “actor” it is assigned as Male; and if it is “actress” it is assigned as “Female” if both tags are not present it is assigned as “Unknown”. The table below shows the first 5 records of the related dataset.

3.3 Data Wrangling and Cleaning

As the movies dataset include both year and title in the same column, these variables are seperated into two different variables. Then the links dataset was joined to movies dataset by using common movieId column. The first 5 rows of merged dataset printed below.

The research questions are based on the budget of the movies and the effect of this variable on some profit criteria. These relations are also evaluated according to the genre, the main character, and the studio. As a budget, profit, list of main characters, and studio names are not located on given datasets, those are scraped from related databases. As all of those variables are found in IMDB, this database selected as a source and joined together. The profit was calculated by extracting the budget from ww_gross. In the table below, 5 rows of a merged table are seen.

3.4 Spotting Mistakes and Missing Data

The numerical summary of movies shows that all records have complete movieId, title, studio variables; while 3399 records has missing budget and 4135 missing profit data. Also, 370 of movies have missing data for star genders. As only first two characters were selected as stars, the maximum values of gender counts were 2.

#>     movieId          title                year     
#>  Min.   :     1   Length:9742        Min.   :1902  
#>  1st Qu.:  3248   Class :character   1st Qu.:1988  
#>  Median :  7300   Mode  :character   Median :1999  
#>  Mean   : 42200                      Mean   :1995  
#>  3rd Qu.: 76232                      3rd Qu.:2008  
#>  Max.   :193609                      Max.   :2018  
#>                                      NA's   :12    
#>     genres              budget             profit         
#>  Length:9742        Min.   :0.00e+00   Min.   :-3.00e+10  
#>  Class :character   1st Qu.:4.85e+06   1st Qu.:-2.57e+06  
#>  Mode  :character   Median :1.50e+07   Median : 1.14e+07  
#>                     Mean   :4.50e+07   Mean   : 4.40e+07  
#>                     3rd Qu.:3.80e+07   3rd Qu.: 6.23e+07  
#>                     Max.   :3.00e+10   Max.   : 2.55e+09  
#>                     NA's   :3399       NA's   :4135       
#>     studio           star_female    star_male    star_unknown
#>  Length:9742        Min.   :0     Min.   :0     Min.   :0    
#>  Class :character   1st Qu.:0     1st Qu.:1     1st Qu.:0    
#>  Mode  :character   Median :1     Median :1     Median :0    
#>                     Mean   :1     Mean   :1     Mean   :0    
#>                     3rd Qu.:1     3rd Qu.:2     3rd Qu.:0    
#>                     Max.   :2     Max.   :2     Max.   :2    
#>                     NA's   :370   NA's   :370   NA's   :370

Some movies have very low budgets (such as < USD 100) while have high income (> USD 10^9). Some of them is due to scraping error and currency differences (such as ESP20 was extracted as USD 20) As they are clearly outliers, they are removed from further analysis. Also movies with net loss were removed from analysis.