4 Exploratory data analysis

  • Mapping out the underlying structure
  • Identifying the most important variables
  • Univariate visualizations
  • Multivariate visualizations
  • Summary tables

4.0.1 Mapping out the underlying structure

The final dataset has 3704 movie records with 8 variables. movieId shows the identifiers of the related movie on the MovieLens, title shows the original title, budget shows the estimated budget of movie, profit indicates the worldwide gross profit, studio shows the company that releases the movie. The last three numerical columns show the numbers of genders of the first two characters (stars) of the movie.

#> tibble [3,704 x 10] (S3: tbl_df/tbl/data.frame)
#>  $ movieId     : num [1:3704] 1 2 3 4 5 6 9 10 11 16 ...
#>  $ title       : chr [1:3704] "Toy Story" "Jumanji" "Grumpier Old "..
#>  $ genres      : chr [1:3704] "Adventure|Animation|Children|Comedy"..
#>  $ year        : num [1:3704] 1995 1995 1995 1995 1995 ...
#>  $ budget      : num [1:3704] 3.0e+07 6.5e+07 2.5e+07 1.6e+07 3.0e+..
#>  $ profit      : num [1:3704] 3.75e+08 1.98e+08 4.65e+07 6.55e+07 4..
#>  $ studio      : chr [1:3704] "Walt Disney Pictures, Pixar Animati"..
#>  $ star_female : int [1:3704] 0 0 0 2 1 0 0 0 1 1 ...
#>  $ star_male   : int [1:3704] 2 2 2 0 1 2 2 2 1 1 ...
#>  $ star_unknown: int [1:3704] 0 0 0 0 0 0 0 0 0 0 ...

4.0.2 Identifying the most important variables

The main question is based on the relation of budget with the profit, so most important variables are ww_gross (worldwide income) and budget.

4.0.3 Univariate and Multivariate Visualizations

The budgets and worldwide revenues of the films range predominantly between $ 10 million and $ 100 million. Both variables show negative skewness with partially normal distribution.

There appears to be a logarithm-scale correlation between film budgets and worldwide revenue.

We can see that One of the entry barriers of this industry is the budget needed to produce a film. Thus, a budget around 1 million seems to be the minimum to produce a good movie with a success potential if the goal is to get a decent return on investment. Our data also shows that you can generate a significant amount of profits in this industry; some of the most lucrative movies have a profit which exceeds 1 billion.

Therefore, as in any industry and as might reasonably be expected, a positive correlation between the film’s budget and its overall profits can be observed. In addition, it is important to note that the larger the budget, the stronger this correlation tends to be. However, it appears that cinema is an ultra-competitive environment where money does not seem to crush competition as it could be the case in other industries. Thus, many lower-budget films seem to do extremely well from a financial point of view.

Warner Bros, Universal Pictures and Columbia Pictures are in the top three when the studios with the most films are ranked.

According to the data used, the top 5 studios make up about 10% of all films shot. Therefore, the film market seems to be caricatured by the presence of a few “Goliaths” against countless “David’s”. In other words, few leaders against countless smaller competitors. We can see that the bigger you are, the more chances you have to succeed (because of the high budgets you have); however, small companies can also enter and stay in this market by focusing on other aspects such as creativity.

It seems that once the miracle recipe for profit is found, it leaves little room for diversification, especially when it comes to the genre of the film. In other words, studios seemed to realize that some genres are more likely than others to appeal to audiences, which seems to impoverish the variety of films produced.

Most of the movies are Drama. This genre is followed by Comedy, Thriller and Action. Documentary and Film-noir genres are the least frequently used tags.

Many films can be defined by more than one genre. In the correlation chart below, the frequency of using the genres together is shown. As expected, the Children and Animation genres are often used together, while Thriller and Comedy are rarely used together.

4.0.4 Summary Tables

The summary table below includes the number of movies produced by each studio (n_movies), the average of their budget (mean_budget) and their standard deviation (sd_budget), the average of their income (mean_income) and the standard deviation (sd_income).

The table below contains the numerical and financial summaries of the films according to the gender combination of the stars in the film.

The summary table below includes the number of movies tagged with each genres (n_movies), the average of their budget (mean_budget) and their standard deviation (sd_budget), the average of their income (mean_income) and the standard deviation (sd_income).