What's the movie score

Wang siqi

Introduction

IMDB(Internet Movie Database) is publicly available movie rating website,
where people can rate movies. Commercial success, director, actors or critic ratings might be some factor that can influence, but doesn't guarantee good imdb score.
So, the question arises what features can be used/ or are important in predicting imdb score of movie. And can we predict score of the movie before release?

Dataset Description

The dataset is downloaded from kaggle datasets. The original dataset is available on https://data.world/ and has been updated while uploading to kaggle.
It contains 28 variables for 5043 movies, spanning across 100 years in 66 countries. There are 2399 unique director names, and thousands of actors/actresses. “imdb_score” is the response variable while the other 27 variables are possible predictors.

Pre-requisite Libraries:

We'll use pandas to load and work on dataset. We'll use matplotlib and seaborn for data visualisation and scikit-learn's estimators for model building, training and predictions.

Data Loading

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')
In [2]:
movie_df = pd.read_csv('movie_metadata.csv')
In [3]:
movie_df.head()
Out[3]:
color director_name num_critic_for_reviews duration director_facebook_likes actor_3_facebook_likes actor_2_name actor_1_facebook_likes gross genres ... num_user_for_reviews language country content_rating budget title_year actor_2_facebook_likes imdb_score aspect_ratio movie_facebook_likes
0 Color James Cameron 723.0 178.0 0.0 855.0 Joel David Moore 1000.0 760505847.0 Action|Adventure|Fantasy|Sci-Fi ... 3054.0 English USA PG-13 237000000.0 2009.0 936.0 7.9 1.78 33000
1 Color Gore Verbinski 302.0 169.0 563.0 1000.0 Orlando Bloom 40000.0 309404152.0 Action|Adventure|Fantasy ... 1238.0 English USA PG-13 300000000.0 2007.0 5000.0 7.1 2.35 0
2 Color Sam Mendes 602.0 148.0 0.0 161.0 Rory Kinnear 11000.0 200074175.0 Action|Adventure|Thriller ... 994.0 English UK PG-13 245000000.0 2015.0 393.0 6.8 2.35 85000
3 Color Christopher Nolan 813.0 164.0 22000.0 23000.0 Christian Bale 27000.0 448130642.0 Action|Thriller ... 2701.0 English USA PG-13 250000000.0 2012.0 23000.0 8.5 2.35 164000
4 NaN Doug Walker NaN NaN 131.0 NaN Rob Walker 131.0 NaN Documentary ... NaN NaN NaN NaN NaN NaN 12.0 7.1 NaN 0

5 rows × 28 columns

As loaded panda frame shows, there are certain expected columns like:

  • director name
  • actors names
  • duration
  • genre
  • gross revenue
  • number of reviews

But there are certain interesting columns too

  • actor's facebook likes (do social media following affect score?)
  • language and country (do familiarity of language and number of reviews based on country population impact score?)
  • do aspect ratio affect the user's rating for movie?

Duplicate Removals

Let's check if dataset contains duplicate rows.

In [4]:
sum(movie_df.duplicated())
Out[4]:
45

There are 45 instances that are duplicate. Let's remove the duplicates.
drop_duplicates will by default keep first instance and drop all other instances.

In [5]:
movie_df.drop_duplicates(inplace=True)
print("Number of duplicates:", movie_df.duplicated())
print("Total number of movies in dataset:", movie_df.shape[0])
Number of duplicates: 0       False
1       False
2       False
3       False
4       False
        ...  
5038    False
5039    False
5040    False
5041    False
5042    False
Length: 4998, dtype: bool
Total number of movies in dataset: 4998

Data Cleaning

There are lot of missing values (NaN) in the dataset.
Let's plot the heatmap of missing values using seaborn

In [6]:
sns.heatmap(movie_df.isnull(), cbar=False)
Out[6]:
<AxesSubplot:>

As we see from above heatmap, 3 columns with most NaN values are:

  • gross
  • budget
  • content rating

Now there can be different strategies that can be used to feel missing values for numerical data.

  • We can drop all the rows with NaN values.
  • We can replace all NaN values with mean of non-NaN values in column.

Further information about handling missing data can be found at https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html

Let's see how many rows we'll end up dropping if we drop all NaN values

In [7]:
num_rows = movie_df[movie_df.isnull().any(axis=1)].shape[0]
print("We'll drop {}% of rows".format(round(num_rows*100/movie_df.shape[0], 2)))
We'll drop 25.51% of rows
In [8]:
# We can use mean of the column as missing value
print("Mean of column budget: ", round(movie_df['budget'].mean(), 2))
sns.distplot(movie_df['budget'])
Mean of column budget:  39747870.01
Out[8]:
<AxesSubplot:xlabel='budget', ylabel='Density'>

Distribution plot shows most of the movies have budget around 0-100 million, so mean can be used for missing value.

We'll here go ahead with dropping the rows instead of replacing with mean.

Other better option that we haven't explored here but worth exploring is replacing missing values with mean of the values corresponding to particular genre, as particular genre of movies might have similar range of budget and similar range of gross. Like superhero movies might have budget more than 100 million, but they will earn probably in ranges of more than 700 million. While indepedent movies will have smaller budgets and smaller gross too.

In [9]:
movie_df.dropna(subset=['gross', 'budget'], inplace=True)

Let's fix other numerical columns with NaN values with means.

In [10]:
movie_df['num_critic_for_reviews'].fillna(movie_df['num_critic_for_reviews'].mean(), inplace=True)
movie_df['duration'].fillna(movie_df['duration'].mean(), inplace=True)
movie_df['director_facebook_likes'].fillna(movie_df['director_facebook_likes'].mean(), inplace=True)
movie_df['actor_3_facebook_likes'].fillna(movie_df['actor_3_facebook_likes'].mean(), inplace=True)
movie_df['actor_1_facebook_likes'].fillna(movie_df['actor_1_facebook_likes'].mean(), inplace=True)
movie_df['num_voted_users'].fillna(movie_df['num_voted_users'].mean(), inplace=True)
movie_df['cast_total_facebook_likes'].fillna(movie_df['cast_total_facebook_likes'].mean(), inplace=True)
movie_df['facenumber_in_poster'].fillna(movie_df['facenumber_in_poster'].mean(), inplace=True)
movie_df['num_user_for_reviews'].fillna(movie_df['num_user_for_reviews'].mean(), inplace=True)
movie_df['actor_2_facebook_likes'].fillna(movie_df['actor_2_facebook_likes'].mean(), inplace=True)
movie_df['aspect_ratio'].fillna(movie_df['aspect_ratio'].mean(), inplace=True)
movie_df['movie_facebook_likes'].fillna(movie_df['movie_facebook_likes'].mean(), inplace=True)

Fix movie titles

Movie titles contain some additional whitespaces, let's remove them
strip function removes the leading and trailing characters from string.

In [11]:
movie_df['movie_title'] = movie_df['movie_title'].str.strip()

Let's adjust content rating

Reference: https://en.wikipedia.org/wiki/Motion_picture_content_rating_system#United_States
There are 5 content ratings available:

  • G (General Audiences) – All ages admitted.
  • PG (Parental Guidance Suggested) – Some material may not be suitable for children.
  • PG-13 (Parents Strongly Cautioned) – Some material may be inappropriate for children under 13.
  • R (Restricted) – Under 17 requires accompanying parent or adult guardian.
  • NC-17 (Adults Only) – No one 17 and under admitted.

Let's replace M and GP with PG, replace X with NC-17 and replace “Approved”, “Not Rated”, “Passed”, “Unrated” with the most common rating “R”.

In [12]:
def update_rating(rating):
    if rating == 'M' or rating == 'GP':
        return 'PG'
    elif rating == 'X':
        return 'NC-17'
    elif rating in ['Approved', 'Not Rated', 'Passed', 'Unrated']:
        return 'R'
    return rating
In [13]:
movie_df['content_rating'] = movie_df['content_rating'].apply(lambda x: update_rating(x))
In [14]:
movie_df['content_rating'].fillna('R', inplace=True)

Data Pre-Processing

Straight-forward additional feature that can be created is : profit = gross-budget

In [15]:
movie_df['profit'] = movie_df['gross'] - movie_df['budget']

The other column, we can add is whether movie was successful or not.
Let's be simplistic and say if the profit is non-negative, the movie is successful, else it's a flop.

In [16]:
movie_df['success'] = movie_df['profit'].apply(lambda x: 1 if x > 0 else 0)

Do Color of movies matter?

The data contains movies that are both blck&white and color. Let's check the average imdb-score of black&white movies vs color movies and check if color in the movies impact the imdb-score. If the averages are similar, let's drop the column.

In [17]:
movie_df['color'].unique()
Out[17]:
array(['Color', ' Black and White', nan], dtype=object)
In [18]:
_, axes = plt.subplots(1, 2, sharey=True, figsize=(12, 8))
sns.boxplot(x='imdb_score', data=movie_df[movie_df['color']=='Color'], ax= axes[0])
sns.boxplot(x='imdb_score', data=movie_df[movie_df['color']==' Black and White'], ax= axes[1])
Out[18]:
<AxesSubplot:xlabel='imdb_score'>

Mean of scores for both color and black&white films is around 6.5-7.25 range. Column can be removed

Do language of movies matter?

The dataset contains movies from different languages. Let's check if the language of the movie impacts the user score. Let's first check ratio of Non-English to English movies in the dataset and then check the mean scores for non-english and english movies.

In [19]:
movie_df['language'].unique()
Out[19]:
array(['English', 'Mandarin', 'Aboriginal', 'Spanish', 'French',
       'Filipino', 'Maya', 'Kazakh', 'Telugu', 'Cantonese', 'Japanese',
       'Aramaic', 'Italian', 'Dutch', 'Dari', 'German', 'Mongolian',
       'Thai', 'Bosnian', 'Korean', 'Hungarian', 'Hindi', nan,
       'Icelandic', 'Danish', 'Portuguese', 'Norwegian', 'Czech',
       'Russian', 'None', 'Zulu', 'Hebrew', 'Dzongkha', 'Arabic',
       'Vietnamese', 'Indonesian', 'Romanian', 'Persian', 'Swedish'],
      dtype=object)
In [20]:
english = movie_df[movie_df['language'] == 'English']
non_english = movie_df[movie_df['language'] != 'English']
In [21]:
print("Non-English to English movie ratio in dataset: ", round(non_english.shape[0]/english.shape[0], 4))
Non-English to English movie ratio in dataset:  0.0498
In [22]:
_, axes = plt.subplots(1, 2, sharey=True, figsize=(12, 8))
sns.boxplot(x='imdb_score', data=english, ax= axes[0])
sns.boxplot(x='imdb_score', data=non_english, ax= axes[1])
Out[22]:
<AxesSubplot:xlabel='imdb_score'>

Dataset contains almost 95% of English titles. and mean scores are in range 6.5-7.25. So language column can be dropped.

Let's check aspect ratio outliers

In [23]:
movie_df['aspect_ratio'].unique()
Out[23]:
array([ 1.78      ,  2.35      ,  1.85      ,  2.        ,  2.2       ,
        2.39      ,  2.24      ,  1.66      ,  1.5       ,  1.77      ,
        2.4       ,  1.37      ,  2.10941316,  2.76      ,  1.33      ,
        1.18      ,  2.55      ,  1.75      , 16.        ])
In [24]:
g= sns.countplot(x='aspect_ratio', data=movie_df)
g.set_xticklabels(g.get_xticklabels(),rotation=90)
plt.show()

Aspect ratio 16 is outlier in this dataset. Probably user wanted to input 16:9 aspect ratio, but we are not sure. So let's set it to most common aspect ratio which is 2.35.

In [25]:
movie_df['aspect_ratio'] = movie_df['aspect_ratio'].apply(lambda x: 2.35 if x == 16 else x)

Do genre of movie impacts imdb score?

One movie can be in multiple genres. This is represented in dataset by genre value being multivalue string separated by '|'. Let's check if genre impacts imdb_score. If genre has impact on score, we have to duplicate movie to each genre and split genre values.

We'll create new dataframe containing genre and imdb_score. And then we'll plot the values for each genre to check distribution of scores according to genre.

In [26]:
genres = []
scores = []

for idx, row in movie_df.iterrows():
    gnrs = row['genres'].split('|')
    for gnr in gnrs:
        genres.append(gnr)
        scores.append(row['imdb_score'])
In [27]:
genre_df = pd.DataFrame(columns=['genre', 'score'])
genre_df['genre'] = genres
genre_df['score'] = scores

Genre is categorical value column. Seaborn provides function called 'catplot' to plot categorical data. Further information can be found @ https://seaborn.pydata.org/generated/seaborn.catplot.html

In [28]:
sns.catplot(x='genre', data=genre_df)
Out[28]:
<seaborn.axisgrid.FacetGrid at 0x7f2f0ddcda90>

As we can see from plot, scores are equally distributed for different genres for most of the genres. So score value is not impacted by the genre of the movie. So there is no need to split genres

Let's drop columns language and color

In [29]:
movie_df.drop(columns=['color', 'language'], inplace=True)
In [30]:
movie_df.columns
Out[30]:
Index(['director_name', 'num_critic_for_reviews', 'duration',
       'director_facebook_likes', 'actor_3_facebook_likes', 'actor_2_name',
       'actor_1_facebook_likes', 'gross', 'genres', 'actor_1_name',
       'movie_title', 'num_voted_users', 'cast_total_facebook_likes',
       'actor_3_name', 'facenumber_in_poster', 'plot_keywords',
       'movie_imdb_link', 'num_user_for_reviews', 'country', 'content_rating',
       'budget', 'title_year', 'actor_2_facebook_likes', 'imdb_score',
       'aspect_ratio', 'movie_facebook_likes', 'profit', 'success'],
      dtype='object')

There are still some columns that are missing values. These are all columns with text values. Let's just replace the missing values with empty string ' '.

In [31]:
movie_df.columns[movie_df.isnull().any()]
Out[31]:
Index(['actor_2_name', 'actor_1_name', 'actor_3_name', 'plot_keywords'], dtype='object')
In [32]:
movie_df['actor_2_name'].fillna('', inplace=True)
movie_df['actor_1_name'].fillna('', inplace=True)
movie_df['actor_3_name'].fillna('', inplace=True)
movie_df['plot_keywords'].fillna('', inplace=True)
In [33]:
movie_df.columns[movie_df.isnull().any()]
Out[33]:
Index([], dtype='object')

Data Visualisations

Let's look who are the top occuring directors and actors

In [34]:
fig, ax = plt.subplots(2, 2, figsize=(10,8))
sns.countplot(y='director_name', data=movie_df,
            order=movie_df.director_name.value_counts(ascending=False).iloc[:10].index,
            ax=ax[0,0])
sns.countplot(y='actor_1_name', data=movie_df,
            order=movie_df.actor_1_name.value_counts(ascending=False).iloc[:10].index,
            ax=ax[0,1])
sns.countplot(y='actor_2_name', data=movie_df,
            order=movie_df.actor_2_name.value_counts(ascending=False).iloc[:10].index,
            ax=ax[1,0])
sns.countplot(y='actor_3_name', data=movie_df,
            order=movie_df.actor_3_name.value_counts(ascending=False).iloc[:10].index,
            ax=ax[1,1])
fig.tight_layout()
plt.show()

Let's try to plot 10 most profitable movies

In [35]:
profit_df = movie_df[movie_df['profit'] > movie_df.profit.nlargest(11).values[-1]]
In [36]:
plt.plot(profit_df['movie_title'], profit_df['profit'])
plt.xticks(rotation='vertical')
plt.show()

Do Social Media Likes impact the score?

Let's plot the facebook likes against imdab score and see if social media likes impact the score. We'll use scatter plot to plot all the values. Further information about scatterplot can be found out @ https://seaborn.pydata.org/generated/seaborn.scatterplot.html

In [37]:
fig, ax = plt.subplots(2, 3, figsize=(10,8))
sns.scatterplot(data=movie_df, x="movie_facebook_likes", y="imdb_score", ax=ax[0, 0])
sns.scatterplot(data=movie_df, x="director_facebook_likes", y="imdb_score", ax=ax[0, 1])
sns.scatterplot(data=movie_df, x="cast_total_facebook_likes", y="imdb_score", ax=ax[0, 2])
sns.scatterplot(data=movie_df, x="actor_1_facebook_likes", y="imdb_score", ax=ax[1, 0])
sns.scatterplot(data=movie_df, x="actor_2_facebook_likes", y="imdb_score", ax=ax[1, 1])
sns.scatterplot(data=movie_df, x="actor_3_facebook_likes", y="imdb_score", ax=ax[1, 2])
fig.tight_layout()
plt.show()

Surprisingly, the score is directly proportional to number of facebook likes for director and actor 3. It's not true for actor1, actor2 and that's why cast as well as movie.

Do number of reviews influence score?

Let's again plot the score against number of critics and user reviews. The aim is to find out whether score is directly proportional to number of votes.

In [38]:
fig, ax = plt.subplots(1, 3, figsize=(10,8))
sns.scatterplot(data=movie_df, x="num_critic_for_reviews", y="imdb_score", ax=ax[0])
sns.scatterplot(data=movie_df, x="num_voted_users", y="imdb_score", ax=ax[1])
sns.scatterplot(data=movie_df, x="num_user_for_reviews", y="imdb_score", ax=ax[2])
fig.tight_layout()
plt.show()

It seems number of votes directly affect the score.

Final Pre-processing before model creation

Let's remove all the name columns

In [39]:
movie_df.columns
Out[39]:
Index(['director_name', 'num_critic_for_reviews', 'duration',
       'director_facebook_likes', 'actor_3_facebook_likes', 'actor_2_name',
       'actor_1_facebook_likes', 'gross', 'genres', 'actor_1_name',
       'movie_title', 'num_voted_users', 'cast_total_facebook_likes',
       'actor_3_name', 'facenumber_in_poster', 'plot_keywords',
       'movie_imdb_link', 'num_user_for_reviews', 'country', 'content_rating',
       'budget', 'title_year', 'actor_2_facebook_likes', 'imdb_score',
       'aspect_ratio', 'movie_facebook_likes', 'profit', 'success'],
      dtype='object')

For this simple experiment we are going to drop most of the text columns like movie title, actor names etc.
Further analysis can be performed whether presence of certain director or actor impacts the score. This is kind of taken into account with popularity metric like number of facebook likes.

In [40]:
columns = ['director_name', 'actor_2_name', 'actor_1_name', 'movie_title', 'actor_3_name', 'genres', 'country', 'movie_imdb_link', 'plot_keywords']
In [41]:
movie_df.drop(columns=columns, inplace=True)

Let's check correlated features

Correlated features in general don't improve models (although it depends on the specifics of the problem like the number of variables and the degree of correlation), but they affect specific models in different ways and to varying extents:

  • For linear models (e.g., linear regression or logistic regression), multicolinearity can yield solutions that are wildly varying and possibly numerically unstable.

  • Random forests can be good at detecting interactions between different features, but highly correlated features can mask these interactions.

Further information can be found @ https://datascience.stackexchange.com/questions/24452/in-supervised-learning-why-is-it-bad-to-have-correlated-features#:~:text=Correlated%20features%20will%20not%20always,Make%20the%20learning%20algorithm%20faster

We'll use pandas corr function to find out pairwise correlation between columns. The default method used is Pearson correlation. In summary, Pearson Corerelation between 2 variable is the covariance of the two variables divided by the product of their standard deviations. (https://en.wikipedia.org/wiki/Pearson_correlation_coefficient). We can use other correlation methods like kendall, spearman or custom. Further information can be found @ https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html

In [42]:
plt.figure(figsize=(12,10))
cor = movie_df.corr()
sns.heatmap(cor, annot=True, cmap=plt.cm.Reds)
plt.show()

There are some features which are highly co-related and can be removed

  • First and obviuous is budget and profit with absolute corelation value 0.95
  • num_user_for_review and num_voted_users with value 0.78
  • actor_1_facebook_likes with cast_total_facebook_likes with value 0.95

Let's remove cast total likes and keep only actor's like. Also we'll remove profit column. We'll keep num_voted_user and we'll create new feature that will be ratio of number of critic reviews to number of user reviews.

In [43]:
movie_df.drop(columns=['profit'], inplace=True)
In [44]:
movie_df['non_prim_cast_likes'] = movie_df['actor_2_facebook_likes'] + movie_df['actor_3_facebook_likes']
movie_df.drop(columns=['cast_total_facebook_likes'], inplace=True)
In [45]:
movie_df['votes_ratio'] = movie_df['num_critic_for_reviews'] / movie_df['num_user_for_reviews']
In [46]:
movie_df.drop(columns=['actor_2_facebook_likes', 'actor_3_facebook_likes', 'num_critic_for_reviews', 'num_user_for_reviews'], inplace=True)
In [47]:
#Let's again look at correlation
plt.figure(figsize=(12,10))
cor = movie_df.corr()
sns.heatmap(cor, annot=True, cmap=plt.cm.Reds)
plt.show()

There are still some correlated features like num_voted_user and gross, num_voted_user and movie_facebook_likes. But we'll ignore them.

In [48]:
movie_df.columns
Out[48]:
Index(['duration', 'director_facebook_likes', 'actor_1_facebook_likes',
       'gross', 'num_voted_users', 'facenumber_in_poster', 'content_rating',
       'budget', 'title_year', 'imdb_score', 'aspect_ratio',
       'movie_facebook_likes', 'success', 'non_prim_cast_likes',
       'votes_ratio'],
      dtype='object')

content_rating is categorical feature(basically non-numerical feature). Certain libraries like catboost can handle the categorical features, but most of the algorithms need numerical values. So, we need to convert categorical variable in some numerical form.

The simplest form is one-hot encoding. In this strategy, each category value is converted into a new column and assigned a 1 or 0 (notation for true/false) value to the column. https://towardsdatascience.com/categorical-encoding-using-label-encoding-and-one-hot-encoder-911ef77fb5bd is a nice article explaining how to handle categorical features.

pandas provide get_dummies function to convert categorical variable into dummy/indicator variables.(https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html)

In [49]:
movie_df = pd.get_dummies(movie_df, columns=['content_rating'])
In [50]:
movie_df.columns
Out[50]:
Index(['duration', 'director_facebook_likes', 'actor_1_facebook_likes',
       'gross', 'num_voted_users', 'facenumber_in_poster', 'budget',
       'title_year', 'imdb_score', 'aspect_ratio', 'movie_facebook_likes',
       'success', 'non_prim_cast_likes', 'votes_ratio', 'content_rating_G',
       'content_rating_NC-17', 'content_rating_PG', 'content_rating_PG-13',
       'content_rating_R'],
      dtype='object')

Linear Regression

This is regression task. As the target variable is not discrete valued, its a continous valued. Let's start with the simplest regression estimatore-linear estimator. Scikit-learn provides Ordinary least squares Linear Regression which fits linear model to the data to minimize the residual sum of squares between the observed targets in the dataset, and the targets predicted by the linear approximation.(https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)

In [51]:
y = movie_df['imdb_score']
movie_df.drop(columns=['imdb_score'], inplace=True)
In [52]:
from sklearn.model_selection import train_test_split
In [53]:
X_train, X_test, y_train, y_test = train_test_split(movie_df, y, test_size=0.33, random_state=42)

We'll use mean squared error to get how close the prediction is from actual score. We can use other metrics like mean absolute error(MAE), root mean squared error(RMSE) also.

In [54]:
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
In [55]:
lreg= LinearRegression()
In [56]:
lreg.fit(X_train, y_train)
Out[56]:
LinearRegression()
In [57]:
pred = lreg.predict(X_test)
In [58]:
err = mean_squared_error(y_test, pred)
In [59]:
print("Mean Squared Error:", err)
Mean Squared Error: 0.7746858410215137
In [61]:
# Let's see some prediction vs actual values
for i in range(10):
    print("Prediction: {} vs Actual: {}".format(round(pred[i], 2), list(y_test)[i]))
Prediction: 6.54 vs Actual: 6.6
Prediction: 6.72 vs Actual: 4.9
Prediction: 9.24 vs Actual: 8.4
Prediction: 6.45 vs Actual: 4.9
Prediction: 6.33 vs Actual: 6.3
Prediction: 5.86 vs Actual: 4.6
Prediction: 6.31 vs Actual: 5.7
Prediction: 5.96 vs Actual: 4.9
Prediction: 6.05 vs Actual: 6.2
Prediction: 7.68 vs Actual: 8.0

Decision Tree

Linear regression is very simple model trying to fit the linear model. Let's try something bit complex, like decision tree. The decision tree creates a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. (https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html)

In [62]:
from sklearn.tree import DecisionTreeRegressor
In [63]:
treg = DecisionTreeRegressor(random_state=42)
In [64]:
treg.fit(X_train, y_train)
Out[64]:
DecisionTreeRegressor(random_state=42)
In [65]:
tpred = treg.predict(X_test)
In [66]:
terr = mean_squared_error(y_test, tpred)
In [67]:
print("Mean squared error for decision tree: ", terr)
Mean squared error for decision tree:  1.0651688923802043
In [68]:
# Let's see some prediction vs actual values
for i in range(10):
    print("Prediction: {} vs Actual: {}".format(round(tpred[i], 2), list(y_test)[i]))
Prediction: 6.5 vs Actual: 6.6
Prediction: 6.5 vs Actual: 4.9
Prediction: 8.4 vs Actual: 8.4
Prediction: 6.9 vs Actual: 4.9
Prediction: 5.3 vs Actual: 6.3
Prediction: 4.9 vs Actual: 4.6
Prediction: 6.5 vs Actual: 5.7
Prediction: 5.6 vs Actual: 4.9
Prediction: 7.1 vs Actual: 6.2
Prediction: 7.7 vs Actual: 8.0

Random Forest

Random forest is ensemble model of number of decision trees. It will fit multiple trees on different subsets of data and average the prediction to get better results and also avoid overfitting that is issue with decision tree.(https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html)

In [72]:
from sklearn.ensemble import RandomForestRegressor
In [73]:
rfreg = RandomForestRegressor(random_state=42)
In [74]:
rfreg.fit(X_train, y_train)
Out[74]:
RandomForestRegressor(random_state=42)
In [75]:
rfpred = rfreg.predict(X_test)
In [76]:
rferr = mean_squared_error(y_test, rfpred)
In [77]:
print("Mean squared error for random forest: ", rferr)
Mean squared error for random forest:  0.5339812160251375

As can be seen from error value, the random forest has least mse error without finetuning

In [79]:
# Let's see some prediction vs actual values
for i in range(10):
    print("Prediction: {} vs Actual: {}".format(round(rfpred[i], 2), list(y_test)[i]))
Prediction: 6.95 vs Actual: 6.6
Prediction: 6.52 vs Actual: 4.9
Prediction: 8.22 vs Actual: 8.4
Prediction: 5.89 vs Actual: 4.9
Prediction: 5.62 vs Actual: 6.3
Prediction: 5.33 vs Actual: 4.6
Prediction: 6.14 vs Actual: 5.7
Prediction: 5.57 vs Actual: 4.9
Prediction: 6.29 vs Actual: 6.2
Prediction: 7.99 vs Actual: 8.0

K-fold Cross Validation

In K Fold cross validation, the data is divided into k subsets.Now each time, one of the k subsets is used as the test set/ validation set and the other k-1 subsets are put together to form a training set. The error estimation is averaged over all k trials to get total effectiveness of our model.

As can be seen, every data point gets to be in a validation set exactly once, and gets to be in a training set k-1 times. This significantly reduces bias as we are using most of the data for fitting, and also significantly reduces variance as most of the data is also being used in validation set.

GridsearchCV

sklearn provides a way through GridSearchCV to do exhaustive search for hyperparameters. As well as, we can do K-fold Crossvalidation with each of the hyperparameter search value. At the end of seach, GridSearchCV provide us with best hyperparameter values , which gives under least error/ max accuracy.

We'll use GridSearchCV to search for hyperparameter values for random forest and do 3-fold cross validation at each step.

In [80]:
from sklearn.model_selection import GridSearchCV
In [81]:
parameters = {'n_estimators':[100, 150, 200, 250, 500],
              'max_depth': [3, 5, 6, 10],
              'max_features':[5, 8, 12, 18]
             }
In [82]:
rfreg = RandomForestRegressor(random_state=42)
In [83]:
clf = GridSearchCV(rfreg, param_grid=parameters, scoring='neg_mean_squared_error', cv=3, verbose=1)
In [84]:
clf.fit(X_train, y_train)
Fitting 3 folds for each of 80 candidates, totalling 240 fits
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 240 out of 240 | elapsed:  4.7min finished
Out[84]:
GridSearchCV(cv=3, estimator=RandomForestRegressor(random_state=42),
             param_grid={'max_depth': [3, 5, 6, 10],
                         'max_features': [5, 8, 12, 18],
                         'n_estimators': [100, 150, 200, 250, 500]},
             scoring='neg_mean_squared_error', verbose=1)
In [85]:
clf.best_params_, clf.best_score_
Out[85]:
({'max_depth': 10, 'max_features': 12, 'n_estimators': 500},
 -0.49615315933536813)

The random forest estimator also provides us feature importance for each feature. Let's plot feature importance and see which features are important

In [86]:
importances = clf.best_estimator_.feature_importances_
indices = np.argsort(importances)

plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), [movie_df.columns[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()

Conclusion

As we can see from above importance plot, the most impotant feature in the given dataset was the number of votes. Number of votes are directly related higher / lower scores. The impotant features seem to be duration -longer might seem boring and affect the user score And budget - Huge budget movies are usually visually spectacular with great production value and good cgi and better actors. That can also result in better earning earnings as gross is another feature which is important here.

Another likely insight is the social media affects the score directly. Popularity of movie, director and crew on social media seems to be directly linked to the score of the movie. The critical question is can social media influence be used to manipulate the movie score for not so good movie, or to downgrade the score of good movie?

Content rating doesn't seem to matter much. I haven't explored the features like keywords much, but similar exploration can be done with other features.

In summary, exploration seems to suggest direct correlation between popularity of the director, crew , movie and its budget to the amount of vote it gets and in turn the score/rating it gets.