IMDB(Internet Movie Database) is publicly available movie rating website,
where people can rate movies. Commercial success, director, actors or critic ratings might be some factor that can influence, but doesn't guarantee good imdb score.
So, the question arises what features can be used/ or are important in predicting imdb score of movie. And can we predict score of the movie before release?
The dataset is downloaded from kaggle datasets. The original dataset is available on https://data.world/ and has been updated while uploading to kaggle.
It contains 28 variables for 5043 movies, spanning across 100 years in 66 countries. There are 2399 unique director names, and thousands of actors/actresses. “imdb_score” is the response variable while the other 27 variables are possible predictors.
We'll use pandas to load and work on dataset. We'll use matplotlib and seaborn for data visualisation and scikit-learn's estimators for model building, training and predictions.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
movie_df = pd.read_csv('movie_metadata.csv')
movie_df.head()
As loaded panda frame shows, there are certain expected columns like:
But there are certain interesting columns too
Let's check if dataset contains duplicate rows.
sum(movie_df.duplicated())
There are 45 instances that are duplicate. Let's remove the duplicates.
drop_duplicates will by default keep first instance and drop all other instances.
movie_df.drop_duplicates(inplace=True)
print("Number of duplicates:", movie_df.duplicated())
print("Total number of movies in dataset:", movie_df.shape[0])
There are lot of missing values (NaN) in the dataset.
Let's plot the heatmap of missing values using seaborn
sns.heatmap(movie_df.isnull(), cbar=False)
As we see from above heatmap, 3 columns with most NaN values are:
Now there can be different strategies that can be used to feel missing values for numerical data.
Further information about handling missing data can be found at https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html
Let's see how many rows we'll end up dropping if we drop all NaN values
num_rows = movie_df[movie_df.isnull().any(axis=1)].shape[0]
print("We'll drop {}% of rows".format(round(num_rows*100/movie_df.shape[0], 2)))
# We can use mean of the column as missing value
print("Mean of column budget: ", round(movie_df['budget'].mean(), 2))
sns.distplot(movie_df['budget'])
Distribution plot shows most of the movies have budget around 0-100 million, so mean can be used for missing value.
We'll here go ahead with dropping the rows instead of replacing with mean.
Other better option that we haven't explored here but worth exploring is replacing missing values with mean of the values corresponding to particular genre, as particular genre of movies might have similar range of budget and similar range of gross. Like superhero movies might have budget more than 100 million, but they will earn probably in ranges of more than 700 million. While indepedent movies will have smaller budgets and smaller gross too.
movie_df.dropna(subset=['gross', 'budget'], inplace=True)
Let's fix other numerical columns with NaN values with means.
movie_df['num_critic_for_reviews'].fillna(movie_df['num_critic_for_reviews'].mean(), inplace=True)
movie_df['duration'].fillna(movie_df['duration'].mean(), inplace=True)
movie_df['director_facebook_likes'].fillna(movie_df['director_facebook_likes'].mean(), inplace=True)
movie_df['actor_3_facebook_likes'].fillna(movie_df['actor_3_facebook_likes'].mean(), inplace=True)
movie_df['actor_1_facebook_likes'].fillna(movie_df['actor_1_facebook_likes'].mean(), inplace=True)
movie_df['num_voted_users'].fillna(movie_df['num_voted_users'].mean(), inplace=True)
movie_df['cast_total_facebook_likes'].fillna(movie_df['cast_total_facebook_likes'].mean(), inplace=True)
movie_df['facenumber_in_poster'].fillna(movie_df['facenumber_in_poster'].mean(), inplace=True)
movie_df['num_user_for_reviews'].fillna(movie_df['num_user_for_reviews'].mean(), inplace=True)
movie_df['actor_2_facebook_likes'].fillna(movie_df['actor_2_facebook_likes'].mean(), inplace=True)
movie_df['aspect_ratio'].fillna(movie_df['aspect_ratio'].mean(), inplace=True)
movie_df['movie_facebook_likes'].fillna(movie_df['movie_facebook_likes'].mean(), inplace=True)
Movie titles contain some additional whitespaces, let's remove them
strip function removes the leading and trailing characters from string.
movie_df['movie_title'] = movie_df['movie_title'].str.strip()
Reference: https://en.wikipedia.org/wiki/Motion_picture_content_rating_system#United_States
There are 5 content ratings available:
Let's replace M and GP with PG, replace X with NC-17 and replace “Approved”, “Not Rated”, “Passed”, “Unrated” with the most common rating “R”.
def update_rating(rating):
if rating == 'M' or rating == 'GP':
return 'PG'
elif rating == 'X':
return 'NC-17'
elif rating in ['Approved', 'Not Rated', 'Passed', 'Unrated']:
return 'R'
return rating
movie_df['content_rating'] = movie_df['content_rating'].apply(lambda x: update_rating(x))
movie_df['content_rating'].fillna('R', inplace=True)
Straight-forward additional feature that can be created is : profit = gross-budget
movie_df['profit'] = movie_df['gross'] - movie_df['budget']
The other column, we can add is whether movie was successful or not.
Let's be simplistic and say if the profit is non-negative, the movie is successful, else it's a flop.
movie_df['success'] = movie_df['profit'].apply(lambda x: 1 if x > 0 else 0)
Do Color of movies matter?
The data contains movies that are both blck&white and color. Let's check the average imdb-score of black&white movies vs color movies and check if color in the movies impact the imdb-score. If the averages are similar, let's drop the column.
movie_df['color'].unique()
_, axes = plt.subplots(1, 2, sharey=True, figsize=(12, 8))
sns.boxplot(x='imdb_score', data=movie_df[movie_df['color']=='Color'], ax= axes[0])
sns.boxplot(x='imdb_score', data=movie_df[movie_df['color']==' Black and White'], ax= axes[1])
Mean of scores for both color and black&white films is around 6.5-7.25 range. Column can be removed
Do language of movies matter?
The dataset contains movies from different languages. Let's check if the language of the movie impacts the user score. Let's first check ratio of Non-English to English movies in the dataset and then check the mean scores for non-english and english movies.
movie_df['language'].unique()
english = movie_df[movie_df['language'] == 'English']
non_english = movie_df[movie_df['language'] != 'English']
print("Non-English to English movie ratio in dataset: ", round(non_english.shape[0]/english.shape[0], 4))
_, axes = plt.subplots(1, 2, sharey=True, figsize=(12, 8))
sns.boxplot(x='imdb_score', data=english, ax= axes[0])
sns.boxplot(x='imdb_score', data=non_english, ax= axes[1])
Dataset contains almost 95% of English titles. and mean scores are in range 6.5-7.25. So language column can be dropped.
Let's check aspect ratio outliers
movie_df['aspect_ratio'].unique()
g= sns.countplot(x='aspect_ratio', data=movie_df)
g.set_xticklabels(g.get_xticklabels(),rotation=90)
plt.show()
Aspect ratio 16 is outlier in this dataset. Probably user wanted to input 16:9 aspect ratio, but we are not sure. So let's set it to most common aspect ratio which is 2.35.
movie_df['aspect_ratio'] = movie_df['aspect_ratio'].apply(lambda x: 2.35 if x == 16 else x)
Do genre of movie impacts imdb score?
One movie can be in multiple genres. This is represented in dataset by genre value being multivalue string separated by '|'. Let's check if genre impacts imdb_score. If genre has impact on score, we have to duplicate movie to each genre and split genre values.
We'll create new dataframe containing genre and imdb_score. And then we'll plot the values for each genre to check distribution of scores according to genre.
genres = []
scores = []
for idx, row in movie_df.iterrows():
gnrs = row['genres'].split('|')
for gnr in gnrs:
genres.append(gnr)
scores.append(row['imdb_score'])
genre_df = pd.DataFrame(columns=['genre', 'score'])
genre_df['genre'] = genres
genre_df['score'] = scores
Genre is categorical value column. Seaborn provides function called 'catplot' to plot categorical data. Further information can be found @ https://seaborn.pydata.org/generated/seaborn.catplot.html
sns.catplot(x='genre', data=genre_df)
As we can see from plot, scores are equally distributed for different genres for most of the genres. So score value is not impacted by the genre of the movie. So there is no need to split genres
Let's drop columns language and color
movie_df.drop(columns=['color', 'language'], inplace=True)
movie_df.columns
There are still some columns that are missing values. These are all columns with text values. Let's just replace the missing values with empty string ' '.
movie_df.columns[movie_df.isnull().any()]
movie_df['actor_2_name'].fillna('', inplace=True)
movie_df['actor_1_name'].fillna('', inplace=True)
movie_df['actor_3_name'].fillna('', inplace=True)
movie_df['plot_keywords'].fillna('', inplace=True)
movie_df.columns[movie_df.isnull().any()]
Let's look who are the top occuring directors and actors
fig, ax = plt.subplots(2, 2, figsize=(10,8))
sns.countplot(y='director_name', data=movie_df,
order=movie_df.director_name.value_counts(ascending=False).iloc[:10].index,
ax=ax[0,0])
sns.countplot(y='actor_1_name', data=movie_df,
order=movie_df.actor_1_name.value_counts(ascending=False).iloc[:10].index,
ax=ax[0,1])
sns.countplot(y='actor_2_name', data=movie_df,
order=movie_df.actor_2_name.value_counts(ascending=False).iloc[:10].index,
ax=ax[1,0])
sns.countplot(y='actor_3_name', data=movie_df,
order=movie_df.actor_3_name.value_counts(ascending=False).iloc[:10].index,
ax=ax[1,1])
fig.tight_layout()
plt.show()
Let's try to plot 10 most profitable movies
profit_df = movie_df[movie_df['profit'] > movie_df.profit.nlargest(11).values[-1]]
plt.plot(profit_df['movie_title'], profit_df['profit'])
plt.xticks(rotation='vertical')
plt.show()
Do Social Media Likes impact the score?
Let's plot the facebook likes against imdab score and see if social media likes impact the score. We'll use scatter plot to plot all the values. Further information about scatterplot can be found out @ https://seaborn.pydata.org/generated/seaborn.scatterplot.html
fig, ax = plt.subplots(2, 3, figsize=(10,8))
sns.scatterplot(data=movie_df, x="movie_facebook_likes", y="imdb_score", ax=ax[0, 0])
sns.scatterplot(data=movie_df, x="director_facebook_likes", y="imdb_score", ax=ax[0, 1])
sns.scatterplot(data=movie_df, x="cast_total_facebook_likes", y="imdb_score", ax=ax[0, 2])
sns.scatterplot(data=movie_df, x="actor_1_facebook_likes", y="imdb_score", ax=ax[1, 0])
sns.scatterplot(data=movie_df, x="actor_2_facebook_likes", y="imdb_score", ax=ax[1, 1])
sns.scatterplot(data=movie_df, x="actor_3_facebook_likes", y="imdb_score", ax=ax[1, 2])
fig.tight_layout()
plt.show()
Surprisingly, the score is directly proportional to number of facebook likes for director and actor 3. It's not true for actor1, actor2 and that's why cast as well as movie.
Do number of reviews influence score?
Let's again plot the score against number of critics and user reviews. The aim is to find out whether score is directly proportional to number of votes.
fig, ax = plt.subplots(1, 3, figsize=(10,8))
sns.scatterplot(data=movie_df, x="num_critic_for_reviews", y="imdb_score", ax=ax[0])
sns.scatterplot(data=movie_df, x="num_voted_users", y="imdb_score", ax=ax[1])
sns.scatterplot(data=movie_df, x="num_user_for_reviews", y="imdb_score", ax=ax[2])
fig.tight_layout()
plt.show()
It seems number of votes directly affect the score.
Let's remove all the name columns
movie_df.columns
For this simple experiment we are going to drop most of the text columns like movie title, actor names etc.
Further analysis can be performed whether presence of certain director or actor impacts the score. This is kind of taken into account with popularity metric like number of facebook likes.
columns = ['director_name', 'actor_2_name', 'actor_1_name', 'movie_title', 'actor_3_name', 'genres', 'country', 'movie_imdb_link', 'plot_keywords']
movie_df.drop(columns=columns, inplace=True)
Let's check correlated features
Correlated features in general don't improve models (although it depends on the specifics of the problem like the number of variables and the degree of correlation), but they affect specific models in different ways and to varying extents:
For linear models (e.g., linear regression or logistic regression), multicolinearity can yield solutions that are wildly varying and possibly numerically unstable.
Random forests can be good at detecting interactions between different features, but highly correlated features can mask these interactions.
Further information can be found @ https://datascience.stackexchange.com/questions/24452/in-supervised-learning-why-is-it-bad-to-have-correlated-features#:~:text=Correlated%20features%20will%20not%20always,Make%20the%20learning%20algorithm%20faster
We'll use pandas corr function to find out pairwise correlation between columns. The default method used is Pearson correlation. In summary, Pearson Corerelation between 2 variable is the covariance of the two variables divided by the product of their standard deviations. (https://en.wikipedia.org/wiki/Pearson_correlation_coefficient). We can use other correlation methods like kendall, spearman or custom. Further information can be found @ https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html
plt.figure(figsize=(12,10))
cor = movie_df.corr()
sns.heatmap(cor, annot=True, cmap=plt.cm.Reds)
plt.show()
There are some features which are highly co-related and can be removed
Let's remove cast total likes and keep only actor's like. Also we'll remove profit column. We'll keep num_voted_user and we'll create new feature that will be ratio of number of critic reviews to number of user reviews.
movie_df.drop(columns=['profit'], inplace=True)
movie_df['non_prim_cast_likes'] = movie_df['actor_2_facebook_likes'] + movie_df['actor_3_facebook_likes']
movie_df.drop(columns=['cast_total_facebook_likes'], inplace=True)
movie_df['votes_ratio'] = movie_df['num_critic_for_reviews'] / movie_df['num_user_for_reviews']
movie_df.drop(columns=['actor_2_facebook_likes', 'actor_3_facebook_likes', 'num_critic_for_reviews', 'num_user_for_reviews'], inplace=True)
#Let's again look at correlation
plt.figure(figsize=(12,10))
cor = movie_df.corr()
sns.heatmap(cor, annot=True, cmap=plt.cm.Reds)
plt.show()
There are still some correlated features like num_voted_user and gross, num_voted_user and movie_facebook_likes. But we'll ignore them.
movie_df.columns
content_rating is categorical feature(basically non-numerical feature). Certain libraries like catboost can handle the categorical features, but most of the algorithms need numerical values. So, we need to convert categorical variable in some numerical form.
The simplest form is one-hot encoding. In this strategy, each category value is converted into a new column and assigned a 1 or 0 (notation for true/false) value to the column. https://towardsdatascience.com/categorical-encoding-using-label-encoding-and-one-hot-encoder-911ef77fb5bd is a nice article explaining how to handle categorical features.
pandas provide get_dummies function to convert categorical variable into dummy/indicator variables.(https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html)
movie_df = pd.get_dummies(movie_df, columns=['content_rating'])
movie_df.columns
This is regression task. As the target variable is not discrete valued, its a continous valued. Let's start with the simplest regression estimatore-linear estimator. Scikit-learn provides Ordinary least squares Linear Regression which fits linear model to the data to minimize the residual sum of squares between the observed targets in the dataset, and the targets predicted by the linear approximation.(https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)
y = movie_df['imdb_score']
movie_df.drop(columns=['imdb_score'], inplace=True)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(movie_df, y, test_size=0.33, random_state=42)
We'll use mean squared error to get how close the prediction is from actual score. We can use other metrics like mean absolute error(MAE), root mean squared error(RMSE) also.
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
lreg= LinearRegression()
lreg.fit(X_train, y_train)
pred = lreg.predict(X_test)
err = mean_squared_error(y_test, pred)
print("Mean Squared Error:", err)
# Let's see some prediction vs actual values
for i in range(10):
print("Prediction: {} vs Actual: {}".format(round(pred[i], 2), list(y_test)[i]))
Linear regression is very simple model trying to fit the linear model. Let's try something bit complex, like decision tree. The decision tree creates a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. (https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html)
from sklearn.tree import DecisionTreeRegressor
treg = DecisionTreeRegressor(random_state=42)
treg.fit(X_train, y_train)
tpred = treg.predict(X_test)
terr = mean_squared_error(y_test, tpred)
print("Mean squared error for decision tree: ", terr)
# Let's see some prediction vs actual values
for i in range(10):
print("Prediction: {} vs Actual: {}".format(round(tpred[i], 2), list(y_test)[i]))
Random forest is ensemble model of number of decision trees. It will fit multiple trees on different subsets of data and average the prediction to get better results and also avoid overfitting that is issue with decision tree.(https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html)
from sklearn.ensemble import RandomForestRegressor
rfreg = RandomForestRegressor(random_state=42)
rfreg.fit(X_train, y_train)
rfpred = rfreg.predict(X_test)
rferr = mean_squared_error(y_test, rfpred)
print("Mean squared error for random forest: ", rferr)
As can be seen from error value, the random forest has least mse error without finetuning
# Let's see some prediction vs actual values
for i in range(10):
print("Prediction: {} vs Actual: {}".format(round(rfpred[i], 2), list(y_test)[i]))
In K Fold cross validation, the data is divided into k subsets.Now each time, one of the k subsets is used as the test set/ validation set and the other k-1 subsets are put together to form a training set. The error estimation is averaged over all k trials to get total effectiveness of our model.
As can be seen, every data point gets to be in a validation set exactly once, and gets to be in a training set k-1 times. This significantly reduces bias as we are using most of the data for fitting, and also significantly reduces variance as most of the data is also being used in validation set.
GridsearchCV
sklearn provides a way through GridSearchCV to do exhaustive search for hyperparameters. As well as, we can do K-fold Crossvalidation with each of the hyperparameter search value. At the end of seach, GridSearchCV provide us with best hyperparameter values , which gives under least error/ max accuracy.
We'll use GridSearchCV to search for hyperparameter values for random forest and do 3-fold cross validation at each step.
from sklearn.model_selection import GridSearchCV
parameters = {'n_estimators':[100, 150, 200, 250, 500],
'max_depth': [3, 5, 6, 10],
'max_features':[5, 8, 12, 18]
}
rfreg = RandomForestRegressor(random_state=42)
clf = GridSearchCV(rfreg, param_grid=parameters, scoring='neg_mean_squared_error', cv=3, verbose=1)
clf.fit(X_train, y_train)
clf.best_params_, clf.best_score_
The random forest estimator also provides us feature importance for each feature. Let's plot feature importance and see which features are important
importances = clf.best_estimator_.feature_importances_
indices = np.argsort(importances)
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), [movie_df.columns[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
As we can see from above importance plot, the most impotant feature in the given dataset was the number of votes. Number of votes are directly related higher / lower scores. The impotant features seem to be duration -longer might seem boring and affect the user score And budget - Huge budget movies are usually visually spectacular with great production value and good cgi and better actors. That can also result in better earning earnings as gross is another feature which is important here.
Another likely insight is the social media affects the score directly. Popularity of movie, director and crew on social media seems to be directly linked to the score of the movie. The critical question is can social media influence be used to manipulate the movie score for not so good movie, or to downgrade the score of good movie?
Content rating doesn't seem to matter much. I haven't explored the features like keywords much, but similar exploration can be done with other features.
In summary, exploration seems to suggest direct correlation between popularity of the director, crew , movie and its budget to the amount of vote it gets and in turn the score/rating it gets.