Movie production has many stages and variables that affect the success
of the resulting film. In this project, we train multiple models to
predict a film’s success based on factors that could be measured in
advance, such as the producers and budget.

Film directors and producers must consider many factors when creating
movies. These include the actors, budget, title, genre, script,
screenplay, themes, etc. The sheer complexity, along with the
unpredictability of how a person might perceive the film, makes it
difficult to predict if right decisions are being made.

We used two Kaggle datasets -
https://www.kaggle.com/datasets/gufukuro/movie-scripts-corpus
and
https://www.kaggle.com/datasets/tmdb/tmdb-movie-metadata
to get data about the overview, revenue, imdb votes, imdb ratings,
budget, star cast, etc. of various movies. We also scraped data from
these websites -
https://www.the-numbers.com/box-office-star-records/worldwide/lifetime-acting/top-grossing-leading-stars/501
and
https://www.the-numbers.com/box-office-star-records/worldwide/lifetime-specific-technical-role/director
to get the revenue of various actors and directors. We ended up with
about 1300 usable data points. We used a simple CSV reader to read in
the data from the kaggle datasets.

We used three different methods to try and model our data. We used
Linear Regression, Random Forest, and Kmeans to try and predict the
movie's success. Literature review indicated that relatively high
accuracy scores were achieved using regression models. This was the
rationale behind choosing linear regression, further, the model also
gives a simple equation to predict other metrics of success. Given the
number of metrics in play, random forests yield more accurate results
by taking into account mean values and dealing with the data in
multiple decision trees. Lastly, movie summaries and scripts play a
huge role in how anticipated the film is, thereby, indicating its
potential performance. We attempt to use NLP for the same and cluster
movies using K-Means. This will help classify similar movies together,
helping predict success metrics of movies with similar ‘overviews’.

In order to simplify our models and improve computational
efficiency, we utilised Principal Component Analysis to convert the
three input features (number of producers, budget, and highest star
revenue) into two features. This method was chosen over feature
selection to capture more of the original variance in the same
number of features, and thus complexity.

Linear regression is a simple model to create a line of best fit
through our data points by training on our x features and y labels.
In doing so, we are able to derive a formula to easily determine how
our x features affect our y labels, which allows a person to easily
input what x features they want and get a corresponding y output
quickly. We split our data into training and test data and trained
our model on the training data.

After training the model, We first determined which of the 6 y labels are the best indicator for success. A good indicator for a movie’s success would have a lot of its variance explained by the x features because we want the movie’s success to be predictable from any given input data. Thus, we used R^2 score to measure the linear regression models' success. We applied linear regression on x with each of the y labels to see which label has the best R^2 score, which turns out to be revenue. Therefore, we will use revenue as our indicator for a movie’s success for linear regression.

We then tested our linear regression model with each of our x features to see how well our x features can predict revenue. We got R^2 scores of 0.42, 0.49, and 0.11. This means our x features aren’t predicting revenue very well, especially the highest star income. The normalized RMSE are 0.29, 0.29, and 0.657, which means the differences between our test labels and predicted labels are bad for the highest star actor salary but are ok for the other two.

While linear regression is a simple model that tells us how much revenue each x feature would produce, its accuracy and predictability of our labels aren’t very good. Thus, we ought to resort to a different model.

After training the model, We first determined which of the 6 y labels are the best indicator for success. A good indicator for a movie’s success would have a lot of its variance explained by the x features because we want the movie’s success to be predictable from any given input data. Thus, we used R^2 score to measure the linear regression models' success. We applied linear regression on x with each of the y labels to see which label has the best R^2 score, which turns out to be revenue. Therefore, we will use revenue as our indicator for a movie’s success for linear regression.

We then tested our linear regression model with each of our x features to see how well our x features can predict revenue. We got R^2 scores of 0.42, 0.49, and 0.11. This means our x features aren’t predicting revenue very well, especially the highest star income. The normalized RMSE are 0.29, 0.29, and 0.657, which means the differences between our test labels and predicted labels are bad for the highest star actor salary but are ok for the other two.

While linear regression is a simple model that tells us how much revenue each x feature would produce, its accuracy and predictability of our labels aren’t very good. Thus, we ought to resort to a different model.

Random Forests is an ensemble learning method for classification,
regression, and other tasks that operate by constructing a multitude
of decision trees at training time and outputting the class or mean
prediction of the individual trees. A random forest regressor is a
machine learning algorithm that is used to predict continuous
values. The X features included the number of producers, budget, and
the highest star revenues. The data was split into a 9:1 ratio for
training and testing respectively and individual models were trained
with each of imdb ratings, metascore ratings, awards, revenues and
popularity as an output for the X values. Revenue turned out to be
the easiest to predict with a score of 62% which is much higher than
the score yielded by the regression model. This is primarily due to
the continuous nature of the RFRegressor and the revenue data. The
continuous nature of data needed in the Random Forest Regressor was
the reason behind the above results. IMDB scores are relatively
discrete averaging from 0 - 10, while the revenue figure is a
continuous piece of information.

We wanted to give this idea a final shot by using an NLP-based
approach based on the assumption that movies with similar plots will
have similar revenue. First, we parsed movie overviews from our
existing dataset(short descriptions of movies) and split the data
into testing and training data. Then, we converted those overviews
into vectors using the Python library sentence-transformers which
provides an easy-to-use API for making embeddings. Then, we made 30
clusters from these embeddings. These clusters were fairly accurate
and clustered similar movies together (for example - all marvel
movies were grouped together). Then we took the mean of each of the
Y fields in our dataset in each of these clusters to predict the
fields of the testing data. The testing data was classified into one
of these clusters and the fields were predicted by the mean of the
predicted cluster. The lowest NMRSE of this approach turned out to
be around 0.56 for popularity score.

Overall, we found that it is extremely hard to predict a movie’s
success with just the data that is available before the movie goes
into production. Our linear regression model provided poor accuracy
but provided a simple formula to calculate y labels, while our random
forest gave a slightly better accuracy. Our clustered overviews showed
that similar movies don’t necessarily have similar success.

This model would have led to far more successful results if the quantity of properly continuous data was not limited to revenues and earnings. A similar problem, complemented with a rather small final dataset of only about 1400 movies, led to low accuracy in the linear regression model.

In general, we could have created more features or imported more data for this problem, as both PCA and Random Forests work better with larger sample sizes. A much better approach to this question would have been to separate discrete and continuous features and then individually train models on them. In the end we did successfully create a model that given a movie script, a budget and a list of actors it can predict with 62% (with random forests) certainty the amount of success a movie will have. Which would help movie studios not waste time and money pursuing movies that will most likely not be successful. Even though our degree of success is somewhat low, given how hard a creative endeavour a movie is and that we didn’t use nearly as much data as there is out there, with more data and time we could build a model with a higher level of success.

This model would have led to far more successful results if the quantity of properly continuous data was not limited to revenues and earnings. A similar problem, complemented with a rather small final dataset of only about 1400 movies, led to low accuracy in the linear regression model.

In general, we could have created more features or imported more data for this problem, as both PCA and Random Forests work better with larger sample sizes. A much better approach to this question would have been to separate discrete and continuous features and then individually train models on them. In the end we did successfully create a model that given a movie script, a budget and a list of actors it can predict with 62% (with random forests) certainty the amount of success a movie will have. Which would help movie studios not waste time and money pursuing movies that will most likely not be successful. Even though our degree of success is somewhat low, given how hard a creative endeavour a movie is and that we didn’t use nearly as much data as there is out there, with more data and time we could build a model with a higher level of success.

- Muhammad Hassan Latif and Hammad Afzal. Prediction of movies popularity using machine learning techniques. https://www.researchgate.net/publication/311913687_Prediction_of_Movies_popularity_Using_Machine_Learning_Techniques
- Nikhil Apte, Mats Forssell, and Anahita Sidhwa. Predicting movie revenue. CS229, Stanford University, 2011. http://cs229.stanford.edu/proj2011/ApteForssellSidhwa-PredictingMovieRevenue.pdf
- Karl Persson. Predicting movie ratings: A comparative study on random forests and support vector machines, 2015. http://www.diva-portal.org/smash/get/diva2:821533/FULLTEXT01.pdf
- Sitaram Asur, Bernardo A. Huberman “Predicting the Future With Social Media”, Hp Labs. https://arxiv.org/pdf/1003.5699.pdf
- Andrei Oghina, Mathias Breuss, Manos Tsagkias, and Maarten de Rijke , “Predicting IMDB Movie Ratings Using Social Media”. https://staff.fnwi.uva.nl/m.derijke/wp-content/papercite-data/pdf/oghina-predicting-2012.pdf