CS 4641 Project Team 80

Movie Success Prediction

Final Report

Link to Gaant Chart

Introduction

Movie production has many stages and variables that affect the success of the resulting film. In this project, we train multiple models to predict a film’s success based on factors that could be measured in advance, such as the producers and budget.

Problem Definition

Film directors and producers must consider many factors when creating movies. These include the actors, budget, title, genre, script, screenplay, themes, etc. The sheer complexity, along with the unpredictability of how a person might perceive the film, makes it difficult to predict if right decisions are being made.

Data Collection

We used two Kaggle datasets - https://www.kaggle.com/datasets/gufukuro/movie-scripts-corpus and https://www.kaggle.com/datasets/tmdb/tmdb-movie-metadata to get data about the overview, revenue, imdb votes, imdb ratings, budget, star cast, etc. of various movies. We also scraped data from these websites - https://www.the-numbers.com/box-office-star-records/worldwide/lifetime-acting/top-grossing-leading-stars/501 and https://www.the-numbers.com/box-office-star-records/worldwide/lifetime-specific-technical-role/director to get the revenue of various actors and directors. We ended up with about 1300 usable data points. We used a simple CSV reader to read in the data from the kaggle datasets.

Methods

We used three different methods to try and model our data. We used Linear Regression, Random Forest, and Kmeans to try and predict the movie's success. Literature review indicated that relatively high accuracy scores were achieved using regression models. This was the rationale behind choosing linear regression, further, the model also gives a simple equation to predict other metrics of success. Given the number of metrics in play, random forests yield more accurate results by taking into account mean values and dealing with the data in multiple decision trees. Lastly, movie summaries and scripts play a huge role in how anticipated the film is, thereby, indicating its potential performance. We attempt to use NLP for the same and cluster movies using K-Means. This will help classify similar movies together, helping predict success metrics of movies with similar ‘overviews’.

Discussion

PCA

In order to simplify our models and improve computational efficiency, we utilised Principal Component Analysis to convert the three input features (number of producers, budget, and highest star revenue) into two features. This method was chosen over feature selection to capture more of the original variance in the same number of features, and thus complexity.

Linear Regression

Linear regression is a simple model to create a line of best fit through our data points by training on our x features and y labels. In doing so, we are able to derive a formula to easily determine how our x features affect our y labels, which allows a person to easily input what x features they want and get a corresponding y output quickly. We split our data into training and test data and trained our model on the training data.
After training the model, We first determined which of the 6 y labels are the best indicator for success. A good indicator for a movie’s success would have a lot of its variance explained by the x features because we want the movie’s success to be predictable from any given input data. Thus, we used R^2 score to measure the linear regression models' success. We applied linear regression on x with each of the y labels to see which label has the best R^2 score, which turns out to be revenue. Therefore, we will use revenue as our indicator for a movie’s success for linear regression.
We then tested our linear regression model with each of our x features to see how well our x features can predict revenue. We got R^2 scores of 0.42, 0.49, and 0.11. This means our x features aren’t predicting revenue very well, especially the highest star income. The normalized RMSE are 0.29, 0.29, and 0.657, which means the differences between our test labels and predicted labels are bad for the highest star actor salary but are ok for the other two.
While linear regression is a simple model that tells us how much revenue each x feature would produce, its accuracy and predictability of our labels aren’t very good. Thus, we ought to resort to a different model.

Random Forest

Random Forests is an ensemble learning method for classification, regression, and other tasks that operate by constructing a multitude of decision trees at training time and outputting the class or mean prediction of the individual trees. A random forest regressor is a machine learning algorithm that is used to predict continuous values. The X features included the number of producers, budget, and the highest star revenues. The data was split into a 9:1 ratio for training and testing respectively and individual models were trained with each of imdb ratings, metascore ratings, awards, revenues and popularity as an output for the X values. Revenue turned out to be the easiest to predict with a score of 62% which is much higher than the score yielded by the regression model. This is primarily due to the continuous nature of the RFRegressor and the revenue data. The continuous nature of data needed in the Random Forest Regressor was the reason behind the above results. IMDB scores are relatively discrete averaging from 0 - 10, while the revenue figure is a continuous piece of information.

K means clustering on overview vectors

We wanted to give this idea a final shot by using an NLP-based approach based on the assumption that movies with similar plots will have similar revenue. First, we parsed movie overviews from our existing dataset(short descriptions of movies) and split the data into testing and training data. Then, we converted those overviews into vectors using the Python library sentence-transformers which provides an easy-to-use API for making embeddings. Then, we made 30 clusters from these embeddings. These clusters were fairly accurate and clustered similar movies together (for example - all marvel movies were grouped together). Then we took the mean of each of the Y fields in our dataset in each of these clusters to predict the fields of the testing data. The testing data was classified into one of these clusters and the fields were predicted by the mean of the predicted cluster. The lowest NMRSE of this approach turned out to be around 0.56 for popularity score.

Conclusion

Overall, we found that it is extremely hard to predict a movie’s success with just the data that is available before the movie goes into production. Our linear regression model provided poor accuracy but provided a simple formula to calculate y labels, while our random forest gave a slightly better accuracy. Our clustered overviews showed that similar movies don’t necessarily have similar success.
This model would have led to far more successful results if the quantity of properly continuous data was not limited to revenues and earnings. A similar problem, complemented with a rather small final dataset of only about 1400 movies, led to low accuracy in the linear regression model.
In general, we could have created more features or imported more data for this problem, as both PCA and Random Forests work better with larger sample sizes. A much better approach to this question would have been to separate discrete and continuous features and then individually train models on them. In the end we did successfully create a model that given a movie script, a budget and a list of actors it can predict with 62% (with random forests) certainty the amount of success a movie will have. Which would help movie studios not waste time and money pursuing movies that will most likely not be successful. Even though our degree of success is somewhat low, given how hard a creative endeavour a movie is and that we didn’t use nearly as much data as there is out there, with more data and time we could build a model with a higher level of success.

Notebook

References

  1. Muhammad Hassan Latif and Hammad Afzal. Prediction of movies popularity using machine learning techniques. https://www.researchgate.net/publication/311913687_Prediction_of_Movies_popularity_Using_Machine_Learning_Techniques
  2. Nikhil Apte, Mats Forssell, and Anahita Sidhwa. Predicting movie revenue. CS229, Stanford University, 2011. http://cs229.stanford.edu/proj2011/ApteForssellSidhwa-PredictingMovieRevenue.pdf
  3. Karl Persson. Predicting movie ratings: A comparative study on random forests and support vector machines, 2015. http://www.diva-portal.org/smash/get/diva2:821533/FULLTEXT01.pdf
  4. Sitaram Asur, Bernardo A. Huberman “Predicting the Future With Social Media”, Hp Labs. https://arxiv.org/pdf/1003.5699.pdf
  5. Andrei Oghina, Mathias Breuss, Manos Tsagkias, and Maarten de Rijke , “Predicting IMDB Movie Ratings Using Social Media”. https://staff.fnwi.uva.nl/m.derijke/wp-content/papercite-data/pdf/oghina-predicting-2012.pdf

Contribution Table