Forecasting Box Office success based on Google Trends data

Predicting the amount of people watching a movie on the opening weekend in german cinemas by using meta information of the movies linked with Google Trends data of search terms like the title of the movie or general terms like "Kino" (german for cinema). Models used for predition are additive regression models as well as boosting models.

Background

Inspired by a whitepaper by Google the models fitted in this reposititory try to predict the amount of moviegoers for specific movies on their opening weekend. In this paper Google is presenting several linear regression that make use of search volume of several movie related search terms to predict the amount of people going to the movies. These models reach a R2 of 58% one week before premiere and 70% one day before premiere which shall be the baseline for this repository.

Data gathering, -preprocessing and descriptive analysis

The provided dataset contains key data for about 900 movies that premiered in Germany between 01/03/2013 and 07/07/2016. It contains 100 features of the movies like age rating, genre, studio as well as the number of ordered copies which is referring to the number of cinemas in which a movie is presented and the number of visitors of the movies on the first weekend. The number of visitors is the target which will be forecasted.

After preprocessing of the raw movie data is done in this script the Google Trends data will be collected and preprocessed. I recommend reading more on how we preprocessed the data in this article.

The order how to run the scripts to gather and preprocess the Google Trends data is

Collecting the Google Trends data for the anchor terms running this script
Collecting and scaling the Google Trends data for the defined search terms running this script
Substracting the median of the time series running this script

For every movie we defined 2 to 3 search terms. The main title, the main title + the suffix film and - if the movie has a secondary title, the complete title. We built a Google Value as linear combination of the 3 search terms. After preprocessing the Google Trends data for every search term by running the scripts in the mentioned order we need to melt the 3 search terms for each movie to a single KPI (Google Value gv) by applying a linear transformation. To do so we'll melt the search volumina of the main title (mt) and main title + "film" (mtf) to a single value. If the movie has a subtitle we'll take the search volume of the complete title (ct) into consideration in a second step. The resulting linear combination should maximize the correlation between the Google Value and the amount of visitors:

$\underset{a, b}{maximize} f(gv_{j}) = \rho(gv_{j}, visitors) \\ with \hspace{0.5cm} gv_{j} = \begin{cases} a_j \cdot mt_{ij} + (1 - a_j) \cdot mtf_{ij} & \text{for } mt = ct. \\ b_j \cdot \Big{(}a_j \cdot mt_{ij} + (1 - a_j) \cdot mtf_{ij}\Big{)} + (1 - b_j) \cdot ct_{ij} & \text{for } ct \neq kt. \end{cases} \\ \\ w.r.t \hspace{0.5 cm} 0 < a_j < 1 \hspace{0.5 cm} and \hspace{0.5 cm} 0 < b_j < 1$

We'll do this for each week . To prevent overfitting of the weights a nested resampling approach is applied and the resulting weights are averaged. To prevent values of 0, the Google Value will be Box-Cox transformed afterwards. Optimization of the weights for the Google Value as well as their application to calculate the final KPI is done in this script. This script also splits the data in a train and test set to prevent overfitting on the optimization results.

After the data is gathered and preprocessed, descriptive analysis is mostly done by the notebooks in the data analysis directory

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
data		data
data_analysis		data_analysis
model_fitting		model_fitting
preprocessing		preprocessing
readme_plots		readme_plots
utils		utils
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Forecasting Box Office success based on Google Trends data

Background

Data gathering, -preprocessing and descriptive analysis

About

Releases

Packages

Languages

mcschmitz/box_office_success_prediction

Folders and files

Latest commit

History

Repository files navigation

Forecasting Box Office success based on Google Trends data

Background

Data gathering, -preprocessing and descriptive analysis

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages