Case study 1

Inference and predictions using movie box office data

About the assignment

This case study will involve hands-on analysis of some real (messy) data. The project will be broken into three parts which will be graded separately. Altogether we will spend two weeks working on this: two class periods (at least partial), and one homework assignment in between.

You should work in pairs on this project. Please find a partner. Let me know if you need help being matched up.

There will be four parts:

(10 points - completed in class Tue Oct 14) “Data wrangling”: getting the data from a web-based source and preparing it for analysis.
(20 points - in class Tue Oct 14, due by Tue Oct 21 if not completed in class) Inference: applied examples and simulation-based understanding of confidence intervals and hypothesis tests
(20 points - outside class for homework, due by Tue Oct 21) Open-ended data insights: visualization and descriptive summaries
(20 points - in class Tue Oct 21) Prediction challenge: using linear regression models to predict an outcome as accurately as possible based on covariates.

For each part, you will make an incremental submission. Each piece will be graded separately, with a solution provided for Part 1 after submission that you can use for subsequent parts. The total of your score will be your grade for the project.

Please let me know if there are questions about this format.

Background and objectives

The Movie Database (TMDB) API provides detailed, community-maintained metadata about films, including titles, release info, countries, genres, runtimes, cast/crew, user ratings, and revenue/budget when available. In this case study, you will use R to programmatically fetch and assemble a working movie dataset, experiencing what modern data acquisition and analysis feels like outside of a textbook: large web-baesd data sources, missing values, inconsistent fields, and rich, high-dimensional features.

Our applied goals are: - Obtain, clean, and prepare the data into a tidy, analysis-ready table. - Run inferential analyses (confidence intervals and hypothesis tests) on random subsamples to answer practical questions. For example: What percent of films have you seen? What is average revenue? What is average user score? Does the average score differ between the US and Canada? How stable are these estimates across subsamples? - Gain a better understanding of confidence intervals and hypothesis tests through repeated sampling - Realistic and open-ended exploratory data analysis: provide clear interpretable insights based on data about distributions and relationships between variables - Build and evaluate predictive models for box office revenue. We will run this as a friendly prediction challenge: train on a provided training set, submit predictions for a hidden test set of films, and we will score submissions by predictive accuracy.

Why this is exciting: movies are a familiar domain, the data are messy in authentic ways, and the task spans the full lifecycle—from API calls and wrangling to inference and prediction. You will practice reproducible workflows, critical thinking about uncertainty and bias, and model selection under real constraints, all while competing to build the most accurate revenue forecaster.

Part 1: Data wrangling (14-Oct)

This part should be completed in class on Tuesday October 14.

Getting access to the API

An API (Application Programming Interface) is a web service that lets users (given a credential) make requests for data, which are in turn fulfilled by a web server. You send an HTTP request to a URL (an endpoint) with parameters and credentials, and you get back a structured response (usually JSON) that you can parse using your programming language of choice (R for us).

Create an account at TMDB. There is no cost of any kind.
Register for an API key. Take a look at the Getting Started page for the API. There is a link there to make the request from the settings page. You will need to agree to the terms of use: for example, analyses of the data from this service cannot be disseminated without attribution, and raw data cannot be disseminated at all.
Test out the API. You could test the “Movie details” endpoint from its reference page here: https://developer.themoviedb.org/reference/movie-details.
- Try entering 17473 for the movie_id field, then click “Try it!” on the right side.
- You should see a response containing structured data about a classic film. What is the name of the film? When was it released? What is its average user rating?

Let me know if you have any issues at this step with accessing the API.

Project template

This was one example of an API query for one film. It returned some data, but as you can see from the reference page, there is far more data available on each movie. Next we will see how to use R to programmatically query the API to collect data on many films.

I will provide a fair bit of starter code to you for this:

Download: tmdb_project_v1.zip
Unzip to a convenient location on your computer
Navigate to the folder and open tmdb.Rproj to open the project in RStudio

Your tasks

Follow the directions in 01_data.R. You have several coding tasks to complete there for Part 1 of the case study.

Submission

Submit your completed script 01_data.R on Moodle.

Part 2: Inference (In class: 14-Oct, turn in by 21-Oct if not completed in class)

This part should be started in class on Tuesday October 14 and completed by Tuesday October 21 if not finished in class.

Objectives

In this part, you will:

Take random samples from the cleaned movie dataset
Calculate point estimates and confidence intervals for population parameters
Explore what proportion of movies in the database you have personally seen
Investigate the stability of estimates across different random samples
Gain intuition about confidence intervals through repeated sampling simulations

Loading the data

Your script should begin by loading the cleaned dataset you created in Part 1 (data/movies_clean.rds). If you did not complete Part 1, a solution will be provided that you can use to generate this file.

Tasks

Follow the directions in 02_inference.R within the project template you downloaded for Part 1. For Part 2, the tasks will include:

Random sampling: Take a random sample of 100 movies from the cleaned dataset (without replacement)
Personal movie experience:
- Review your random sample of 100 movies
- Count how many you have personally seen
- Calculate a 95% confidence interval for the proportion of all TMDB movies you have seen
- Share your point estimate and confidence bounds with the class for discussion
Revenue confidence interval:
- Based on your random sample of 100 movies, calculate a 95% confidence interval for the mean revenue across all films in TMDB
- Share your point estimate and confidence bounds with the class for discussion
Repeated sampling simulation:
- Write a function to randomly sample movies of a specified size
- Write a function to calculate the sample mean and 95% confidence interval for revenue from a given sample
- Use a for loop to repeat the sampling and inference process 100 times
- Calculate the actual mean revenue in the full dataset
- Determine what proportion of your 100 confidence intervals contain the true mean
- Create a visualization showing all 100 confidence intervals, with color-coding to indicate which contain the true mean (black) and which do not (red)

Key concepts

This exercise will help you understand:

Sampling variability: How estimates change from sample to sample
Confidence interval interpretation: What it means for a confidence interval to have 95% coverage
The meaning of confidence level: In the long run, approximately 95% of 95% confidence intervals should contain the true parameter

Submission

Submit your completed script 02_inference.R on Moodle under Case Study 1 Box Office Analysis (Part 2).

Part 3: Visualization and Descriptive Insights (Homework: 14-Oct to 21-Oct)

This part is assigned as homework, due by Tuesday October 21.

Your task

Act as a data analyst providing insights about what film characteristics are associated with box-office revenue. Complete your analysis in the Quarto document 03_visualization.qmd and render it to PDF (approximately 3-5 pages).

Your report must include:

Revenue distribution: Histogram and descriptive statistics (mean, median, quantiles, etc.)
Five additional visualizations: Showing relationships between revenue and other variables (e.g., genre, budget, runtime, ratings, release year, production country, cast/director popularity)

Requirements

Audience: studio executives familiar with basic statistics but not data scientists
Clear, well-labeled visualizations with written interpretation
Set echo: false in YAML so code doesn’t appear in the PDF

Submission

Submit both the .qmd source file and the .pdf output on Moodle under Case Study 1 Box Office Analysis (Part 3).

Part 4: Regression and prediction challenge (In class: 21-Oct)

This part will be completed in class on Tuesday October 21.

Objectives

In this final part, you will:

Build multiple linear regression models to predict log_revenue
Experiment with different predictor variables and model specifications
Submit predictions for a held-out test set
Compete with your classmates to achieve the lowest prediction error (RMSE)

The Challenge

You will work in pairs to:

Explore the training data: Understand relationships between predictors and revenue
Build regression models: Start with simple models and iterate to improve performance
Evaluate your models: Calculate training RMSE to gauge performance
Make predictions: Apply your best model to a held-out test set
Submit for scoring: Submit predictions in CSV format for evaluation
Climb the leaderboard: You may submit multiple times to improve your score

Getting started

Download: tmdb_project_v2.zip
Unzip to a convenient location on your computer
Navigate to the folder and open tmdb.Rproj to open the project in RStudio

Once you have the project open, you will work with the script 04_prediction.R, which includes: - Starter code for loading data and building baseline models - Helper functions for calculating predictive accuracy using RMSE (defined below) - Examples of feature engineering - Instructions for creating the submission file - Tips for improving model performance

Evaluation

Your predictions will be scored using RMSE (Root Mean Squared Error):

\[\text{RMSE} = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}\]

where \(y_i\) is the actual log revenue and \(\hat{y}_i\) is your predicted log revenue.

Lower RMSE is better!

Submission requirements

During class: - Submit predictions as a CSV file with two columns: id and predicted_log_revenue - Name your file: predictions_TEAMNAME_i.csv (with “TEAMNAME” being replaced by your team name, and i indexing your team’s multiple submissions) - You may submit multiple times (increasing the index i each time) to improve your score - Results will be posted to a class leaderboard

At end of class: - Submit your final R script (04_prediction.R) on Moodle under Case Study 1 Box Office Analysis (Part 4)

Grading (20 points total)

Completeness of code (20 points): Did you build models and generate predictions?
Prediction accuracy (up to 5 bonus points): Bonus points for top 3 pairs:
- 1st place: 5 bonus points
- 2nd place: 3 bonus points
- 3rd place: 1 bonus point

Tips for success

Start simple: Begin with basic models using a few predictors
Feature engineering: Try creating new variables (e.g., release_year, interaction terms, transformations)
Balance complexity: More complex models may overfit and perform worse on test data
Iterate quickly: Don’t spend too long perfecting one model - try multiple approaches
Have fun! This is a friendly competition to reinforce regression concepts

Good luck! 🎬📊🏆