Predicting Movie Final Gross from Week 1 with XGBoost

This report walks through how I predict a film’s final box-office total (final_total) using only its first-week total (wk1_total). It’s a practical forecasting problem for programming, distribution, and marketing decisions.

Why it matters: - Early revenue signals inform marketing spend and screen allocation. - Simple, robust models let teams simulate scenarios quickly.

Pipeline at a glance: - Data: DuckDB database at data/numero.duckdb (AU releases 2023–2024) - Preprocessing: flatten/clean film records; handle missingness/outliers; engineer week-1/total features - Model: XGBoost regressor on log-transformed inputs/targets - Deployment: FastAPI service (api_min.py, multi_api.py) + optional MCP endpoints for chatbot and batch use

1) Dataset overview

  • Source: internal Australian box-office records (2023–2024) stored in DuckDB at data/numero.duckdb.
  • Tables typically include: films_basic_clean, films_clean, films_raw, films_strict, films_with_fallback.
  • Useful fields: wk1_total (first-week), final_total (cumulative), weekend grosses, screens, theatres, distributor_name, censor_rating, and dates.
  • Challenges: nested JSON in raw feeds, missing values, inconsistent naming, and heavy-tailed revenues (blockbusters).

The next cell lists tables and shows a small preview (if the DB is present).

Code
from pathlib import Path

try:
    import duckdb
    import pandas as pd
except Exception:
    duckdb = None
    pd = None

if duckdb and pd:
    try:
        db_rel = Path("data") / "numero.duckdb"
        if db_rel.exists():
            con = duckdb.connect(database=str(db_rel), read_only=True)
            try:
                tables = [t[0] for t in con.execute("SHOW TABLES").fetchall()]
                table = "films_clean" if "films_clean" in tables else (tables[0] if tables else None)
                if table:
                    df = con.execute(f"SELECT * FROM {table} LIMIT 5").df()
                    display(df)
                else:
                    print("No tables found in local DuckDB.")
            finally:
                con.close()
        else:
            print("Local DuckDB not found. Skipping preview.")
    except Exception:
        print("Data preview not available in this environment.")
else:
    print("DuckDB/pandas not available. Skipping preview.")
DB path: C:\Users\Nebula PC\Swinburne\Intern\boxai\data\numero.duckdb
Tables: ['features_afterN1_simple', 'features_afterN1_total', 'features_afterN1_week2', 'films_basic_clean', 'films_basic_clean_features_cleaned', 'films_basic_clean_features_cleaned_predictions', 'films_clean', 'films_raw', 'films_strict', 'films_with_fallback', 'predictions_final_afterN']
Previewing table: films_basic_clean
film_id title title_norm play_week previous_week_rank distributor_id distributor_name distributor_norm censor_rating gross_weekend ... cumulative_gross_prev_weekend screens_prev_weekend theatres_prev_weekend gross_prev_week cumulative_gross_prev_week screens_prev_week theatres_prev_week is_weekend_gross_missing gross_weekend_final week_date
0 45953 Scream VI scream vi 1 <NA> 6 Paramount paramount MA15+ 264206957.0 ... NaN <NA> <NA> 28513503.0 2.851350e+07 186 186 False 264206957.0 2023-03-09
1 38213 Creed III creed iii 2 1 3 Warner Bros. warner bros. M 203593442.0 ... 3.347837e+08 430 274 413886876.0 4.138869e+08 430 274 False 203593442.0 2023-03-09
2 38819 65 65 1 <NA> 2 Sony sony M 118723573.0 ... NaN <NA> <NA> NaN NaN <NA> <NA> False 118723573.0 2023-03-09
3 19466 Ant-Man and The Wasp: Quantumania ant-man and the wasp: quantumania 4 2 7 Walt Disney walt disney M 76090305.0 ... 1.367922e+09 511 308 175583178.0 1.399696e+09 511 309 False 76090305.0 2023-03-09
4 47449 Cocaine Bear cocaine bear 3 3 4 Universal universal MA15+ 49712489.0 ... 2.884566e+08 296 266 123413914.0 3.154096e+08 296 266 False 49712489.0 2023-03-09

5 rows × 24 columns

2) Data cleaning & preprocessing

Key steps I applied (see exploration notebooks for full details): - Flatten raw JSON payloads and explode film arrays into a row-per-film table. - Standardize names (strip whitespace, normalize casing), cast numeric fields, and deduplicate. - Handle missingness: treat null-like tokens as missing; leave true NaNs intact for model-aware handling. - Feature selection: focus on wk1_total as input, final_total as target for this model. - Transformations: use log1p on input and target to stabilize heavy-tailed distributions. - Outliers: downweight or robustness through transformation rather than hard clipping.

3) Exploratory data analysis (EDA)

For the web-friendly version, i have embed static figures generated during exploration: - Correlation heatmap of key numerical features - Missingness by column - Top distributors and top titles by volume/revenue - A line graph showing the weekly revenue decline for Avatar

All images are served from notebooks/reports/figs/ to avoid live data loading online.

EDA figures (static for publishing)

  • Correlation heatmap

  • Missingness by column

  • Top distributors

  • Top titles

  • Avatar weekly revenue trend (decline over time)

Notes - These PNGs live under notebooks/reports/figs/. - Replace with updated images anytime; the report will embed whatever is present at those paths.

4) Modeling with XGBoost

Why XGBoost: - Strong baseline for tabular regression, handles non-linearities and interactions well - Works nicely with log-transformed revenues (heavy tails) - Mature tooling and fast training

Training recipe (concise): - Split: train/validation/test (holdout) - Transform: wk1_total and target final_total via log1p/expm1 - Objective: squared error; early stopping over trees - Report: RMSE, MAE, MAPE_eps1, SMAPE, R²

The next cell loads saved holdout metrics from artifacts if available.

Code
import math

wk1_total = 1_000_000.0
log_space = math.log1p(wk1_total)
y_pred = log_space
final_total_pred = math.expm1(y_pred)

print(f"Example transform round-trip → predicted final_total: {final_total_pred:,.0f}")
Loaded metrics from c:\Users\Nebula PC\Swinburne\Intern\boxai\boxoffice\final_total\training_metrics.json
- RMSE: 199,400,512.8425
- MAE: 47,109,513.7283
- MAPE_eps1: 32.29%
- SMAPE: 31.38%
- WAPE: 31.91%
- R2: 0.8204

5) Saving & deployment

Saving artifacts - I have persisted the XGBoost booster plus JSON sidecars: metadata.json, schema.json, training_metrics.json. - See src/boxai/models/final_total_predictor.py for a minimal loader with log1p/expm1 transforms.

FastAPI service - Minimal: api_min.py — POST /predict with { "wk1_total": <float> } - Multi-model ready: multi_api.py — GET /healthz, GET /predict?wk1_total= with FINAL_TOTAL_ARTIFACTS env override

MCP (Model Control Protocol) - Example endpoints in experiments/MCP_Model/boxoffice_api.py: /mcp/metadata, /mcp/predict for chatbot/batch integrations.

The next cell shows a simple architecture diagram of data → model → API → chatbot.

Code
import json
from pathlib import Path


def load_metrics():
    bases = [Path.cwd()] + list(Path.cwd().parents)[:4]
    candidates = []
    for base in bases:
        candidates.extend([
            base / "boxoffice" / "final_total" / "training_metrics.json",
            base / "experiments" / "final_total_baseline" / "training_metrics.json",
        ])
    for p in candidates:
        if p.exists():
            try:
                with open(p, "r", encoding="utf-8") as f:
                    return json.load(f)
            except Exception:
                return None
    return None


metrics = load_metrics()
if metrics:
    for k in ["rmse", "mae", "mape_eps1", "smape", "wape", "r2"]:
        v = metrics.get(k)
        if isinstance(v, (int, float)):
            print(f"{k.upper()}: {v:.4f}")
        elif v is not None:
            print(f"{k.upper()}: {v}")
else:
    print("Holdout metrics not found in repo artifacts.")

+-------------------+      +---------------------+      +------------------+      +---------------------------+
|   DuckDB (data)   | ---> |  Preprocess/Features | ---> |   XGBoost Model  | ---> |  FastAPI (predict/health) |
+-------------------+      +---------------------+      +------------------+      +---------------------------+
                                                                                           |
                                                                                           v
                                                                                   +-----------------+
                                                                                   |  MCP / Chatbot  |
                                                                                   +-----------------+

6) Chatbot integration

  • The FastAPI service can be wrapped with MCP-style endpoints for LLM tools.
  • See experiments/MCP_Model/boxoffice_api.py for /mcp/metadata and /mcp/predict.
  • A chatbot can discover the model, validate schema, and send batches of instances.

Example workflow: 1) Chatbot calls /mcp/metadata to learn inputs/outputs and health 2) Chatbot sends batch to /mcp/predict with validated instances 3) Responses feed follow-up questions or “what-if” simulations