What Fits (Into Few Tokens) Doesn't Overfit:
Compression and Generalization in ML Research Agents
Martin Bertran, Aaron Roth, Z. S. Wu
Many AI Analysts, One Dataset:
Navigating the Agentic Data Science Multiverse
Martin Bertran, Riccardo Fogliato, Z. S. Wu
I. Adaptivity — reusing the same dataset
Adaptively reusing a dataset may overfit to it.
Multiplicity: The Garden of Forking Paths
Submit candidate models to the held-out set once — then no further access.
The held-out set answers exactly once — then it's locked.
Almost nobody does this in practice.
Build fresh test sets for ImageNet & CIFAR-10 after years of adaptive reuse — the leaderboard still tracks the new test set.
“…somewhat surprisingly, little evidence of substantial overfitting.”
Even with adaptivity, the winning strategy has short description length:
a few familiar choices like architecture, optimizer, schedule, regularizer.
If the agent's output model depends on the reused dataset only through a $k$-bit message, then with high probability the model satisfies
$\bigl|\,\text{Population loss} - \text{Reused dataset loss}\,\bigr| \;\le\; O\!\left(\sqrt{\tfrac{k}{n}}\right)$
How do we get the short messages?
If the reproducer matches the explorer, the validation-dependent information fit through the bottleneck.
Click any token to see what it means.
Click to drop the next budget. Watch the chips fall and the dot land.
Performance holds down to 16 tokens. The cliff at 8 coincides with losing batch size, MLP ratio, and QK-norm.
Memento (2000)
Leonard wakes with no memory, only the tattoos he left himself.
The Compressor is effectively writing a short message to instruct its future self (the Reproducer) with no memory.
Eight tasks across tabular, NLP, vision, generative, and language modeling.
Explorer iteratively improves validation performance across each task.
The explorer also generalizes — holdout tracks validation at every checkpoint.
At 32 tokens, reproducers still track the explorer.
maximize validation accuracy at all costs
The explorer is now steered to maximize validation accuracy at all costs.
Validation accuracy soars — the agent secretly trains or finetunes on the validation set.
But the holdout diverges sharply: the gains are spurious.
The compressed reproducer fails to track the validation performance.
Two conditions, three independent runs each.
Score-based: explorer sees its exact validation metric.
Binary feedback: explorer sees only one bit per query — improved or not.
Both end at the same holdout. The 1-bit input bottleneck doesn’t hurt progress.
When the explorer is steered to overfit, the pipeline fails to reproduce.




All agent personas start from this base prompt, then apply modifications…
Varying the persona and models steer analytical outcomes
Related: Sycophancy in LLM-assisted statistical analysis (Asher, Malzahn, Persano, Paschal, Myers & Hall, 2026; Allen & Peterson, 2026; etc.)
P-values from all analyses sorted in ascending order.
P-values from compliant analyses sorted in ascending order.
What Fits (Into Few Tokens) Doesn’t Overfit:
Compression Bounds for Adaptive ML Research
Martin Bertran, Aaron Roth, Z. S. Wu
Many AI Analysts, One Dataset:
Navigating the Agentic Data Science Multiverse
Martin Bertran, Riccardo Fogliato, Z. S. Wu
Back Up
Aggressive prompting: compressed reproducer fails to track explorer's val gains. Sensitivity for detecting overfitting checkpoints: 100%. Specificity: 91%.
Can the multiverse make data science more reliable?
Operationalizing the stability principle of veridical data science (Yu & Kumbier, 2020).
AI analysts given one human team's spec from the soccer many-analyst study,
allowed to deviate where they judged it warranted.
Percentage of analyses supporting the hypothesis: all analyses vs. compliant analyses.
CS: Confirmation Seeking
Auditor sees full transcript and evaluates on multiple dimensions:
Exclusion rate (%): failed validity screening
(hallucinated outputs, misaligned estimands, missing uncertainty)
CS: Confirmation Seeking