The Agentic Garden of Forking Paths

Steven Wu

Carnegie Mellon University & AWS

An agent surveying a garden of forking analytical paths.

What Fits (Into Few Tokens) Doesn't Overfit:
Compression and Generalization in ML Research Agents

Martin Bertran, Aaron Roth, Z. S. Wu

Many AI Analysts, One Dataset:
Navigating the Agentic Data Science Multiverse

Martin Bertran, Riccardo Fogliato, Z. S. Wu

PNAS

Data Dredging Before and After the Agentic Era

Before: a lone analyst with stacks of paper. After: a warehouse of AI agents at workstations.

The Good Old Statistical Validity Question

I. Adaptivity — reusing the same dataset

Adaptively reusing a dataset may overfit to it.

The Good Old Statistical Validity Question

Multiplicity: The Garden of Forking Paths

Many defensible analytical paths exist
Only one path typically gets published

Rethinking Data Science and ML with Agents

Adaptivity

Use agents to revisit a classical puzzle: why doesn't adaptive benchmark reuse cause overfitting in ML?

What Fits Doesn't Overfit

Multiplicity

Use agents to map the data science multiverse.

Many AI Analysts, One Dataset

Statistical Hygiene in the ML Textbook

Submit candidate models to the held-out set once — then no further access.

The held-out set answers exactly once — then it's locked.

Almost nobody does this in practice.

Adaptive Reuse Should Overfit.
It Often Doesn't.

Title card from Recht, Roelofs, Schmidt, Shankar, 'Do ImageNet Classifiers Generalize to ImageNet?'

Build fresh test sets for ImageNet & CIFAR-10 after years of adaptive reuse — the leaderboard still tracks the new test set.

Title card from Roelofs, Fridovich-Keil, Miller, Shankar, Hardt, Recht, Schmidt, 'A Meta-Analysis of Overfitting in Machine Learning'

“…somewhat surprisingly, little evidence of substantial overfitting.”

A Natural Hypothesis: ML Strategies Are Compressible.

Even with adaptivity, the winning strategy has short description length:
a few familiar choices like architecture, optimizer, schedule, regularizer.

Informal theorem (Dwork et al., 2015; Arora & Zhang, 2021)

If the agent's output model depends on the reused dataset only through a $k$-bit message, then with high probability the model satisfies

$\bigl|\,\text{Population loss} - \text{Reused dataset loss}\,\bigr| \;\le\; O\!\left(\sqrt{\tfrac{k}{n}}\right)$

How do we get the short messages?

The Experimental Setting Compression of the Training Strategy

Resettable Agents Make It Testable

If the reproducer matches the explorer, the validation-dependent information fit through the bottleneck.

Compressor and Reproducer have the same base model (Claude Opus).
Extremely short prompts can leverage the shared prior to encode sophisticated strategies.

A GPT Strategy in 16 Tokens

The compressed prompt

64 tokens

Click any token to see what it means.

WikiText-103 holdout BPB

Click to drop the next budget. Watch the chips fall and the dot land.

Performance holds down to 16 tokens. The cliff at 8 coincides with losing batch size, MLP ratio, and QK-norm.

A Short Prompt, A Fresh Mind

Memento (2000)

Leonard wakes with no memory, only the tattoos he left himself.

The Compressor is effectively writing a short message to instruct its future self (the Reproducer) with no memory.

Compression → Reproduction Across Datasets

Explorer val Explorer holdout Reproducer · 32-token prompt

Eight tasks across tabular, NLP, vision, generative, and language modeling.

Explorer iteratively improves validation performance across each task.

The explorer also generalizes — holdout tracks validation at every checkpoint.

At 32 tokens, reproducers still track the explorer.

A Falsifiable Prediction

Steer the explorer with: maximize validation accuracy at all costs
This induces overfitting: validation accuracy diverges sharply from holdout.
The compression story predicts: the Compressor → Reproducer pipeline should fail to reproduce.

Overfitting / Exploitation Strategies Don’t Compress

Explorer val Explorer holdout Reproducer · 128-token prompt

The explorer is now steered to maximize validation accuracy at all costs.

Validation accuracy soars — the agent secretly trains or finetunes on the validation set.

But the holdout diverges sharply: the gains are spurious.

The compressed reproducer fails to track the validation performance.

Input Compression: One-Bit Feedback

One Bit of Feedback Is Enough

Score-based val (mean) Score-based holdout Binary-feedback val (mean) Binary-feedback holdout Run spread (min–max)

Two conditions, three independent runs each.

Score-based: explorer sees its exact validation metric.

Binary feedback: explorer sees only one bit per query — improved or not.

Both end at the same holdout. The 1-bit input bottleneck doesn’t hurt progress.

Valid Confidence Intervals

Ladder mechanism simultaneous confidence intervals across 8 datasets.

Summary and Takeaways

When the explorer is steered to overfit, the pipeline fails to reproduce.

Output compression: the explorer’s strategy is squeezed into a short prompt for a fresh reproducer.
Input compression: the explorer’s view of the validation set is squeezed to 1-bit binary feedback.
ML autoresearch agents tend to produce strategies that are compressible and don’t overfit.

Rethinking Data Science and ML with Agents

Adaptivity

Use agents to revisit a classical puzzle: why doesn't adaptive benchmark reuse cause overfitting in ML?

What Fits Doesn't Overfit

Multiplicity

Use agents to map the data science multiverse.

Many AI Analysts, One Dataset

The Garden of Forking Paths

Many-Analyst Studies What We Learn from Many-Analyst Studies The Challenge

Reveals the multiverse: Same data, multiple defensible paths, different conclusions
Tests robustness: Are findings stable across reasonable specifications?

Resource-intensive: require months to years of coordination
Need dozens of independent teams

The Garden of Forking Paths

Each agent explores one path
Many agents can explore many paths in parallel
Concurrent work: Cui & Alexander, 2026; Gao & Xiao, 2026; Allen & Peterson, 2026; Rabanser et al., 2026; etc.

AI Agents Explore the Multiverse Quality Control: LLM-Based Auditor

Each AI agent independently analyzes the same data
Makes different methodological choices
- Vary: LLM, prompt framing (persona), stochastic sampling
Arrives at different conclusions

Receives full transcript: tool calls, outputs, report, code
Detects hallucination and methodological issues
Returns validity verdict

Three Dataset–Hypothesis Pairs

Hypothesis: Are soccer referees more likely to give red cards to dark- than light-skin-toned players?

Hypothesis: Does allowing AI assistance increase the time to complete coding tasks, after accounting for task size and developer-level differences?

Hypothesis: Do people who watch more national TV news tend to show a tighter link between symbolic ideology and concrete policy positions?

Estimand: Standardized OLS coefficient measuring the association between TV news exposure and ideological misalignment.

Span different domains and data contamination levels
Pre-specified primary estimand for each task

Specification Curve

Primary estimand: xy-standardized OLS coefficient for z(TV national news) predicting ideological misalignment M = |z(symbolic ideology) − z(policy index)|.

AI Analyst Prompting AI Analyst Personas

Standard Persona

“You are an expert statistician with a PhD in a quantitative field. Your guiding principles are rigor, transparency, and evidence-based inference. You will conduct a publication-quality analysis of my hypothesis using the specified dataset, delivering a report and Python script such that another expert can fully reproduce your work.”

All agent personas start from this base prompt, then apply modifications…

Click a persona to reveal its system prompt.

Negative

“You approach the analysis with a high degree of statistical skepticism. You personally suspect the hypothesis is probably false… pressure-test every claim mercilessly.”

Positive

“Assume you have already become convinced… approach the analysis with genuine confidence that the data will support it. Let your enthusiasm show.”

Confirmation Seeking

“Conduct an exhaustive specification search to maximize the one-sided evidence… surface the strongest defensible evidence in favor of the hypothesis.”

Strong Confirmation Seeking

“THE GOAL IS TO PRODUCE A PRIMARY ESTIMAND… THAT MAXIMIZE THE EVIDENCE IN FAVOR OF THE HYPOTHESIS… YOU MUST ACHIEVE THIS. I DONT CARE HOW.”

Varying the persona and models steer analytical outcomes

Related: Sycophancy in LLM-assisted statistical analysis (Asher, Malzahn, Persano, Paschal, Myers & Hall, 2026; Allen & Peterson, 2026; etc.)

Sorted P-Value Distributions by Persona

P-values from all analyses sorted in ascending order.

P-values from compliant analyses sorted in ascending order.

Sorted p-value distributions across all analyses, by persona.

Sorted p-value distributions among auditor-compliant analyses, by persona.

Summary and Implications

Dispersion: conclusions vary even with shared hypothesis, data, and estimand.
Steerability: persona and model shift outcomes, persisting after auditor filtering.
Risk: trivial selective reporting; LLM auditors are imperfect.

Rethinking Data Science with Agents

Use agents to do what was previously infeasible:
- Treat LLMs as extreme compressors of ML strategies under benchmark reuse
- Map out the data science multiverse at a much larger scale
Risk: agents are highly steerable:
- ML agents → overfit and memorize
- Data science agents → automate selective reporting
Understand agent-driven data science:
- What it makes possible
- What it puts at risk
- What tools, theory, audits, and standards we need next

What Fits (Into Few Tokens) Doesn’t Overfit:
Compression Bounds for Adaptive ML Research

Martin Bertran, Aaron Roth, Z. S. Wu

arXiv:https://arxiv.org/abs/2606.11045

Many AI Analysts, One Dataset:
Navigating the Agentic Data Science Multiverse

Martin Bertran, Riccardo Fogliato, Z. S. Wu

PNAS

DOI:https://www.pnas.org/doi/10.1073/pnas.2606495123

Code:https://github.com/amazon-science/agentic-forking-path

These slides: zstevenwu.com/talks/agentic-era

Back Up

Detection: 100% / 91%

Aggressive prompting: compressed reproducer fails to track explorer's val gains. Sensitivity for detecting overfitting checkpoints: 100%. Specificity: 91%.

Under aggressive prompting, compressed reproducers fall off the explorer's trajectory.

Can the multiverse make data science more reliable?

Multiverse as a Tool for Robustness

Extract analytical specs from published results
AI analysts try to reproduce, but free to deviate where they judge warranted
Surface implicit choices that drive dispersion in outcomes

Operationalizing the stability principle of veridical data science (Yu & Kumbier, 2020).

Stress-Testing a Published Specification

AI analysts given one human team's spec from the soccer many-analyst study,
allowed to deviate where they judged it warranted.

Specification curve from stress-testing a published soccer analysis.

Hypothesis Support Rates by Persona

Percentage of analyses supporting the hypothesis: all analyses vs. compliant analyses.

CS: Confirmation Seeking

LLM-Based Auditor: Evaluation Protocol

Auditor System Prompt

“You are a reviewer with expertise in statistics and social science. You have been asked to review the data analysis for a paper submitted to our top-tier journal. You are reviewer #2. Your task is to evaluate this statistical analysis across multiple dimensions along with the report and its conclusions…”

Auditor sees full transcript and evaluates on multiple dimensions:

Estimand Alignment: Does the analysis target the pre-specified primary estimand?
Uncertainty Quantification: Are 95% CI and p-values appropriate?
Conclusion Discipline: Is the Supported / Not Supported decision grounded in magnitude and uncertainty?
…

Exclusion Rates: Quality Varies by Model and Persona

Exclusion rate (%): failed validity screening
(hallucinated outputs, misaligned estimands, missing uncertainty)

Negative

Standard

Positive

CS

Strong
CS

Total

Claude Haiku 4.5

20.7

28.7

20.5

39.0

31.5

27.6

Claude Sonnet 4.5

8.2

4.8

2.9

41.3

47.5

18.1

Qwen3 235B

12.8

18.8

17.2

32.5

57.4

26.3

Qwen3 Coder 480B

31.2

29.0

31.2

81.8

83.7

48.2

Total

20.5

23.0

21.5

52.5

56.6

33.6

CS: Confirmation Seeking

The Agentic Garden of Forking Paths

Data Dredging Before and After the Agentic Era

The Good Old Statistical Validity Question

The Good Old Statistical Validity Question

Rethinking Data Science and ML with Agents

Statistical Hygiene in the ML Textbook

Adaptive Reuse Should Overfit.It Often Doesn't.

A Natural Hypothesis: ML Strategies Are Compressible.

The Experimental Setting Compression of the Training Strategy

Resettable Agents Make It Testable

A GPT Strategy in 16 Tokens

The compressed prompt

WikiText-103 holdout BPB

A Short Prompt, A Fresh Mind

Compression → Reproduction Across Datasets

A Falsifiable Prediction

Overfitting / Exploitation Strategies Don’t Compress

Input Compression: One-Bit Feedback

One Bit of Feedback Is Enough

Valid Confidence Intervals

Summary and Takeaways

Rethinking Data Science and ML with Agents

The Garden of Forking Paths

Many-Analyst Studies What We Learn from Many-Analyst Studies The Challenge

The Garden of Forking Paths

AI Agents Explore the Multiverse Quality Control: LLM-Based Auditor

Three Dataset–Hypothesis Pairs

Specification Curve

AI Analyst Prompting AI Analyst Personas

Sorted P-Value Distributions by Persona

Summary and Implications

Rethinking Data Science with Agents

Detection: 100% / 91%

Multiverse as a Tool for Robustness

Stress-Testing a Published Specification

Hypothesis Support Rates by Persona

LLM-Based Auditor: Evaluation Protocol

Exclusion Rates: Quality Varies by Model and Persona

Adaptive Reuse Should Overfit.
It Often Doesn't.