Mind the Sim-to-Real Gap & Think Like a Scientist

Harsh Parikh, Gabriel Levin-Konigsberg, Dominique Perrault-Joncas, Alexander Volfovsky

May 20, 2026

arXiv:2605.21458v1 PDF

cs.AI(primary)cs.LGstat.ME

#732of 2292·Artificial Intelligence

#732 of 2292 · Artificial Intelligence

Tournament Score

1450±42

10501800

63%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance7.5

Rigor6.5

Novelty7

Clarity6

Tournament Score

1450±42

10501800

63%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Suppose a planner has a pre-trained simulator of a sequential decision problem and the option to run real experiments in the field. The simulator is cheap to query but inherits confounding and drift from its calibration data. Experimentation is unbiased but consumes one real unit per trial. We study when, and how, the planner should supplement the simulator with experiments. We give three results. First, an extended simulation lemma decomposes the simulator's value error into a calibration--deployment shift that randomization can identify and a parametric residual that no further interaction can reduce. Second, the value gap between the simulator-optimal policy and the optimum splits into a local component, on states the deployed policy already visits, and a reachability component, on states it does not. The reachability component stays bounded away from zero at any horizon under purely passive learning. Third, we propose Fisher-SEP, a simulation-aided experimental policy (SEP) that minimizes the posterior predictive variance of a target policy's value, with reward-only and transition-only specializations. Two case studies illustrate the regimes. In a vending-machine supply chain, front-loaded experimentation overtakes posterior updating once the horizon is long enough to amortize the pilot. In an HIV mobile-testing example with a corridor that separates a well-surveilled region from a poorly-surveilled one, only designed exploration reaches the poorly-surveilled region.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

Core Contribution

This paper addresses a fundamental question at the intersection of reinforcement learning, causal inference, and experimental design: when should a planner with a pre-trained simulator supplement it with real-world experiments? The core novelty is a formal decomposition of the simulator's value gap into two structurally distinct components: a local gap (on states the deployed policy visits) that passive Bayesian updating can close, and a reachability gap (on states outside the deployed policy's support) that requires deliberate experimentation. This decomposition is formalized through an extended simulation lemma that separates calibration-deployment shift (identifiable via randomization) from parametric misspecification (irreducible).

The paper introduces Fisher-SEP, a simulation-aided experimental policy that minimizes posterior predictive variance of a target policy's value, weighting each parameter by its propagation through the Bellman resolvent. The key insight is using the simulator not as an accurate forward model but as a *connectivity prior*—the Bellman resolvent (I−γP^{π_tgt})^{−1} determines which parameters' uncertainties propagate to value, regardless of the simulator's parameter accuracy.

Methodological Rigor

The theoretical framework is carefully constructed. The four-object chain M★ → M★_obs → M̄_calib → M̂_sim cleanly separates three error sources (history dependence, calibration-deployment shift, misspecification). The proofs are thorough—the appendices span ~70 pages with detailed derivations.

However, several caveats limit the rigor claims:

The framework is strictly tabular with finite latent state (Assumption 1), and the function-approximation extension is acknowledged as open.

Conjecture 2 (geometric reachability gap for all passive policies on the stochastic fork) remains unproven for the full class; it is established only for a "regular" subclass.

The navigation-restricted Fisher-SEP-T bound (Conjecture 1) is only proved in the "strong-bottleneck regime."

The case studies are synthetic mechanism illustrations, not calibrated deployments. The vending DGP draws loosely from real data but is otherwise constructed; the HIV grid is a stylized 5×8 abstraction.

With only 30 common-seed trials, some headline comparisons have overlapping confidence intervals (e.g., Fisher-SEP vs. KG-SEP on vending at T=1600).

Potential Impact

The paper's impact operates at multiple levels:

1. Conceptual: The local/reachability decomposition provides a clean mental model for practitioners deciding whether to experiment. The analogy to positivity violations in causal inference is illuminating and could bridge communities.

2. Methodological: Fisher-SEP's use of the Bellman resolvent as a design criterion is novel and could influence how simulators are used in operations research, clinical trial design, and policy evaluation. The insight that the simulator's *connectivity pattern* matters more than its *parameter accuracy* is practically important.

3. Applied: The Exploration Priority Index (EPI) is a useful diagnostic. The vending and HIV case studies, while synthetic, illustrate two qualitatively different regimes (local vs. reachability) that practitioners encounter.

4. Limitations on impact: The tabular assumption severely limits direct applicability. The paper acknowledges this but doesn't sketch how to extend to function approximation beyond hand-waving about GP/BNN posteriors. The two-phase explore-then-commit structure yields T^{1/3} regret (slower than √T for PSRL/UCRL2), limiting competitiveness.

Timeliness & Relevance

The paper addresses a genuinely important gap. As simulators become ubiquitous (digital twins, foundation world models), the question of when to trust them versus collecting real data becomes increasingly pressing. The sim-to-real literature typically assumes experimentation will occur; this paper asks the logically prior question of *whether* it should. The causal inference framing (confounding, positivity violations) brings a perspective largely missing from the RL literature.

Strengths

1. Clean conceptual framework: The three uses of a simulator (policy source, Bayesian prior, design tool) and the corresponding policy hierarchy provide an organizing structure that clarifies relationships between existing methods.

2. Structural decomposition: The ε^h/ε^m ratio as the decision criterion for experimentation is simple, interpretable, and actionable.

3. Thoroughness: The 90-page appendix covers proofs, pseudocode, ablations, paired statistical tests, hierarchical-prior sensitivity, and ethical considerations with unusual care.

4. Bellman resolvent as design criterion: Using the resolvent to weight parameter sensitivities is elegant and practically motivated—it explains why Fisher-SEP works even when the simulator is 15× off on Region-B prevalence.

Limitations

1. Tabular restriction: The entire framework requires finite S, A, H. This is acknowledged but substantially limits applicability to real-world problems.

2. Synthetic experiments only: Neither case study involves real data or a real deployment. The HIV example uses a stylized grid, not real geographic/epidemiological data.

3. Open conjectures: Two of the paper's key claims (Conjectures 1 and 2) are only partially proved. The stochastic-fork reachability result is established for a regular subclass but not universally.

4. Computational scalability: The coordinate-descent PVV minimizer and per-pair Fisher computations scale as O(S³K), which is prohibitive beyond small tabular problems.

5. Prior sensitivity: The framework depends heavily on Assumption 4 (conjugate independent priors). The hierarchical-prior extension (Proposition 9) shows the reachability gap can shrink or vanish under correlated priors, undermining the clean separation.

6. Comparison fairness: UCRL2/UCBVI baselines use a coarser state representation than simulator-informed policies. The paper acknowledges this but the comparison remains somewhat misleading.

Overall Assessment

This is a theoretically ambitious paper that provides a useful conceptual framework for an important practical problem. The local/reachability decomposition and the "simulator as connectivity prior" insight are genuine contributions. However, the restriction to tabular settings, reliance on synthetic experiments, and multiple open conjectures limit the paper's current impact. The work reads more as a theoretical foundation paper than a practical methodology paper—which is valuable, but the gap to practice is substantial.

Rating:6.8/ 10

Significance 7.5Rigor 6.5Novelty 7Clarity 6

Generated May 21, 2026

Comparison History (24)

vs. AMEL: Accumulated Message Effects on LLM Judgments

gpt-5.25/22/2026

Paper 2 is more novel and broadly impactful: it contributes theoretical decomposition results and a principled algorithm (Fisher-SEP) for combining simulators with real-world experimentation under confounding/drift, a core sim-to-real problem spanning robotics, operations, healthcare, and RL. Its results (identifiability limits, passive-learning reachability gap, and variance-minimizing experimental design) are likely to generalize and influence methodology and practice. Paper 1 is timely and rigorously measured with clear practical implications for LLM evaluation pipelines, but its scope is narrower and mainly diagnostic/mitigative within LLM-based judging.

vs. Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention

claude-opus-4.65/22/2026

Paper 1 introduces a concrete architectural improvement (Gated DeltaNet-2) to linear attention mechanisms with strong empirical results across multiple benchmarks, including language modeling, reasoning, and retrieval at 1.3B scale. The efficient attention/linear RNN space is extremely active and high-impact, and improvements here have immediate broad applicability across NLP and beyond. Paper 2 addresses an important but more niche theoretical problem at the intersection of simulation and experimentation in sequential decision-making. While rigorous, its narrower scope and more specialized audience limit its breadth of impact compared to advances in foundational transformer-alternative architectures.

vs. Compiling Agentic Workflows into LLM Weights: Near-Frontier Quality at Two Orders of Magnitude Less Cost

gemini-3.15/22/2026

Paper 1 addresses a critical bottleneck in the booming field of LLM agents: the high cost and latency of orchestration frameworks. By proving workflows can be compiled into smaller model weights at a 100x cost reduction, it offers immense real-world applicability, timeliness, and potential to reshape how industry builds AI agents. While Paper 2 presents rigorous theoretical contributions to the sim-to-real gap, Paper 1's immediate economic and practical implications give it a higher potential for widespread, disruptive impact across both academia and industry.

vs. AMEL: Accumulated Message Effects on LLM Judgments

claude-opus-4.65/22/2026

Paper 1 addresses a fundamental problem in sequential decision-making—bridging sim-to-real gaps—with novel theoretical contributions (extended simulation lemma, value gap decomposition, reachability bounds) and a principled algorithm (Fisher-SEP). It has broad applicability across operations research, reinforcement learning, healthcare, and supply chains. Paper 2 identifies an important but relatively narrow bias (AMEL) in LLM evaluators, offering useful empirical findings and practical mitigations (fresh context per item). While timely, its contributions are primarily observational and specific to LLM evaluation pipelines. Paper 1's theoretical depth, methodological rigor, and cross-domain applicability give it greater long-term scientific impact.

vs. Advancing Mathematics Research with AI-Driven Formal Proof Search

gemini-3.15/22/2026

Paper 2 demonstrates a breakthrough in automated theorem proving by autonomously solving previously open Erdős problems and OEIS conjectures using AI. This represents a major milestone in AI for mathematics, offering immediate, broad applicability across multiple rigorous scientific disciplines like algebraic geometry and quantum optics. While Paper 1 provides valuable theoretical insights into the sim-to-real gap in reinforcement learning, Paper 2's concrete resolution of open mathematical problems gives it a significantly higher potential for immediate and transformative scientific impact.

vs. Towards a General Intelligence and Interface for Wearable Health Data

gemini-3.15/22/2026

Paper 1 presents a massive-scale foundation model for wearable health data, utilizing an unprecedented dataset of 5 million participants. Its integration of LLM agents and validation by clinicians demonstrate immense potential for real-world healthcare applications, promising a broader and more immediate scientific and societal impact than the theoretical sim-to-real gap framework presented in Paper 2.

vs. AMR-SD: Asymmetric Meta-Reflective Self-Distillation for Token-Level Credit Assignment

claude-opus-4.65/21/2026

Paper 1 addresses a critical bottleneck in LLM alignment—token-level credit assignment in RLVR—which is a highly active and impactful research area. Its novel combination of reflection bottlenecks, Causal Information Gain, and asymmetric gating offers a practical, broadly applicable framework for improving reasoning in LLMs. Paper 2 presents theoretically interesting results on sim-to-real gaps in sequential decision-making, but its scope is narrower and its case studies are more domain-specific. Given the enormous current momentum and breadth of LLM research, Paper 1's contributions are likely to see wider adoption and citation.

vs. VBFDD-Agent for Electric Vehicle Battery Fault Detection and Diagnosis: Descriptive Text Modeling of Battery Digital Signals

gpt-5.25/21/2026

Paper 2 offers broadly applicable theoretical contributions to sim-to-real decision-making, including new decomposition results, limits of passive learning via reachability, and a principled experiment-design algorithm (Fisher-SEP) with clear implications across RL, causal inference, operations research, and scientific experimentation. Its methodological rigor and generality suggest wider cross-domain impact and timeliness given current reliance on simulators. Paper 1 is practically relevant for EV battery maintenance and leverages LLM-based text modeling, but appears more application-specific and potentially less foundational.

vs. Beyond Cooperative Simulators: Generating Realistic User Personas for Robust Evaluation of LLM Agents

gemini-3.15/21/2026

While Paper 1 offers strong theoretical contributions to the classic sim-to-real reinforcement learning problem, Paper 2 addresses a critical and highly timely bottleneck in modern AI: the robust evaluation of LLM agents. By solving the 'cooperative simulator' flaw through evolutionary program search, Paper 2 provides an immediately applicable framework for the rapidly expanding field of agentic AI. Its methodology significantly improves human-likeness and agent robustness, suggesting higher immediate citation potential, broader adoption in the tech industry, and significant relevance to current AI safety and deployment challenges.

vs. PlanningBench: Generating Scalable and Verifiable Planning Data for Evaluating and Training Large Language Models

claude-opus-4.65/21/2026

Paper 1 addresses a fundamental problem in sequential decision-making—bridging the sim-to-real gap—with rigorous theoretical contributions (extended simulation lemma, value gap decomposition, reachability bounds) and a principled algorithm (Fisher-SEP). Its results apply broadly across reinforcement learning, causal inference, and experimental design. Paper 2, while useful as a benchmark framework for LLM planning evaluation, is more incremental—contributing engineering infrastructure rather than deep theoretical insights. Paper 1's formal results on when experimentation is necessary versus when simulation suffices have lasting methodological impact across multiple scientific domains.

vs. Explainable Wastewater Digital Twins: Adaptive Context-Conditioned Structured Simulators with Self-Falsifying Decision Support

claude-opus-4.65/21/2026

Paper 1 addresses a fundamental and broadly applicable problem—when and how to supplement simulators with real experiments—providing theoretical results (extended simulation lemma, value gap decomposition) and a principled algorithm (Fisher-SEP) with wide applicability across sequential decision-making domains. Its contributions span reinforcement learning theory, experimental design, and causal inference, giving it broad cross-field impact. Paper 2, while rigorous and practically valuable for wastewater treatment, is more domain-specific. Paper 1's theoretical framework is likely to influence a larger research community and inspire follow-up work across multiple application areas.

vs. High Quality Embeddings for Horn Logic Reasoning

gpt-5.25/21/2026

Paper 1 has higher impact potential due to clearer novelty (new decompositions of sim-to-real value error and policy gap, plus a principled experimental design algorithm), strong timeliness for RL/simulation-based decision making, and broad applicability across domains where simulators are biased (robotics, operations, healthcare, economics). It also offers actionable guidance (when to experiment vs. trust simulation) with illustrative case studies. Paper 2 is useful for neural-guided theorem proving, but its contributions are more incremental (training/negative sampling heuristics) and likely narrower in cross-field reach.

vs. Towards Resilient and Autonomous Networks: A BlueSky Vision on AI-Native 6G

gemini-3.15/21/2026

Paper 2 offers concrete theoretical contributions, including an extended simulation lemma and a novel algorithm (Fisher-SEP), supported by rigorous methodology and diverse case studies. Its approach to the sim-to-real gap has broad applicability across reinforcement learning, operations research, and applied sciences. In contrast, Paper 1 is a high-level visionary roadmap for 6G networks lacking specific methodological breakthroughs, limiting its immediate scientific impact compared to the rigorous, multi-domain solutions provided in Paper 2.

vs. Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR

gemini-3.15/21/2026

Paper 1 addresses a highly timely and critical bottleneck in modern AI: optimizing reinforcement learning for LLMs using rubric-based rewards. By improving the efficiency and effectiveness of GRPO—a trending algorithm in LLM post-training—its proposed POW3R framework has immediate, widespread applicability in AI alignment and reasoning tasks. While Paper 2 offers rigorous theoretical insights into the classic sim-to-real gap, Paper 1's direct relevance to the rapidly evolving field of LLM training gives it a higher potential for rapid, broad scientific and practical impact.

vs. From Automated to Autonomous: Hierarchical Agent-native Network Architecture (HANA)

claude-opus-4.65/21/2026

Paper 2 addresses a fundamental and broadly applicable problem—bridging sim-to-real gaps in sequential decision-making—with rigorous theoretical contributions (extended simulation lemma, value gap decomposition, impossibility result for passive learning) and a principled algorithm (Fisher-SEP). Its results generalize across many fields (operations research, public health, robotics, RL). Paper 1 proposes an engineering architecture for autonomous networks with promising empirical results but is more domain-specific (5G/telecom), less theoretically novel, and builds incrementally on existing multi-agent and LLM-based orchestration ideas.

vs. From Automated to Autonomous: Hierarchical Agent-native Network Architecture (HANA)

claude-opus-4.65/21/2026

Paper 2 presents a more theoretically rigorous contribution with novel formal results (extended simulation lemma, value gap decomposition, Fisher-SEP algorithm) addressing the fundamental sim-to-real gap problem relevant across reinforcement learning, operations research, and healthcare. Its mathematical framework provides generalizable insights for when and how to supplement simulators with real experiments—a broadly applicable problem. Paper 1, while practically useful for 5G network management, is more incremental in nature, combining existing concepts (multi-agent systems, LLM orchestration) in a domain-specific architecture with limited theoretical novelty beyond the specific application.

vs. Not all uncertainty is alike: volatility, stochasticity, and exploration

gemini-3.15/21/2026

Paper 2 offers a fundamental theoretical breakthrough by distinguishing the opposing effects of volatility and stochasticity on exploration. This insight bridges artificial intelligence, cognitive science, and computational psychiatry, providing a broad multidisciplinary impact. While Paper 1 provides a rigorous and practical solution to the sim-to-real gap in reinforcement learning with strong real-world applications, Paper 2's foundational contribution to decision-making theory and its implications for understanding both algorithmic exploration and human psychiatric conditions give it a higher potential for widespread scientific influence across diverse fields.

vs. Not all uncertainty is alike: volatility, stochasticity, and exploration

gemini-3.15/21/2026

Paper 2 offers profound interdisciplinary impact by distinguishing between volatility and stochasticity in exploration. By bridging artificial and biological intelligence, it not only advances theoretical reinforcement learning but also provides novel insights into cognitive science and computational psychiatry. While Paper 1 addresses a highly practical AI challenge (the sim-to-real gap), Paper 2 challenges foundational assumptions about uncertainty, promising broader theoretical and cross-disciplinary scientific influence.

vs. Governance by Construction for Generalist Agents

gemini-3.15/21/2026

Paper 2 provides fundamental theoretical contributions to the sim-to-real gap in sequential decision-making. By offering rigorous mathematical decompositions of value error and proposing a novel experimental policy algorithm (Fisher-SEP), it addresses a core challenge in RL and operations research. Its rigorous methodology and broad applicability across fields give it higher potential scientific impact than Paper 1, which primarily presents a practical architectural demo for enterprise LLM agent governance.

vs. Playing Devil's Advocate: Off-the-Shelf Persona Vectors Rival Targeted Steering for Sycophancy

gpt-5.25/21/2026

Paper 2 offers broader, more foundational contributions: formal decompositions of sim-to-real value error, a principled characterization of when passive learning fails (reachability gap), and a concrete algorithm (Fisher-SEP) with domain case studies. These results generalize across RL, causal inference, operations research, and experimental design, with clear real-world relevance wherever simulators are used (robotics, healthcare, supply chains). Paper 1 is timely and useful for LLM alignment, but is narrower in scope and largely empirical/diagnostic, with impact likely confined to interpretability/steering methods.