The Sim-to-Real Gap of Foundation Model Agents: A Unified MDP Perspective

Xiaoou Liu, Tiejin Chen, Weibo Li, Xiyang Hu, Hua Wei

Jun 5, 2026arXiv:2606.07017v1

cs.AIcs.CLcs.ET

#1331of 3539·Artificial Intelligence

#1331 of 3539 · Artificial Intelligence

Tournament Score

1428±43

10501800

60%

Win Rate

Wins

Losses

Matches

Rating

4.5/ 10

Significance5.5

Rigor3.5

Novelty4

Clarity6.5

Abstract

Foundation model agents are increasingly deployed for real-world decision-making, but suffer from the sim-to-real gap. While robotics and classical control have mature frameworks to address this gap, the foundation model community is treating agent robustness as an entirely novel phenomenon. Our paper proposes formalizing the foundation model agent evaluation and training gap as a classical sim-to-real problem structured entirely around the four elements of a Markov Decision Process, including Observation, Action, Transition, and Reward. In this paper, we set a comprehensive research agenda that translates classical discrepancies into the foundation model domain and advocates for adopting established solutions like domain randomization. We provide concrete examples, such as a multilingual tool calling to demonstrate how severe observation space gaps lead to operationally invalid actions despite correct semantic intent. Ultimately, this agenda aims to drive a paradigm shift, yielding a unified vocabulary and standardized stress test benchmarks to foster a new generation of highly trustworthy agents for reliable real-world applications.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

1. Core Contribution

This paper proposes reframing the gap between foundation model (FM) agent performance on benchmarks versus real-world deployment as a classical sim-to-real transfer problem, organized around the four components of a Markov Decision Process (Observation, Action, Transition, Reward). The central thesis is that the FM agent community is "reinventing the wheel" by treating robustness failures as novel phenomena, when in fact they mirror well-studied sim-to-real gaps in robotics and classical RL. The paper maps each MDP component to its FM-agent analogue (e.g., multilingual input noise as observation gap, API failures as transition gap) and advocates for adopting established solutions like domain randomization and grounded action transformation.

The contribution is primarily conceptual—a unifying vocabulary and a research agenda—rather than a technical or empirical one. The paper does not introduce new algorithms, benchmarks, or extensive experiments. It serves as a position/vision paper ("Blue Sky" at KDD).

2. Methodological Rigor

The paper's methodology is largely taxonomic and analogical. It draws on a prior survey of sim-to-real methods in RL [15] and systematically maps each classical gap category to the FM agent domain. Table 1 provides a structured comparison, and Section 3 elaborates on each gap with concrete examples.

The empirical evidence is limited. The only quantitative data presented (Table 2) comes from a separate multilingual tool-calling study [37], showing error rate increases when switching from English to other languages. This is used to illustrate the observation gap but does not constitute new experimental validation of the proposed framework. The paper does not demonstrate that adopting any classical sim-to-real technique (e.g., domain randomization) actually closes an FM agent gap—this is entirely aspirational.

The MDP formalization itself is fairly standard and lightweight. The gap definition G(π) := ψ_s(π) − ψ_r(π) is straightforward and borrowed from prior work. The mapping between classical RL and FM agents, while useful, sometimes stretches analogies thin. For instance, "action shielding" in robotics involves continuous control safety constraints, while in FM agents it involves filtering invalid API calls—the underlying technical challenges are quite different despite the shared label.

3. Potential Impact

The paper's potential impact lies primarily in community organization and vocabulary standardization rather than in technical advancement. If adopted, the MDP-based decomposition could:

Provide a common language for researchers working on different aspects of FM agent robustness (multilingual robustness, tool-use reliability, reward misspecification) to recognize their work as facets of the same problem.

Motivate systematic benchmarking that perturbs each MDP component independently and in combination, rather than ad-hoc stress testing.

Lower barriers for robotics/RL researchers to contribute solutions to FM agent robustness by making the connection explicit.

However, the practical impact is uncertain. The analogies, while intellectually appealing, may not translate into actionable technical solutions. Domain randomization over visual parameters in robotics is well-defined; "randomizing" multilingual inputs or API failure patterns for FM agents involves fundamentally different challenges (e.g., the curse of natural language diversity, the complexity of real-world API ecosystems). The paper acknowledges this implicitly but does not grapple with why classical solutions might fail or need substantial modification in the FM domain.

4. Timeliness & Relevance

The paper is timely. FM agents (tool-using LLMs, autonomous coding agents, etc.) are indeed being deployed at scale, and the gap between benchmark performance and real-world reliability is a pressing concern. Multiple concurrent works ([44, 52, 58]) are addressing related issues, which validates the problem's relevance. The KDD venue is appropriate given the emphasis on trustworthy and responsible AI.

The observation that the FM community is "reinventing the wheel" resonates with recent trends—papers like AgentNoiseBench [52] and robustness evaluations of agentic function calling [44] are indeed rediscovering perturbation-based testing without connecting to classical sim-to-real literature. This paper fills a genuine gap in the discourse.

5. Strengths & Limitations

Strengths:

Conceptual clarity: The four-component MDP decomposition provides an intuitive and comprehensive organizing principle for a fragmented research area.

Breadth of coverage: The paper touches on all major gap types with concrete FM-agent examples (multilingual inputs, distractor tools, API failures, cost-aware rewards).

Research agenda value: The structured research directions in Section 4 (with Figure 2) offer a useful roadmap that could guide future work.

Timeliness: Addresses a genuine and growing need as FM agents move to production.

Limitations:

Lack of empirical validation: The paper does not implement or test any of the proposed solutions. There is no evidence that classical techniques actually transfer effectively to FM agents.

Shallow analogies: Some mappings between classical RL and FM agents are superficial. The technical challenges of, say, domain randomization over natural language inputs are fundamentally different from randomizing physics parameters or visual textures.

Missing discussion of fundamental differences: FM agents operate in discrete, combinatorial, and semantically rich spaces—the paper does not adequately address why these differences might limit the applicability of classical solutions designed for continuous control.

Limited novelty in formalization: The MDP gap definition is borrowed from prior work, and the mapping exercise, while useful, is primarily organizational.

No benchmark contribution: Despite advocating for standardized benchmarks, the paper provides none.

Table 2 data is borrowed: The only quantitative evidence comes from another paper by overlapping authors, limiting the empirical contribution.

Additional Observations

The paper reads as a well-organized research agenda or position paper rather than a technical contribution. Its value will depend largely on whether the community adopts its vocabulary and whether follow-up work demonstrates that classical sim-to-real techniques genuinely transfer to the FM domain. The concurrent work [58] by overlapping authors (which appears to implement some of these ideas) may provide the necessary empirical grounding.

The writing is clear and well-structured, though somewhat repetitive across sections. The paper would benefit from a more critical analysis of where the analogy breaks down and what genuinely new techniques the FM domain requires beyond classical solutions.

Rating:4.5/ 10

Significance 5.5Rigor 3.5Novelty 4Clarity 6.5

Generated Jun 8, 2026

Comparison History (20)

Lostvs. TouchThinker: Scaling Tactile Commonsense Reasoning to the Open World with Large-scale Data and Action-aware Representation

TouchThinker presents a substantial concrete contribution—a million-scale dataset, a new benchmark, and a novel action-aware representation mechanism for tactile reasoning—with experimental validation. It addresses a specific, growing need in embodied AI with tangible artifacts (dataset, code, benchmark). Paper 2 is a position/perspective paper proposing a conceptual framework (MDP-based sim-to-real formalization for foundation model agents) without significant empirical contributions. While Paper 2 offers useful framing, position papers typically have lower citation impact than papers introducing large-scale resources and validated methods that others can build upon.

claude-opus-4-6·Jun 11, 2026

Wonvs. Search Discipline for Long-Horizon Research Agents

Paper 1 proposes a broad, unifying framework that bridges classical sim-to-real transfer with foundation model agent deployment, addressing a widely recognized and growing problem across the entire AI agent community. Its MDP-based formalization offers a standardized vocabulary and benchmark agenda with potential to influence how the community evaluates and trains agents at scale. Paper 2 identifies a specific but important failure mode (aggregate score inversion) in autoresearch agents, but its scope is narrower and its immediate audience more limited. Paper 1's breadth of impact across robotics, NLP, and agent research gives it higher potential.

claude-opus-4-6·Jun 11, 2026

Lostvs. FALSIFYBENCH: Evaluating Inductive Reasoning in LLMs with Rule Discovery Games

Paper 1 introduces a concrete, empirical benchmark (FALSIFYBENCH) targeting a critical and highly relevant capability: scientific inductive reasoning and falsification in LLMs. It provides actionable insights and immediate utility for evaluating new models. In contrast, Paper 2 is primarily a position or agenda paper; while it offers a valuable conceptual bridge between classical control and foundation models, agenda papers typically have less immediate measurable impact than widely adopted empirical benchmarks.

gemini-3.1-pro-preview·Jun 8, 2026

Wonvs. AdMem: Advanced Memory for Task-solving Agents

Paper 2 proposes a foundational paradigm shift by formally bridging the gap between classical control theory (MDPs) and foundation model agents. By framing agent robustness as a sim-to-real problem, it sets a broad, unifying research agenda that can impact multiple fields (robotics, RL, and LLMs) and establish new standardized evaluation benchmarks. In contrast, Paper 1 offers a valuable but more narrowly focused architectural improvement for agent memory systems.

gemini-3.1-pro-preview·Jun 8, 2026

Lostvs. Think Fast: Estimating No-CoT Task-Completion Time Horizons of Frontier AI Models

Paper 2 introduces a novel, quantitative metric (no-CoT task-completion time horizon) for measuring a concrete AI safety concern—models reasoning internally without observable chain-of-thought. It provides empirical measurements across 43 benchmarks with 30,000+ questions, offers actionable projections, and directly addresses a critical gap in AI safety monitoring. Paper 1 proposes a conceptual framework mapping sim-to-real gaps to foundation model agents using existing MDP formalism, but is more of a position/agenda paper without substantial new empirical contributions. Paper 2's timeliness, methodological rigor, and direct policy relevance give it higher impact potential.

claude-opus-4-6·Jun 8, 2026

Lostvs. MAVEN-T: Reinforced Heterogeneous Distillation for Real-Time Multi-Agent Trajectory Prediction

Paper 2 is likely to have higher scientific impact due to a concrete, technically novel framework (reinforced heterogeneous distillation) with demonstrated real-time performance and safety-aligned refinement, evaluated across multiple major autonomous-driving datasets and deployment hardware—supporting methodological rigor, timeliness, and clear real-world applicability. Paper 1 offers a valuable unifying perspective and agenda for sim-to-real issues in foundation model agents, but appears more conceptual with fewer definitive methodological contributions or empirical results, which may limit near-term measurable impact compared to Paper 2’s deployable system advances.

gpt-5.2·Jun 8, 2026

Wonvs. A Study of Parallel Continuous Local Search

Paper 1 addresses the critical sim-to-real gap in foundation model agents, a rapidly growing and highly relevant field with broad real-world applications. By unifying this problem under classical MDP frameworks, it proposes a foundational research agenda that can impact multiple domains deploying AI agents. Paper 2 offers valuable empirical insights for SAT optimization on accelerator hardware, but its scope and potential audience are significantly narrower, making Paper 1's overarching paradigm shift more likely to achieve widespread scientific and practical impact.

gemini-3.1-pro-preview·Jun 8, 2026

Wonvs. Off-Policy Evaluation with Strategic Agents via Local Disclosure

Paper 1 addresses the highly timely and relevant topic of foundation model agents, proposing a unified framework that bridges classical robotics (MDPs) with modern LLM agents. This paradigm shift has massive potential for broad impact and real-world applications across various domains. Paper 2, while methodologically rigorous, addresses a more niche theoretical problem in off-policy evaluation, likely resulting in a narrower scope of scientific impact.

gemini-3.1-pro-preview·Jun 8, 2026

Wonvs. StainFlow: Entity-Stain Tracking and Evidence Linking for Process Rewards in GUI Agents

Paper 2 has higher potential impact due to its broad, unifying conceptual framework: casting foundation-model agent sim-to-real issues into an MDP decomposition (O/A/T/R) creates a shared vocabulary, evaluation lens, and research agenda across domains (LLM agents, robotics, tool use). This is timely and widely applicable, likely to influence benchmarks and robustness practices. Paper 1 is a concrete, method-specific contribution with modest reported gains and narrower scope (GUI RL reward modeling), suggesting more limited cross-field impact despite solid novelty.

gpt-5.2·Jun 8, 2026

Lostvs. How AI Agents Reshape Knowledge Work: Autonomy, Efficiency, and Scope

Paper 1 presents novel empirical findings with large-scale production data quantifying how AI agents reshape knowledge work, offering concrete metrics on autonomy, efficiency gains (87% time reduction, 94% cost reduction), and scope expansion. Its real-world evidence on a timely topic (autonomous AI agents vs. assistants) has broad implications for economics, labor, and AI deployment. Paper 2 proposes a useful conceptual framework mapping the sim-to-real gap to foundation model agents via MDP formalization, but is primarily a position/agenda paper without substantial empirical validation, limiting its immediate impact.

claude-opus-4-6·Jun 8, 2026

#1331of 3539·Artificial Intelligence

#1331 of 3539 · Artificial Intelligence

Tournament Score

1428±43

10501800

60%

Win Rate

Wins

Losses

Matches

Rating

4.5/ 10

Significance5.5

Rigor3.5

Novelty4

Clarity6.5