Xiaoou Liu, Tiejin Chen, Weibo Li, Xiyang Hu, Hua Wei
Foundation model agents are increasingly deployed for real-world decision-making, but suffer from the sim-to-real gap. While robotics and classical control have mature frameworks to address this gap, the foundation model community is treating agent robustness as an entirely novel phenomenon. Our paper proposes formalizing the foundation model agent evaluation and training gap as a classical sim-to-real problem structured entirely around the four elements of a Markov Decision Process, including Observation, Action, Transition, and Reward. In this paper, we set a comprehensive research agenda that translates classical discrepancies into the foundation model domain and advocates for adopting established solutions like domain randomization. We provide concrete examples, such as a multilingual tool calling to demonstrate how severe observation space gaps lead to operationally invalid actions despite correct semantic intent. Ultimately, this agenda aims to drive a paradigm shift, yielding a unified vocabulary and standardized stress test benchmarks to foster a new generation of highly trustworthy agents for reliable real-world applications.
This paper proposes reframing the gap between foundation model (FM) agent performance on benchmarks versus real-world deployment as a classical sim-to-real transfer problem, organized around the four components of a Markov Decision Process (Observation, Action, Transition, Reward). The central thesis is that the FM agent community is "reinventing the wheel" by treating robustness failures as novel phenomena, when in fact they mirror well-studied sim-to-real gaps in robotics and classical RL. The paper maps each MDP component to its FM-agent analogue (e.g., multilingual input noise as observation gap, API failures as transition gap) and advocates for adopting established solutions like domain randomization and grounded action transformation.
The contribution is primarily conceptual—a unifying vocabulary and a research agenda—rather than a technical or empirical one. The paper does not introduce new algorithms, benchmarks, or extensive experiments. It serves as a position/vision paper ("Blue Sky" at KDD).
The paper's methodology is largely taxonomic and analogical. It draws on a prior survey of sim-to-real methods in RL [15] and systematically maps each classical gap category to the FM agent domain. Table 1 provides a structured comparison, and Section 3 elaborates on each gap with concrete examples.
The empirical evidence is limited. The only quantitative data presented (Table 2) comes from a separate multilingual tool-calling study [37], showing error rate increases when switching from English to other languages. This is used to illustrate the observation gap but does not constitute new experimental validation of the proposed framework. The paper does not demonstrate that adopting any classical sim-to-real technique (e.g., domain randomization) actually closes an FM agent gap—this is entirely aspirational.
The MDP formalization itself is fairly standard and lightweight. The gap definition G(π) := ψ_s(π) − ψ_r(π) is straightforward and borrowed from prior work. The mapping between classical RL and FM agents, while useful, sometimes stretches analogies thin. For instance, "action shielding" in robotics involves continuous control safety constraints, while in FM agents it involves filtering invalid API calls—the underlying technical challenges are quite different despite the shared label.
The paper's potential impact lies primarily in community organization and vocabulary standardization rather than in technical advancement. If adopted, the MDP-based decomposition could:
However, the practical impact is uncertain. The analogies, while intellectually appealing, may not translate into actionable technical solutions. Domain randomization over visual parameters in robotics is well-defined; "randomizing" multilingual inputs or API failure patterns for FM agents involves fundamentally different challenges (e.g., the curse of natural language diversity, the complexity of real-world API ecosystems). The paper acknowledges this implicitly but does not grapple with why classical solutions might fail or need substantial modification in the FM domain.
The paper is timely. FM agents (tool-using LLMs, autonomous coding agents, etc.) are indeed being deployed at scale, and the gap between benchmark performance and real-world reliability is a pressing concern. Multiple concurrent works ([44, 52, 58]) are addressing related issues, which validates the problem's relevance. The KDD venue is appropriate given the emphasis on trustworthy and responsible AI.
The observation that the FM community is "reinventing the wheel" resonates with recent trends—papers like AgentNoiseBench [52] and robustness evaluations of agentic function calling [44] are indeed rediscovering perturbation-based testing without connecting to classical sim-to-real literature. This paper fills a genuine gap in the discourse.
The paper reads as a well-organized research agenda or position paper rather than a technical contribution. Its value will depend largely on whether the community adopts its vocabulary and whether follow-up work demonstrates that classical sim-to-real techniques genuinely transfer to the FM domain. The concurrent work [58] by overlapping authors (which appears to implement some of these ideas) may provide the necessary empirical grounding.
The writing is clear and well-structured, though somewhat repetitive across sections. The paper would benefit from a more critical analysis of where the analogy breaks down and what genuinely new techniques the FM domain requires beyond classical solutions.
Generated Jun 8, 2026
TouchThinker presents a substantial concrete contribution—a million-scale dataset, a new benchmark, and a novel action-aware representation mechanism for tactile reasoning—with experimental validation. It addresses a specific, growing need in embodied AI with tangible artifacts (dataset, code, benchmark). Paper 2 is a position/perspective paper proposing a conceptual framework (MDP-based sim-to-real formalization for foundation model agents) without significant empirical contributions. While Paper 2 offers useful framing, position papers typically have lower citation impact than papers introducing large-scale resources and validated methods that others can build upon.
Paper 1 proposes a broad, unifying framework that bridges classical sim-to-real transfer with foundation model agent deployment, addressing a widely recognized and growing problem across the entire AI agent community. Its MDP-based formalization offers a standardized vocabulary and benchmark agenda with potential to influence how the community evaluates and trains agents at scale. Paper 2 identifies a specific but important failure mode (aggregate score inversion) in autoresearch agents, but its scope is narrower and its immediate audience more limited. Paper 1's breadth of impact across robotics, NLP, and agent research gives it higher potential.
Paper 1 introduces a concrete, empirical benchmark (FALSIFYBENCH) targeting a critical and highly relevant capability: scientific inductive reasoning and falsification in LLMs. It provides actionable insights and immediate utility for evaluating new models. In contrast, Paper 2 is primarily a position or agenda paper; while it offers a valuable conceptual bridge between classical control and foundation models, agenda papers typically have less immediate measurable impact than widely adopted empirical benchmarks.
Paper 2 proposes a foundational paradigm shift by formally bridging the gap between classical control theory (MDPs) and foundation model agents. By framing agent robustness as a sim-to-real problem, it sets a broad, unifying research agenda that can impact multiple fields (robotics, RL, and LLMs) and establish new standardized evaluation benchmarks. In contrast, Paper 1 offers a valuable but more narrowly focused architectural improvement for agent memory systems.
Paper 2 introduces a novel, quantitative metric (no-CoT task-completion time horizon) for measuring a concrete AI safety concern—models reasoning internally without observable chain-of-thought. It provides empirical measurements across 43 benchmarks with 30,000+ questions, offers actionable projections, and directly addresses a critical gap in AI safety monitoring. Paper 1 proposes a conceptual framework mapping sim-to-real gaps to foundation model agents using existing MDP formalism, but is more of a position/agenda paper without substantial new empirical contributions. Paper 2's timeliness, methodological rigor, and direct policy relevance give it higher impact potential.
Paper 2 is likely to have higher scientific impact due to a concrete, technically novel framework (reinforced heterogeneous distillation) with demonstrated real-time performance and safety-aligned refinement, evaluated across multiple major autonomous-driving datasets and deployment hardware—supporting methodological rigor, timeliness, and clear real-world applicability. Paper 1 offers a valuable unifying perspective and agenda for sim-to-real issues in foundation model agents, but appears more conceptual with fewer definitive methodological contributions or empirical results, which may limit near-term measurable impact compared to Paper 2’s deployable system advances.
Paper 1 addresses the critical sim-to-real gap in foundation model agents, a rapidly growing and highly relevant field with broad real-world applications. By unifying this problem under classical MDP frameworks, it proposes a foundational research agenda that can impact multiple domains deploying AI agents. Paper 2 offers valuable empirical insights for SAT optimization on accelerator hardware, but its scope and potential audience are significantly narrower, making Paper 1's overarching paradigm shift more likely to achieve widespread scientific and practical impact.
Paper 1 addresses the highly timely and relevant topic of foundation model agents, proposing a unified framework that bridges classical robotics (MDPs) with modern LLM agents. This paradigm shift has massive potential for broad impact and real-world applications across various domains. Paper 2, while methodologically rigorous, addresses a more niche theoretical problem in off-policy evaluation, likely resulting in a narrower scope of scientific impact.
Paper 2 has higher potential impact due to its broad, unifying conceptual framework: casting foundation-model agent sim-to-real issues into an MDP decomposition (O/A/T/R) creates a shared vocabulary, evaluation lens, and research agenda across domains (LLM agents, robotics, tool use). This is timely and widely applicable, likely to influence benchmarks and robustness practices. Paper 1 is a concrete, method-specific contribution with modest reported gains and narrower scope (GUI RL reward modeling), suggesting more limited cross-field impact despite solid novelty.
Paper 1 presents novel empirical findings with large-scale production data quantifying how AI agents reshape knowledge work, offering concrete metrics on autonomy, efficiency gains (87% time reduction, 94% cost reduction), and scope expansion. Its real-world evidence on a timely topic (autonomous AI agents vs. assistants) has broad implications for economics, labor, and AI deployment. Paper 2 proposes a useful conceptual framework mapping the sim-to-real gap to foundation model agents via MDP formalization, but is primarily a position/agenda paper without substantial empirical validation, limiting its immediate impact.