COMAP: Co-Evolving World Models and Agent Policies for LLM Agents
Youwei Liu, Jian Wang, Hanlin Wang, Wenjie Li
Abstract
Equipping language agents with world models enables them to anticipate environment dynamics and evaluate candidate actions before execution. However, existing textual world models are typically fixed after training, preventing them from adapting to the on-policy state-action distributions induced by an evolving agent. Meanwhile, agent-improvement methods often rely on external rewards or verifiers, limiting their applicability in realistic interactive environments. In this paper, we propose COMAP, a novel framework that co-evolves textual world models and agent policies through closed-loop interaction. At each decision step, the world model predicts future state feedback for candidate actions, and the agent performs future-aware reflection by estimating the reliability of this feedback and refining its action accordingly. The resulting on-policy trajectories are then used to update the world model via self-distillation, allowing it to better match the agent's evolving interaction distribution. Across embodied task planning, Web navigation, and tool-use benchmarks, COMAP consistently outperforms competitive baselines, e.g., +16.75% relative improvement with Qwen3-4B. Further analyses show that the co-evolutionary loop improves the world model's prediction accuracy over time and leads to more effective long-horizon decision-making. Our code is available at: https://github.com/loyiv/CoMAP.
AI Impact Assessments
(1 models)Scientific Impact Assessment: COMAP
1. Core Contribution
COMAP addresses a genuine gap in LLM agent research: the mismatch between static, fixed-after-training textual world models and the non-stationary state-action distributions produced by evolving agent policies. The paper proposes a closed-loop co-evolution framework with two interacting mechanisms: (1) on-policy self-distillation for the world model, where a student world model is trained with real-state supervision plus soft distributional guidance from an EMA teacher; and (2) future-aware reflection for the agent policy, where imagined future states guide action refinement through a gated mechanism.
The key conceptual insight — that world models and policies should co-evolve rather than be trained independently — is well-motivated. The distribution shift problem (a fixed world model becoming inaccurate as the policy visits novel states) is real and underexplored in the textual world model literature. The framework elegantly couples both learning loops: improved policies generate more diverse transitions for world-model training, while improved world models provide better lookahead signals for policy refinement.
2. Methodological Rigor
The methodology is technically detailed and generally sound. Several design choices demonstrate careful engineering:
However, several methodological concerns arise:
3. Potential Impact
Breadth of evaluation: Testing across four diverse benchmarks (embodied planning, scientific reasoning, web navigation, tool use) with multiple backbone sizes (4B, 8B, 30B-A3B) provides reasonably strong evidence of generality.
Practical significance: The +16.75% relative improvement with Qwen3-4B is notable, particularly because the gains are larger for smaller models, suggesting COMAP could democratize strong agent capabilities without requiring massive backbones. The finding that Qwen3-8B with COMAP approaches DeepSeek-V4-Pro with Imagine-and-Act is compelling.
Limitations on broader impact: The framework is restricted to text-representable environments, which limits applicability to multimodal or visual settings. The additional inference-time world-model call adds latency. The co-evolving training requires environment access for rollouts, which may not always be available.
4. Timeliness & Relevance
This work is highly timely. The field is rapidly moving from static prompt-based agents toward learning-based agent frameworks. The co-evolution paradigm sits at the intersection of several active research threads: world models for planning (WKM, IWM), agent self-improvement (Reflexion, AgentGym), and on-policy distillation. The paper positions itself well against concurrent work like WebEvolver (Fang et al., 2025), distinguishing COMAP's closed-loop formulation from methods that use world models merely as auxiliary planning components.
The reference to very recent models (GPT-5.4, DeepSeek-V4, Qwen3) suggests the work is current, though the futuristic model names (2026 publication dates) are unusual and may indicate speculative baselines.
5. Strengths & Limitations
Strengths:
Limitations:
Notable observations:
Summary
COMAP presents a well-motivated and technically detailed framework for an important problem. The co-evolution concept is sound, the experimental evidence is broad, and the ablations are informative. The main impact is in demonstrating that dynamic world-model adaptation through self-distillation meaningfully improves both state prediction and downstream task performance. While the framework has non-trivial complexity and still requires initialization resources, it represents a meaningful advance in the LLM agent learning paradigm.
Generated Jun 2, 2026
Comparison History (25)
COMAP addresses a fundamental bottleneck in autonomous agents by enabling the continuous, closed-loop co-evolution of world models and policies without relying on external rewards. This self-improving framework advances the critical frontier of world-modeling in AI, offering broader theoretical implications for interactive AI, embodied planning, and long-horizon decision-making compared to the task-specific reasoning extraction in Paper 1.
Paper 1 introduces a foundational algorithmic framework (COMAP) for co-evolving world models and agent policies, applicable across diverse domains like embodied planning, web navigation, and tool use. While Paper 2 provides a highly valuable domain-specific medical benchmark, Paper 1's methodology addresses a core challenge in general AI agent design, offering broader theoretical impact and wider applicability across the machine learning landscape.
Paper 1 addresses a critical gap in evaluating deep-research agents by introducing process-level error localization rather than just outcome-based evaluation. TELBench (a benchmark with 2,790 real trajectories) and DRIFT (a claim-centric auditing framework) provide foundational infrastructure for understanding and debugging agentic AI systems. As deep-research agents become increasingly deployed, tools for diagnosing failure modes at the span level will be essential. While Paper 2's COMAP is a solid contribution to co-evolving world models and policies, it represents more incremental progress in agent training methodology. Paper 1's broader applicability to AI safety, reliability, and interpretability gives it higher potential impact.
Paper 2 proposes a comprehensive conceptual framework (ICAM) that could reshape how the community thinks about LLM-based systems architecture. Its breadth of impact spans systems engineering, AI, and computer architecture, offering design laws and a unifying six-layer model for an emerging paradigm. While Paper 1 presents solid empirical work with a co-evolution framework showing good results, it represents an incremental advance in agent-world model training. Paper 2's visionary framing of model-native computing as a new architectural paradigm has greater potential to influence multiple research directions and establish foundational abstractions for a rapidly growing field.
Paper 2 (COMAP) has higher potential impact due to its broader applicability and timeliness: co-evolving world models and policies addresses a central bottleneck in LLM agents—adaptation under on-policy distribution shift—across embodied planning, web navigation, and tool use. The closed-loop self-distillation without external rewards/verifiers increases real-world deployability. Methodologically, it introduces a general interaction-and-update paradigm that could influence agent learning, planning, and self-improvement research across fields. Paper 1 is valuable for reasoning reliability, but its scope is narrower (primarily math/symbolic anchoring) and closer to existing tool-augmented reasoning lines.
Paper 1 introduces a broadly applicable, conceptually novel interaction formalism (Engagement Process) that generalizes the action–observation interface by making time explicit, potentially influencing RL/POMDP theory, HCI, robotics, and systems with latency/persistence. Its framing as a temporal interface could standardize modeling across many domains beyond LLM agents. Paper 2 is timely and practically useful for LLM agents, but is more incremental within current world-model/self-improvement trends and likely narrower in cross-field impact. Paper 1’s abstraction suggests higher long-term scientific reach if rigorously developed.
Paper 1 introduces both a novel benchmark (Causal-Plan-Bench) and a large-scale training corpus (Causal-Plan-1M) that address a fundamental gap in embodied AI: the distinction between token prediction and causal reasoning. The discovery of a 'Causal Scaling Law' is a significant empirical contribution. While Paper 2's co-evolution framework (COMAP) is innovative and practical, Paper 1's contributions—reframing the evaluation paradigm, providing reusable community resources, and demonstrating systematic scaling behavior—have broader potential to redirect research priorities in embodied AI, giving it higher long-term impact.
Paper 2 (COMAP) likely has higher scientific impact due to a more novel methodological contribution: a closed-loop co-evolution of world models and agent policies without relying on external rewards/verifiers, applicable across multiple benchmark families (embodied planning, web navigation, tool use). This advances core agent-learning principles and can generalize broadly. Paper 1 (MCP-Persona) is timely and practically valuable as a benchmark for personalized MCP tool use, but its primary contribution is evaluative infrastructure with narrower conceptual novelty and potentially more limited cross-field methodological influence.
Paper 2 targets a broadly important and timely safety/robustness failure mode (answering under insufficient information) with clear real-world stakes (medical/high-risk domains). Its framing (detection-to-abstention gap) and trajectory-level control method (Judge-Then-Solve) are likely broadly applicable across reasoning models and deployment settings, and can influence evaluation and training practices. Paper 1 is innovative for interactive agents, but its impact is more scoped to agentic benchmarking and world-model co-training. Overall, Paper 2 has wider cross-field relevance and deployment impact.
COMAP introduces a novel co-evolutionary framework for world models and agent policies that addresses fundamental limitations in LLM agent training—the mismatch between fixed world models and evolving agent distributions. It demonstrates broad applicability across embodied planning, web navigation, and tool-use benchmarks with significant improvements. Paper 2, while valuable in combining generative models with classical search for planning, addresses a narrower problem (test-time inference efficiency for combinatorial planning). COMAP's broader scope, self-improving loop paradigm, and relevance to the rapidly growing LLM agent field give it higher potential impact.
Paper 2 directly accelerates scientific discovery by automating quantitative modeling workflows, a critical bottleneck across numerous disciplines. By enabling VLMs to dynamically create diagnostic tools and successfully applying this to complex real-world problems like astrophysics, it promises a broader, more direct impact on the broader scientific community than Paper 1, which primarily advances general AI agent methodology.
Paper 1 proposes a highly innovative framework (COMAP) that advances LLM agent autonomy by co-evolving world models and agent policies without relying on fixed external rewards. This addresses a major bottleneck in agent development. Its broad applicability across embodied planning, web navigation, and tool use, combined with strong empirical results, suggests immense potential for shaping future research in autonomous systems. While Paper 2 addresses an important verification problem (reward hacking in RLVR), Paper 1 offers a broader, more paradigm-shifting contribution to the rapidly expanding field of interactive AI agents.
COMAP addresses a fundamental challenge in LLM agent development—co-evolving world models and policies without external rewards—with broad applicability across embodied planning, web navigation, and tool use. The framework introduces a novel closed-loop training paradigm with strong empirical results (+16.75% improvement). Its practical impact spans multiple AI application domains and addresses timely needs in autonomous agent development. Paper 2, while methodologically rigorous and novel in causal inference evaluation, addresses a narrower problem with more limited immediate applications, primarily serving as an assessment tool rather than enabling new capabilities.
Paper 2 exposes a highly timely and critical vulnerability in modern reasoning LLMs (e.g., extracting hidden 'thoughts' that companies try to protect). This has profound implications for AI safety, intellectual property, and model distillation, making it likely to drive immediate and widespread follow-up research. While Paper 1 presents a strong architectural improvement for LLM agents, Paper 2's findings on interface-level trace hiding bypasses have broader disruptive impact across both academia and the commercial AI industry.
Paper 1 (SRPO) addresses a fundamental problem in multimodal RLVR—uniform credit assignment across functionally distinct token roles—with a principled, theoretically grounded solution that decomposes advantages into perception and reasoning components without external models. This tackles a core limitation in training large vision-language models with RL, a rapidly growing area. Paper 2 (COMAP) proposes an interesting co-evolution framework for world models and agent policies, but its contribution is more incremental in the agent-planning space. SRPO's token-level credit assignment insight has broader methodological implications for RLVR research across modalities.
COMAP presents a concrete, well-evaluated framework for co-evolving world models and agent policies for LLM agents, addressing a timely problem in AI. It demonstrates strong empirical results across multiple benchmarks (+16.75% improvement), provides code, and tackles practical challenges in embodied AI, web navigation, and tool use. Paper 2 presents a purely abstract mathematical framework for representing conflict in data modulation without concrete implementations, empirical validation, or clear practical applications, making its impact speculative and harder to assess. COMAP's timeliness in the LLM agents space and methodological rigor give it significantly higher impact potential.
COMAP addresses a broader and more fundamental problem—co-evolving world models with agent policies for LLM agents—with wide applicability across embodied planning, web navigation, and tool use. Its closed-loop co-evolution framework is more novel and generalizable than JNO's incremental improvement to knowledge editing. The stronger empirical gains (+16.75%), broader benchmark coverage across multiple domains, and the growing importance of LLM agents make COMAP more likely to attract citations and inspire follow-up work across multiple research communities.
Paper 2 likely has higher scientific impact due to a concrete, novel algorithmic framework (co-evolving world models and policies) with demonstrated benchmark gains, clear methodological evaluation, and immediate applicability to LLM agents in web, tool-use, and embodied settings. Its closed-loop self-distillation approach is timely and broadly relevant to agentic AI, planning, and interactive systems, increasing cross-field uptake. Paper 1 offers a valuable conceptual framework for interaction-centered analysis, but impact may be more diffuse and slower unless paired with standardized operationalizations and empirical validation.
Paper 2 (COMAP) has higher likely scientific impact due to stronger timeliness and broader cross-domain relevance: improving LLM-agent decision-making via co-evolving world models and policies applies to embodied agents, web navigation, and tool use. The closed-loop adaptation and future-aware reflection address a widely recognized limitation (static world models, reliance on external verifiers) with clear real-world implications for autonomous systems. Paper 1 is novel and rigorous for embedding alignment and database integration, but its impact is narrower and more specialized than agent/world-model co-training, which can influence multiple subfields (RL, planning, LLM agents, alignment).
COMAP addresses the timely and high-impact intersection of LLM agents and world models, proposing a co-evolutionary framework applicable across embodied planning, web navigation, and tool use. Its broader applicability to the rapidly growing LLM agent ecosystem, combined with practical relevance (no external reward/verifier dependency), open-source code, and strong empirical gains (+16.75%), gives it wider potential impact. While Paper 1 makes solid theoretical contributions to offline RL with Bayesian methods, it operates in a more mature and narrower subfield with incremental advances over existing approaches.