COMAP: Co-Evolving World Models and Agent Policies for LLM Agents

Youwei Liu, Jian Wang, Hanlin Wang, Wenjie Li

Jun 1, 2026

arXiv:2606.02372v1 PDF

cs.AI(primary)cs.CL

#876of 3355·Artificial Intelligence

#876 of 3355 · Artificial Intelligence

Tournament Score

1454±43

10501800

68%

Win Rate

Wins

Losses

Matches

Rating

7/ 10

Significance7

Rigor7

Novelty7

Clarity7.5

Tournament Score

1454±43

10501800

68%

Win Rate

Wins

Losses

Matches

Rating

7/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Equipping language agents with world models enables them to anticipate environment dynamics and evaluate candidate actions before execution. However, existing textual world models are typically fixed after training, preventing them from adapting to the on-policy state-action distributions induced by an evolving agent. Meanwhile, agent-improvement methods often rely on external rewards or verifiers, limiting their applicability in realistic interactive environments. In this paper, we propose COMAP, a novel framework that co-evolves textual world models and agent policies through closed-loop interaction. At each decision step, the world model predicts future state feedback for candidate actions, and the agent performs future-aware reflection by estimating the reliability of this feedback and refining its action accordingly. The resulting on-policy trajectories are then used to update the world model via self-distillation, allowing it to better match the agent's evolving interaction distribution. Across embodied task planning, Web navigation, and tool-use benchmarks, COMAP consistently outperforms competitive baselines, e.g., +16.75% relative improvement with Qwen3-4B. Further analyses show that the co-evolutionary loop improves the world model's prediction accuracy over time and leads to more effective long-horizon decision-making. Our code is available at: https://github.com/loyiv/CoMAP.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: COMAP

1. Core Contribution

COMAP addresses a genuine gap in LLM agent research: the mismatch between static, fixed-after-training textual world models and the non-stationary state-action distributions produced by evolving agent policies. The paper proposes a closed-loop co-evolution framework with two interacting mechanisms: (1) on-policy self-distillation for the world model, where a student world model is trained with real-state supervision plus soft distributional guidance from an EMA teacher; and (2) future-aware reflection for the agent policy, where imagined future states guide action refinement through a gated mechanism.

The key conceptual insight — that world models and policies should co-evolve rather than be trained independently — is well-motivated. The distribution shift problem (a fixed world model becoming inaccurate as the policy visits novel states) is real and underexplored in the textual world model literature. The framework elegantly couples both learning loops: improved policies generate more diverse transitions for world-model training, while improved world models provide better lookahead signals for policy refinement.

2. Methodological Rigor

The methodology is technically detailed and generally sound. Several design choices demonstrate careful engineering:

World-state gate: Controls whether the policy trains with imagined or real future states based on student-teacher agreement, creating an effective curriculum.

Action gate: Uses three complementary conditions (refinement probability, confidence, and canonicalization check) to prevent unnecessary or harmful revisions.

Self-distillation mechanism: Combines hard real-state targets with soft token-level teacher distributions, providing denser supervision on transition-critical tokens.

Delta-F1 metric: A well-designed evaluation metric for world models that focuses on action-induced state changes rather than superficial text overlap.

However, several methodological concerns arise:

The initialization phase requires expert demonstrations and environment rollouts for suffix-based supervision, which partially undermines the claim of reducing reliance on external supervision. The framework still needs expert trajectories for warm-up.

The number of hyperparameters is substantial (τ_wm, τ_p, τ_q, η, α, β, μ), and sensitivity analysis is limited to the world-state gate threshold only.

The ablation study is conducted only on ALFWorld, raising questions about whether component contributions generalize across benchmarks.

3. Potential Impact

Breadth of evaluation: Testing across four diverse benchmarks (embodied planning, scientific reasoning, web navigation, tool use) with multiple backbone sizes (4B, 8B, 30B-A3B) provides reasonably strong evidence of generality.

Practical significance: The +16.75% relative improvement with Qwen3-4B is notable, particularly because the gains are larger for smaller models, suggesting COMAP could democratize strong agent capabilities without requiring massive backbones. The finding that Qwen3-8B with COMAP approaches DeepSeek-V4-Pro with Imagine-and-Act is compelling.

Limitations on broader impact: The framework is restricted to text-representable environments, which limits applicability to multimodal or visual settings. The additional inference-time world-model call adds latency. The co-evolving training requires environment access for rollouts, which may not always be available.

4. Timeliness & Relevance

This work is highly timely. The field is rapidly moving from static prompt-based agents toward learning-based agent frameworks. The co-evolution paradigm sits at the intersection of several active research threads: world models for planning (WKM, IWM), agent self-improvement (Reflexion, AgentGym), and on-policy distillation. The paper positions itself well against concurrent work like WebEvolver (Fang et al., 2025), distinguishing COMAP's closed-loop formulation from methods that use world models merely as auxiliary planning components.

The reference to very recent models (GPT-5.4, DeepSeek-V4, Qwen3) suggests the work is current, though the futuristic model names (2026 publication dates) are unusual and may indicate speculative baselines.

5. Strengths & Limitations

Strengths:

Clean conceptual framing of the distribution shift problem between world models and policies

Comprehensive experimental design across four benchmarks and multiple model scales

Dual evaluation of both policy performance (success rate) and world-model quality (Delta-F1)

Detailed learning dynamics analysis (Figure 4) showing stable co-evolution

Diagnostic metrics (URR, HRR, BRP) for reflection quality assessment

Strong ablation showing that frozen world models degrade dramatically (−26.9% to −29.1% Delta-F1)

Code availability and detailed reproducibility information (training costs, parameter settings)

Limitations:

The paper assumes text-representable states, excluding multimodal environments

Expert demonstrations are still required for initialization, weakening the self-improvement narrative

The self-distillation mechanism, while effective, adds training complexity (EMA teacher, privileged state access)

Cross-benchmark ablations are missing — only ALFWorld is ablated

The improvement on the larger 8B model (69.53→72.11, ~3.7% absolute) is more modest than on 4B, and it's unclear whether gains persist at larger scales

The paper's framing as eliminating need for "external rewards or verifiers" is slightly overstated given the reliance on expert demonstrations and environment reward signals for initialization

Notable observations:

The change-token NLL analysis (Figure 6) provides a precise mechanistic explanation of why self-distillation helps

The world-state gate adoption ratio analysis (Figure 5) elegantly demonstrates increasing world-model reliability

Training cost (~37.7 GPU-hours on 4×A100) is reasonable for the gains achieved

Summary

COMAP presents a well-motivated and technically detailed framework for an important problem. The co-evolution concept is sound, the experimental evidence is broad, and the ablations are informative. The main impact is in demonstrating that dynamic world-model adaptation through self-distillation meaningfully improves both state prediction and downstream task performance. While the framework has non-trivial complexity and still requires initialization resources, it represents a meaningful advance in the LLM agent learning paradigm.

Rating:7/ 10

Significance 7Rigor 7Novelty 7Clarity 7.5

Generated Jun 2, 2026

Comparison History (25)

vs. Inducing Reasoning Primitives from Agent Traces

gemini-3.16/3/2026

COMAP addresses a fundamental bottleneck in autonomous agents by enabling the continuous, closed-loop co-evolution of world models and policies without relying on external rewards. This self-improving framework advances the critical frontier of world-modeling in AI, offering broader theoretical implications for interactive AI, embodied planning, and long-horizon decision-making compared to the task-specific reasoning extraction in Paper 1.

vs. MedCUA-Bench: A Screenshot-Only Benchmark for Clinical Computer-Use Agents

gemini-3.16/3/2026

Paper 1 introduces a foundational algorithmic framework (COMAP) for co-evolving world models and agent policies, applicable across diverse domains like embodied planning, web navigation, and tool use. While Paper 2 provides a highly valuable domain-specific medical benchmark, Paper 1's methodology addresses a core challenge in general AI agent design, offering broader theoretical impact and wider applicability across the machine learning landscape.

vs. Where Do Deep-Research Agents Go Wrong? Span-Level Error Localization in Agent Trajectories

claude-opus-4.66/3/2026

Paper 1 addresses a critical gap in evaluating deep-research agents by introducing process-level error localization rather than just outcome-based evaluation. TELBench (a benchmark with 2,790 real trajectories) and DRIFT (a claim-centric auditing framework) provide foundational infrastructure for understanding and debugging agentic AI systems. As deep-research agents become increasingly deployed, tools for diagnosing failure modes at the span level will be essential. While Paper 2's COMAP is a solid contribution to co-evolving world models and policies, it represents more incremental progress in agent training methodology. Paper 1's broader applicability to AI safety, reliability, and interpretability gives it higher potential impact.

vs. Model-Native Computing Architecture: Envisioning Future System Architecture Through the Lens of Computer Architecture

claude-opus-4.66/2/2026

Paper 2 proposes a comprehensive conceptual framework (ICAM) that could reshape how the community thinks about LLM-based systems architecture. Its breadth of impact spans systems engineering, AI, and computer architecture, offering design laws and a unifying six-layer model for an emerging paradigm. While Paper 1 presents solid empirical work with a co-evolution framework showing good results, it represents an incremental advance in agent-world model training. Paper 2's visionary framing of model-native computing as a new architectural paradigm has greater potential to influence multiple research directions and establish foundational abstractions for a rapidly growing field.

vs. eMoT: evolving Memory-of-Thought via Symbolic Anchoring and Memory Corrosion

gpt-5.26/2/2026

Paper 2 (COMAP) has higher potential impact due to its broader applicability and timeliness: co-evolving world models and policies addresses a central bottleneck in LLM agents—adaptation under on-policy distribution shift—across embodied planning, web navigation, and tool use. The closed-loop self-distillation without external rewards/verifiers increases real-world deployability. Methodologically, it introduces a general interaction-and-update paradigm that could influence agent learning, planning, and self-improvement research across fields. Paper 1 is valuable for reasoning reliability, but its scope is narrower (primarily math/symbolic anchoring) and closer to existing tool-augmented reasoning lines.

vs. Engagement Process: Rethinking the Temporal Interface of Action and Observation

gpt-5.26/2/2026

Paper 1 introduces a broadly applicable, conceptually novel interaction formalism (Engagement Process) that generalizes the action–observation interface by making time explicit, potentially influencing RL/POMDP theory, HCI, robotics, and systems with latency/persistence. Its framing as a temporal interface could standardize modeling across many domains beyond LLM agents. Paper 2 is timely and practically useful for LLM agents, but is more incremental within current world-model/self-improvement trends and likely narrower in cross-field impact. Paper 1’s abstraction suggests higher long-term scientific reach if rigorously developed.

vs. Token Predictors Are Not Planners: Building Physically Grounded Causal Reasoners

claude-opus-4.66/2/2026

Paper 1 introduces both a novel benchmark (Causal-Plan-Bench) and a large-scale training corpus (Causal-Plan-1M) that address a fundamental gap in embodied AI: the distinction between token prediction and causal reasoning. The discovery of a 'Causal Scaling Law' is a significant empirical contribution. While Paper 2's co-evolution framework (COMAP) is innovative and practical, Paper 1's contributions—reframing the evaluation paradigm, providing reusable community resources, and demonstrating systematic scaling behavior—have broader potential to redirect research priorities in embodied AI, giving it higher long-term impact.

vs. MCP-Persona: Benchmarking LLM Agents on Real-World Personal Applications via Environment Simulation

gpt-5.26/2/2026

Paper 2 (COMAP) likely has higher scientific impact due to a more novel methodological contribution: a closed-loop co-evolution of world models and agent policies without relying on external rewards/verifiers, applicable across multiple benchmark families (embodied planning, web navigation, tool use). This advances core agent-learning principles and can generalize broadly. Paper 1 (MCP-Persona) is timely and practically valuable as a benchmark for personalized MCP tool use, but its primary contribution is evaluative infrastructure with narrower conceptual novelty and potentially more limited cross-field methodological influence.

vs. Bridging the Detection-to-Abstention Gap in Reasoning Models under Insufficient Information

gpt-5.26/2/2026

Paper 2 targets a broadly important and timely safety/robustness failure mode (answering under insufficient information) with clear real-world stakes (medical/high-risk domains). Its framing (detection-to-abstention gap) and trajectory-level control method (Judge-Then-Solve) are likely broadly applicable across reasoning models and deployment settings, and can influence evaluation and training practices. Paper 1 is innovative for interactive agents, but its impact is more scoped to agentic benchmarking and world-model co-training. Overall, Paper 2 has wider cross-field relevance and deployment impact.

vs. Efficient Test-time Inference for Generative Planning Models

claude-opus-4.66/2/2026

COMAP introduces a novel co-evolutionary framework for world models and agent policies that addresses fundamental limitations in LLM agent training—the mismatch between fixed world models and evolving agent distributions. It demonstrates broad applicability across embodied planning, web navigation, and tool-use benchmarks with significant improvements. Paper 2, while valuable in combining generative models with classical search for planning, addresses a narrower problem (test-time inference efficiency for combinatorial planning). COMAP's broader scope, self-improving loop paradigm, and relevance to the rapidly growing LLM agent field give it higher potential impact.

vs. VESTA: Visual Exploration with Statistical Tool Agents

gemini-3.16/2/2026

Paper 2 directly accelerates scientific discovery by automating quantitative modeling workflows, a critical bottleneck across numerous disciplines. By enabling VLMs to dynamically create diagnostic tools and successfully applying this to complex real-world problems like astrophysics, it promises a broader, more direct impact on the broader scientific community than Paper 1, which primarily advances general AI agent methodology.

vs. Before the Model Learns the Bug:Fuzzing RLVR Verifiers

gemini-3.16/2/2026

Paper 1 proposes a highly innovative framework (COMAP) that advances LLM agent autonomy by co-evolving world models and agent policies without relying on fixed external rewards. This addresses a major bottleneck in agent development. Its broad applicability across embodied planning, web navigation, and tool use, combined with strong empirical results, suggests immense potential for shaping future research in autonomous systems. While Paper 2 addresses an important verification problem (reward hacking in RLVR), Paper 1 offers a broader, more paradigm-shifting contribution to the rapidly expanding field of interactive AI agents.

vs. Evaluating Bivariate Causal Statements Based on Mutual Compatibility

claude-opus-4.66/2/2026

COMAP addresses a fundamental challenge in LLM agent development—co-evolving world models and policies without external rewards—with broad applicability across embodied planning, web navigation, and tool use. The framework introduces a novel closed-loop training paradigm with strong empirical results (+16.75% improvement). Its practical impact spans multiple AI application domains and addresses timely needs in autonomous agent development. Paper 2, while methodologically rigorous and novel in causal inference evaluation, addresses a narrower problem with more limited immediate applications, primarily serving as an assessment tool rather than enabling new capabilities.

vs. Hidden Thoughts Are Not Secret: Reasoning Trace Exposure in LLMs

gemini-3.16/2/2026

Paper 2 exposes a highly timely and critical vulnerability in modern reasoning LLMs (e.g., extracting hidden 'thoughts' that companies try to protect). This has profound implications for AI safety, intellectual property, and model distillation, making it likely to drive immediate and widespread follow-up research. While Paper 1 presents a strong architectural improvement for LLM agents, Paper 2's findings on interface-level trace hiding bypasses have broader disruptive impact across both academia and the commercial AI industry.

vs. Structured Role-Aware Policy Optimization for Multimodal Reasoning

claude-opus-4.66/2/2026

Paper 1 (SRPO) addresses a fundamental problem in multimodal RLVR—uniform credit assignment across functionally distinct token roles—with a principled, theoretically grounded solution that decomposes advantages into perception and reasoning components without external models. This tackles a core limitation in training large vision-language models with RL, a rapidly growing area. Paper 2 (COMAP) proposes an interesting co-evolution framework for world models and agent policies, but its contribution is more incremental in the agent-planning space. SRPO's token-level credit assignment insight has broader methodological implications for RLVR research across modalities.

vs. A Mathematical Conflict Framework for Contextual Data Modulation

claude-opus-4.66/2/2026

COMAP presents a concrete, well-evaluated framework for co-evolving world models and agent policies for LLM agents, addressing a timely problem in AI. It demonstrates strong empirical results across multiple benchmarks (+16.75% improvement), provides code, and tackles practical challenges in embodied AI, web navigation, and tool use. Paper 2 presents a purely abstract mathematical framework for representing conflict in data modulation without concrete implementations, empirical validation, or clear practical applications, making its impact speculative and harder to assess. COMAP's timeliness in the LLM agents space and methodological rigor give it significantly higher impact potential.

vs. Revisiting Ripple Effects in Knowledge Editing through Pressure-Aware Joint Neighborhood Optimization

claude-opus-4.66/2/2026

COMAP addresses a broader and more fundamental problem—co-evolving world models with agent policies for LLM agents—with wide applicability across embodied planning, web navigation, and tool use. Its closed-loop co-evolution framework is more novel and generalizable than JNO's incremental improvement to knowledge editing. The stronger empirical gains (+16.75%), broader benchmark coverage across multiple domains, and the growing importance of LLM agents make COMAP more likely to attract citations and inspire follow-up work across multiple research communities.

vs. Interaction-Centered Intelligence: Toward Interaction as the Primary Unit of Analysis in Co-Creative AI and Human-AI Systems

gpt-5.26/2/2026

Paper 2 likely has higher scientific impact due to a concrete, novel algorithmic framework (co-evolving world models and policies) with demonstrated benchmark gains, clear methodological evaluation, and immediate applicability to LLM agents in web, tool-use, and embodied settings. Its closed-loop self-distillation approach is timely and broadly relevant to agentic AI, planning, and interactive systems, increasing cross-field uptake. Paper 1 offers a valuable conceptual framework for interaction-centered analysis, but impact may be more diffuse and slower unless paired with standardized operationalizations and empirical validation.

vs. Vector Linking via Cross-Model Local Isometric Consistency

gpt-5.26/2/2026

Paper 2 (COMAP) has higher likely scientific impact due to stronger timeliness and broader cross-domain relevance: improving LLM-agent decision-making via co-evolving world models and policies applies to embodied agents, web navigation, and tool use. The closed-loop adaptation and future-aware reflection address a widely recognized limitation (static world models, reliance on external verifiers) with clear real-world implications for autonomous systems. Paper 1 is novel and rigorous for embedding alignment and database integration, but its impact is narrower and more specialized than agent/world-model co-training, which can influence multiple subfields (RL, planning, LLM agents, alignment).

vs. Regularized Offline Policy Optimization with Posterior Hybrid Bayesian Belief

claude-opus-4.66/2/2026

COMAP addresses the timely and high-impact intersection of LLM agents and world models, proposing a co-evolutionary framework applicable across embodied planning, web navigation, and tool use. Its broader applicability to the rapidly growing LLM agent ecosystem, combined with practical relevance (no external reward/verifier dependency), open-source code, and strong empirical gains (+16.75%), gives it wider potential impact. While Paper 1 makes solid theoretical contributions to offline RL with Bayesian methods, it operates in a more mature and narrower subfield with incremental advances over existing approaches.