Composition Collapse: Stable Factual Knowledge Does Not Imply Compositional Reasoning

Zhe Yu, Wenpeng Xing, Yunzhao Wei, Jie Chen, Hongzhi Wang, Xuyang Teng, Meng Han

May 26, 2026

arXiv:2605.26789v1 PDF

cs.AI(primary)

#454of 2682·Artificial Intelligence

#454 of 2682 · Artificial Intelligence

Tournament Score

1485±44

10501800

67%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance7.5

Rigor6.5

Novelty7

Clarity7.5

Tournament Score

1485±44

10501800

67%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Post-training is routinely evaluated through aggregate benchmark scores that treat multi-hop reasoning as a single capability -- as if a model that answers more questions correctly must be better at assembling facts. We show that this assumption can be misleading: recipes with statistically indistinguishable atomic knowledge produce composition behaviour separated by over 40 percentage points, a phenomenon we call composition collapse: the systematic failure to assemble stably-known facts into chains, invisible to aggregate metrics. We introduce a double-gate protocol that changes the estimand from an aggregate compositionality gap to residual composition failure conditioned on stable atomic access, decomposing post-training gains into three independent channels: atomic stability, residual composition, and critical depth. On a benchmark of temporal factual chains spanning depths 2--11 across four post-training recipes, this decomposition reveals that post-training objectives shift composition capability in directions that aggregate metrics mask, and suggests that claims about multi-hop reasoning improvement should be accompanied by atomic-gate-controlled composition metrics. Diagnostic probes further show that a substantial share of measured composition failure reflects generation-time computation constraints rather than permanent inability to compose.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

Core Contribution

This paper identifies and formalizes "composition collapse"—the phenomenon where LLMs that demonstrably possess individual atomic facts still fail systematically at composing them into multi-hop reasoning chains. The key methodological contribution is a double-gate protocol that conditions composition evaluation on verified stable atomic knowledge, changing the estimand from an aggregate compositionality gap to a *residual* composition failure rate. This yields a three-channel decomposition of post-training effects: atomic stability (∆atom), residual composition at matched atoms (∆comp), and critical depth (∆depth).

The central finding is striking: post-training recipes with statistically indistinguishable atomic knowledge (87–90% stability) diverge by over 40 percentage points in residual composition failure at depth 2. This cleanly demonstrates that aggregate benchmark scores conflate knowledge acquisition with reasoning capability—a conflation with real consequences for model development decisions.

Methodological Rigor

Strengths in design: The double-gate protocol is well-motivated. Gate 1 (paraphrase-stable atomic knowledge) and Gate 2 (sub-question correctness in isolation) together ensure that measured failures genuinely reflect composition inability rather than knowledge gaps. The paper quantifies the inflation from using only a single gate (2.5–9.8 pp aggregate, up to 11.4 pp per cell), justifying the added complexity.

Causal intervention: The controlled LoRA-GRPO experiment (§5) is the paper's strongest methodological element. By holding base model, data, and compute budget fixed while varying only the training objective (SFT-answer, SFT-trace, GRPO), the authors isolate a clean causal signal: GRPO reduces depth-4 residual failure by 52 pp relative to SFT-trace. Cross-model replication on Llama3-8B-Instruct strengthens this claim.

Weaknesses: Sample sizes are a persistent concern. At deeper depths, gate-passing n drops to single digits (e.g., n=4 at depth 8 for Mistral), making point estimates unreliable. The authors acknowledge this with bootstrap CIs and cautionary notes, but the fundamental statistical power at deep chains is limited. The benchmark is restricted to 390 temporal factual chains—a narrow domain. The cross-domain pilot (Appendix L) and in-context synthetic evaluation (Table 2) provide preliminary evidence of generality, but the paper's primary claims rest heavily on temporal reasoning. The human validation study (Appendix N) reveals that Gemini adjudication has only 57% recall, meaning the consistency tier is conservative—residual failure rates are upper bounds. While this directional bias is acknowledged, it complicates interpretation of absolute numbers.

Potential Impact

Evaluation methodology. The double-gate protocol is lightweight, generalizable, and addresses a genuine blind spot in current evaluation practices. If adopted, it could change how the community reports multi-hop reasoning improvements—requiring atomic-gate-controlled metrics alongside aggregate scores. The three-channel decomposition provides a shared vocabulary (∆atom, ∆comp, ∆depth) for comparing post-training recipes on a like-for-like basis.

Post-training research. The finding that SFT reasoning-trace distillation can *worsen* composition relative to an untrained baseline (Table 7) while outcome-verified RL dramatically improves it is directly actionable for practitioners choosing training recipes. The limited depth transfer finding—training on a depth closes that depth's gap but transfers weakly to adjacent depths—has implications for curriculum design.

Inference-time computation. The diagnostic finding that 70–75% of gate-passing failures are recovered by chain-of-thought reasoning (§6) locates the bottleneck in generation-time computation rather than static representation. This connects to the growing literature on inference-time scaling and test-time compute.

Timeliness & Relevance

This paper arrives at a moment when the field is heavily invested in post-training recipes (RLHF, SFT distillation, RLVR) and evaluates them primarily through aggregate benchmark scores. The demonstration that two recipes indistinguishable on knowledge benchmarks can diverge by 40+ pp on composition directly challenges current evaluation norms. With the proliferation of "reasoning models" (DeepSeek-R1, Qwen3, etc.), a measurement protocol that separates knowing from composing is timely.

Strengths

1. Clean conceptual contribution: The composition collapse phenomenon is well-defined and the double-gate protocol is a principled response to a real measurement problem.

2. Controlled causal evidence: The same-base GRPO intervention with three training objectives provides unusually clean evidence for a post-training evaluation paper.

3. Multiple convergent diagnostics: CoT recovery, in-context evaluation, prompt-end patching, and failure taxonomy all triangulate the same conclusion from different angles.

4. Actionable decomposition: The three-channel framework is immediately useful for comparing training recipes.

5. Honest limitations: The paper is transparent about small sample sizes, domain restrictions, and the upper-bound nature of residual failure.

Limitations

1. Domain narrowness: Temporal factual chains are a specific and arguably easy domain for atomic verification. Whether the protocol scales to more complex reasoning types (causal, counterfactual, mathematical) is unaddressed.

2. Scale limitations: All models are 7–13B parameters. Whether composition collapse manifests similarly at 70B+ or frontier scale is unknown.

3. Benchmark size: 390 questions is small; gate-passing subsets at deep depths are tiny, limiting statistical power.

4. Ecological validity: The XML-structured prompt format and greedy decoding are specific choices; the prompt-robustness analysis (Appendix Q) shows that answer agreement across prompt variants is only 19.7%, raising questions about how stable the measured phenomenon is across evaluation conditions.

5. No theoretical account: The paper is purely empirical—it identifies the phenomenon but offers no mechanistic explanation for *why* certain training objectives preserve composition while others destroy it.

Overall Assessment

This is a well-executed empirical paper that identifies a real and under-appreciated problem in LLM evaluation. The double-gate protocol is a genuine methodological contribution with potential for broad adoption. The causal intervention is clean and replicable. The main limitations are domain narrowness and small sample sizes at deep chain depths, which constrain the generality of the specific numerical findings. The conceptual contribution—that stable knowledge does not imply compositional reasoning, and that this gap varies dramatically across training recipes—is robust and timely.

Rating:6.8/ 10

Significance 7.5Rigor 6.5Novelty 7Clarity 7.5

Generated May 27, 2026

Comparison History (27)

vs. LiveBrowseComp: Are Search Agents Searching, or Just Verifying What They Already Know?

claude-opus-4.65/28/2026

Both papers expose critical flaws in how AI benchmarks conflate distinct capabilities. Paper 2 has higher impact because: (1) it addresses the rapidly growing and commercially important area of LLM search agents, with broader relevance; (2) it provides both a diagnostic framework AND a concrete, publicly available benchmark (LiveBrowseComp) that can be immediately adopted; (3) the finding that search agents rely on intrinsic knowledge rather than genuine retrieval has immediate practical implications for product development and safety; (4) the timeliness is stronger given the explosion of agentic AI systems. Paper 1's composition collapse finding is valuable but more niche in scope.

vs. DenoiseRL: Bootstrapping Reasoning Models to Recover from Noisy Prefixes

gemini-3.15/28/2026

Bootstrapping LLM reasoning without relying on stronger teacher models addresses a critical bottleneck in AI scaling. While Paper 2 offers valuable evaluation insights, Paper 1 provides a scalable, self-improving framework that directly advances model capabilities, making its potential real-world applications and impact on future training paradigms highly significant.

vs. OpenURMA: A Clean-Room Open Implementation of the Unified Bus Protocol

claude-opus-4.65/28/2026

Paper 2 addresses a fundamental and broadly relevant problem in LLM evaluation methodology—revealing that aggregate benchmarks can mask critical failures in compositional reasoning. The 'composition collapse' concept and double-gate protocol have wide applicability across the entire AI/NLP community, affecting how all post-training methods are evaluated. Paper 1, while technically strong as the first open implementation of Huawei's UB protocol with impressive performance gains, serves a narrower hardware/systems audience and is primarily an engineering contribution implementing an existing specification rather than introducing a new conceptual framework.

vs. ZipRL: Adaptive Multi-Turn Context Compression with Hindsight Response Replay

claude-opus-4.65/28/2026

Paper 2 identifies a fundamental and previously overlooked phenomenon ('composition collapse') in LLM evaluation methodology, with broad implications for how the entire field assesses post-training improvements. Its contribution—showing that aggregate benchmarks mask critical failures in compositional reasoning despite stable atomic knowledge—challenges widespread evaluation practices and introduces a principled diagnostic framework. This methodological insight affects virtually all LLM research involving multi-hop reasoning. Paper 1, while technically strong with impressive empirical results on context compression, addresses a narrower optimization problem with more incremental impact.

vs. Which Changes Matter? Towards Trustworthy Legal AI via Relevance-Sensitive Evaluation and Solver-Grounded Reasoning

gemini-3.15/27/2026

Paper 2 identifies a fundamental flaw in how multi-hop reasoning is evaluated across LLMs and introduces a novel protocol to address it. Its findings impact the entire field of LLM evaluation and post-training, offering significantly broader scientific impact than Paper 1, which, while highly rigorous and practically valuable, is largely confined to the specific domain of legal AI.

vs. Credit Assignment with Resets in Language Model Reasoning

gemini-3.15/27/2026

Paper 1 addresses the critical bottleneck of credit assignment in multi-step reasoning for LLMs. By proposing a practical, self-supervised reset mechanism to isolate and correct erroneous steps, it offers a direct path to significantly advancing LLM reasoning capabilities—a major focus of current AI research. While Paper 2 provides valuable insights into evaluation flaws, Paper 1's method for actively improving reasoning without external supervision has broader immediate applicability and higher potential to drive next-generation model performance.

vs. Can Broad Biomedical Knowledge be Contextualized into Scenario-Grounded Propositions?

claude-opus-4.65/27/2026

Paper 2 (SCENE) addresses a fundamental challenge in biomedical discovery—bridging broad knowledge with specific experimental data—with a novel multi-agent framework that has immediate practical applications in clinical trials and drug discovery. It demonstrates concrete utility across multiple biomedical settings. While Paper 1 provides valuable methodological insights about evaluating LLM compositional reasoning (composition collapse), its impact is more narrowly focused on AI evaluation methodology. Paper 2's breadth of real-world biomedical applications, combined with its novel framework for knowledge contextualization, gives it higher potential for cross-disciplinary impact.

vs. Fundamental Limitation in Explaining AI

claude-opus-4.65/27/2026

Paper 1 establishes a fundamental theoretical impossibility result (quadrilemma) for AI explainability that has broad implications across all of AI governance, regulation, and XAI research. Its impact spans policy, theory, and practice, providing a mathematical foundation that constrains what can be expected from explainable AI systems. Paper 2, while methodologically rigorous and practically useful for LLM evaluation, addresses a narrower issue in post-training evaluation methodology. The fundamental nature of Paper 1's impossibility result gives it greater potential to reshape discourse across multiple fields including AI policy, law, and computer science.

vs. PolyFusionAgent: A Multimodal Foundation Model and Autonomous AI Assistant for Polymer Property Prediction and Inverse Design

claude-opus-4.65/27/2026

PolyFusionAgent addresses a significant practical problem in polymer discovery with a comprehensive multimodal framework combining foundation models, inverse design, and literature-grounded reasoning. It has broader real-world applications across energy storage, biomedicine, and materials science, and introduces a novel end-to-end system integrating multiple modalities. Paper 1, while methodologically rigorous and insightful about LLM evaluation limitations, is more narrowly focused on diagnosing compositional reasoning failures in language models—important but incremental to the AI evaluation community rather than enabling new scientific discoveries.

vs. MobileExplorer: Accelerating On-Device Inference for Mobile GUI Agents via Online Exploration

claude-opus-4.65/27/2026

Paper 2 addresses a fundamental problem in LLM evaluation methodology—revealing that aggregate benchmarks can mask critical failures in compositional reasoning. The 'composition collapse' phenomenon and the proposed double-gate protocol offer broadly applicable diagnostic tools that could reshape how the community evaluates post-training methods. This has wider scientific impact across NLP/AI evaluation, touching every model that claims multi-hop reasoning ability. Paper 1, while practically useful, is a more incremental engineering contribution focused on mobile GUI agent latency optimization with narrower applicability.

vs. Reasoning, Code, or Both? How Large Language Models Handle Variations in Math Questions

gpt-5.25/27/2026

Paper 1 has higher potential impact due to a clearer novel contribution ("composition collapse") and a new evaluation protocol (double-gate) that changes the estimand and decomposes gains into interpretable channels, addressing a core measurement flaw in LLM post-training claims. It is broadly relevant across alignment/evaluation, reasoning, and training methodology, and timely given heavy reliance on aggregate benchmarks. Paper 2 is useful but narrower (GSM-Symbolic, one model), with modest, statistically non-significant findings and limited methodological novelty beyond comparative evaluation.

vs. From Norms to Indicators (N2I-RAG): An Agentic Retrieval-Augmented Generation Framework for Legal Indicator Computation

gemini-3.15/27/2026

Paper 2 addresses a fundamental flaw in LLM evaluation, introducing 'composition collapse' and a new measurement protocol. Its findings impact the broader AI community's understanding of multi-hop reasoning and post-training. In contrast, Paper 1 applies an existing methodology (agentic RAG) to a specialized domain (legal NLP), making its impact much narrower.

vs. Learning to Search and Searching to Learn for Generalization in Planning

gpt-5.25/27/2026

Paper 2 likely has higher scientific impact: it proposes a concrete, broadly applicable self-improving framework combining weighted A* search with learned relational heuristics, demonstrates strong empirical results and striking zero-shot scaling (e.g., Blocksworld 488 blocks), and targets a longstanding, timely challenge (combinatorial generalization in planning/RL). The methodology is action-oriented and transferable across planning, RL, and GNN-based reasoning, with clear real-world relevance to automated planning and decision-making. Paper 1 is insightful diagnostically for LLM evaluation, but is narrower and primarily reframes measurement rather than enabling new capabilities.

vs. LELA: An End-to-end LLM-based Entity Linking Framework with Zero-shot Domain Adaptation

gemini-3.15/27/2026

Paper 2 provides fundamental insights into LLM multi-hop reasoning, identifying 'composition collapse' and proposing a novel evaluation protocol that challenges existing aggregate metrics. This methodological advancement has broad implications for future LLM assessment and development. In contrast, Paper 1 is primarily an engineering contribution—packaging an existing method into a Python library—which, while practically useful, offers less theoretical novelty and fundamental scientific impact.

vs. Counteraction-Aware Multi-Teacher On-Policy Distillation for General Capability Recovery with Domain Preservation

gpt-5.25/27/2026

Paper 2 introduces a broadly applicable diagnostic concept (composition collapse) and an evaluation protocol (double-gate) that reframes how multi-hop reasoning improvements are measured, with implications for benchmarking, post-training research, and interpretability across many LLM domains. Its methodological contribution—conditioning compositional evaluation on stable atomic access and decomposing gains into independent channels—directly addresses a widespread evaluation blind spot and is timely for current claims about reasoning. Paper 1 is a solid training-method improvement for capability recovery, but is narrower in scope and likely less cross-field impactful.

vs. Position: AI Safety Requires Effective Controllability

gemini-3.15/27/2026

Paper 1 exposes a fundamental flaw in LLM evaluation metrics regarding compositional reasoning. By introducing the double-gate protocol to isolate atomic knowledge from reasoning ability, it provides a rigorous, actionable methodology that directly impacts how the field assesses multi-hop reasoning. While Paper 2 offers a timely conceptual shift for AI safety, Paper 1's concrete empirical findings and novel diagnostic tools will likely drive more immediate, widespread methodological changes across ML research.

vs. Credit Assignment with Resets in Language Model Reasoning

gemini-3.15/27/2026

Paper 1 addresses a critical bottleneck in RL post-training for LLM reasoning: credit assignment in long trajectories. By introducing reset-based policy optimization (SRPO), it offers a scalable, unsupervised method to significantly improve multi-step reasoning models. While Paper 2 provides highly valuable analytical insights into evaluation flaws regarding compositional reasoning, Paper 1 presents a concrete algorithmic advancement with broader, more immediate applicability to the development of next-generation reasoning AI systems.

vs. Multi-Stakeholder LLM Alignment: Decomposing Estimation from Aggregation

claude-opus-4.65/27/2026

Paper 1 addresses a fundamental and broadly relevant problem in LLM evaluation—revealing that aggregate benchmark scores can mask critical failures in compositional reasoning. The 'composition collapse' phenomenon and the double-gate protocol offer novel diagnostic tools applicable across the entire post-training evaluation landscape. Paper 2 tackles a more niche problem (multi-stakeholder alignment) with a useful but narrower contribution. Paper 1's findings have broader implications for how the field evaluates and develops LLMs, likely influencing evaluation practices, benchmark design, and post-training methodology more widely.

vs. A Dataset of Robot-Patient and Doctor-Patient Medical Dialogues for Spoken Language Processing Tasks

claude-opus-4.65/27/2026

Paper 2 introduces a novel conceptual framework ('composition collapse') and a rigorous diagnostic protocol that fundamentally challenges how the field evaluates LLM reasoning capabilities. It reveals that aggregate benchmarks can be deeply misleading about multi-hop reasoning, with implications for how all post-training methods are evaluated. This methodological contribution has broad impact across AI evaluation, reasoning research, and model development. Paper 1 contributes a useful but relatively niche medical speech dataset with straightforward LLM benchmarking, offering more incremental impact within a specific application domain.

vs. MedGuideX: Internalizing Decision Logic from Executable Guidelines into Large Language Models for Clinical Reasoning

gpt-5.25/27/2026

Paper 2 introduces a broadly applicable conceptual and methodological contribution: it identifies “composition collapse,” shows aggregate multi-hop scores can be misleading, and proposes a double-gate evaluation protocol that decomposes gains into atomic stability, residual composition, and critical depth. This reframes how post-training and reasoning improvements should be measured across many LLM domains, with immediate relevance to current evaluation practices. Paper 1 is impactful for clinical NLP and guideline-based supervision, but its scope is narrower and more application-specific, whereas Paper 2’s insights and metrics can influence evaluation and training claims across fields.