Inducing Reasoning Primitives from Agent Traces
Zhihan Lei, Jiarui Yan, Joshua Momo, William W. Cohen
Abstract
ReAct-style LLM agents often rediscover the same reasoning routines across problems, yet leave those routines trapped in transient scratchpads. We introduce Reasoning Primitive Induction, a single-pass method that mines successful ReAct traces, clusters recurrent reasoning moves, and converts the most frequent moves into a compact library of typed pseudo-tools. Each pseudo-tool is specified by a natural-language docstring interpreted by an LLM at invocation time, and a standard ReAct loop composes these primitives at test time. The central result is that induced libraries outperform the very agent that generated their traces: by +44pp on RuleArena NBA (30 -> 74), +30pp on MuSR team allocation (38 -> 68), and +22pp on NatPlan meeting planning (7 -> 29). Across five comparable subtasks spanning narrative deduction, rule application, and constraint-satisfaction planning, a single fixed configuration improves over zero-shot Chain-of-Thought on every subtask, matches or surpasses expert-authored decompositions, and outperforms AWM at lower average inference cost.
AI Impact Assessments
(1 models)Scientific Impact Assessment: "Inducing Reasoning Primitives from Agent Traces"
1. Core Contribution
The paper introduces Reasoning Primitive Induction (RPI), a single-pass pipeline that extracts recurring reasoning patterns from successful ReAct agent traces, clusters them into canonical categories, and synthesizes them into typed pseudo-tools with natural-language docstrings. The key insight is that LLM agents repeatedly reinvent the same reasoning subroutines across problem instances, and by crystallizing these into a stable, named library, the agent can outperform its own source traces—sometimes dramatically.
The most striking claim is "induction exceeds its source": the induced library outperforms the very agent that generated the training traces by up to +44 percentage points (RuleArena NBA). The proposed mechanism is implicit aggregation—synthesis sees representative thoughts from many successful rollouts and writes a corpus-level specification, replacing the high-variance per-instance reconstruction that the source agent performs. This is a genuine conceptual advance over prior trace-induction work (AWM, ASI, SkillWeaver), which primarily repackages complete workflows or executable skills rather than extracting individual reasoning moves.
2. Methodological Rigor
Strengths in experimental design:
Concerns:
3. Potential Impact
Immediate practical impact: The method offers a lightweight, training-free approach to improving LLM agent performance. With only two free parameters, three prompts, and no per-task tuning, it provides a low-barrier entry point for practitioners seeking to improve reasoning agents without expert decomposition engineering.
Conceptual contribution: The idea that agent traces contain a latent "reasoning vocabulary" that can be extracted and reified is powerful. The distinction between reasoning-space primitives (verify_alibi_consistency) and environment-grounded actions is well-articulated and opens a new design space between pure prompting and full tool creation.
Broader implications: The work connects to library learning in program synthesis (DreamCoder, LILO) and could catalyze a research direction on automated discovery of reasoning modules. The observation that induced libraries can exceed their source is particularly thought-provoking—it suggests that LLM agents have latent capabilities that are inconsistently expressed, and that trace analysis can surface and stabilize these capabilities.
Limitations on impact: The method currently requires re-induction per task family, limiting its generality. Cross-family transfer is explicitly left as future work. The failure on arithmetic-heavy tasks (RuleArena airline/tax) reveals a fundamental limitation of LLM-interpreted docstrings for deterministic operations.
4. Timeliness & Relevance
This paper addresses a current bottleneck in LLM agent design: the gap between the reasoning patterns agents implicitly discover and the structured tools they can reliably deploy. As the community moves toward more capable agentic systems, methods for automatically improving agent behavior without weight updates are highly relevant. The work arrives at an opportune time when ReAct-style agents are widely deployed but their efficiency and reliability remain challenging.
5. Strengths & Limitations
Key Strengths:
Notable Weaknesses:
Additional observations: The paper is well-written with clear figures and comprehensive appendices. The gallery of induced primitives (Appendix A) provides excellent transparency into what the method actually discovers. The honest treatment of failure modes (arithmetic tasks, NatPlan trip) strengthens credibility.
Generated Jun 3, 2026
Comparison History (19)
Paper 1 demonstrates substantial empirical improvements (+22-44pp) on multiple benchmarks through a novel, practical method of extracting reusable reasoning primitives from agent traces. It addresses a concrete limitation of current LLM agents with immediate applicability. Paper 2 provides rigorous formal foundations for tool protocols (SGD/MCP bisimilarity), which is theoretically valuable but targets a narrower audience and has less immediate practical impact. Paper 1's approach of self-improving agents through primitive induction is more likely to be widely adopted and extended, driving broader impact across the LLM agent community.
Paper 2 addresses the highly consequential problem of automating data curation for AI development, introducing a novel benchmark (Curation-Bench) and identifying the important 'execution-research gap' in agent behavior. Its findings—that scaffolded agents can compose data-selection policies outperforming published baselines at 1/10th the data budget—have broad practical implications for AI development pipelines. While Paper 1 presents a clever technique for inducing reasoning primitives with strong empirical gains, Paper 2 tackles a more foundational bottleneck in AI development, has broader cross-field applicability, and provides an open-sourced benchmark likely to catalyze further research.
Paper 1 challenges the fundamental assumption that all errors carry equal weight, shifting test-time compute allocation from difficulty-based to risk-aware routing. This conceptual pivot has profound implications for AI safety, reliability, and real-world deployment economics, aligning perfectly with current trends in scalable test-time compute. While Paper 2 offers a strong methodological improvement for agents, Paper 1 addresses a broader systemic gap in how AI models are deployed in high-stakes environments.
Paper 1 introduces a more novel concept—automatically extracting reusable reasoning primitives from agent traces and forming a pseudo-tool library that outperforms the generating agent itself. This 'self-improvement' paradigm is highly innovative and has broad implications for agent architecture, meta-learning, and program synthesis. The dramatic performance gains (+44pp, +30pp, +22pp) demonstrate strong empirical results. Paper 2 addresses the important but narrower problem of tool abuse with an optimization framework, offering incremental improvements over existing RL methods. While solid, it tackles a more specific efficiency concern rather than opening a fundamentally new research direction.
Paper 2 is more novel and broadly impactful: it proposes a general, single-pass method to distill reusable “reasoning primitives” from agent traces, applicable across many LLM agent tasks. The reported gains are large across diverse benchmarks and suggest practical improvements in capability and inference efficiency, making it timely for current agent research. Paper 1 addresses an important, real-world problem (demographic bias in dermatology AI) with solid experimental structure, but its methodological contribution is more incremental and its impact is narrower to medical imaging fairness, with limited evidence of a new widely transferable technique.
Paper 2 likely has higher impact: it reframes agent evaluation around abstention competence, identifying a systemic benchmark blind spot (compliance bias) with clear safety implications and broad relevance to RLHF, alignment, evaluation, and deployment governance. It contributes a general taxonomy plus concrete, portable metrics and protocols, and backs them with multi-model results over 144 enterprise scenarios—supporting methodological rigor and real-world applicability. Paper 1 is technically strong and yields large gains, but it is more incremental within agent prompting/tooling and narrower in cross-field implications than a new evaluation paradigm for safe autonomy.
Paper 1 has higher estimated impact due to a more novel and broadly applicable method: automatically inducing reusable “reasoning primitives” from agent traces, yielding large, quantitative gains across multiple reasoning/planning benchmarks and lowering inference cost. The approach is timely for LLM agent efficiency and tool-use, and could generalize across many downstream tasks and agent frameworks. Paper 2 is innovative for social simulation interpretability, but its evaluation appears narrower (one domain scenario) and its impact may be more specialized to computational social science rather than general LLM capability improvement.
Paper 1 presents a novel, concrete method (Reasoning Primitive Induction) with strong empirical results showing significant performance improvements (+22-44pp) across multiple benchmarks. It addresses a practical problem with a replicable methodology and clear quantitative gains. Paper 2, while intellectually ambitious in proposing ICAM as a unified framework for model-native computing, is explicitly a conceptual/survey contribution without new experiments. Its proposed 'laws' and six-layer architecture, while thought-provoking, lack empirical validation from original work. Paper 1's actionable, experimentally-validated contribution is more likely to drive follow-on research and adoption.
COMAP addresses a fundamental bottleneck in autonomous agents by enabling the continuous, closed-loop co-evolution of world models and policies without relying on external rewards. This self-improving framework advances the critical frontier of world-modeling in AI, offering broader theoretical implications for interactive AI, embodied planning, and long-horizon decision-making compared to the task-specific reasoning extraction in Paper 1.
Paper 1 presents a foundational methodological advancement in LLM agent reasoning by automating the discovery of reusable reasoning primitives. Its generalizable approach offers broad impact across numerous AI applications, outperforming existing baselines across diverse tasks. While Paper 2 provides high value in a specific domain (drug discovery), Paper 1's domain-agnostic framework for self-improving agents is likely to drive wider adoption, stimulate more subsequent research, and have a more profound, cross-disciplinary scientific impact in the rapidly evolving field of AI.
Paper 2 is more novel and broadly applicable: it proposes a general, single-pass method to distill reusable “reasoning primitives” from agent traces, yielding a compact pseudo-tool library that improves performance and reduces inference cost across diverse reasoning/planning tasks. This has wide impact potential for agentic LLM design, efficiency, and automation beyond any single domain. Paper 1 is timely and valuable for healthcare evaluation, but as a benchmark its impact is narrower, more dataset-dependent, and primarily methodological/assessment-focused rather than introducing a generalizable new learning/control mechanism.
Paper 2 is more novel and broadly impactful: it introduces a general, single-pass mechanism to distill reusable “reasoning primitives” from agent traces into typed pseudo-tools, enabling compositional reuse at test time. The approach is widely applicable across agentic reasoning tasks (deduction, rule use, planning) and shows large, consistent gains (up to +44pp) while reducing inference cost, suggesting strong real-world utility for scalable agent deployment. Paper 1 is methodologically solid and timely for RLVR training, but is narrower (verifier-specific training trick) and shows modest gains (~3.8pp) on QA benchmarks.
Paper 2 addresses a fundamental challenge in LLM agents—autonomous skill acquisition—by automatically inducing reusable reasoning primitives from successful traces. This method not only improves reasoning capabilities but also generalizes across multiple tasks, offering a potentially transformative approach to agent self-improvement. Paper 1 offers a highly practical optimization for agent memory costs and determinism, but Paper 2's contribution to automated reasoning decomposition and dynamic tool creation is likely to have a broader and more profound impact on the development of advanced autonomous agents.
Paper 1 addresses a highly timely and widely applicable problem in the booming field of LLM agents. By automating the induction of reasoning primitives, it demonstrates massive empirical improvements across diverse tasks. Its immediate practical utility and relevance to AI problem-solving give it a broader and more immediate potential scientific impact compared to the fundamental, but more niche, theoretical contributions to causal inference in Paper 2.
While Paper 1 offers a strong methodological advancement in LLM reasoning, Paper 2 addresses a fundamental bottleneck in Brain-Computer Interfaces. By pioneering cross-task continual learning for EEG foundation models, it paves the way for a unified 'one-for-all' brain decoding system. This promises profound real-world applications in healthcare and accessibility, representing a major paradigm shift rather than an algorithmic optimization.
Paper 2 proposes a generalizable method for improving LLM agent reasoning by inducing reusable pseudo-tools from agent traces, demonstrating significant performance gains across diverse tasks. Its broad applicability across various domains of AI research gives it higher potential impact compared to Paper 1, which focuses on a domain-specific (chemistry) evaluation benchmark, albeit a rigorous and necessary one.
Paper 2 is likely to have higher scientific impact due to broader applicability and timeliness: process-level reliability and auditing for deep-research agents is a cross-cutting need for deployment, evaluation, and safety. It contributes a sizable real-trajectory dataset, a benchmark (TELBench), and an auditing method (DRIFT), enabling standardized comparisons and follow-on work across agent frameworks and model families. Paper 1 is novel and shows strong performance gains, but its impact is more narrowly centered on ReAct-style prompting and tool induction, with fewer direct implications for reliability/oversight.
Paper 1 introduces a highly novel, broadly applicable paradigm for LLM agents by converting transient reasoning traces into reusable pseudo-tools. It achieves massive performance gains (up to +44 percentage points) over strong baselines in a rapidly evolving, high-impact field. In contrast, Paper 2 offers incremental architectural improvements (e.g., column masking, TF-IDF encoding) to an existing graph transformer for a specific database benchmark. Paper 1's fundamental contribution to autonomous AI reasoning dictates a significantly higher potential for widespread scientific impact.
Paper 2 presents a broadly applicable method for improving general LLM agent reasoning by extracting reusable primitives. Its substantial performance gains across diverse tasks (planning, rule application, narrative deduction) highlight high potential for widespread impact in AI. Paper 1, while innovative, addresses the much narrower domain of formal mathematical proof refactoring, limiting its immediate breadth of impact compared to the general-purpose reasoning advancements in Paper 2.