Inducing Reasoning Primitives from Agent Traces

Zhihan Lei, Jiarui Yan, Joshua Momo, William W. Cohen

Jun 2, 2026

arXiv:2606.02994v1 PDF

cs.AI(primary)cs.CL

#1160of 3355·Artificial Intelligence

#1160 of 3355 · Artificial Intelligence

Tournament Score

1437±46

10501800

68%

Win Rate

Wins

Losses

Matches

Rating

7/ 10

Significance7.5

Rigor7

Novelty7.5

Clarity8.5

Tournament Score

1437±46

10501800

68%

Win Rate

Wins

Losses

Matches

Rating

7/ 10

Significance

Rigor

Novelty

Clarity

Abstract

ReAct-style LLM agents often rediscover the same reasoning routines across problems, yet leave those routines trapped in transient scratchpads. We introduce Reasoning Primitive Induction, a single-pass method that mines successful ReAct traces, clusters recurrent reasoning moves, and converts the most frequent moves into a compact library of typed pseudo-tools. Each pseudo-tool is specified by a natural-language docstring interpreted by an LLM at invocation time, and a standard ReAct loop composes these primitives at test time. The central result is that induced libraries outperform the very agent that generated their traces: by +44pp on RuleArena NBA (30 -> 74), +30pp on MuSR team allocation (38 -> 68), and +22pp on NatPlan meeting planning (7 -> 29). Across five comparable subtasks spanning narrative deduction, rule application, and constraint-satisfaction planning, a single fixed configuration improves over zero-shot Chain-of-Thought on every subtask, matches or surpasses expert-authored decompositions, and outperforms AWM at lower average inference cost.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: "Inducing Reasoning Primitives from Agent Traces"

1. Core Contribution

The paper introduces Reasoning Primitive Induction (RPI), a single-pass pipeline that extracts recurring reasoning patterns from successful ReAct agent traces, clusters them into canonical categories, and synthesizes them into typed pseudo-tools with natural-language docstrings. The key insight is that LLM agents repeatedly reinvent the same reasoning subroutines across problem instances, and by crystallizing these into a stable, named library, the agent can outperform its own source traces—sometimes dramatically.

The most striking claim is "induction exceeds its source": the induced library outperforms the very agent that generated the training traces by up to +44 percentage points (RuleArena NBA). The proposed mechanism is implicit aggregation—synthesis sees representative thoughts from many successful rollouts and writes a corpus-level specification, replacing the high-variance per-instance reconstruction that the source agent performs. This is a genuine conceptual advance over prior trace-induction work (AWM, ASI, SkillWeaver), which primarily repackages complete workflows or executable skills rather than extracting individual reasoning moves.

2. Methodological Rigor

Strengths in experimental design:

The paper evaluates across three task families (6 subtasks) spanning narrative deduction, rule application, and constraint-satisfaction planning—a reasonable diversity of reasoning challenges.

Five baselines isolate distinct alternative explanations: CoT (is an agent loop needed?), ReAct (does any structuring help?), hand-designed primitives (does discovery match expert design?), AWM (closest method-level cousin), and Program-of-Thoughts (is code-as-reasoning better?).

Bootstrap confidence intervals with paired comparisons (Table 11) provide statistical grounding. The "induction exceeds source" claim is supported by strictly positive CIs on 5/6 subtasks.

A compute-matched control (Self-Consistency @N=20) rules out the "more tokens" explanation for gains over CoT.

Cross-model generalization to Gemini Flash Lite (Table 2) addresses backbone-specificity concerns.

Cross-seed robustness (Table 6) shows consistent rank ordering across three seeds.

Concerns:

The evaluation uses greedy decoding (temperature 0), which is standard but means results reflect single-point estimates rather than distributional behavior. The cross-seed analysis partially addresses this.

The LLM-judge evaluation for NatPlan is validated with cross-judge agreement (96.6%), which is reassuring, but introduces a layer of evaluation uncertainty. The strict parser gives near-zero for all methods, suggesting the tasks may not be well-suited for automated evaluation.

NatPlan trip is effectively excluded from meaningful comparison since all methods score below 10%, reducing the effective evaluation to 5 subtasks.

The cross-model experiment (Table 2) tests only 2 subtasks on one additional model—broader validation would strengthen claims.

The method uses only correct rollouts for induction, which presupposes access to ground-truth labels—a form of supervision that should be more prominently discussed.

3. Potential Impact

Immediate practical impact: The method offers a lightweight, training-free approach to improving LLM agent performance. With only two free parameters, three prompts, and no per-task tuning, it provides a low-barrier entry point for practitioners seeking to improve reasoning agents without expert decomposition engineering.

Conceptual contribution: The idea that agent traces contain a latent "reasoning vocabulary" that can be extracted and reified is powerful. The distinction between reasoning-space primitives (verify_alibi_consistency) and environment-grounded actions is well-articulated and opens a new design space between pure prompting and full tool creation.

Broader implications: The work connects to library learning in program synthesis (DreamCoder, LILO) and could catalyze a research direction on automated discovery of reasoning modules. The observation that induced libraries can exceed their source is particularly thought-provoking—it suggests that LLM agents have latent capabilities that are inconsistently expressed, and that trace analysis can surface and stabilize these capabilities.

Limitations on impact: The method currently requires re-induction per task family, limiting its generality. Cross-family transfer is explicitly left as future work. The failure on arithmetic-heavy tasks (RuleArena airline/tax) reveals a fundamental limitation of LLM-interpreted docstrings for deterministic operations.

4. Timeliness & Relevance

This paper addresses a current bottleneck in LLM agent design: the gap between the reasoning patterns agents implicitly discover and the structured tools they can reliably deploy. As the community moves toward more capable agentic systems, methods for automatically improving agent behavior without weight updates are highly relevant. The work arrives at an opportune time when ReAct-style agents are widely deployed but their efficiency and reliability remain challenging.

5. Strengths & Limitations

Key Strengths:

The "exceeds its source" finding is genuinely surprising and well-supported empirically

Minimal supervision requirements (no expert tool design, no domain-specific actions)

Clean algorithmic presentation with only two free parameters

Thorough ablation studies (K sensitivity, deployment factorial, dispatch comparison)

The pseudo-tool abstraction elegantly handles reasoning moves that lack clean deterministic implementations

Code and libraries are publicly released

Notable Weaknesses:

Evaluation is limited to 5 effectively comparable subtasks—a modest evaluation surface

Dependence on ground-truth labels for filtering successful traces

No cross-family transfer experiments

The method adds significant compute overhead (~117K tokens/instance vs ~5-6K for CoT), though it's cheaper than AWM

The gains over Program-of-Thoughts are modest and mostly non-significant (only meeting planning is significant)

The paper is from June 2026 and uses DeepSeek-V3; results may shift with newer models that have better baseline reasoning

Additional observations: The paper is well-written with clear figures and comprehensive appendices. The gallery of induced primitives (Appendix A) provides excellent transparency into what the method actually discovers. The honest treatment of failure modes (arithmetic tasks, NatPlan trip) strengthens credibility.

Rating:7/ 10

Significance 7.5Rigor 7Novelty 7.5Clarity 8.5

Generated Jun 3, 2026

Comparison History (19)

vs. Formal Semantics for Agentic Tool Protocols: A Process Calculus Approach

claude-opus-4.66/5/2026

Paper 1 demonstrates substantial empirical improvements (+22-44pp) on multiple benchmarks through a novel, practical method of extracting reusable reasoning primitives from agent traces. It addresses a concrete limitation of current LLM agents with immediate applicability. Paper 2 provides rigorous formal foundations for tool protocols (SGD/MCP bisimilarity), which is theoretically valuable but targets a narrower audience and has less immediate practical impact. Paper 1's approach of self-improving agents through primitive induction is more likely to be widely adopted and extended, driving broader impact across the LLM agent community.

vs. Can Generalist Agents Automate Data Curation?

claude-opus-4.66/5/2026

Paper 2 addresses the highly consequential problem of automating data curation for AI development, introducing a novel benchmark (Curation-Bench) and identifying the important 'execution-research gap' in agent behavior. Its findings—that scaffolded agents can compose data-selection policies outperforming published baselines at 1/10th the data budget—have broad practical implications for AI development pipelines. While Paper 1 presents a clever technique for inducing reasoning primitives with strong empirical gains, Paper 2 tackles a more foundational bottleneck in AI development, has broader cross-field applicability, and provides an open-sourced benchmark likely to catalyze further research.

vs. Not All Errors Are Equal: Consequence-Aware Reasoning Compute Allocation

gemini-3.16/5/2026

Paper 1 challenges the fundamental assumption that all errors carry equal weight, shifting test-time compute allocation from difficulty-based to risk-aware routing. This conceptual pivot has profound implications for AI safety, reliability, and real-world deployment economics, aligning perfectly with current trends in scalable test-time compute. While Paper 2 offers a strong methodological improvement for agents, Paper 1 addresses a broader systemic gap in how AI models are deployed in high-stakes environments.

vs. Learning When Not to Act: Mitigating Tool Abuse in Agentic Reinforcement Learning

claude-opus-4.66/3/2026

Paper 1 introduces a more novel concept—automatically extracting reusable reasoning primitives from agent traces and forming a pseudo-tool library that outperforms the generating agent itself. This 'self-improvement' paradigm is highly innovative and has broad implications for agent architecture, meta-learning, and program synthesis. The dramatic performance gains (+44pp, +30pp, +22pp) demonstrate strong empirical results. Paper 2 addresses the important but narrower problem of tool abuse with an optimization framework, offering incremental improvements over existing RL methods. While solid, it tackles a more specific efficiency concern rather than opening a fundamentally new research direction.

vs. Effect of Demographic Bias on Skin Lesion Classification

gpt-5.26/3/2026

Paper 2 is more novel and broadly impactful: it proposes a general, single-pass method to distill reusable “reasoning primitives” from agent traces, applicable across many LLM agent tasks. The reported gains are large across diverse benchmarks and suggest practical improvements in capability and inference efficiency, making it timely for current agent research. Paper 1 addresses an important, real-world problem (demographic bias in dermatology AI) with solid experimental structure, but its methodological contribution is more incremental and its impact is narrower to medical imaging fairness, with limited evidence of a new widely transferable technique.

vs. What Benchmarks Don't Measure: The Case for Evaluating Abstention Competence in Autonomous Agents

gpt-5.26/3/2026

Paper 2 likely has higher impact: it reframes agent evaluation around abstention competence, identifying a systemic benchmark blind spot (compliance bias) with clear safety implications and broad relevance to RLHF, alignment, evaluation, and deployment governance. It contributes a general taxonomy plus concrete, portable metrics and protocols, and backs them with multi-model results over 144 enterprise scenarios—supporting methodological rigor and real-world applicability. Paper 1 is technically strong and yields large gains, but it is more incremental within agent prompting/tooling and narrower in cross-field implications than a new evaluation paradigm for safe autonomy.

vs. Think-Before-Speak: From Internal Evaluation to Public Expression in Multi-Agent Social Simulation

gpt-5.26/3/2026

Paper 1 has higher estimated impact due to a more novel and broadly applicable method: automatically inducing reusable “reasoning primitives” from agent traces, yielding large, quantitative gains across multiple reasoning/planning benchmarks and lowering inference cost. The approach is timely for LLM agent efficiency and tool-use, and could generalize across many downstream tasks and agent frameworks. Paper 2 is innovative for social simulation interpretability, but its evaluation appears narrower (one domain scenario) and its impact may be more specialized to computational social science rather than general LLM capability improvement.

vs. Model-Native Computing Architecture: Envisioning Future System Architecture Through the Lens of Computer Architecture

claude-opus-4.66/3/2026

Paper 1 presents a novel, concrete method (Reasoning Primitive Induction) with strong empirical results showing significant performance improvements (+22-44pp) across multiple benchmarks. It addresses a practical problem with a replicable methodology and clear quantitative gains. Paper 2, while intellectually ambitious in proposing ICAM as a unified framework for model-native computing, is explicitly a conceptual/survey contribution without new experiments. Its proposed 'laws' and six-layer architecture, while thought-provoking, lack empirical validation from original work. Paper 1's actionable, experimentally-validated contribution is more likely to drive follow-on research and adoption.

vs. COMAP: Co-Evolving World Models and Agent Policies for LLM Agents

gemini-3.16/3/2026

COMAP addresses a fundamental bottleneck in autonomous agents by enabling the continuous, closed-loop co-evolution of world models and policies without relying on external rewards. This self-improving framework advances the critical frontier of world-modeling in AI, offering broader theoretical implications for interactive AI, embodied planning, and long-horizon decision-making compared to the task-specific reasoning extraction in Paper 1.

vs. CP-Agent: Context-Aware Multimodal Reasoning for Cellular Morphological Profiling under Chemical Perturbations

gemini-3.16/3/2026

Paper 1 presents a foundational methodological advancement in LLM agent reasoning by automating the discovery of reusable reasoning primitives. Its generalizable approach offers broad impact across numerous AI applications, outperforming existing baselines across diverse tasks. While Paper 2 provides high value in a specific domain (drug discovery), Paper 1's domain-agnostic framework for self-improving agents is likely to drive wider adoption, stimulate more subsequent research, and have a more profound, cross-disciplinary scientific impact in the rapidly evolving field of AI.

vs. ClinicalMC: A Benchmark for Multi-Course Clinical Decision-Making with Large Language Models

gpt-5.26/3/2026

Paper 2 is more novel and broadly applicable: it proposes a general, single-pass method to distill reusable “reasoning primitives” from agent traces, yielding a compact pseudo-tool library that improves performance and reduces inference cost across diverse reasoning/planning tasks. This has wide impact potential for agentic LLM design, efficiency, and automation beyond any single domain. Paper 1 is timely and valuable for healthcare evaluation, but as a benchmark its impact is narrower, more dataset-dependent, and primarily methodological/assessment-focused rather than introducing a generalizable new learning/control mechanism.

vs. CAPF: Guiding Search-Agent Rollouts with Credit-Attenuated Privileged Feedback

gpt-5.26/3/2026

Paper 2 is more novel and broadly impactful: it introduces a general, single-pass mechanism to distill reusable “reasoning primitives” from agent traces into typed pseudo-tools, enabling compositional reuse at test time. The approach is widely applicable across agentic reasoning tasks (deduction, rule use, planning) and shows large, consistent gains (up to +44pp) while reducing inference cost, suggesting strong real-world utility for scalable agent deployment. Paper 1 is methodologically solid and timely for RLVR training, but is narrower (verifier-specific training trick) and shows modest gains (~3.8pp) on QA benchmarks.

vs. DMF: A Deterministic Memory Framework for Conversational AI Agents

gemini-3.16/3/2026

Paper 2 addresses a fundamental challenge in LLM agents—autonomous skill acquisition—by automatically inducing reusable reasoning primitives from successful traces. This method not only improves reasoning capabilities but also generalizes across multiple tasks, offering a potentially transformative approach to agent self-improvement. Paper 1 offers a highly practical optimization for agent memory costs and determinism, but Paper 2's contribution to automated reasoning decomposition and dynamic tool creation is likely to have a broader and more profound impact on the development of advanced autonomous agents.

vs. Unveiling the Structure of Do-Calculus Reasoning via Derivation Graphs

gemini-3.16/3/2026

Paper 1 addresses a highly timely and widely applicable problem in the booming field of LLM agents. By automating the induction of reasoning primitives, it demonstrates massive empirical improvements across diverse tasks. Its immediate practical utility and relevance to AI problem-solving give it a broader and more immediate potential scientific impact compared to the fundamental, but more niche, theoretical contributions to causal inference in Paper 2.

vs. EvoBrain: Continual Learning of EEG Foundation Models Across Heterogeneous BCI Tasks

gemini-3.16/3/2026

While Paper 1 offers a strong methodological advancement in LLM reasoning, Paper 2 addresses a fundamental bottleneck in Brain-Computer Interfaces. By pioneering cross-task continual learning for EEG foundation models, it paves the way for a unified 'one-for-all' brain decoding system. This promises profound real-world applications in healthcare and accessibility, representing a major paradigm shift rather than an algorithmic optimization.

vs. From Answers to States: Verifiable Process-Level Evaluation of Chemical Reasoning in Large Language Models

gemini-3.16/3/2026

Paper 2 proposes a generalizable method for improving LLM agent reasoning by inducing reusable pseudo-tools from agent traces, demonstrating significant performance gains across diverse tasks. Its broad applicability across various domains of AI research gives it higher potential impact compared to Paper 1, which focuses on a domain-specific (chemistry) evaluation benchmark, albeit a rigorous and necessary one.

vs. Where Do Deep-Research Agents Go Wrong? Span-Level Error Localization in Agent Trajectories

gpt-5.26/3/2026

Paper 2 is likely to have higher scientific impact due to broader applicability and timeliness: process-level reliability and auditing for deep-research agents is a cross-cutting need for deployment, evaluation, and safety. It contributes a sizable real-trajectory dataset, a benchmark (TELBench), and an auditing method (DRIFT), enabling standardized comparisons and follow-on work across agent frameworks and model families. Paper 1 is novel and shows strong performance gains, but its impact is more narrowly centered on ReAct-style prompting and tool induction, with fewer direct implications for reliability/oversight.

vs. RelGT-AC: A Relational Graph Transformer for Autocomplete Tasks in Relational Databases

gemini-3.16/3/2026

Paper 1 introduces a highly novel, broadly applicable paradigm for LLM agents by converting transient reasoning traces into reusable pseudo-tools. It achieves massive performance gains (up to +44 percentage points) over strong baselines in a rapidly evolving, high-impact field. In contrast, Paper 2 offers incremental architectural improvements (e.g., column masking, TF-IDF encoding) to an existing graph transformer for a specific database benchmark. Paper 1's fundamental contribution to autonomous AI reasoning dictates a significantly higher potential for widespread scientific impact.

vs. Proof-Refactor: Refactoring Generated Formal Proofs into Modular Artifacts

gemini-3.16/3/2026

Paper 2 presents a broadly applicable method for improving general LLM agent reasoning by extracting reusable primitives. Its substantial performance gains across diverse tasks (planning, rule application, narrative deduction) highlight high potential for widespread impact in AI. Paper 1, while innovative, addresses the much narrower domain of formal mathematical proof refactoring, limiting its immediate breadth of impact compared to the general-purpose reasoning advancements in Paper 2.