Trace2Policy: From Expert Behavior Traces to Self-Evolving Decision Agents

Junli Zha, Jinbo Wang, Chao Zhou, Xiang Song

Jun 9, 2026arXiv:2606.10457v1

cs.AI

#1803of 3622·Artificial Intelligence

#1803 of 3622 · Artificial Intelligence

Tournament Score

1399±43

10501800

60%

Win Rate

Wins

Losses

Matches

Rating

5.8/ 10

Significance6.5

Rigor5

Novelty6

Clarity7

Abstract

Decision rules that enterprise experts apply tacitly -- in auditing, compliance, and contract review -- can be systematically recovered and improved through iterative error analysis. We present \textbf{Trace2Policy}, whose core mechanism -- \textbf{EISR} (\textbf{E}rror-driven \textbf{I}terative \textbf{S}kill \textbf{R}efinement) -- maintains a human-readable rule document as its optimization target: each round executes the rules on a validation set, clusters errors by root cause into MISSING, WRONG, or CONFLICT types, applies targeted patches, and commits only those that pass a regression gate. \textbf{For this class of compliance-sensitive, skewed-base-rate decision tasks, we identify rule quality -- not model capability -- as the dominant performance lever}: across five LLMs, one-shot distillation plateaus near $\sim$ 70\% on the deployed pool, while eight EISR rounds lift the same rules to 79.6\% when compiled into deterministic Python -- zero LLM calls at inference. \textbf{Execution form compounds the gain: in production, the same EISR-refined content runs 9.8~pp higher as compiled Python than as an LLM prompt, a form-and-engineering bundle the 22-day deployment matured together.} Deployed for 22 days at a major logistics carrier (3,349 audit cases), the compiled pipeline outperforms the pure-LLM baseline it replaced (72.7\%); on these calibrated, skewed-base-rate workloads, re-enabling LLM fallback monotonically degrades accuracy. An LLM-driven variant, \textbf{Auto-EISR}, reproduces this refinement at \ $5--\$ 10 per cycle versus $\sim$ 70 expert-hours, and transfers to four public benchmarks spanning legal reasoning (LegalBench) and process-mining decisions (BPIC 2012) without re-engineering.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: Trace2Policy

1. Core Contribution

Trace2Policy addresses a genuine gap in enterprise AI: extracting, refining, and deploying *interpretable decision rules* from expert behavior traces rather than training opaque models. The central algorithmic contribution, EISR (Error-driven Iterative Skill Refinement), is an iterative diagnose-and-patch loop that maintains a human-readable rule document as the optimization target. Each round classifies validation errors into MISSING, WRONG, or CONFLICT categories, clusters them by root cause, proposes targeted patches, and commits only those surviving a regression gate.

The key insight is that for compliance-sensitive, skewed-base-rate decision tasks, rule quality dominates model capability as a performance lever. This is demonstrated across five LLMs where one-shot distillation plateaus near ~70%, while eight EISR rounds lift performance to 79.6%—and critically, the refined rules can be compiled into deterministic Python requiring zero LLM calls at inference. The Auto-EISR variant automates this at $5 –$ 10 per cycle versus ~70 expert-hours.

2. Methodological Rigor

Strengths in evaluation design: The paper includes a real 22-day production deployment at a major logistics carrier (3,349 cases), which is rare in the agent/policy literature. The evaluation spans three production benchmarks (training, held-out, drift), multiple LLMs, and four public benchmarks (LegalBench, BPIC 2012). The authors are notably transparent about limitations—explicitly flagging the 9.8pp compilation gap as a "form-bundle observation" rather than a clean causal claim, acknowledging that auditor anchoring may inflate production accuracies, and noting that cross-regime behavior differs from within-regime behavior.

Weaknesses: Several methodological gaps limit confidence: (1) The held-out sets are small (40 and 139 cases), leaving many McNemar comparisons underpowered—the headline Auto-EISR result doesn't survive Bonferroni correction. (2) There is no ablation of individual EISR components (clustering mechanism, gate threshold, error taxonomy). (3) The 9.8pp gap between compiled Python and LLM prompt execution confounds execution form with 22 days of engineering—the authors acknowledge this but don't resolve it. (4) The "natural ground truth" from auditors who see the agent's recommendations is contaminated by automation bias, which the authors flag but cannot quantify. (5) The LegalBench and BPIC probes are shallow (N=64 and N=297 respectively) and use different configurations than the primary study.

The theoretical analysis (Appendix C) casting EISR as version-space narrowing is informal and provides limited formal guarantees. The PAC-style bounds are loose and primarily illustrative.

3. Potential Impact

Practical impact could be significant for a specific but important class of problems: compliance-sensitive enterprise decisions with systematic but tacit rules. The framework's value proposition—auditable, version-controlled rule artifacts that persist across model upgrades and require zero LLM inference calls—addresses real enterprise concerns about LLM reliability, cost, and auditability.

The "trap rules" discovery (encoding state ambiguity, implicit action prefixes, claim mismatches) is compelling qualitative evidence that deep domain knowledge cannot be extracted through one-shot methods. This observation alone has design implications for any knowledge extraction system.

Adjacent field influence: The authority displacement finding—that providing unrefined rules to strong models *degrades* performance by 7-9pp—has broader implications for human-AI collaboration and prompt engineering. The observation that LLM fallback monotonically degrades accuracy on skewed-base-rate tasks challenges the common "rules + LLM safety net" design pattern.

However, the narrow domain validation (primarily logistics damage audit, with probe-level tests on legal reasoning) limits confidence in generalizability claims.

4. Timeliness & Relevance

The paper is timely on multiple fronts: (1) Enterprise AI deployment is rapidly expanding but struggling with reliability and auditability requirements. (2) The tension between LLM capabilities and deterministic, auditable decision-making is a current industry bottleneck. (3) Self-evolving agent systems are an active research frontier, and the "natural data flywheel" concept leverages existing human review workflows elegantly. (4) The cost comparison ( $5 -$ 10 vs. ~70 expert-hours) addresses the practical economics of knowledge maintenance.

The positioning against GUI agents, process mining, prompt optimization (DSPy/MIPRO), and self-refine methods is well-articulated, identifying a genuine niche: externalized, interpretable decision rules as the refinement target.

5. Strengths & Limitations

Key Strengths:

Real deployment at scale: 22 days, 3,349 cases—rare in academic literature and provides genuine operational evidence.

Intellectual honesty: The authors are unusually forthcoming about confounds, scope limitations, and measurement gaps (form-bundle, anchoring bias, cross-regime behavior).

Practical architecture: The pipeline produces auditable artifacts usable without LLM inference, addressing real enterprise constraints.

Multi-model evaluation: Testing across 5-6 LLMs strengthens the "rule quality > model choice" claim.

Discovery of "trap rules": Compelling evidence that iterative error analysis surfaces knowledge invisible to one-shot extraction.

Notable Weaknesses:

Single primary domain: Despite probe-level cross-domain testing, the evidence base is overwhelmingly from one logistics audit task.

Small held-out sets and underpowered statistics: Many comparisons lack statistical significance after correction.

Missing ablations: No systematic decomposition of EISR's components limits understanding of which mechanisms drive improvement.

Confounded compilation claim: The headline 9.8pp gap mixes execution form with engineering effort.

Ground truth contamination: Production accuracy numbers may be inflated by automation bias in auditor labels.

Reproducibility constraints: Proprietary data prevents full reproduction; the promised code release covers only the orchestrator.

Overall Assessment

Trace2Policy presents a practically valuable framework for a well-defined problem class, backed by genuine production evidence that is rare in the literature. The core insight—that iteratively refined, externalized rules outperform both one-shot extraction and direct LLM prompting for compliance-sensitive decisions—is empirically supported within its stated scope. However, the paper's scientific contribution is limited by narrow domain validation, small sample sizes, missing ablations, and several confounded comparisons. It is best characterized as a strong systems/deployment paper with preliminary but promising algorithmic contributions that require broader validation.

Rating:5.8/ 10

Significance 6.5Rigor 5Novelty 6Clarity 7

Generated Jun 10, 2026

Comparison History (20)

Wonvs. Abstracting Cross-Domain Action Sequences into Interpretable Workflows

Paper 2 presents a highly innovative approach by extracting deterministic Python rules from expert traces, which eliminates LLM inference costs and improves accuracy. Its successful real-world deployment in a major enterprise and strong performance on public benchmarks demonstrate broader practical utility, economic benefits, and methodological rigor compared to Paper 1's focus on log abstraction.

gemini-3.1-pro-preview·Jun 15, 2026

Lostvs. Embodied-BenchClaw: An Autonomous Multi-Agent System for Embodied Spatial Intelligence Benchmark Construction

Paper 2 addresses a fundamental bottleneck in AI research—benchmark saturation and the high cost of manual evaluation creation. By proposing an autonomous, multi-agent system to dynamically generate embodied AI benchmarks, it provides a foundational tool that can accelerate research across robotics, spatial reasoning, and UAVs. While Paper 1 offers a strong, deployed real-world application for enterprise decision-making, Paper 2's contribution has broader implications for how the scientific community evaluates and progresses in embodied intelligence.

gemini-3.1-pro-preview·Jun 11, 2026

Wonvs. Nonslop: A Gamified Experiment in Human-AI Collaborative Writing

Trace2Policy presents a novel, rigorous framework (EISR) for systematically extracting and refining expert decision rules, demonstrated with real-world deployment (3,349 cases over 22 days) and validated across multiple benchmarks. It addresses a significant practical problem in enterprise AI — compliance-sensitive decision-making — with strong methodological contributions (error-driven iterative refinement, deterministic compilation). Paper 2, while interesting, is a smaller-scale exploratory study (74 participants) with a narrower, more niche contribution to understanding human-AI creative interaction, offering a framework but limited generalizability and immediate real-world impact.

claude-opus-4-6·Jun 11, 2026

Wonvs. A Five-Plane Reference Architecture for Runtime Governance of Production AI Agents

Paper 1 presents a highly novel and empirically validated approach to distilling expert tacit knowledge into deterministic, self-evolving rules. Its real-world deployment at a major carrier, combined with the counterintuitive finding that compiled rules outperform LLM prompts, offers immediate and substantial practical impact. While Paper 2 addresses a critical and timely issue (agent governance), Paper 1 provides a more rigorous, end-to-end methodology with concrete, deployed results and significant performance improvements.

gemini-3.1-pro-preview·Jun 11, 2026

Wonvs. Towards Responsibly Non-Compliant Machines

Paper 1 has higher impact potential: it introduces a concrete, novel iterative refinement framework (EISR) with measurable gains, strong methodological elements (error clustering, regression gating), and a real-world deployment demonstrating improved accuracy and reduced inference cost (compiled rules, zero LLM calls). It also shows transfer to public benchmarks, suggesting broader applicability to compliance-sensitive decision systems. Paper 2 raises timely and important conceptual questions about responsible non-compliance, but appears more agenda/position-oriented with less technical novelty, validation, or demonstrated application.

gpt-5.2·Jun 11, 2026

Wonvs. Do VLMs Reason Like Engineers? A Benchmark and a Stage-wise Evaluation

Paper 2 introduces a highly innovative methodology for transforming expert traces into deterministic code, bridging neuro-symbolic AI and agentic systems. Its successful real-world enterprise deployment, cost-efficiency analysis, and generalization to multiple benchmarks demonstrate exceptional practical utility and methodological rigor. While Paper 1 offers a valuable benchmark for VLMs, Paper 2's broader implications for reliable, low-cost AI deployment and its novel iterative refinement approach afford it a significantly higher potential impact across both academia and industry.

gemini-3.1-pro-preview·Jun 10, 2026

Wonvs. Instruction Finetuning DeepSeek-R1-8B Model Using LoRA and NEFTune

Trace2Policy introduces a novel, generalizable framework (EISR) for extracting and iteratively refining expert decision rules into deterministic, interpretable policies—validated through real-world deployment (3,349 cases over 22 days) and across multiple benchmarks. It addresses a fundamental question about rule quality vs. model capability, offers practical cost savings, and has broad applicability across compliance-sensitive domains. Paper 2 applies existing techniques (LoRA, NEFTune) to a single financial NER task with a small dataset, representing incremental engineering rather than conceptual innovation.

claude-opus-4-6·Jun 10, 2026

Lostvs. Traj-Evolve: A Self-Evolving Multi-Agent System for Patient Trajectory Modeling in Lung Cancer Early Detection

Traj-Evolve addresses a high-impact healthcare problem (lung cancer early detection) with broader scientific contributions: a novel combination of non-parametric memory (ExPool) and multi-agent reinforcement learning, with complementary mechanisms that improve both sensitivity and specificity. Its methodological innovations (cross-retrieval strategy, MARL for multi-agent collaboration, experience-driven evolution) are more generalizable across medical AI and multi-agent systems. Paper 1, while practically valuable for enterprise compliance, is narrower in scope—primarily an engineering contribution around rule refinement for decision tasks with limited methodological novelty beyond iterative error patching.

claude-opus-4-6·Jun 10, 2026

Lostvs. Do More Agents Help? Controlled and Protocol-Aligned Evaluation of LLM Agent Workflows

Paper 1 addresses a foundational and widely debated question in current AI research: whether multi-agent workflows genuinely outperform single-agent systems. By providing a rigorously controlled evaluation framework that challenges prevailing hype, it has broad methodological implications across the entire LLM agent community. While Paper 2 offers excellent real-world deployment insights, Paper 1's findings fundamentally impact how future agentic systems will be designed and evaluated.

gemini-3.1-pro-preview·Jun 10, 2026

Wonvs. How Far Did They Go? The Persuasive Tactics of Covert LLM Agents in a Discontinued Field Experiment

Paper 2 introduces a novel, scalable framework (Trace2Policy) with demonstrated real-world enterprise deployment and generalization to public benchmarks. Its approach of using LLMs to iteratively refine human-readable rules into deterministic code offers significant practical value for efficiency and cost reduction. While Paper 1 provides an important ethical and rhetorical analysis of a specific incident, Paper 2's methodological rigor, algorithmic contribution, and direct applicability to enterprise AI suggest a broader and more enduring scientific and industrial impact.

gemini-3.1-pro-preview·Jun 10, 2026

#1803of 3622·Artificial Intelligence

#1803 of 3622 · Artificial Intelligence

Tournament Score

1399±43

10501800

60%

Win Rate

Wins

Losses

Matches

Rating

5.8/ 10

Significance6.5

Rigor5

Novelty6

Clarity7