Trace2Skill: Verifier-Guided Skill Evolution for Long-Context EDA Agents

Zijian Du, Nathaniel Pinckney

May 20, 2026

arXiv:2605.21810v1 PDF

cs.AI(primary)cs.MA

#1225of 2292·Artificial Intelligence

#1225 of 2292 · Artificial Intelligence

Tournament Score

1404±48

10501800

50%

Win Rate

Wins

Losses

Matches

Rating

5.5/ 10

Significance5.5

Rigor5

Novelty5

Clarity6.5

Tournament Score

1404±48

10501800

50%

Win Rate

Wins

Losses

Matches

Rating

5.5/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Complex Verilog Design Problems (CVDP) challenge hardware LLM agents because solving them requires localizing verifier-relevant RTL, testbenches, include paths, and build dependencies inside large repository snapshots, making precise edits, and recovering from sparse hidden-verifier failures. We present Trace2Skill, a test-time scaling framework that improves a hardware agent without RTL-specialized model fine-tuning. Rather than training a new model or only sampling more candidate solutions, Trace2Skill treats the agent's natural-language skill as an evolvable policy. It mines repeated rollout traces for success and failure modes, converts them into dense diagnostics and oracle lessons, and uses an oracle, mutator, and selector loop to produce task-specific skills that guide later search, editing, validation, and recovery. Because final pass/fail labels are often too coarse for hard failures, Trace2Skill also supports bounded runtime dense verifier feedback that returns sanitized functional observations while keeping hidden harnesses and reference solutions inaccessible to the agent. This feedback helps guide skill evolution and agent execution by connecting skill text, verifier evidence, and downstream behavior. Across hard CVDP tasks that defeat the seed CVDP agent, including tasks that also defeat frontier coding agents, Trace2Skill with dense verifier feedback substantially improves task pass rates and produces breakthrough passes on previously unsolved tasks, without requiring high-quality fine-tuning data, specialized RTL model training, or model weight updates. The same framework provides a general test-time scaling strategy that can extend beyond digital design to other verifiable EDA tasks.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: Trace2Skill

Core Contribution

Trace2Skill proposes a training-free, test-time scaling framework that evolves natural-language "skills" (prompt-level policy documents) injected into a fixed hardware design agent. The key idea is to treat the agent's instruction prompt as an evolvable artifact: after each batch of rollouts on a Complex Verilog Design Problem (CVDP), the system mines execution traces for success/failure patterns, uses an oracle LLM to distill lessons, and applies evolutionary mutation and selection to produce improved task-specific skills for subsequent generations. An optional "dense verifier feedback" mechanism provides bounded, sanitized functional observations from a hidden test harness, offering richer signal than binary pass/fail.

The framework solves 6/8 hard CVDP tasks that defeated the baseline agent (and many that also defeated Claude Code and Codex), achieving a 33.6% pass rate versus 0% for the seed agent—without any model weight updates.

Methodological Rigor

Strengths in experimental design: The paper presents a clean four-configuration ablation (C1–C4) that isolates the contributions of dense feedback alone (C2), sparse skill evolution (C3), and their combination (C4). The progression C1→C2→C3→C4 clearly demonstrates that skill evolution and dense feedback are complementary, not redundant.

Metric design is thoughtful: The decomposition into SkillQ (content quality), AgentProgressQ (execution quality with LCB aggregation), AgentVarianceQ (stability), and SelectQ (combined survivor score) addresses a real problem—binary pass/fail is too coarse for hard tasks where most rollouts fail. The point-biserial correlation of 0.90 between AgentQ and verifier outcomes provides reasonable validation of the proxy signal.

Concerns about rigor:

The evaluation is on only 8 hard tasks, which is a very small sample. Statistical significance is difficult to establish, and task-level heterogeneity is high (some tasks reach 100% pass rate, others remain at 0%).

C3/C4 use 640 rollouts versus 32 for C1/C2, a 20× compute difference. While the paper acknowledges this is not a train/test generalization experiment, the comparison conflates the effect of more compute (more rollouts) with the effect of skill evolution. A fairer comparison would include a "best-of-640" random sampling baseline without evolution.

The metric weights (e.g., 0.35, 0.30, 0.10 in SkillQ) appear hand-tuned without justification or sensitivity analysis.

The framework uses three different frontier LLMs (Claude Opus 4.6 for rollouts, GPT-5 for oracle, Claude Sonnet 4.5 for mutation), making it expensive and difficult to reproduce or ablate individual model contributions.

Potential Impact

Domain-specific value: For EDA/hardware design automation, this work addresses a genuine pain point—LLM agents struggle with long-context repository navigation, RTL editing, and recovery from sparse verification failures. The idea of evolving prompt-level guidance without retraining is practically valuable for organizations that cannot easily fine-tune frontier models on proprietary RTL data.

Broader applicability: The general framework—mining execution traces, evolutionary prompt optimization, dense proxy metrics for sparse-reward tasks—could transfer to other verifiable engineering domains (formal verification, synthesis, timing closure). However, the paper provides no evidence of such transfer.

Limitations on impact: The approach is computationally expensive (640 rollouts per task configuration), uses multiple frontier models, and requires access to a verification infrastructure (NeMo Gym). This limits immediate adoption. The task-specific (non-transferable) nature of evolved skills also limits scalability—each new task requires its own evolutionary run.

Timeliness & Relevance

The paper is timely. Test-time compute scaling is an active research frontier, and applying it to hardware design agents addresses a real gap. The CVDP benchmark and the emergence of coding agents (Claude Code, Codex) as baselines make this relevant to both the EDA and LLM agent communities. The framing as complementary to ACE-RTL (which focuses on model fine-tuning) is appropriate.

Strengths

1. Clean ablation structure that isolates feedback and evolution contributions.

2. Rich traceability artifacts—the paper provides unusually detailed trace examples (Tables 5-7) showing exactly how skills evolve and how dense feedback changes agent behavior.

3. Practical architecture that requires no model fine-tuning, making it applicable to black-box API-served models.

4. Honest limitations section acknowledging compute costs, rollout variance, and benchmark narrowness.

5. Dense metric design that addresses the real challenge of learning from sparse binary rewards in hard verification tasks.

Weaknesses

1. Very small evaluation set (8 tasks for the main ablation) limits statistical confidence in the findings.

2. Missing compute-matched baselines: No comparison against simple best-of-N sampling with 640 rollouts but no evolution, which would isolate the value of the evolutionary skill optimization loop.

3. No generalization evaluation: Skills are task-specific, and the paper explicitly notes weak cross-task transfer but provides no formal analysis of what makes skills transferable or not.

4. Expensive and complex pipeline: Three frontier LLMs, evolutionary loop, custom metric computation, and NeMo Gym infrastructure create significant barriers to reproduction.

5. Metric design opacity: Many hand-tuned weights and thresholds without ablation or sensitivity analysis. The correlation validation (r=0.90) is on seed-skill rollouts only, not on evolved-skill rollouts where the distribution may shift.

6. Limited novelty in individual components: Evolutionary prompt optimization, trace mining, and dense reward shaping are each known techniques; the contribution is primarily their combination and application to EDA.

Overall Assessment

Trace2Skill presents a well-engineered system that demonstrates meaningful improvements on hard hardware design tasks through test-time skill evolution. The ablation is clean and the traceability artifacts are exemplary. However, the evaluation is narrow (8 tasks), lacks compute-matched baselines, and the framework's complexity and cost raise questions about practical scalability. The work is a solid applied contribution to EDA agent research but would benefit from broader evaluation and stronger baselines to establish the specific value of evolutionary skill optimization over simpler scaling strategies.

Rating:5.5/ 10

Significance 5.5Rigor 5Novelty 5Clarity 6.5

Generated May 22, 2026

Comparison History (16)

vs. AtelierEval: Agentic Evaluation of Humans & LLMs as Text-to-Image Prompters

gemini-3.15/22/2026

Paper 1 tackles a highly complex, high-value problem (automated hardware design) by introducing an innovative test-time scaling and skill evolution framework. Overcoming the limits of long-context EDA tasks without model fine-tuning offers immense real-world applications in chip design. While Paper 2 provides a valuable benchmarking tool for text-to-image prompting, Paper 1's methodological novelty and its potential to unlock breakthroughs in specialized, high-stakes engineering domains give it a higher potential for broad scientific and industrial impact.

vs. Evaluation of Pipelines for Data Integration into Knowledge Graphs

gemini-3.15/22/2026

Paper 1 explores the highly timely and impactful intersection of LLM agents, test-time scaling, and hardware design automation. Its novel approach of evolving natural-language skills via verifier feedback without fine-tuning addresses critical bottlenecks in complex reasoning tasks. While Paper 2 provides a useful benchmark for knowledge graphs, Paper 1's methodological innovation in agentic workflows and potential to significantly accelerate chip design offer a higher broader impact and technological relevance.

vs. Self-supervised Hierarchical Visual Reasoning with World Model

gpt-5.25/22/2026

Paper 1 is more likely to have higher scientific impact due to its novel test-time “skill evolution” framework that leverages verifier traces and optional dense, bounded feedback to systematically improve agent behavior without fine-tuning or weight updates. This is timely for LLM-agent reliability and scales to high-value, real-world EDA workflows where verification is the ground truth. Its methodology directly targets a hard industrial bottleneck (long-context repo localization + sparse verifier signals) and proposes a generalizable verifier-guided scaling paradigm beyond hardware. Paper 2 is solid and relevant, but hierarchical residual world models are a more crowded space.

vs. Is Capability a Liability? More Capable Language Models Make Worse Forecasts When It Matters Most

claude-opus-4.65/22/2026

Paper 1 documents a fundamental and counterintuitive finding—inverse scaling in LLM forecasting on critical tasks involving tail risk and superlinear growth—with broad implications across AI safety, finance, epidemiology, and evaluation methodology. It challenges prevailing assumptions that more capable models are universally better, proposes actionable evaluation recommendations, and spans multiple domains. Paper 2, while technically solid, addresses a narrower problem (Verilog design agents) with more limited cross-field impact. Paper 1's findings are more likely to reshape how the community evaluates and deploys LLMs in high-stakes forecasting settings.

vs. ST-SimDiff: Balancing Spatiotemporal Similarity and Difference for Efficient Video Understanding with MLLMs

claude-opus-4.65/22/2026

Paper 1 addresses a broadly impactful problem—efficient video understanding with MLLMs—relevant across computer vision, NLP, and multimedia. Its training-free framework with a novel dual perspective (similarity for redundancy, difference for key events) and spatio-temporal graph modeling offers wide applicability. Paper 2, while technically interesting, targets a narrow domain (EDA/Verilog agents) with limited cross-field impact. Paper 1's approach is more generalizable, timely given the rapid growth of MLLMs, and likely to influence a larger research community.

vs. AI-Enabled Serious Games: Integrating Intelligence and Adaptivity in Training Systems

claude-opus-4.65/22/2026

Paper 2 presents a novel, concrete technical framework (Trace2Skill) with empirical results demonstrating measurable improvements on previously unsolved tasks. It introduces an innovative test-time scaling approach for LLM agents that avoids fine-tuning, with clear generalizability beyond EDA. Paper 1 is a review/chapter that synthesizes existing work on AI in serious games without introducing new methods or empirical contributions. While useful as a survey, its impact is inherently limited compared to Paper 2's actionable technical contribution with demonstrated breakthrough results on hard benchmarks.

vs. Neurosymbolic Learning for Inference-Time Argumentation

claude-opus-4.65/22/2026

Paper 1 introduces a novel neurosymbolic framework (ITA) that integrates formal argumentation semantics with LLM training for claim verification, addressing the broadly important problem of faithful explainability in AI decision-making. Its contributions span multiple fields (NLP, formal reasoning, explainable AI) and tackle fundamental challenges around trustworthy AI in high-stakes domains. Paper 2, while technically solid, addresses a narrower domain (EDA/Verilog debugging) with a test-time scaling approach that, though effective, is more incremental and domain-specific in its impact potential.

vs. Neurosymbolic Learning for Inference-Time Argumentation

gpt-5.25/22/2026

Paper 1 is likely higher impact due to strong novelty in test-time scaling via “skill evolution” without fine-tuning, leveraging verifier-guided dense feedback—highly timely for agentic LLM reliability. It targets a difficult, industrially relevant domain (hardware/EDA) with clear real-world applicability and a pathway to broader “verifiable tasks” beyond RTL. Methodologically, the oracle–mutator–selector loop plus bounded verifier observations suggests a rigorous, deployable framework. Paper 2 offers valuable neurosymbolic faithfulness for claim verification, but the scope and demonstrated gains appear narrower and more incremental relative to existing argumentation-based approaches.

vs. Can AI Make Conflicts Worse? An Alignment Failure in LLM Deployment Across Conflict Contexts

gpt-5.25/22/2026

Paper 2 likely has higher scientific impact due to its broad societal relevance and cross-domain applicability: it introduces a first-of-its-kind evaluation framework for LLM alignment in armed-conflict contexts, tests multiple major providers with clear, policy-relevant failure modes, and yields actionable findings (large variance across models; systematic failures under “balance” prompting). This is timely given widespread deployment and can influence AI safety benchmarks, governance, journalism, and humanitarian practice. Paper 1 is technically novel for EDA agents but its impact is narrower to hardware/EDA workflows.

vs. SGR-Bench: Benchmarking Search Agents on State-Gated Retrieval

claude-opus-4.65/22/2026

Trace2Skill introduces a novel test-time scaling framework with broader methodological contributions—evolvable skill policies from rollout traces, dense verifier feedback, and an oracle-mutator-selector loop—applicable beyond EDA to other verifiable domains. It demonstrates concrete breakthroughs on previously unsolved tasks without fine-tuning. SGR-Bench makes a solid benchmarking contribution identifying state-gated retrieval as an underexplored problem, but benchmarks typically have narrower methodological impact compared to frameworks that introduce new algorithmic paradigms with demonstrated generalizability.

vs. Insights Generator: Systematic Corpus-Level Trace Diagnostics for LLM Agents

gemini-3.15/22/2026

Paper 1 addresses a universal bottleneck in LLM agent development—diagnosing failures at scale across large execution traces. Its corpus-level diagnostic approach is highly generalizable and has broad impact across any field utilizing LLM agents. While Paper 2 presents an innovative test-time scaling method, its primary focus on Electronic Design Automation (EDA) and Verilog makes its immediate impact more niche compared to the foundational debugging framework proposed in Paper 1.

vs. Echo: Learning from Experience Data via User-Driven Refinement

gpt-5.25/22/2026

Paper 1 has broader, more timely impact: it proposes a general framework for turning ubiquitous real-world agent interaction logs plus user refinements into scalable training signals, demonstrated at production scale with a clear quantitative gain. This targets a central bottleneck (data and continual alignment) across many agent domains, making applications wide and immediate. Paper 2 is innovative and rigorous for a hard, important niche (EDA/Verilog) and introduces verifier-guided test-time skill evolution, but its scope and real-world deployment footprint are narrower and results appear more domain-specific, reducing cross-field impact relative to Paper 1.

vs. Conflict-Aware Additive Guidance for Flow Models under Compositional Rewards

gpt-5.25/22/2026

Paper 1 likely has higher scientific impact due to broader methodological relevance: it addresses a fundamental, widely encountered failure mode in diffusion/flow guidance under compositional constraints, offers a principled analysis (gradient misalignment → off-manifold drift), and proposes a lightweight, learnable correction applicable across multiple domains (images, synthetic, planning/control). This gives wide cross-field applicability and timeliness for controllable generative modeling. Paper 2 is impactful for EDA agents, but its scope is more domain-specific and depends on verifier setups, potentially limiting breadth despite strong practical relevance.

vs. Open-World Evaluations for Measuring Frontier AI Capabilities

claude-opus-4.65/22/2026

Paper 1 introduces a broadly applicable evaluation paradigm ('open-world evaluations') for frontier AI systems, addressing a fundamental gap in how the AI community measures capabilities. Its conceptual contribution—complementing benchmarks with long-horizon, real-world tasks—has wide relevance across all of AI safety, policy, and capability research. The CRUX project provides an institutional framework for ongoing impact. Paper 2, while technically strong, addresses a narrower domain (EDA/Verilog agents) with a specialized test-time scaling method. Paper 1's breadth of influence across AI evaluation, governance, and multiple application domains gives it higher potential impact.

vs. LACO: Adaptive Latent Communication for Collaborative Driving

gpt-5.25/22/2026

Paper 1 likely has higher impact: it introduces a broadly applicable test-time scaling framework (skill evolution from rollout traces + verifier-guided dense feedback) that addresses a key bottleneck in long-context, verifiable coding/EDA agents without fine-tuning or weight updates. The method is tightly tied to rigorous pass/fail verification, shows breakthroughs on previously unsolved tasks, and could generalize to many “verifier-in-the-loop” domains beyond hardware. Paper 2 is timely and useful for collaborative driving, but its impact is narrower and may depend more on simulation-to-reality transfer.

vs. Claw AI Lab: An Autonomous Multi-Agent Research Team

claude-opus-4.65/22/2026

Trace2Skill presents a more novel and technically rigorous contribution—a concrete test-time scaling framework for hardware design agents that demonstrates measurable improvements on hard, previously unsolved tasks without model fine-tuning. It introduces specific algorithmic innovations (skill evolution via oracle-mutator-selector loops, dense verifier feedback) with clear evaluation on challenging benchmarks. Paper 1 describes an engineering platform for autonomous research that, while useful, is more incremental (building on existing multi-agent pipelines) with only internal evaluations on five case studies, limiting its scientific rigor and reproducibility.