SIGA: Self-Evolving Coding-Agent Adapters for Scientific Simulation

Matthew Ho, Brian Liu, Jixuan Chen, Audrey Wang, Lianhui Qin

Jun 8, 2026arXiv:2606.09774v1

cs.AIcs.CL

#458of 3489·Artificial Intelligence

#458 of 3489 · Artificial Intelligence

Tournament Score

1489±43

10501800

73%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance6.5

Rigor6

Novelty6

Clarity7.5

Abstract

Advanced scientific simulators expose specialized input languages that turn simulation goals into executable configurations, but learning them can cost domain scientists hours to days. We study simulator setup as a problem of agent-tool interface grounding: what minimal simulator-specific adaptations are needed for an off-the-shelf coding agent to operate real scientific software? Our intuition is that coding agents already know how to navigate files, edit code, run commands, and repair outputs, but they lack the simulator's executable contract: its vocabulary, structural constraints, validation rules, and termination conditions. We introduce SIGA, a Simulator-Interface Grounding Adapter that supplies this contract through retrieval, procedural memory, in-trajectory validation, and validation-enforced termination. We primarily evaluate SIGA on GEOS, an open-source multiphysics simulator used in subsurface science. SIGA produces a complete GEOS deck in about five minutes with TreeSim above 0.90, matching an extended-budget human expert who took about three hours, a roughly 36x wall-clock speedup. On a harder held-out set, grounding raises TreeSim from 0.720 to 0.789, a roughly 10% relative gain over the bare agent, and can reduce the across-seed standard deviation by 16x. Self-evolution further improves SIGA by rewriting adapter contents from prior trajectories, yielding the highest held-out GEOS mean and matching or outperforming the strongest hand-designed configuration. Transfers to OpenFOAM and LAMMPS show that the dominant mechanism shifts by interface: validation matters most when structural completeness is the bottleneck, while memory and retrieval matter most when domain correctness is the bottleneck. These results suggest that lightweight, self-improvable grounding layers can turn general coding agents into practical operators of scientific software.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: SIGA — Self-Evolving Coding-Agent Adapters for Scientific Simulation

1. Core Contribution

SIGA proposes that general-purpose coding agents (e.g., Claude Code) already possess the mechanical capabilities for scientific simulator configuration — file navigation, editing, shell execution, error repair — but lack *simulator-specific interface grounding*: vocabulary, structural constraints, validation rules, and termination conditions. The paper frames simulator setup as an agent-tool interface grounding problem and introduces a thin adapter layer with four components: semantic retrieval (R), procedural memory (M), agent-callable validation (X), and validation-enforced termination (S). These plug into three harness interfaces (context, tools, termination) without modifying the underlying agent loop.

The key conceptual insight — *adapt the harness, don't rebuild the agent* — is practically important. Rather than constructing bespoke agent architectures per simulator (as most prior scientific-agent work does), SIGA treats the simulator's "executable contract" as a small, bounded design space amenable to factorial analysis and self-evolution.

2. Methodological Rigor

Strengths in experimental design:

The Resolution-IV fractional factorial design (2^{4-1}) over the four binary components is unusually principled for this area, enabling separation of main effects from two-factor interactions at half the compute of a full factorial.

The failures-as-zero convention correctly penalizes structurally unusable outputs, which is critical for practical simulator use.

TreeSim — a recursive tree-similarity metric with unordered bipartite matching — is well-defined and appropriate for structured XML comparison.

Leakage control (hygiene gate regex-scanning distilled artifacts for ground-truth filenames) demonstrates awareness of contamination risks.

Limitations in rigor:

Sample sizes are small: n=3 runs per cell for GEOS, single runs for OpenFOAM and LAMMPS transfers. The held-out improvement of +0.069 TreeSim (Vanilla→SE) is concentrated in just two task rescues, making the aggregate gain fragile.

The human baseline (n=2 on a single task) is more of an existence proof than a statistically meaningful comparison. The 36× speedup claim, while striking, rests on one task and one extended-budget expert.

The LAMMPS evaluation uses a non-deterministic LLM judge (Claude Sonnet 4.6), introducing unquantified measurement noise.

The OpenFOAM baselines (Foam-Agent, MetaOpenFOAM) are run in lint-only mode, which the authors acknowledge is not their intended execution mode — weakening the comparison.

TreeSim is purely structural; a deck scoring 0.9 may still fail physically at runtime. No execution-based validation is performed.

3. Potential Impact

Practical impact: The immediate value proposition — reducing GEOS deck authoring from hours to minutes — addresses a genuine bottleneck in subsurface science workflows (CO₂ sequestration, reservoir simulation, geothermal energy). If the adapter generalizes reliably, it could substantially accelerate simulation-driven research across domains.

Methodological impact: The "adapt-the-harness" paradigm is a sensible architectural pattern. As frontier models are increasingly post-trained within specific agent harnesses, preserving that alignment while adding domain grounding is more pragmatic than rebuilding from scratch. The self-evolution mechanism (rewriting adapter contents from trajectories) is modest but principled, and the finding that it matches or outperforms hand-designed configurations suggests a scalable adaptation pathway.

Cross-domain insight: The finding that the dominant grounding mechanism shifts by interface — validation for structural completeness (GEOS, OpenFOAM) vs. memory/retrieval for domain correctness (LAMMPS) — provides actionable design guidance. This is a genuinely useful empirical observation for the growing community building scientific agents.

Limitations in impact scope: The paper explicitly scopes itself to *tool operation*, not scientific discovery. While this is intellectually honest, it also bounds the ceiling: the agent is translating specifications, not reasoning scientifically. The residual errors (bad attribute values requiring domain knowledge) point to where this approach fundamentally plateaus.

4. Timeliness & Relevance

The paper arrives at a moment when coding agents are maturing rapidly and the AI-for-science community is actively searching for practical applications beyond benchmarks. The framing — that reliable tool operation is a prerequisite for, and distinct from, autonomous scientific reasoning — is timely and strategically sound. The work fills an application gap (no prior GEOS agent) and a methodological gap (building on existing coding harnesses rather than from scratch).

The self-evolution angle connects to the active meta-harness/harness-optimization literature (Lee et al. 2026, Ning et al. 2026) but applies it to a domain-knowledge-heavy task rather than general coding benchmarks, which is a meaningful extension.

5. Strengths & Limitations

Key strengths:

Clean conceptual decomposition: four components, three interfaces, bounded design space

Principled factorial ablation design with proper confounding analysis

Strong negative results honestly reported (memory-as-retrieval-tool never invoked; retrieval slightly hurts on strong backbones)

The autonomy study (§6.4) reveals an important finding: agents substitute on-disk examples for human consultation, which has implications for human-in-the-loop benchmark design

Transfer studies across three simulators with qualitatively different interfaces

Notable weaknesses:

Statistical power is thin throughout; many claims rest on single runs or small n

The 36× speedup headline, while attention-grabbing, is anchored to an unfairly constrained human comparison (experts new to GEOS, one task)

No runtime execution of generated decks — the gap between structural correctness and physical validity remains unaddressed

The self-evolution improvement over hand-designed configurations is marginal (+0.006 TreeSim on held-out), within noise given n=3

Heavy reliance on a single backbone (deepseek-v4-flash); cross-model results are preliminary

Overall assessment: This is a well-conceived systems paper that introduces a clean, portable framework for adapting coding agents to scientific simulators. The empirical evidence, while sometimes thin, consistently supports the core thesis that lightweight grounding layers are more effective than rebuilding agent loops. The cross-simulator transfer analysis, with its finding of shifting dominant mechanisms, is the most novel and broadly useful contribution. The work is positioned sensibly as a near-term, bounded contribution rather than overclaiming about scientific autonomy.

Rating:6.5/ 10

Significance 6.5Rigor 6Novelty 6Clarity 7.5

Generated Jun 9, 2026

Comparison History (22)

Lostvs. Cross-Modal Knowledge Distillation without Paired Data: Theoretical Foundation and Algorithm

Paper 1 addresses a fundamental problem in cross-modal knowledge distillation with broad theoretical contributions and practical impact. It establishes theoretical foundations for CMKD without paired data—a setting with wide applicability across many multimodal AI tasks. The theoretical guarantees and principled framework make it generalizable across domains. Paper 2, while practical and impressive for scientific simulation automation, addresses a narrower problem (adapting coding agents to specific simulators) with more limited generalizability. Paper 1's contributions to transfer learning theory and multimodal AI have broader potential to influence multiple research communities.

claude-opus-4-6·Jun 10, 2026

Wonvs. Bellman-Taylor Score Decoding for Markov Decision Processes with State-Dependent Feasible Action Sets

Paper 2 presents a highly timely and transformative application of LLM agents that dramatically accelerates scientific simulation setup across multiple disciplines, offering profound cross-field impact. While Paper 1 provides rigorous methodological advancements in reinforcement learning, Paper 2's potential to reduce workflow times from hours to minutes for domain scientists gives it a broader and more immediate real-world scientific impact.

gemini-3.1-pro-preview·Jun 10, 2026

Wonvs. What Fits (Into Few Tokens) Doesn't Overfit: Compression and Generalization in ML Research Agents

Paper 1 has higher impact potential: it introduces a concrete, deployable grounding adapter that turns general coding agents into reliable operators of complex scientific simulators (GEOS, plus transfer to OpenFOAM/LAMMPS), with large measured productivity gains and mechanisms (validation, memory, retrieval, self-evolution) that generalize across interfaces. This has immediate real-world applications across many simulation-driven fields and addresses a timely bottleneck in scientific computing. Paper 2 is conceptually novel and broad, but its primary contribution is explanatory/diagnostic for ML benchmarking rather than a directly enabling tool, limiting near-term downstream impact.

gpt-5.2·Jun 10, 2026

Lostvs. Optical Reasoning: Rethinking Images as an Expressive Reasoning Medium Beyond Text

Paper 1 proposes a fundamental paradigm shift in AI by using images as a standalone reasoning medium, potentially revolutionizing how multimodal models process information. Its ability to improve token efficiency by up to 28% while matching text reasoning performance gives it broad applicability across the entire AI field. In contrast, while Paper 2 offers significant practical value and time savings for scientific simulations, its impact is more domain-specific and applied.

gemini-3.1-pro-preview·Jun 9, 2026

Wonvs. Proxy Reward Internalization and Mechanistic Exploitation: A Learned Precursor to Reward Hacking and Its Generalization

Paper 2 has higher likely impact due to clear, near-term real-world utility (automating configuration of major scientific simulators), demonstrated large productivity gains, and broader applicability across domains (GEOS, OpenFOAM, LAMMPS). The adapter concept is a practical, modular interface-grounding layer that many labs could adopt, making it timely for scientific computing workflows. Paper 1 is novel and relevant for AI alignment, but its impact may be narrower and depends more on interpretability/probing assumptions and the extent to which PRIME generalizes beyond specific proxy-reward coding RL setups.

gpt-5.2·Jun 9, 2026

Wonvs. From Coarse to Fine: Managing Temporal Granularity in Spatio-Temporal Data for Fine-Grained Traffic Prediction

SIGA addresses a broader and more transformative problem—enabling general coding agents to operate complex scientific simulators with minimal adaptation. This has wide cross-disciplinary impact (subsurface science, CFD, molecular dynamics) and taps into the rapidly growing field of AI agents for scientific discovery. The self-evolution capability and transferability across simulators (GEOS, OpenFOAM, LAMMPS) demonstrate generalizability. The 36x speedup over human experts is practically significant. Paper 1, while solid, addresses a more incremental improvement in traffic prediction with narrower domain applicability.

claude-opus-4-6·Jun 9, 2026

Lostvs. The Self-Correction Illusion: LLMs Correct Others but Not Themselves

Paper 2 addresses a fundamental and highly debated topic in foundational AI (LLM self-correction), revealing that apparent reasoning deficits are actually chat-template artifacts. Its findings apply broadly across all LLM reasoning and agentic research, offering a simple yet profound intervention. While Paper 1 presents a highly useful tool for domain scientists, its impact is more narrowly confined to specific scientific simulation applications.

gemini-3.1-pro-preview·Jun 9, 2026

Lostvs. Next-Token Prediction Learns Generalisable Representations of Sleep Physiology

Paper 2 presents a multi-modal foundation model for physiological signals with broad, high-stakes healthcare applications. Its demonstration that next-token prediction effectively models stochastic biological data, combined with strong zero-shot or few-shot generalization to tasks like atrial fibrillation detection, offers massive potential impact across medicine and bio-machine learning. While Paper 1 provides a highly practical tool for computational scientists, Paper 2's methodological innovation and potential to transform clinical diagnostics give it a wider and more profound scientific impact.

gemini-3.1-pro-preview·Jun 9, 2026

Wonvs. Deterministic Integrity Gates for LLM-Assisted Clinical Manuscript Preparation: An Auditable Biomedical Informatics Architecture

SIGA addresses a broader and more transformative problem—enabling general-purpose coding agents to operate complex scientific simulators with minimal adaptation. It demonstrates practical speedups (36x over human experts), generalizes across multiple simulators (GEOS, OpenFOAM, LAMMPS), and introduces a self-evolution mechanism. The concept of simulator-interface grounding adapters has wide applicability across computational sciences. Paper 1, while solving a real problem in LLM-assisted manuscript preparation, addresses a narrower application domain with incremental engineering contributions (deterministic verification gates) rather than a fundamentally new paradigm.

claude-opus-4-6·Jun 9, 2026

Lostvs. When Simulation Lies: A Sim-to-Real Benchmark and Domain-Randomized RL Recipe for Tool-Use Agents

Paper 2 has higher likely impact due to broader applicability and timeliness: robustness of tool-use agents affects many real deployments beyond a single scientific simulator ecosystem. It contributes a verified, publicly released benchmark/leaderboard grounded in real failure modes plus an RL domain-randomization recipe with measurable cross-perturbation generalization, enabling reproducible progress across models and communities (LLM agents, RL, software reliability). Paper 1 is novel and valuable for scientific simulation workflows, but its evaluations are more domain-specific and may generalize less widely than a robustness benchmark and training paradigm.

gpt-5.2·Jun 9, 2026

#458of 3489·Artificial Intelligence

#458 of 3489 · Artificial Intelligence

Tournament Score

1489±43

10501800

73%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance6.5

Rigor6

Novelty6

Clarity7.5