Huanshuo Dong, Keyao Zhang, Hong Wang, Zhezheng Hao, Zhiwei Zhuang, Ziyan Liu, Jiacong Wang, Gengyuan Liu
Numerical solvers for partial differential equations (PDEs) are core computational tools in science and engineering. Building reliable PDE solvers requires not only executable code, but a numerical solver strategy, a set of decisions about discretization, stabilization, solver configuration, and resolution control, that matches the PDE structure. Recent LLM-based coding agents have begun to reduce the programming burden by generating and debugging solver implementations. However, they typically move directly from a PDE problem to solver code, leaving the solver strategy implicit in implementation details. Feedback from a failed solve is therefore routed back to code edits rather than to the underlying strategy, so numerical decisions remain hard to check before code is generated and hard to revise using numerical evidence when it fails. To address this limitation, we propose AutoPDE, a code agent that maintains the solver strategy as an explicitly represented object throughout the solving process: an independent, inspectable object that is built before any code is written and can be revised, using numerical evidence, whenever a solve fails. AutoPDE builds and maintains this object in three stages, all drawing from a library of reusable PDE-solving skills: PDE analysis identifies the equation type and algebraic structure; numerical method selection chooses a numerical method that matches the analysis result and commits to a discretization, stabilization, and linear solver accordingly; and adaptive tuning runs low-cost pilot solves to calibrate resolution and tolerances under the prescribed accuracy and runtime budget. We evaluate AutoPDE on the PDE Agent Bench, where experimental results show that AutoPDE achieves a pass rate of , improving over the strongest baseline by percentage points.
AutoPDE introduces a "strategy-first" design for LLM-based PDE solver generation. The central insight is that current LLM coding agents go directly from PDE specification to code, leaving numerical decisions (discretization, stabilization, solver selection, resolution) implicit. When solves fail, feedback loops can only patch code rather than revise the underlying numerical strategy. AutoPDE addresses this by maintaining an explicit, inspectable "solver strategy record" (composed of DIAGNOSIS and METHOD cards) that is constructed before code generation and revised using numerical evidence from pilot solves.
The system operates in three stages: (1) PDE analysis to classify the equation and identify its structure, (2) numerical method selection using a library of reusable text-based PDE-solving skills covering 10 equation families, and (3) adaptive tuning via pilot solves to calibrate resolution and tolerances. This mirrors how human numerical analysts work — first understanding the problem structure, then selecting appropriate methods, then tuning parameters.
The experimental evaluation on PDE Agent Bench (191 cases, 8 equation families) is reasonably comprehensive. Several design choices strengthen the evaluation:
However, there are methodological concerns. The benchmark uses a single backend (FEniCSx/dolfinx), limiting generalizability claims. The pass rate of 54.5% — while substantially better than baselines — still means nearly half of cases fail. The CodePDE baseline is run in a somewhat pared-down configuration (only 3 samples, no debugging/refinement enabled), which may not represent its full capability. Additionally, the evaluation is conducted on a single run per case with no confidence intervals or statistical significance tests.
Near-term impact: The paper addresses a genuine pain point — making PDE solver construction more accessible to non-specialists while maintaining numerical reliability. The strategy-card protocol is simple, interpretable, and could be adopted by other agent frameworks.
Broader implications: The "explicit strategy representation" paradigm extends beyond PDEs. Any domain where AI code generation must make coupled technical decisions (compiler optimization, database query planning, signal processing pipeline design) could benefit from separating strategy from implementation.
Practical limitations on impact: The system currently handles only FEM-based solvers through FEniCSx. Many real-world PDE problems require finite differences, spectral methods, or meshless approaches. The 10-family skill library, while covering common cases, is fixed and would need expansion for production use.
This work arrives at a critical juncture. LLM-based code generation is rapidly maturing, and scientific computing is an obvious but challenging application domain. The paper correctly identifies that the gap between "code that compiles" and "numerically sound solver" is fundamentally about strategy, not syntax. This distinction becomes more important as LLMs become better at writing syntactically correct code — the remaining failures will increasingly be numerical in nature.
The benchmark (PDE Agent Bench) also appears to be relatively new, and this paper helps establish performance standards on it.
The paper's motivation example (Zalesak disk, Figure 3) effectively demonstrates that algorithm choice dominates mesh refinement, but this is well-known in numerical methods — the contribution is in automating this insight, not discovering it. The Poisson results (Appendix D) showing AutoPDE slightly underperforming simpler agents on easy problems (84% vs 98%) suggests some overhead cost from the strategy formulation that may be unnecessary for straightforward problems.
Generated Jun 10, 2026
Paper 1 presents a fundamental conceptual shift in AI interpretability and safety by bypassing traditional explanations to directly forecast model behavior via a learned task. As Large Reasoning Models become ubiquitous, establishing trust and predicting their behavior on novel inputs is a critical bottleneck. This approach has broad implications across the entire AI ecosystem. Paper 2 is highly valuable for AI-driven scientific computing (AI4Science), but its impact is more narrowly focused on computational physics and engineering, making Paper 1 more broadly impactful across diverse domains relying on foundation models.
Paper 2 (AutoPDE) likely has higher impact: it targets a central, widely used scientific computing primitive (PDE solvers) with clear real-world engineering applications and measurable gains on a benchmark. Explicitly representing solver strategies is a novel, checkable abstraction that can improve reliability and interpretability of agentic coding, with potential uptake across computational physics, engineering, and applied math. Paper 1 is timely for AI-for-science, but its contribution is more conceptual and evaluation-heavy in a narrower “discovery agent” setting, with less immediate downstream deployment compared to robust PDE solving.
Paper 1 introduces a fundamentally novel self-supervised RL framework (OT-GRPO) that improves spatial reasoning in LRMs without ground-truth labels, challenging the dominant SFT paradigm. Its consistency-based verification approach is broadly applicable across reasoning tasks and model architectures. The optimal transport-based RL strategy is methodologically innovative. Paper 2, while practically useful, is more incremental—adding explicit strategy representation to LLM-based PDE solvers. Paper 1's insights about latent capabilities in pre-trained models and label-free alignment have broader implications for the AI/ML community, whereas Paper 2 targets a narrower computational science audience.
AutoPDE addresses a fundamental gap in LLM-based scientific computing by explicitly representing solver strategies for PDEs, enabling systematic debugging and revision. This bridges AI agents with core computational science infrastructure, potentially transforming how PDEs are solved across physics, engineering, and applied math. Paper 2 (HyperLoRA) makes solid but more incremental contributions to federated LoRA fine-tuning, addressing known aggregation biases. While useful, it operates in a more crowded research space with many competing federated learning methods, limiting its relative novelty and breadth of impact.
Paper 2 (SIFT) addresses a critical bottleneck in modern AI: RAG latency and KV cache memory limits. By reducing storage by 24,000x and accelerating time-to-first-token by 1.71x with minimal accuracy loss, it offers immediate, widespread infrastructural impact for all LLM deployments. While Paper 1 presents an innovative LLM agent for PDEs, its impact is largely constrained to computational sciences, and its moderate pass rate (54.5%) indicates the method is still in its early stages. Paper 2's fundamental optimization of attention mechanisms provides broader and more immediate real-world utility.
Paper 2 likely has higher scientific impact: it introduces a timely, broadly relevant benchmark for control/intervention awareness—central to AI safety, governance, and deployment of frontier LLMs across many applications. Benchmarks often become standard evaluation tools, shaping research agendas and operational practices. Its multi-domain design and empirical evaluation across 11 models supports methodological rigor and immediate real-world applicability for monitoring and control protocols. Paper 1 is innovative and useful for scientific computing, but its impact is narrower (PDE-solver agent tooling) and more dependent on adoption within a specific technical community.
AutoPDE addresses a fundamental challenge in scientific computing—automating PDE solving—with a novel architecture that explicitly separates solver strategy from code generation. This has broad applications across science and engineering, introduces a methodologically rigorous framework with reusable skills and adaptive tuning, and demonstrates significant improvement over baselines. Paper 2, while introducing a useful benchmark for office automation, addresses a narrower application domain with less scientific novelty, primarily documenting LLM limitations rather than proposing a transformative solution. AutoPDE's impact spans computational science, AI-for-science, and numerical methods research.
Paper 2 (AutoPDE) is more novel and broadly impactful: it introduces an explicit, inspectable “solver strategy” representation for agentic PDE solving, enabling principled revisions based on numerical evidence rather than ad‑hoc code edits. PDE solvers are foundational across science/engineering, so improvements generalize widely and have clear real-world applications. It also reports a substantial benchmark gain (+14.2 pp) with an articulated methodology (analysis, method selection, adaptive tuning). Paper 1 is mainly a replication/diagnostic study of PlanGPT with limited innovation and narrower downstream impact.
Paper 2 presents a fundamental theoretical framework for causal inference and world models, addressing a structural limitation in current predictive models regarding counterfactual couplings. Its mathematical formulation as a coupling kernel offers profound implications for AGI and causal ML. While Paper 1 provides a highly practical and useful LLM agent for PDE solving, Paper 2's theoretical breakthroughs in unidentifiable quantities and counterfactual bounds promise a broader, longer-lasting impact across the foundational AI and causal inference communities.
Paper 1 is likely higher impact: it introduces a concrete, novel agent architecture (explicit, revisable solver-strategy object) that directly targets reliability in scientific computing, with clear real-world applications across engineering and physics workflows. The methodology includes a staged pipeline and benchmarked gains (+14.2 pp pass rate) on a dedicated PDE benchmark, supporting rigor and measurable progress. Its impact spans ML agents, numerical analysis, and computational science, and it is timely given growing interest in trustworthy LLM-based coding/automation. Paper 2 is insightful for interpretability, but is more diagnostic and narrower in immediate application.