Iteris: Agentic Research Loops for Computational Mathematics

Leheng Chen, Zihao Liu, Wanyi He, Bin Dong

Jun 1, 2026

arXiv:2606.02484v1 PDF

cs.AI(primary)cs.LG

#891of 3355·Artificial Intelligence

#891 of 3355 · Artificial Intelligence

Tournament Score

1453±43

10501800

68%

Win Rate

Wins

Losses

Matches

Rating

7.2/ 10

Significance7.5

Rigor7

Novelty6.5

Clarity8

Tournament Score

1453±43

10501800

68%

Win Rate

Wins

Losses

Matches

Rating

7.2/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Recent advances in large language models and agentic AI systems have enabled significant progress in mathematical discovery, from solving competition problems to tackling research-level conjectures. However, open problems in computational mathematics have received comparatively less attention: research in this area often requires not only proofs but also numerical experimentation, adversarial constructions, and algorithm design. In this paper, we introduce an agentic research system, Iteris, designed for open problems in computational mathematics. We apply Iteris to two open problems from a recent Simons Workshop collection (arXiv:2602.05394). In these case studies, Iteris generated numerical evidence, constructions, and proof drafts that led, after expert review and correction, to verified results. The first result is a phase diagram for the asymptotic comparison between conjugate gradient and randomized coordinate descent on power-law spectra; the second is a counterexample showing that QR factorization with column pivoting can fail to select well-conditioned submatrices even under low coherence. These case studies suggest that agentic AI systems can participate meaningfully in research workflows for open problems in computational mathematics, while human validation remains essential.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: Iteris: Agentic Research Loops for Computational Mathematics

Core Contribution

Iteris presents an explore–plan–execute agentic loop system specifically tailored for computational mathematics research, where progress requires coordinating numerical experimentation, adversarial constructions, algorithm design, and proof development. The system is applied to two open problems from a curated Simons Workshop collection, producing verified mathematical results:

1. Theorem 1: A fixed-parameter phase diagram characterizing when conjugate gradient (CG) or randomized coordinate descent (RCD) has asymptotic advantage on power-law spectra, parameterized by spectral decay rate *p* and error threshold *ε*.

2. Theorem 2: A counterexample showing QRCP can fail to select well-conditioned submatrices even under bounded coherence, with a Lean 4 formalization.

The key novelty is twofold: (a) the system architecture separating exploration, planning, and execution agents with file-based memory, and (b) the demonstration that such a system can contribute meaningfully to resolving genuine open problems rather than benchmarks.

Methodological Rigor

System design: The explore–plan–execute protocol is cleanly specified. The separation of the exploration agent from the plan agent to avoid "route inertia" is a thoughtful architectural choice. The four execution agent types (foundation, experiment, proof, review) map naturally to computational mathematics workflows. However, the system description remains largely conceptual—there is limited quantitative evaluation of the system itself (no ablations, no comparison of agent configurations, no metrics on iteration counts vs. progress).

Mathematical results: The mathematical content is substantial and rigorous. Theorem 1 involves delicate asymptotic analysis combining random matrix theory, moment problems, Hankel matrices, and Schur complements across multiple spectral regimes. The proof spans ~25 pages with carefully stated lemmas. Theorem 2 provides an explicit construction with five named lemmas culminating in the obstruction. The Lean 4 formalization of Theorem 2 adds significant verification confidence.

Transparency about human involvement: The paper is notably honest about the human role. The CG problem had an unjustified assumption that required human-detected repair. The QRCP proof required substantial reorganization. The rate bounds were initially overstated and corrected through human–AI interaction. This transparency is commendable and scientifically important.

Potential Impact

Mathematical contributions: Both theorems address problems from a well-known open problem collection, giving them immediate relevance to the numerical linear algebra community. The CG/RCD phase diagram is a nuanced result that goes beyond simple condition-number comparisons. The QRCP counterexample settles a natural question about a classical algorithm.

AI-for-mathematics: This paper provides one of the strongest demonstrations to date that agentic AI systems can contribute to *computational* mathematics research (as opposed to pure mathematics or competition problems). The contrast with direct GPT-Pro queries is informative—showing that the structured agentic loop adds value beyond a single model call—though the comparison is informal.

Broader research automation: The file-based memory and structured message-passing design could influence future agentic research systems. The idea that failed proof attempts can be systematically converted into counterexample constructions (as in the QRCP case) is a valuable methodological insight.

Timeliness & Relevance

This work is highly timely. The AI-for-mathematics space is rapidly evolving (FunSearch, AlphaEvolve, AI Co-Mathematician, Aletheia), but most systems target either competition problems, combinatorial optimization, or pure mathematics. Computational mathematics—requiring tight integration of numerical experiments and proofs—has been underserved. The paper fills this gap at a moment when the community is actively debating what AI systems can and cannot contribute to mathematical research.

Strengths

1. Real open problems solved: Unlike many AI-for-science papers evaluated on benchmarks, this paper addresses genuine open problems and produces verified theorems.

2. Mathematical depth: The appendices contain serious mathematics—this is not a superficial demonstration but produces results that would be publishable on mathematical merit alone.

3. Honest human-AI delineation: The paper carefully specifies what the system contributed versus what required human intervention, including explicit discussion of errors the system made.

4. Lean formalization: Formal verification of Theorem 2 adds a layer of rigor unusual in this genre.

5. Trajectory analysis: The detailed trajectory analyses (Figures 2-3) provide valuable insight into how the system actually works in practice.

Limitations

1. N=2 case studies: Two problems is a very small sample. It is unclear how well Iteris generalizes to other computational mathematics problems, or what its failure rate is. Were other problems attempted unsuccessfully?

2. System evaluation gaps: No ablation studies, no quantitative metrics on agent contributions, no comparison with alternative architectures. The system claims are supported only by existence proofs (two successes).

3. Reproducibility concerns: The system uses GPT-5.5 via OpenAI Codex, making exact reproduction impossible. The agent skills are described at a high level without full specification.

4. Human effort quantification: While human involvement is acknowledged, it is not quantified. How many hours of expert time were required? How does this compare to solving the problems without AI assistance?

5. Selection bias: The paper presents only successes. Understanding the failure modes and limitations would be equally valuable.

6. Limited novelty in system design: The explore–plan–execute pattern, while well-adapted, draws heavily on existing agentic workflow ideas (ReAct, AI Scientist). The architectural innovation is incremental.

Overall Assessment

This paper makes a credible and well-presented case that agentic AI systems can contribute to computational mathematics research. The mathematical results are genuine and non-trivial. The main limitation is the small sample size and lack of systematic evaluation of the system itself. The paper is more convincing as a proof of concept than as a rigorous systems evaluation, but the proof of concept is compelling. The combination of real mathematical contributions, honest reporting, and formal verification sets a good standard for the emerging field of AI-assisted mathematical research.

Rating:7.2/ 10

Significance 7.5Rigor 7Novelty 6.5Clarity 8

Generated Jun 2, 2026

Comparison History (22)

vs. Diagnosing Knowledge Gaps in LLM Tool Use: An Agentic Benchmark for Novel API Acquisition

claude-opus-4.66/3/2026

Paper 1 demonstrates AI systems contributing to solving genuine open mathematical research problems, producing verified novel results (a phase diagram and a counterexample) on problems from a Simons Workshop collection. This represents a significant milestone in AI-assisted scientific discovery with broad implications across computational mathematics. Paper 2, while methodologically sound, addresses the more incremental question of how LLMs handle novel APIs—an important but narrower engineering contribution within the well-explored space of LLM code generation benchmarks. Paper 1's novelty in agentic research workflows for open problems has greater potential to reshape scientific practice.

vs. AgentCL: Toward Rigorous Evaluation of Continual Learning in Language Agents

gpt-5.26/3/2026

Paper 2 likely has higher impact: it introduces a rigorous evaluation framework (AgentCL) and diagnostics (MemProbe) for continual learning in language agents, a timely and broadly relevant area as agents become widely deployed. Benchmarks and metrics can become community standards, influencing many subsequent methods across NLP, agent systems, and ML evaluation, with clear real-world implications for long-lived assistants. Paper 1 is novel and exciting, showing agentic AI aiding computational mathematics on two open problems, but its impact may be narrower (computational math) and currently depends on expert correction/validation, limiting immediate scalability.

vs. Learning When Not to Act: Mitigating Tool Abuse in Agentic Reinforcement Learning

gpt-5.26/3/2026

Paper 2 likely has higher scientific impact: it introduces a broadly applicable RL optimization framework (EAPO) targeting a timely, general problem in agentic systems—tool abuse—and demonstrates gains across nine benchmarks and multiple widely used open LLMs, suggesting strong reproducibility and immediate practical relevance. Its contributions can transfer across domains wherever tool-augmented agents are deployed. Paper 1 is novel and exciting but is narrower (computational mathematics case studies) and relies on expert correction for verified results, which may limit near-term generalizability despite high interest.

vs. EvoDrive: Pareto Evolution for Safety-Critical Autonomous Driving via Self-Improving LLM Agents

claude-opus-4.66/3/2026

Iteris demonstrates higher scientific impact by producing verified new mathematical results on open research problems—a phase diagram for CG vs. randomized coordinate descent and a counterexample for QR factorization with column pivoting. These are concrete contributions to computational mathematics that advance fundamental knowledge. While EvoDrive is a solid engineering contribution to autonomous driving testing, it primarily improves an existing pipeline (scenario generation) with incremental methodology. Iteris opens a new paradigm of AI-assisted mathematical discovery on genuinely open problems, with broader cross-disciplinary implications and higher novelty.

vs. SafeMCP: Proactive Power Regulation for LLM Agent Defense via Environment-Grounded Look-Ahead Reasoning

gemini-3.16/2/2026

Paper 2 demonstrates direct, verifiable scientific discovery by using an AI agent to solve open problems in computational mathematics. While Paper 1 addresses a critical AI safety challenge, Paper 2's empirical success in generating novel mathematical results and counterexamples showcases a paradigm shift in how AI can actively participate in and accelerate human research workflows, yielding a more profound and immediate scientific impact.

vs. From Capability Models to Automated Planning: An AAS-Native Approach for Automatic PDDL Generation

gemini-3.16/2/2026

Paper 1 presents a novel AI agent system that successfully contributes to solving open problems in computational mathematics. Demonstrating AI-assisted scientific discovery in fundamental mathematics offers a transformative and highly visible impact across AI and math communities. In contrast, Paper 2 focuses on a narrower, application-specific translation tool for industrial engineering standards, which, while practically useful, has less potential for broad scientific paradigm shifts.

vs. CAPF: Guiding Search-Agent Rollouts with Credit-Attenuated Privileged Feedback

gpt-5.26/2/2026

Paper 2 likely has higher scientific impact: CAPF is a broadly applicable training mechanism for RLVR search agents that addresses a common bottleneck (sparse successful rollouts) and shows quantitative gains across multiple QA benchmarks, suggesting better generality and reproducibility. Its methodological contribution (privileged feedback with credit attenuation to enable deployment without feedback) can transfer across tasks and agent architectures, making it timely for current LLM-agent training. Paper 1 is innovative and compelling but demonstrated on two domain-specific computational math case studies with heavy expert validation, limiting immediate breadth and rigor of generalization.

vs. Coordination Graphs for Constrained Multi-Agent Reinforcement Learning

claude-opus-4.66/2/2026

Paper 2 demonstrates a novel paradigm where agentic AI systems contribute to solving genuinely open mathematical research problems, producing verified new results (phase diagrams, counterexamples). This has broader impact across mathematics and AI research methodology, representing a qualitative shift in how computational mathematics research can be conducted. Paper 1, while technically solid, is an incremental contribution combining known techniques (coordination graphs, Lagrangian duality, Max-Sum) for constrained MARL, with impact limited primarily to the multi-agent RL community. Paper 2's timeliness and cross-disciplinary relevance give it higher impact potential.

vs. SMH-Bench: Benchmarking LLM Agents for Environment-Grounded Reasoning and Action in Smart Homes

gemini-3.16/2/2026

Paper 1 demonstrates a highly novel application of AI in scientific discovery, directly contributing to solving open research-level problems in computational mathematics. While Paper 2 presents a valuable benchmark for smart home agents, Paper 1 represents a more significant leap in AI capabilities, showing how agentic loops can generate novel proofs and mathematical constructions, which has profound implications for the future of AI-assisted scientific and mathematical research.

vs. Absorbing Complexity: An Interaction-Native Knowledge Harness for Financial LLM Agents

claude-opus-4.66/2/2026

Paper 1 demonstrates higher scientific impact by applying AI agents to solve genuine open research problems in computational mathematics, producing verified novel mathematical results (a phase diagram and a counterexample) that advance the field. It addresses a less-explored but fundamental area—AI-assisted mathematical discovery—with concrete contributions to open problems from a recognized workshop. Paper 2, while technically solid, addresses a more incremental engineering contribution in financial AI agent architecture, evaluated only on synthetic benchmarks, limiting its broader scientific significance and real-world validation.

vs. Beyond One-shot: AI Agents for Learning in Field Experiments

gemini-3.16/2/2026

Paper 1 demonstrates higher potential scientific impact due to its massive scale (nearly 700,000 patient visits) and direct real-world application in healthcare. While Paper 2 presents a valuable tool for computational mathematics, Paper 1 introduces a broadly applicable framework for cumulative AI-driven experimental design. By transforming A/B testing from a one-shot evaluation into an automated, continuous learning system, Paper 1 offers immense cross-disciplinary utility for behavioral science, healthcare, and tech industries, backed by exceptional methodological rigor and large-scale empirical validation.

vs. Efficient Test-time Inference for Generative Planning Models

gemini-3.16/2/2026

Paper 2 demonstrates an agentic AI system that successfully contributes to solving open research problems in computational mathematics, yielding novel, verified mathematical results. This direct application of AI to scientific discovery represents a paradigm shift with broader implications across STEM fields. Paper 1 offers valuable algorithmic improvements for generative planning models, but its impact is more narrowly focused on computational efficiency rather than producing novel scientific knowledge.

vs. Emergent Ordinal Geometry in Transformers Trained on Local Comparisons

gemini-3.16/2/2026

Paper 1 demonstrates a paradigm shift in scientific discovery by showing AI agents successfully contributing to open, research-level problems in computational mathematics. The practical generation of novel, verified mathematical results indicates a high potential to broadly transform research workflows, offering more immediate and tangible real-world applications than the theoretical, synthetic-task focus of Paper 2.

vs. Token Predictors Are Not Planners: Building Physically Grounded Causal Reasoners

gpt-5.26/2/2026

Paper 1 likely has higher scientific impact: it introduces substantial community infrastructure (a high-fidelity benchmark plus a million-scale dataset), directly targets a known failure mode in embodied VLM planning (token prediction vs. causal/next-state reasoning), and reports systematic evaluations, generalization, and a scaling law. This combination is timely for robotics/embodied AI and can influence model training, evaluation standards, and downstream autonomy across multiple labs. Paper 2 is compelling but rests on limited case studies and narrower domain scope; impact may be significant within computational mathematics workflows but less broad and less benchmark-driven.

vs. MINDGAMES: A Live Arena for Evaluating Social and Strategic Reasoning in Multi-Agent LLMs

gemini-3.16/2/2026

Paper 2 demonstrates an AI system making tangible progress on open research problems in computational mathematics, yielding actual new mathematical results. While Paper 1 provides a valuable benchmark, the ability of an agentic system to directly contribute to open scientific discovery represents a more profound shift in research methodology and has broader implications for the future of scientific workflows.

vs. "Skill issues'': data-centric optimization of lakehouse agents

gpt-5.26/2/2026

Paper 2 has higher potential impact: it demonstrates an agentic system contributing to verified progress on two open problems in computational mathematics, yielding concrete, field-relevant results (a phase diagram and a counterexample). This is novel in targeting research loops (experimentation + construction + proof drafting) rather than task automation, with clear real-world applicability to numerical linear algebra and optimization communities and broader implications for AI-assisted scientific discovery. While Paper 1 is valuable for data/agent infrastructure, its evidence is preliminary (25 tasks) and more domain-specific to a particular lakehouse setting.

vs. Reliable Reasoning with Large Language Models via Preference-Based Maximum Satisfiability

gemini-3.16/2/2026

Paper 1 demonstrates an AI system actively contributing to solving real, open problems in computational mathematics, yielding novel, verified mathematical results (a new phase diagram and a counterexample). This represents a direct and significant breakthrough in AI-driven scientific discovery. While Paper 2 offers a valuable methodological improvement for LLM reasoning via MaxSAT solvers, Paper 1's achievement of generating new scientific knowledge gives it a higher potential for broad scientific impact.

vs. Generating Graph-like Rules for Knowledge Graph Reasoning via Diffusion Models

claude-opus-4.66/2/2026

Iteris demonstrates a novel paradigm of agentic AI systems contributing to open research problems in computational mathematics, producing verified new results on two open problems from a Simons Workshop. This represents a significant advance in AI-assisted mathematical discovery with broad implications for how research is conducted. Paper 2, while solid, makes an incremental contribution to knowledge graph reasoning by extending rule mining to graph-like structures using diffusion models—a more narrowly scoped contribution in a well-explored area. The paradigm-shifting potential of AI agents meaningfully participating in mathematical research gives Paper 1 higher impact.

vs. Enhancing Multi-Agent Communication through Attention Steering with Context Relevance

gemini-3.16/2/2026

Paper 2 demonstrates a profound real-world scientific impact by using an AI agent to tackle and contribute to open problems in computational mathematics, resulting in verified, novel discoveries. While Paper 1 offers a useful methodological improvement for multi-agent LLM systems, Paper 2 represents a significant leap in AI-driven scientific research and discovery, which carries broader and more groundbreaking implications for the scientific community.

vs. POIROT: Interrogating Agents for Failure Detection in Multi-Agent Systems

gemini-3.16/2/2026

Paper 1 addresses a critical, universal bottleneck in deploying Multi-Agent Systems: safety and failure detection. By decentralizing oversight and utilizing the agents themselves, it offers a scalable solution for AI reliability that directly addresses emerging regulatory requirements. Its broad applicability across any safety-critical domain gives it a wider potential scientific and real-world impact compared to Paper 2, which, despite impressive concrete results, is relatively narrowly focused on computational mathematics.