Inductive Deductive Synthesis: Enabling AI to Generate Formally Verified Systems

Shubham Agarwal, Alexander Krentsel, Shu Liu, Mert Cemri, Audrey Cheng, Rui Meng, Tomas Pfister, Chun-Liang Li

May 22, 2026

arXiv:2605.23109v1 PDF

cs.AI(primary)cs.DCcs.LOcs.PL

#77of 2682·Artificial Intelligence

#77 of 2682 · Artificial Intelligence

Tournament Score

1560±44

10501800

57%

Win Rate

Wins

Losses

Matches

Rating

8.2/ 10

Significance8.5

Rigor8

Novelty8

Clarity8.5

Tournament Score

1560±44

10501800

57%

Win Rate

Wins

Losses

Matches

Rating

8.2/ 10

Significance

Rigor

Novelty

Clarity

Abstract

AI agents increasingly excel at generating, testing, and refining code. However, they fall short on tasks requiring formal guarantees of full coverage that testing alone cannot provide. Distributed systems are a prime example: properties such as consistency between reads and writes must hold under every possible interleaving of events. Mechanized formal verification can guarantee such correctness, but typically demands months to years of expert effort. As evidence, even SOTA coding agents (Codex with GPT-5.4 and Claude Code with Opus 4.6) succeed on only 2/7 distributed key-value-store specifications. In this paper, we present the first effective approach to addressing this gap, Inductive Deductive Synthesis (IDS), which jointly and incrementally synthesizes implementation and proof, and learns from failed attempts to systematically try promising strategies. Built as an agentic LLM system, IDS achieves 7/7 in about 6.8 hours and $106 per spec on average, roughly 200x faster than expert effort and 17% cheaper than SOTA agents. IDS further incorporates performance feedback into the same loop, yielding implementations up to 3x faster than published verified systems.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: Inductive Deductive Synthesis (IDS)

1. Core Contribution

IDS introduces a fundamentally different approach to AI-driven formal verification by jointly and incrementally synthesizing both implementation and proof, rather than treating verification as a post-hoc check on completed code. The key insight is that partial proofs can be mechanically checked by Rocq's type-checker at every intermediate step, providing a formally grounded "chain-of-thought" where each state is verified. The system comprises two interacting components: a Deductive Synthesis Agent (DSA) that incrementally extends code and proof, and an Inductive Synthesis Agent (ISA) that learns from failures to propose new strategies (tactical via a "proposer" role, or strategic via a "reloader" role). This two-level architecture effectively transforms verified system construction into a guided search problem with an exact oracle.

The paper demonstrates IDS on 7 distributed key-value-store consistency specifications (causal consistency, read-your-writes, monotonic reads/writes, and compositions), achieving 7/7 completions where state-of-the-art coding agents (Codex/GPT-5.4 and Claude Code/Opus 4.6) manage only 2/7 each. The system also achieves SOTA on seven cross-language verification benchmarks.

2. Methodological Rigor

The evaluation is thorough and multi-dimensional:

Correctness: All outputs are kernel-checked by Rocq with no `Admitted`, `Axiom`, or `Parameter` remaining — this is an unimpeachable correctness standard.

Baselines: Comparisons against two leading commercial coding agents under identical prompts and budgets isolate the architectural contribution.

Ablations: Systematic removal of each component (joint discovery, proposer, reloader, audit, verification feedback, performance feedback) demonstrates that each contributes meaningfully, with verification feedback being the most consequential single component (-35% on VERINA when removed).

Performance evaluation: Extracted implementations are benchmarked on a real 5-VM cluster with multiple metrics (throughput, p99 latency, memory, scaling), not just correctness.

Cross-language generalization: SOTA results across Dafny, Verus, Lean, and Rocq benchmarks demonstrate the technique is not overfitted to one proof assistant.

One concern is that only 3 independent runs per specification are reported, though the authors note ≤7% relative std. The "best-of-N=100" experiment (Appendix C) for implementation-only synthesis is a valuable control, showing that even 100 unverified candidates cannot match verified synthesis.

3. Potential Impact

Immediate impact: This work could dramatically lower the barrier to formal verification of distributed systems, a domain where correctness bugs have severe real-world consequences (data loss, security breaches). The ~200× speedup over expert effort (hours vs. months) for Chapar CC is striking.

Broader implications:

The technique generalizes to any domain with a machine-checkable partial-proof oracle, as demonstrated by cross-language results. The authors explicitly mention OS kernels, compilers, cryptographic protocols, and hardware as future targets.

The joint code+proof synthesis paradigm could influence how the formal methods community thinks about automation — shifting from "prove existing code" to "co-discover code and proof."

The performance feedback loop is novel: IDS doesn't just find *a* correct implementation but searches for *efficient* ones, achieving up to 3× throughput over hand-written expert references.

Benchmark contribution: The IDS suite of 6 new Rocq specifications for distributed KV-store consistency models fills a gap in verification benchmarks.

4. Timeliness & Relevance

This paper sits at the intersection of two major trends: (1) the rapid improvement of LLM coding agents and (2) growing concern about the reliability of AI-generated code. The observation that SOTA agents fail on 5/7 formal verification tasks despite excelling at standard coding benchmarks is timely and important — it concretely demonstrates the limits of test-based approaches for safety-critical systems.

The "specification bottleneck" the authors identify (writing formal specs remains hard) is an honest assessment that positions this as step one in a broader research program. The community is actively seeking ways to make formal methods practical; IDS provides a concrete, working solution for the synthesis half of the problem.

5. Strengths & Limitations

Key Strengths:

The partial-proof oracle idea is elegant and well-exploited — it provides exact feedback at every step, fundamentally different from test-based or score-based signals.

The ISA's two-role design (tactical proposer vs. strategic reloader) maps naturally to how human experts work, and ablations confirm both roles matter.

The audit step catches non-trivial failure modes (vacuous proofs, always-false guards) that would otherwise silently pass the type-checker.

Artifact sizes are substantial: Chapar CC produces 3,037 lines / 79 lemmas; IDS suite CC produces 3,807 lines / 121 lemmas — these are real proofs, not toy examples.

Open-source release with benchmark specifications enables reproducibility.

Limitations:

The evaluation covers only distributed KV stores. While cross-language benchmarks show generality for simpler tasks, the claim that IDS extends to "OS kernels, compilers, cryptographic protocols" is unsubstantiated.

The specifications evaluated, while non-trivial, do not cover fault recovery, reconfiguration, or scaling — aspects critical to production systems.

The reliance on frontier commercial LLMs (GPT-5.4, Opus 4.6) raises reproducibility concerns as these models evolve.

Cost ($106/spec average) and time (6.8h average) are practical but not negligible; the approach may not scale to specifications orders of magnitude larger.

The comparison is primarily against general-purpose coding agents rather than specialized formal verification tools (though the cross-language benchmarks partially address this).

Additional Observations

The paper's framing as "shifting vibe coding to verified coding" is compelling but slightly oversold — the specification must still be written by an expert, and this is acknowledged as the main bottleneck. The analogy to chain-of-thought with formally verified intermediate states is insightful and could inspire related work in other domains requiring certified reasoning.

Rating:8.2/ 10

Significance 8.5Rigor 8Novelty 8Clarity 8.5

Generated May 25, 2026

Comparison History (21)

vs. A Collective Variational Principle Unifying Bayesian Inference, Game Theory, and Thermodynamics

gemini-3.15/25/2026

Paper 2 proposes a fundamental theoretical unification of Bayesian inference, game theory, and thermodynamics, offering profound implications across physics, biology, neuroscience, and AI. While Paper 1 presents a highly practical and timely advancement in AI-driven formal verification, Paper 2's potential to reshape our theoretical understanding of collective intelligence and multi-agent systems across diverse scientific disciplines gives it a broader and deeper long-term scientific impact.

vs. AI scientists produce results without reasoning scientifically

gemini-3.15/25/2026

Paper 2 provides a crucial, large-scale critique of AI agents in scientific research, exposing fundamental flaws in their reasoning processes. While Paper 1 offers a significant technical breakthrough in a specific domain (formal verification), Paper 2's findings have broad, cross-disciplinary implications for the entire burgeoning field of AI-driven science. By demonstrating that current AI models fail at basic epistemic norms, it challenges widespread assumptions and identifies a critical new direction for AI research, giving it higher potential impact across the broader scientific community.

vs. Generative structure search for efficient and diverse discovery of molecular and crystal structures

claude-opus-4.65/25/2026

Paper 1 addresses a fundamental challenge in materials science and chemistry—efficient structure prediction across high-dimensional energy landscapes—by unifying generative models with physics-based search. Its >10x efficiency gains, ability to discover metastable structures, and generalization beyond training data have broad impact across molecular discovery, materials design, and drug development. Paper 2 is impressive for formal verification of distributed systems but targets a narrower domain. Paper 1's framework bridges machine learning and physical sciences with broader interdisciplinary impact and addresses a more universally impactful scientific problem.

vs. Machine Collective Intelligence for Explainable Scientific Discovery

claude-opus-4.65/25/2026

Paper 1 presents a broadly applicable paradigm (machine collective intelligence) for autonomous scientific equation discovery across diverse domains, combining symbolic AI and metaheuristics. Its breadth of impact spans virtually all empirical sciences, offering orders-of-magnitude improvements in extrapolation and interpretability over deep learning. While Paper 2 makes an impressive contribution to formal verification of distributed systems—a significant but narrower domain—Paper 1's potential to transform how scientific laws are discovered gives it greater cross-disciplinary impact and foundational significance for AI-driven science.

vs. MIMIC: A Generative Multimodal Foundation Model for Biomolecules

gemini-3.15/25/2026

While Paper 1 significantly advances automated formal verification, Paper 2 presents a multimodal foundation model for biology that bridges sequence, structure, and evolutionary contexts. By enabling both SOTA predictions and constrained biomolecular design (e.g., clinically relevant RNA editing and protein binder generation), Paper 2 has a broader potential for cross-disciplinary impact, particularly in accelerating drug discovery, genomics, and structural biology.

vs. End-to-end autonomous scientific discovery on a real optical platform

claude-opus-4.65/25/2026

Paper 1 demonstrates the first end-to-end autonomous scientific discovery system that identifies and experimentally validates a previously unreported physical mechanism on real hardware. This represents a paradigm shift in how scientific research can be conducted—moving from AI-assisted to fully autonomous discovery. The breadth of impact is enormous: it spans AI, physics, and potentially optical computing hardware. While Paper 2 makes a strong contribution to formal verification with practical speedups, its impact is more narrowly scoped to software engineering. Paper 1's demonstration of AI-driven experimental science has transformative implications across all empirical sciences.

vs. Towards a General Intelligence and Interface for Wearable Health Data

gpt-5.25/25/2026

Paper 1 combines massive-scale self-supervised pretraining (trillion-minute wearable dataset, 5M participants) with broad evaluation (35 tasks) and a clinically rated Personal Health Agent, positioning it for wide real-world deployment and cross-domain impact (healthcare, ML, HCI, digital medicine). Its methodological scale and demonstrated label efficiency/generation address a major bottleneck in wearable health. Paper 2 is highly novel and rigorous for formal methods, but its immediate breadth of application is narrower (formally verified distributed systems) and depends on adoption of specific verification toolchains.

vs. Foundation Models to Unlock Real-World Evidence from Nationwide Medical Claims

gpt-5.25/25/2026

Paper 1 likely has higher scientific impact due to its scale (43.8B events, 200M lives), strong empirical rigor (1,000+ tasks, prospective + external validation), and immediate, broad real-world applications in healthcare (disease surveillance, expenditure forecasting, RWE/target-trial bias reduction). It advances foundation modeling on an underexplored but widely available data modality (claims), with clear translational relevance for regulators, payers, and providers. Paper 2 is highly novel for verified systems, but its near-term breadth is narrower and depends on adoption within formal methods and distributed systems tooling ecosystems.

vs. Simulating clinical interventions with a generative multimodal model of human physiology

gpt-5.25/25/2026

Paper 2 likely has higher impact due to direct clinical relevance and broad applicability: a multimodal generative “health world model” enabling forecasting, risk stratification, and in silico intervention simulation across multiple physiological domains. It is trained on a large longitudinal cohort, demonstrates transfer to independent cohorts, and benchmarks against clinical risk scores and randomized trial effects—supporting methodological rigor and real-world utility. Its breadth spans medicine, epidemiology, and ML, and it is timely for digital-twin healthcare. Paper 1 is highly novel for formal verification but targets a narrower domain.

vs. When Reasoning Traces Become Performative: Step-Level Evidence that Chain-of-Thought Is an Imperfect Oversight Channel

gemini-3.15/25/2026

Paper 2 presents a breakthrough in automating mechanized formal verification, a notoriously difficult and time-consuming task. By introducing a framework that successfully synthesizes both implementation and proof for distributed systems (achieving 100% success where current SOTA fails), it bridges a critical gap in software reliability. While Paper 1 provides highly relevant insights into LLM interpretability and CoT faithfulness, Paper 2's potential to dramatically reduce the cost and time of verifying mission-critical infrastructure offers broader, more immediate real-world applications and highly tangible transformative impact across computer science.

vs. EVE-Agent: Evidence-Verifiable Self-Evolving Agents

claude-opus-4.65/25/2026

Paper 2 (IDS) addresses the fundamental challenge of automated formal verification of distributed systems—a problem with enormous practical significance. It demonstrates a concrete, measurable breakthrough: achieving 7/7 on specifications where SOTA agents manage only 2/7, at ~200x speedup over expert effort. This bridges the long-standing gap between AI code generation and formal correctness guarantees, with broad implications for software reliability. Paper 1 (EVE-Agent) makes a meaningful but more incremental contribution to self-evolving agents with evidence verification. While valuable, its impact is narrower compared to enabling AI-driven formally verified system synthesis.

vs. Foundation Protocol: A Coordination Layer for Agentic Society

gemini-3.15/25/2026

Paper 1 presents a concrete, measurable breakthrough in a highly rigorous field (formal verification of distributed systems) with impressive quantitative results demonstrating significant improvements over both SOTA AI and human experts. Paper 2 proposes a broad, conceptual framework for multi-agent coordination but lacks the empirical evidence and methodological rigor demonstrated in Paper 1.

vs. GENSTRAT: Toward a Science of Strategic Reasoning in Large Language Models

gpt-5.25/25/2026

Paper 2 likely has higher scientific impact: it introduces a novel agentic loop (IDS) that jointly synthesizes code and mechanized proofs, demonstrably closing a major capability gap (2/7 to 7/7) on formally verified distributed systems with strong efficiency gains. The real-world applications (verified infrastructure, safety-critical software) are immediate and broad across PL, formal methods, distributed systems, and AI. If the evaluation is rigorous and reproducible, the results suggest a step-change. Paper 1 is timely and useful for LLM evaluation, but its impact is more incremental and narrower to benchmarking.

vs. CP or DP? Why Not Both: A Case Study in the Partial Shop Scheduling Problem

claude-opus-4.65/25/2026

Paper 1 presents a fundamentally novel approach (Inductive Deductive Synthesis) that enables AI to generate formally verified distributed systems—a problem of immense practical importance. It demonstrates dramatic improvements (200x faster than expert effort, 7/7 success vs 2/7 for SOTA agents) on a challenging benchmark. The work bridges AI code generation and formal verification, two major fields, with broad implications for software reliability. Paper 2, while a solid contribution combining DP and CP, acknowledges it is not competitive with state-of-the-art solvers and primarily demonstrates viability of a hybrid integration, limiting its impact.

vs. Advancing Mathematics Research with AI-Driven Formal Proof Search

gpt-5.25/25/2026

Paper 2 likely has higher impact: it introduces a novel IDS paradigm that jointly synthesizes implementations and proofs with learning from failures, and demonstrates strong, complete benchmark gains (7/7) with low cost/time plus performance optimization. The real-world applications (formally verified distributed systems) are immediate and broadly relevant to safety-critical software engineering and industry, spanning PL, formal methods, systems, and AI. Paper 1 is timely and exciting for mathematics, but its demonstrated scope (solving a small fraction of open problems) suggests narrower near-term practical impact than verified-systems automation.

vs. Beyond Binary Edits Robust Multimodal Knowledge Editing with Adversarial Subspace Alignment

gpt-5.25/25/2026

Paper 2 has higher potential impact: it tackles a major, high-stakes bottleneck—automating formally verified system construction—directly enabling more reliable distributed systems. The approach is novel in jointly synthesizing implementations and mechanized proofs with iterative learning from failures and integrating performance optimization, showing strong end-to-end results (7/7 specs) and large efficiency gains. Applications are immediate across verification, programming languages, systems, and AI-assisted software engineering. Paper 1 is solid and timely for MLLM knowledge editing, but its impact is narrower and more incremental within model editing methodology.

vs. Efficient Agentic Reasoning Through Self-Regulated Simulative Planning

gpt-5.25/25/2026

Paper 1 likely has higher scientific impact: it tackles a hard, high-stakes gap (LLM-generated code with formal correctness guarantees) and demonstrates a large, concrete step-change on an objective benchmark (7/7 verified distributed specs, joint proof+implementation synthesis, performance-aware loop). This combines strong novelty with clear real-world applicability (reliable distributed systems) and broad relevance to PL, formal methods, and AI. Paper 2 is timely and useful for efficient reasoning/planning, but its contributions are more incremental within agent architecture and depend on evaluation sensitivity to task suites and prompting/training choices.

vs. Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention

gpt-5.25/25/2026

Paper 1 likely has higher scientific impact due to its step-change in automating formally verified system construction: jointly synthesizing code and machine-checked proofs for distributed systems, achieving full success where strong baselines fail, with large speed/cost gains and performance-optimized verified outputs. This targets a major bottleneck in trustworthy software and can influence programming languages, verification, systems, and AI agent design. Paper 2 is a strong, timely incremental advance in linear attention with solid applications, but it is closer to ongoing architectural refinement in a crowded area and likely has narrower cross-field impact than scalable verified-system synthesis.

vs. SMDD-Bench: Can LLMs Solve Real-World Small Molecule Drug Design Tasks?

claude-opus-4.65/25/2026

Paper 2 (IDS) demonstrates a transformative capability—enabling AI to automatically generate formally verified distributed systems, a task previously requiring months of expert effort. It achieves 7/7 success where SOTA agents manage only 2/7, representing a qualitative breakthrough. The 200x speedup over expert effort has enormous practical implications for software reliability. Paper 1 (SMDD-Bench) is a valuable benchmark contribution, but benchmarks have incremental impact compared to novel methodological breakthroughs. IDS opens a new paradigm for verified software synthesis with broader cross-field implications.

vs. Implicit Safety Alignment from Crowd Preferences

gpt-5.25/25/2026

Paper 2 likely has higher impact due to stronger novelty and broader applicability: it targets a major bottleneck (end-to-end synthesis of formally verified distributed systems) with a concrete, measurable leap (2/7 to 7/7) and large efficiency gains, plus performance optimization within verification. This spans PL/verification, distributed systems, and LLM agents, with clear real-world deployment value (correctness guarantees). Paper 1 is timely and useful for safety in RLHF, but its contributions are more incremental and domain-scoped, and safety transfer from preferences may be harder to generalize and rigorously guarantee.