Inductive Deductive Synthesis: Enabling AI to Generate Formally Verified Systems
Shubham Agarwal, Alexander Krentsel, Shu Liu, Mert Cemri, Audrey Cheng, Rui Meng, Tomas Pfister, Chun-Liang Li
Abstract
AI agents increasingly excel at generating, testing, and refining code. However, they fall short on tasks requiring formal guarantees of full coverage that testing alone cannot provide. Distributed systems are a prime example: properties such as consistency between reads and writes must hold under every possible interleaving of events. Mechanized formal verification can guarantee such correctness, but typically demands months to years of expert effort. As evidence, even SOTA coding agents (Codex with GPT-5.4 and Claude Code with Opus 4.6) succeed on only 2/7 distributed key-value-store specifications. In this paper, we present the first effective approach to addressing this gap, Inductive Deductive Synthesis (IDS), which jointly and incrementally synthesizes implementation and proof, and learns from failed attempts to systematically try promising strategies. Built as an agentic LLM system, IDS achieves 7/7 in about 6.8 hours and $106 per spec on average, roughly 200x faster than expert effort and 17% cheaper than SOTA agents. IDS further incorporates performance feedback into the same loop, yielding implementations up to 3x faster than published verified systems.
AI Impact Assessments
(1 models)Scientific Impact Assessment: Inductive Deductive Synthesis (IDS)
1. Core Contribution
IDS introduces a fundamentally different approach to AI-driven formal verification by jointly and incrementally synthesizing both implementation and proof, rather than treating verification as a post-hoc check on completed code. The key insight is that partial proofs can be mechanically checked by Rocq's type-checker at every intermediate step, providing a formally grounded "chain-of-thought" where each state is verified. The system comprises two interacting components: a Deductive Synthesis Agent (DSA) that incrementally extends code and proof, and an Inductive Synthesis Agent (ISA) that learns from failures to propose new strategies (tactical via a "proposer" role, or strategic via a "reloader" role). This two-level architecture effectively transforms verified system construction into a guided search problem with an exact oracle.
The paper demonstrates IDS on 7 distributed key-value-store consistency specifications (causal consistency, read-your-writes, monotonic reads/writes, and compositions), achieving 7/7 completions where state-of-the-art coding agents (Codex/GPT-5.4 and Claude Code/Opus 4.6) manage only 2/7 each. The system also achieves SOTA on seven cross-language verification benchmarks.
2. Methodological Rigor
The evaluation is thorough and multi-dimensional:
One concern is that only 3 independent runs per specification are reported, though the authors note ≤7% relative std. The "best-of-N=100" experiment (Appendix C) for implementation-only synthesis is a valuable control, showing that even 100 unverified candidates cannot match verified synthesis.
3. Potential Impact
Immediate impact: This work could dramatically lower the barrier to formal verification of distributed systems, a domain where correctness bugs have severe real-world consequences (data loss, security breaches). The ~200× speedup over expert effort (hours vs. months) for Chapar CC is striking.
Broader implications:
Benchmark contribution: The IDS suite of 6 new Rocq specifications for distributed KV-store consistency models fills a gap in verification benchmarks.
4. Timeliness & Relevance
This paper sits at the intersection of two major trends: (1) the rapid improvement of LLM coding agents and (2) growing concern about the reliability of AI-generated code. The observation that SOTA agents fail on 5/7 formal verification tasks despite excelling at standard coding benchmarks is timely and important — it concretely demonstrates the limits of test-based approaches for safety-critical systems.
The "specification bottleneck" the authors identify (writing formal specs remains hard) is an honest assessment that positions this as step one in a broader research program. The community is actively seeking ways to make formal methods practical; IDS provides a concrete, working solution for the synthesis half of the problem.
5. Strengths & Limitations
Key Strengths:
Limitations:
Additional Observations
The paper's framing as "shifting vibe coding to verified coding" is compelling but slightly oversold — the specification must still be written by an expert, and this is acknowledged as the main bottleneck. The analogy to chain-of-thought with formally verified intermediate states is insightful and could inspire related work in other domains requiring certified reasoning.
Generated May 25, 2026
Comparison History (21)
Paper 2 proposes a fundamental theoretical unification of Bayesian inference, game theory, and thermodynamics, offering profound implications across physics, biology, neuroscience, and AI. While Paper 1 presents a highly practical and timely advancement in AI-driven formal verification, Paper 2's potential to reshape our theoretical understanding of collective intelligence and multi-agent systems across diverse scientific disciplines gives it a broader and deeper long-term scientific impact.
Paper 2 provides a crucial, large-scale critique of AI agents in scientific research, exposing fundamental flaws in their reasoning processes. While Paper 1 offers a significant technical breakthrough in a specific domain (formal verification), Paper 2's findings have broad, cross-disciplinary implications for the entire burgeoning field of AI-driven science. By demonstrating that current AI models fail at basic epistemic norms, it challenges widespread assumptions and identifies a critical new direction for AI research, giving it higher potential impact across the broader scientific community.
Paper 1 addresses a fundamental challenge in materials science and chemistry—efficient structure prediction across high-dimensional energy landscapes—by unifying generative models with physics-based search. Its >10x efficiency gains, ability to discover metastable structures, and generalization beyond training data have broad impact across molecular discovery, materials design, and drug development. Paper 2 is impressive for formal verification of distributed systems but targets a narrower domain. Paper 1's framework bridges machine learning and physical sciences with broader interdisciplinary impact and addresses a more universally impactful scientific problem.
Paper 1 presents a broadly applicable paradigm (machine collective intelligence) for autonomous scientific equation discovery across diverse domains, combining symbolic AI and metaheuristics. Its breadth of impact spans virtually all empirical sciences, offering orders-of-magnitude improvements in extrapolation and interpretability over deep learning. While Paper 2 makes an impressive contribution to formal verification of distributed systems—a significant but narrower domain—Paper 1's potential to transform how scientific laws are discovered gives it greater cross-disciplinary impact and foundational significance for AI-driven science.
While Paper 1 significantly advances automated formal verification, Paper 2 presents a multimodal foundation model for biology that bridges sequence, structure, and evolutionary contexts. By enabling both SOTA predictions and constrained biomolecular design (e.g., clinically relevant RNA editing and protein binder generation), Paper 2 has a broader potential for cross-disciplinary impact, particularly in accelerating drug discovery, genomics, and structural biology.
Paper 1 demonstrates the first end-to-end autonomous scientific discovery system that identifies and experimentally validates a previously unreported physical mechanism on real hardware. This represents a paradigm shift in how scientific research can be conducted—moving from AI-assisted to fully autonomous discovery. The breadth of impact is enormous: it spans AI, physics, and potentially optical computing hardware. While Paper 2 makes a strong contribution to formal verification with practical speedups, its impact is more narrowly scoped to software engineering. Paper 1's demonstration of AI-driven experimental science has transformative implications across all empirical sciences.
Paper 1 combines massive-scale self-supervised pretraining (trillion-minute wearable dataset, 5M participants) with broad evaluation (35 tasks) and a clinically rated Personal Health Agent, positioning it for wide real-world deployment and cross-domain impact (healthcare, ML, HCI, digital medicine). Its methodological scale and demonstrated label efficiency/generation address a major bottleneck in wearable health. Paper 2 is highly novel and rigorous for formal methods, but its immediate breadth of application is narrower (formally verified distributed systems) and depends on adoption of specific verification toolchains.
Paper 1 likely has higher scientific impact due to its scale (43.8B events, 200M lives), strong empirical rigor (1,000+ tasks, prospective + external validation), and immediate, broad real-world applications in healthcare (disease surveillance, expenditure forecasting, RWE/target-trial bias reduction). It advances foundation modeling on an underexplored but widely available data modality (claims), with clear translational relevance for regulators, payers, and providers. Paper 2 is highly novel for verified systems, but its near-term breadth is narrower and depends on adoption within formal methods and distributed systems tooling ecosystems.
Paper 2 likely has higher impact due to direct clinical relevance and broad applicability: a multimodal generative “health world model” enabling forecasting, risk stratification, and in silico intervention simulation across multiple physiological domains. It is trained on a large longitudinal cohort, demonstrates transfer to independent cohorts, and benchmarks against clinical risk scores and randomized trial effects—supporting methodological rigor and real-world utility. Its breadth spans medicine, epidemiology, and ML, and it is timely for digital-twin healthcare. Paper 1 is highly novel for formal verification but targets a narrower domain.
Paper 2 presents a breakthrough in automating mechanized formal verification, a notoriously difficult and time-consuming task. By introducing a framework that successfully synthesizes both implementation and proof for distributed systems (achieving 100% success where current SOTA fails), it bridges a critical gap in software reliability. While Paper 1 provides highly relevant insights into LLM interpretability and CoT faithfulness, Paper 2's potential to dramatically reduce the cost and time of verifying mission-critical infrastructure offers broader, more immediate real-world applications and highly tangible transformative impact across computer science.
Paper 2 (IDS) addresses the fundamental challenge of automated formal verification of distributed systems—a problem with enormous practical significance. It demonstrates a concrete, measurable breakthrough: achieving 7/7 on specifications where SOTA agents manage only 2/7, at ~200x speedup over expert effort. This bridges the long-standing gap between AI code generation and formal correctness guarantees, with broad implications for software reliability. Paper 1 (EVE-Agent) makes a meaningful but more incremental contribution to self-evolving agents with evidence verification. While valuable, its impact is narrower compared to enabling AI-driven formally verified system synthesis.
Paper 1 presents a concrete, measurable breakthrough in a highly rigorous field (formal verification of distributed systems) with impressive quantitative results demonstrating significant improvements over both SOTA AI and human experts. Paper 2 proposes a broad, conceptual framework for multi-agent coordination but lacks the empirical evidence and methodological rigor demonstrated in Paper 1.
Paper 2 likely has higher scientific impact: it introduces a novel agentic loop (IDS) that jointly synthesizes code and mechanized proofs, demonstrably closing a major capability gap (2/7 to 7/7) on formally verified distributed systems with strong efficiency gains. The real-world applications (verified infrastructure, safety-critical software) are immediate and broad across PL, formal methods, distributed systems, and AI. If the evaluation is rigorous and reproducible, the results suggest a step-change. Paper 1 is timely and useful for LLM evaluation, but its impact is more incremental and narrower to benchmarking.
Paper 1 presents a fundamentally novel approach (Inductive Deductive Synthesis) that enables AI to generate formally verified distributed systems—a problem of immense practical importance. It demonstrates dramatic improvements (200x faster than expert effort, 7/7 success vs 2/7 for SOTA agents) on a challenging benchmark. The work bridges AI code generation and formal verification, two major fields, with broad implications for software reliability. Paper 2, while a solid contribution combining DP and CP, acknowledges it is not competitive with state-of-the-art solvers and primarily demonstrates viability of a hybrid integration, limiting its impact.
Paper 2 likely has higher impact: it introduces a novel IDS paradigm that jointly synthesizes implementations and proofs with learning from failures, and demonstrates strong, complete benchmark gains (7/7) with low cost/time plus performance optimization. The real-world applications (formally verified distributed systems) are immediate and broadly relevant to safety-critical software engineering and industry, spanning PL, formal methods, systems, and AI. Paper 1 is timely and exciting for mathematics, but its demonstrated scope (solving a small fraction of open problems) suggests narrower near-term practical impact than verified-systems automation.
Paper 2 has higher potential impact: it tackles a major, high-stakes bottleneck—automating formally verified system construction—directly enabling more reliable distributed systems. The approach is novel in jointly synthesizing implementations and mechanized proofs with iterative learning from failures and integrating performance optimization, showing strong end-to-end results (7/7 specs) and large efficiency gains. Applications are immediate across verification, programming languages, systems, and AI-assisted software engineering. Paper 1 is solid and timely for MLLM knowledge editing, but its impact is narrower and more incremental within model editing methodology.
Paper 1 likely has higher scientific impact: it tackles a hard, high-stakes gap (LLM-generated code with formal correctness guarantees) and demonstrates a large, concrete step-change on an objective benchmark (7/7 verified distributed specs, joint proof+implementation synthesis, performance-aware loop). This combines strong novelty with clear real-world applicability (reliable distributed systems) and broad relevance to PL, formal methods, and AI. Paper 2 is timely and useful for efficient reasoning/planning, but its contributions are more incremental within agent architecture and depend on evaluation sensitivity to task suites and prompting/training choices.
Paper 1 likely has higher scientific impact due to its step-change in automating formally verified system construction: jointly synthesizing code and machine-checked proofs for distributed systems, achieving full success where strong baselines fail, with large speed/cost gains and performance-optimized verified outputs. This targets a major bottleneck in trustworthy software and can influence programming languages, verification, systems, and AI agent design. Paper 2 is a strong, timely incremental advance in linear attention with solid applications, but it is closer to ongoing architectural refinement in a crowded area and likely has narrower cross-field impact than scalable verified-system synthesis.
Paper 2 (IDS) demonstrates a transformative capability—enabling AI to automatically generate formally verified distributed systems, a task previously requiring months of expert effort. It achieves 7/7 success where SOTA agents manage only 2/7, representing a qualitative breakthrough. The 200x speedup over expert effort has enormous practical implications for software reliability. Paper 1 (SMDD-Bench) is a valuable benchmark contribution, but benchmarks have incremental impact compared to novel methodological breakthroughs. IDS opens a new paradigm for verified software synthesis with broader cross-field implications.
Paper 2 likely has higher impact due to stronger novelty and broader applicability: it targets a major bottleneck (end-to-end synthesis of formally verified distributed systems) with a concrete, measurable leap (2/7 to 7/7) and large efficiency gains, plus performance optimization within verification. This spans PL/verification, distributed systems, and LLM agents, with clear real-world deployment value (correctness guarantees). Paper 1 is timely and useful for safety in RLHF, but its contributions are more incremental and domain-scoped, and safety transfer from preferences may be harder to generalize and rigorously guarantee.