FormalScience: Scalable Human-in-the-Loop Autoformalisation of Science with Agentic Code Generation in Lean

Jordan Meadows, Lan Zhang, Andre Freitas

Apr 24, 2026

arXiv:2604.23002v1 PDF

cs.AI(primary)cs.CL

#136of 2292·Artificial Intelligence

#136 of 2292 · Artificial Intelligence

Tournament Score

1534±43

10501800

76%

Win Rate

Wins

Losses

Matches

Rating

5.8/ 10

Significance6.5

Rigor5.5

Novelty6.5

Clarity7

Tournament Score

1534±43

10501800

76%

Win Rate

Wins

Losses

Matches

Rating

5.8/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Formalising informal mathematical reasoning into formally verifiable code is a significant challenge for large language models. In scientific fields such as physics, domain-specific machinery (\textit{e.g.} Dirac notation, vector calculus) imposes additional formalisation challenges that modern LLMs and agentic approaches have yet to tackle. To aid autoformalisation in scientific domains, we present FormalScience; a domain-agnostic human-in-the-loop agentic pipeline that enables a single domain expert (without deep formal language experience) to produce \textit{syntactically correct} and \textit{semantically aligned} formal proofs of informal reasoning for low economic cost. Applying FormalScience to physics, we construct FormalPhysics, a dataset of 200 university-level (LaTeX) physics problems and solutions (primarily quantum mechanics and electromagnetism), along with their Lean4 formal representations. Compared to existing formal math benchmarks, FormalPhysics achieves perfect formal validity and exhibits greater statement complexity. We evaluate open-source models and proprietary systems on a statement autoformalisation task on our dataset via zero-shot prompting, self-refinement with error feedback, and a novel multi-stage agentic approach, and explore autoformalisation limitations in modern LLM-based approaches. We provide the first systematic characterisation of semantic drift in physics autoformalisation in terms of concepts such as notational collapse and abstraction elevation which reveals what formal language verifies when full semantic preservation is unattainable. We release the codebase together with an interactive UI-based FormalScience system which facilitates autoformalisation and theorem proving in scientific domains beyond physics.https://github.com/jmeadows17/formal-science

AI Impact Assessments

(1 models)

Scientific Impact Assessment: FormalScience

1. Core Contribution

FormalScience introduces a human-in-the-loop pipeline for converting informal scientific reasoning (LaTeX derivations) into formally verified Lean4 code, targeting domains where automated approaches consistently fail. The pipeline combines few-shot LLM generation of informal statement-proof pairs, iterative compilation with error correction, and human-guided semantic alignment checking. Applied to physics, it produces FormalPhysics: 200 university-level problems (quantum mechanics, electromagnetism) with complete Lean4 formalizations achieving 100% compilation success.

The most intellectually substantive contribution is the systematic taxonomy of semantic drift in physics autoformalisation—notational collapse, abstraction elevation, proof strategy substitution, and implicit premise selection. This taxonomy addresses a previously uncharacterized phenomenon: what exactly does a type-checker verify when the formal representation necessarily diverges from the physics? This reframes the verification question from binary (correct/incorrect) to a nuanced spectrum of partial guarantees.

2. Methodological Rigor

Pipeline formalization. The FormalScience pipeline is described with mathematical precision (Equations 1-5, Algorithm 1), which aids reproducibility. The iterative correction loop with compilation feedback and human alignment classification is well-structured, though the human "binary classifier" step (Equation 4) introduces subjectivity that is not rigorously controlled—one physics expert performed all evaluations.

Benchmark comparison. The comparison against 7 existing datasets (Tables 1-2) is informative but has methodological concerns. LLM-as-a-judge evaluation (GPT-4.1-mini) for FQ, LP, and MC metrics introduces noise, though the authors partially address this with an inter-judge agreement analysis using Qwen2.5-Coder-7B-Instruct (phi coefficients 0.28–0.37, significant but moderate agreement). The judges agree on rankings for top models but diverge for middle/bottom tiers, and some secondary claims (e.g., score invariance under self-refinement) are judge-dependent.

Experimental evaluation. The three-tier evaluation (zero-shot, self-refinement, agentic) across multiple models is systematic. However, the primary finding—an alignment-validity trade-off with near-zero Spearman/Pearson correlation—is computed over only 5 model points per setting, making statistical claims fragile. The 100+ GPU hours per agentic baseline is transparently reported but limits the exploration space.

Semantic drift analysis. The qualitative analysis (Section 5, Figure 3, Table 8) is the paper's strongest analytical contribution. The examples convincingly demonstrate how quantum operators collapse to complex scalars, vector calculus reduces to abstract linear algebra, and integrals become limit statements. The quantitative prevalence analysis by subdomain (>75% notational collapse in QM) provides actionable information about Lean4/Mathlib's physics coverage gaps.

3. Potential Impact

Short-term: FormalPhysics serves as a challenging benchmark exposing fundamental limitations of current autoformalisation approaches in science. The 31% best-case compilation rate for fully automated approaches (vs. 100% with human-in-the-loop) quantifies the gap that remains.

Medium-term: The semantic drift taxonomy could influence how the formal verification community thinks about partial verification guarantees. Rather than treating compilation failure/success as binary, researchers could develop drift-aware metrics. The identification of specific Mathlib gaps (vector calculus, Dirac notation, non-commutative operators) provides a roadmap for library developers.

Broader applications: The domain-agnostic pipeline design suggests extensibility to chemistry, biology, and engineering, though this remains entirely unvalidated. The released UI-based system and codebase lower barriers for domain experts without formal methods training.

Limitations on impact: The dataset size (200 examples) is small for training purposes. The fundamental finding—that Lean4 cannot natively express core physics formalisms—is more a characterization of current tool limitations than a solution. The practical value of formal proofs that verify `x - 0 = x` when the physics requires evaluating line integrals is questionable.

4. Timeliness & Relevance

The paper addresses a genuine gap: autoformalisation research has been overwhelmingly focused on competition mathematics and undergraduate math, with almost no systematic work on physics or other sciences. The timing is appropriate given rapid advances in LLM-based theorem proving (DeepSeek-Prover, Kimina) and growing interest in AI for scientific discovery. The honest characterization of what formal verification actually guarantees in physics is a timely corrective to overly optimistic narratives about automated scientific reasoning.

5. Strengths & Limitations

Key Strengths:

First systematic study of autoformalisation in physics with a complete, compilable dataset

Rigorous semantic drift taxonomy with domain-specific quantitative analysis

Transparent cost analysis (~$50 USD, one month, one expert)

The alignment-validity trade-off finding is practically important for the field

Well-documented pipeline with released code and UI

Notable Weaknesses:

Single annotator introduces potential bias; no inter-annotator agreement for human alignment classification

200 examples is adequate for benchmarking but insufficient for training; scalability claims ("thousands of formalisations") are aspirational

The semantic drift analysis, while valuable, fundamentally reveals that Lean4+Mathlib is inadequate for physics—the paper characterizes the problem more than it solves it

LLM-as-a-judge metrics show only moderate inter-judge agreement (phi ~0.3), undermining confidence in fine-grained comparisons

The "domain-agnostic" claim is entirely unsupported by evidence beyond physics

Some evaluated models (GPT-5.1, Claude-Opus-4.5, GPT-OSS-20B) are very recent, raising reproducibility concerns as APIs evolve

Missing comparisons: The paper acknowledges that comparison with DRIFT and retrieval-based approaches is "methodologically inappropriate" due to domain differences, but this leaves the agentic pipeline inadequately contextualized against state-of-the-art.

Summary

This paper makes a meaningful contribution by bringing autoformalisation research into physics, providing both a benchmark and an honest characterization of current limitations. The semantic drift taxonomy is the most novel intellectual contribution and could influence how the community thinks about partial formal verification. However, the work is more diagnostic than prescriptive—it characterizes problems rather than solving them—and the dataset scale and single-annotator design limit its immediate practical impact.

Rating:5.8/ 10

Significance 6.5Rigor 5.5Novelty 6.5Clarity 7

Generated May 5, 2026

Comparison History (29)

vs. Foundation Models to Unlock Real-World Evidence from Nationwide Medical Claims

claude-opus-4.65/5/2026

ReClaim presents a foundation model trained on 43.8 billion medical events from 200M+ patients, demonstrating strong results across 1,000+ prediction tasks, expenditure forecasting, and causal inference (target trial emulation). Its breadth of impact spans healthcare AI, epidemiology, health economics, and regulatory science. The scale of data, rigorous external validation, and direct applicability to real-world healthcare decision-making give it enormous practical impact. While FormalScience is novel in bridging formal verification and physics, it addresses a narrower community and its 200-problem dataset is relatively small. ReClaim's potential to transform population-level health research is substantially broader.

vs. Foundation Models to Unlock Real-World Evidence from Nationwide Medical Claims

claude-opus-4.65/5/2026

ReClaim represents a major advance in healthcare AI by training the first large-scale foundation model on administrative claims data (43.8B events, 200M+ patients). Its demonstrated improvements across 1,000+ disease prediction tasks, expenditure forecasting, and bias reduction in trial emulation have immediate, broad real-world impact on healthcare decision-making and regulatory evaluation. The scaling laws and external validation strengthen its methodological rigor. Paper 2, while novel in formalizing physics proofs, addresses a narrower problem with a smaller dataset (200 problems) and more limited near-term practical applications.

vs. AutoPKG: An Automated Framework for Dynamic E-commerce Product-Attribute Knowledge Graph Construction

gemini-35/5/2026

Paper 1 addresses a fundamental frontier in AI: autoformalizing scientific reasoning into verifiable code (Lean). Its novel dataset (FormalPhysics) and characterization of semantic drift offer foundational impact for AI-driven scientific discovery and mathematical reasoning. In contrast, Paper 2 focuses on a narrower, applied domain (e-commerce product knowledge graphs), which holds significant commercial value but narrower broad scientific impact.

vs. Prompt Optimization Is a Coin Flip: Diagnosing When It Helps in Compound AI Systems

gpt-5.25/5/2026

Paper 2 has higher potential impact: it introduces a scalable human-in-the-loop pipeline for autoformalising scientific reasoning, releases a new physics-focused Lean dataset (FormalPhysics) with verified validity, provides a UI/tooling system, and offers a systematic analysis of semantic drift—advancing both methodology and infrastructure. Its applications span formal methods, AI for science, theorem proving, and LLM evaluation, with strong timeliness given interest in verifiable AI and scientific reproducibility. Paper 1 is rigorous and useful diagnostically, but is narrower (prompt optimization in compound systems) and less broadly enabling.

vs. Prompt Optimization Is a Coin Flip: Diagnosing When It Helps in Compound AI Systems

gemini-35/5/2026

Paper 2 addresses a grand challenge in AI and science: the autoformalisation of scientific reasoning (physics) into verifiable code (Lean). By extending formal verification beyond mathematics and introducing a novel dataset and agentic pipeline, it lays foundational work for reliable AI scientists. Paper 1, while highly practical and rigorous, focuses on diagnosing and optimizing current prompt engineering tools, which represents a more transient engineering problem compared to the broad, interdisciplinary scientific impact of verifiable scientific reasoning.

vs. AutoPKG: An Automated Framework for Dynamic E-commerce Product-Attribute Knowledge Graph Construction

gpt-5.25/5/2026

Paper 1 is more novel and broadly impactful: it advances human-in-the-loop autoformalisation beyond math into scientific domains, contributes a high-validity Lean4 physics dataset, a reusable UI/pipeline, and a systematic analysis of semantic drift—useful across AI, formal methods, and scientific computing. Its methodological rigor is strengthened by formal verification targets and error characterization. Paper 2 shows strong real-world business impact and solid engineering, but is more application-specific to e-commerce PKGs and incremental relative to existing LLM-based extraction/graph construction lines.

vs. The Accountability Horizon: An Impossibility Theorem for Governing Human-Agent Collectives

gpt-5.25/5/2026

Paper 2 is more likely to have higher scientific impact due to immediate real-world applicability and methodological tangibility: it delivers a reusable human-in-the-loop system, a released codebase/UI, and a high-quality dataset enabling benchmarking and downstream research in LLMs, formal methods, and scientific computing. Its contributions (pipeline, dataset, evaluation, semantic-drift taxonomy) are actionable and broadly useful across domains. Paper 1 is conceptually novel, but its impact depends heavily on acceptance of new formal definitions/thresholds and synthetic validation, with less clear near-term uptake beyond governance theory.

vs. The Accountability Horizon: An Impossibility Theorem for Governing Human-Agent Collectives

gemini-35/5/2026

Paper 2 establishes a foundational impossibility theorem in AI governance, mathematically proving the limits of current accountability frameworks for highly autonomous systems. This cross-disciplinary result has profound implications for AI ethics, law, and multi-agent systems. While Paper 1 offers a valuable tool and dataset for autoformalization in Lean, Paper 2's theoretical breakthrough provides broader, paradigm-shifting impact across multiple fields dealing with AI regulation, safety, and deployment.

vs. Universal Smoothness via Bernstein Polynomials: A Constructive Approximation Approach for Activation Functions

claude-opus-4.65/5/2026

FormalScience addresses a more novel and underexplored problem—autoformalisation of scientific reasoning (physics) in formal proof assistants—opening new research directions at the intersection of AI, formal verification, and science. It introduces a reusable pipeline, a new benchmark (FormalPhysics), and provides systematic characterization of semantic drift, all of which have broad cross-disciplinary impact. While Paper 1 offers a solid incremental contribution to activation function design (BerLU), the space is crowded with alternatives, limiting its novelty ceiling. Paper 2's open-source tools and dataset further amplify its potential impact.

vs. ANO: A Principled Approach to Robust Policy Optimization

gpt-5.25/5/2026

Paper 1 likely has higher scientific impact due to its novelty and breadth: it introduces a scalable human-in-the-loop autoformalization pipeline, releases a new high-quality Lean4 physics dataset, and provides a systematic analysis of semantic drift—useful across formal methods, AI, and multiple scientific domains. Its real-world applications (verifiable scientific reasoning, tooling for formalization beyond math) extend beyond a single subfield. Paper 2 is a strong, timely RL optimization contribution with theoretical framing and solid benchmarks, but its impact is narrower and more incremental within deep RL algorithms.

vs. CyberAId: AI-Driven Cybersecurity for Financial Service Providers

gpt-5.25/5/2026

Paper 2 offers higher scientific impact due to a concrete, reproducible contribution: a scalable human-in-the-loop pipeline, an interactive tool, and a released benchmark dataset (FormalPhysics) with systematic evaluation and new analysis concepts (semantic drift taxonomy). This advances methodology for trustworthy AI and formal verification across many scientific domains, with clear rigor and broad cross-field relevance. Paper 1 is largely a position/platform proposal in a narrow application domain (financial cybersecurity) with validation plans but less demonstrated technical novelty or evidence, making near-term scientific impact less certain.

vs. ResearchEVO: An End-to-End Framework for Automated Scientific Discovery and Documentation

claude-opus-4.65/5/2026

Paper 1 (FormalScience) addresses a fundamental and well-defined challenge—autoformalisation of scientific reasoning—with rigorous methodology, producing a concrete dataset (FormalPhysics) with verified formal validity, systematic characterization of semantic drift, and a reusable open-source pipeline. Its contributions are methodologically sound and immediately useful for formal verification communities. Paper 2 (ResearchEVO), while ambitious in scope (end-to-end automated discovery + paper writing), risks being perceived as incremental in both algorithm evolution and automated writing, with validation on only two problems and claims that are harder to rigorously verify. Paper 1's focused, rigorous contributions have more durable impact.

vs. A Language for Describing Agentic LLM Contexts

gemini-35/5/2026

Paper 2 tackles the highly challenging and impactful area of autoformalization in scientific domains (physics) using Lean4. It provides a novel human-in-the-loop pipeline, a rigorously verified dataset (FormalPhysics), and a deep empirical analysis of semantic drift. Paper 1 proposes a standardized notation language for LLM contexts, which, while practically useful, traditionally has a lower threshold for scientific impact unless universally adopted, making Paper 2's empirical and methodological contributions more immediately significant.

vs. Algebraic Semantics of Governed Execution: Monoidal Categories, Effect Algebras, and Coterminous Boundaries

gemini-35/5/2026

Paper 2 demonstrates broader scientific impact due to its high timeliness and cross-disciplinary applications. By bridging large language models with Lean 4 formalization for scientific domains like physics, it directly addresses a major bottleneck in AI-driven mathematical reasoning (autoformalisation). While Paper 1 exhibits exceptional methodological rigor in programming language theory, Paper 2 provides scalable, open-source tools and datasets (FormalPhysics) accessible to domain experts without deep formal methods experience. This significantly lowers the barrier to entry for formalizing science, giving it wider real-world application and broader impact across STEM fields.

vs. MARL-GPT: Foundation Model for Multi-Agent Reinforcement Learning

gemini-35/5/2026

Paper 2 presents a foundation model for Multi-Agent Reinforcement Learning (MARL), demonstrating success across highly diverse environments without task-specific tuning. Establishing a generalized foundation model paradigm for MARL offers immense transformative potential across AI, robotics, and autonomous systems, mirroring the revolutionary impact of LLMs in NLP. While Paper 1 is innovative in autoformalizing physics proofs into Lean4, its immediate impact is largely restricted to the specialized formal verification and mathematics communities. Paper 2's broader applicability, massive scale, and potential to unify multi-agent learning tasks give it a significantly higher expected scientific impact.

vs. An Empirical Study of Agent Skills for Healthcare: Practice, Gaps, and Governance

gpt-5.25/5/2026

Paper 2 likely has higher impact: it introduces a general, scalable human-in-the-loop pipeline for autoformalising scientific reasoning, releases a usable system/UI and code, and provides a new high-quality dataset (FormalPhysics) with formal validity plus systematic analysis of semantic drift. This advances formal methods + LLM tooling with potential cross-domain benefits (math, physics, verification, scientific computing, AI safety). Paper 1 is valuable and timely for healthcare governance, but is primarily a descriptive empirical survey of public “skills” with narrower methodological/technical novelty and more domain-specific reach.

vs. Lifting Traces to Logic: Programmatic Skill Induction with Neuro-Symbolic Learning for Long-Horizon Agentic Tasks

claude-opus-4.65/5/2026

FormalScience tackles the novel and significant challenge of autoformalising scientific reasoning (beyond pure mathematics) into formally verifiable code. It contributes a concrete dataset (FormalPhysics), a reusable pipeline with released code and UI, systematic characterization of semantic drift, and bridges formal verification with physics—opening new interdisciplinary research directions. Paper 1 addresses important but more incremental advances in skill induction for LLM agents. Paper 2's broader cross-field impact (formal methods, physics, AI), concrete artifacts, and pioneering positioning in scientific autoformalisation give it higher potential impact.

vs. AI-Assisted Peer Review at Scale: The AAAI-26 AI Review Pilot

claude-opus-4.65/5/2026

The AAAI-26 AI-assisted peer review paper represents a landmark large-scale deployment (22,977 papers) addressing a critical bottleneck in scientific publishing. Its findings—that AI reviews were preferred over human reviews on key dimensions—have immediate, broad implications for the entire scientific enterprise across all fields. The scale, real-world validation, and timeliness make it exceptionally impactful. While FormalScience is novel and rigorous in formalizing scientific reasoning, its scope (200 problems, niche Lean4 community) and immediate applicability are narrower, limiting its near-term breadth of impact.

vs. End-to-end autonomous scientific discovery on a real optical platform

gemini-35/5/2026

Paper 1 represents a groundbreaking milestone in AI-driven scientific research by demonstrating the first end-to-end autonomous discovery of a novel, previously unreported physical mechanism on a real experimental platform. While Paper 2 provides a valuable tool and dataset for formalizing physics in Lean4, Paper 1's integration of an LLM agent with physical experimentation to generate new scientific knowledge has much broader and more transformative implications for accelerating scientific discovery across various experimental disciplines.

vs. AI scientists produce results without reasoning scientifically

gpt-5.25/5/2026

Paper 2 has higher likely scientific impact due to broad, timely relevance: it rigorously evaluates LLM-based scientific agents across eight domains with 25,000+ runs, quantifies how little scaffolds contribute vs base models, and provides behavioral evidence that agents violate key epistemic norms. These results have immediate implications for AI-for-science deployment, evaluation methodology, and training objectives across many fields. Paper 1 is novel and useful for formal methods/autoformalization with a valuable dataset and pipeline, but its direct impact is narrower and more tool/dataset-centric.