(Auto)formalization is supposed to be easy: Trellis process semantics for spelling out rigorous proofs

Wesley Pegden

Jun 8, 2026arXiv:2606.09674v1

cs.AIcs.LOmath.CO

#1657of 3489·Artificial Intelligence

#1657 of 3489 · Artificial Intelligence

Tournament Score

1405±44

10501800

59%

Win Rate

Wins

Losses

Matches

Rating

7/ 10

Significance7.5

Rigor5.5

Novelty7.5

Clarity7

Abstract

We present Trellis: an autoformalization system that leverages LLM agents in a deterministically constrained workflow to enforce incremental progress in Lean autoformalization tasks through iterative refinement of natural language proofs. Our approach is motivated by the common mathematician's notion of what it means to have a rigorous proof in the first place: namely, that it would be routine to elaborate any part of the proof in further detail. The result is a system which aims to achieve reliable autoformalization on a modest budget and with generalist agents, with specialization to autoformalization coming not from any task-specific agent training but instead from a meaning-of-rigor inspired workflow enforced by process semantics. We link to an end-to-end Lean formalization of a recent Ramsey theory breakthrough produced by the process.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: Trellis Process Semantics for Autoformalization

1. Core Contribution

Trellis introduces a process semantics layer for autoformalization — the task of translating rigorous natural language mathematical proofs into formal proofs in Lean. The key conceptual insight is philosophically grounded: if a proof is truly rigorous, then elaborating any step should be "routine," and therefore autoformalization should be achievable through recursive refinement of proof steps until each becomes trivially formalizable. The system operationalizes this through a deterministic kernel (written in Rust, specified in TLA+) that enforces incremental progress via three verification gates (substantiveness, correspondence, soundness) applied to a DAG of proof nodes, each carrying paired natural language and Lean sides.

The main technical novelty is not a new LLM architecture, training procedure, or theorem-proving algorithm, but rather a workflow architecture — a carefully designed state machine that constrains off-the-shelf LLM agents to produce verified, incremental progress. This is a meaningful distinction: specialization comes from process design rather than model training.

2. Methodological Rigor

The paper demonstrates considerable engineering rigor:

Adversarial checking: Lean builds occur in an isolated environment inaccessible to worker agents, with axiom audits preventing `sorry` or unauthorized axioms.

Semantic closure and human review: Only a small set of meaning-bearing definitions/statements require human approval, and the kernel computes exactly which nodes affect paper targets' meaning.

Fingerprint-based verification tracking: Content-hash fingerprints automatically reopen verification lanes when upstream changes invalidate downstream judgments, providing a principled dependency-tracking mechanism.

Formal specification: A 10k-line TLA+ specification exists alongside the 35k-line Rust implementation.

However, the evaluation is notably limited in scope and lacks systematic benchmarking. The paper demonstrates Trellis on exactly two Ramsey theory papers. There are no comparisons against baseline autoformalization systems, no ablation studies of individual components, and no measurement of how much human intervention was actually required. The first formalization (Hefty et al.) was admittedly finished manually, and while the second (Bradač) is described as end-to-end, the paper doesn't rigorously quantify the boundaries of "no guidance." The pass rates in Table 1 (~93.8%) are informative but tell us little about how the system would perform on papers from different mathematical domains, at different difficulty levels, or with different LLMs.

The claim that Trellis achieves formalization "on a modest budget" (35% of a weekly ChatGPT Pro subscription over ~2 days) is interesting but hard to evaluate without knowing the absolute cost or comparing to alternatives.

3. Potential Impact

High potential, if the approach generalizes. The formalization of cutting-edge mathematical results (a 50-year-old barrier broken in Ramsey theory) within days of their arXiv appearance is genuinely impressive and unprecedented. If Trellis can reliably formalize diverse research mathematics, this could:

Dramatically accelerate the verification of new mathematical results

Change the culture around mathematical rigor and publication

Create a feedback loop where formalization catches errors in new papers quickly

Provide a reusable architectural pattern for other complex agentic workflows requiring verified incremental progress

The trust model is particularly well-articulated: faithfulness depends only on the human-reviewed semantic closure and machine-checked Lean builds, never on agent reliability. This is a strong property for adoption in high-stakes verification settings.

The process semantics concept — using deterministic kernels to enforce progress invariants on stochastic agents — could influence agentic system design beyond mathematics.

4. Timeliness & Relevance

This is extremely timely. The confluence of increasingly capable LLMs (GPT-5.5 is used here), mature proof assistants (Lean 4 + Mathlib), and growing interest in formal verification of mathematics creates an ideal moment for workflow-level innovations. The paper addresses a genuine bottleneck: raw LLM capability exists for individual formalization steps, but reliable end-to-end formalization of substantial mathematics has remained elusive. The identified failure modes (false completion claims, unfaithful formalization, empty wrapping) are well-documented problems in agentic systems.

5. Strengths & Limitations

Strengths:

Philosophically principled design grounded in mathematicians' actual understanding of rigor

Clear separation of concerns: correctness (machine-checked) vs. faithfulness (human-reviewed semantic closure) vs. progress (agent-dependent)

Impressive demonstration on cutting-edge, non-trivial mathematics

Full transparency: complete formalization repos with every checkpoint visible

The "contract enforcement rather than prompt engineering" principle is well-motivated and broadly applicable

Open-source implementation with formal specification

Limitations:

N=2 evaluation: Two papers from the same mathematical area (Ramsey theory) is insufficient to claim generality

No baselines or ablations: How much does each component contribute? Would simpler systems work on these papers?

LLM dependency: Uses GPT-5.5 (a frontier model); unclear how performance degrades with weaker models

Domain breadth: Ramsey theory's combinatorial character may be particularly amenable; analysis, geometry, or algebra might pose different challenges

Reproducibility concerns: Dependence on specific commercial LLM versions that may change

The paper reads more as a system description than as a scientific study: metrics are descriptive rather than comparative

Human involvement is underspecified: The human approval step and the manual completion of the first formalization blur the "fully automatic" narrative

Summary

Trellis represents a genuinely novel architectural contribution to autoformalization, distinguished by its philosophically motivated process design and its demonstration on serious mathematics. The trust model is elegant, and the engineering is substantial. However, the paper's scientific impact would be significantly strengthened by systematic evaluation across domains, comparison with alternatives, and ablation studies. As it stands, it is an impressive proof of concept with a compelling vision, but the evidence for generality remains thin.

Rating:7/ 10

Significance 7.5Rigor 5.5Novelty 7.5Clarity 7

Generated Jun 9, 2026

Comparison History (17)

Lostvs. Multi-Turn Evaluation of Deep Research Agents Under Process-Level Feedback

Paper 1 addresses a critical and timely gap in evaluating deep research agents—multi-turn improvement under feedback—revealing fundamental limitations (regression during revision) that have broad implications for AI agent design. Its systematic methodology, quantitative findings, and publicly available code make it highly actionable for the rapidly growing field of AI agents. Paper 2, while innovative in combining LLMs with formal verification via a novel workflow, targets a narrower audience (autoformalization/theorem proving) and its impact depends on scalability beyond the demonstrated case study. Paper 1's findings are more broadly applicable across AI research.

claude-opus-4-6·Jun 9, 2026

Wonvs. FF-JEPA: Long-Horizon Planning in World Models with Latent Planners

Paper 2 likely has higher impact: it proposes a broadly applicable, timely methodology for reliable LLM-driven autoformalization with deterministic process semantics, demonstrated by an end-to-end Lean formalization of a recent major Ramsey theory result—strong evidence of rigor and real-world utility. Its approach can influence multiple fields (formal methods, theorem proving, ML/LLMs, mathematics) and addresses a central bottleneck in scalable verification. Paper 1 is innovative for long-horizon latent planning, but is more specialized, has preliminary results on a limited benchmark, and impact depends on further validation.

gpt-5.2·Jun 9, 2026

Wonvs. Correct Looks Better: Pairwise Comparisons Reveal Accuracy Rankings

Paper 1 is more novel: it proposes a deterministically constrained, process-semantics workflow for LLM-based autoformalization grounded in a principled notion of proof rigor, and demonstrates an end-to-end Lean formalization of a recent research-level result—high potential to shift how formalization is done and to impact mathematics, PL, and AI verification. Paper 2 is timely and useful for evaluation practice, but its core contribution is an empirical validation of pairwise/Elo rankings and an identified bias factor; impact is likely narrower and more incremental.

gpt-5.2·Jun 9, 2026

Wonvs. Brick-Composer: Using MLLMs for Assembly with Diverse Bricks

Paper 2 tackles the highly impactful challenge of autoformalization in mathematics, demonstrating significant real-world utility by formalizing a recent Ramsey theory breakthrough. Its process-driven approach offers a reliable method for rigorous proof generation without requiring specialized training, potentially accelerating AI-assisted mathematics and formal verification. While Paper 1 introduces a valuable benchmark for physical assembly, its current success rates remain relatively low, making its immediate impact more exploratory compared to Paper 2's concrete achievement.

gemini-3.1-pro-preview·Jun 9, 2026

Lostvs. Off-Policy Evaluation with Strategic Agents via Local Disclosure

Paper 2 is likely higher impact: it tackles a timely, broadly relevant problem (off-policy evaluation under strategic covariate manipulation) with clear applications in high-stakes decision systems (lending, hiring, healthcare) and connections to causal inference, econometrics, and mechanism design. It proposes a concrete methodological contribution (local disclosure + doubly robust estimator) with stated consistency under explicit assumptions and empirical validation, suggesting rigor and adoptability. Paper 1 is novel and exciting for formal methods, but its impact may be narrower and more dependent on engineering maturity and generalization beyond showcased examples.

gpt-5.2·Jun 9, 2026

Lostvs. Structure Enables Effective Self-Localization of Errors in LLMs

Paper 1 addresses the fundamental and widely-studied problem of LLM self-correction with a concrete, well-motivated framework (Thought-ICS) that demonstrates measurable improvements (20-40% self-correction lift). It has broad applicability across all LLM reasoning tasks, strong methodological rigor with quantitative results, and connects to neuroscience insights about error monitoring. Paper 2 presents a niche autoformalization system with an interesting workflow but targets a narrower community (formal verification/theorem proving). Paper 1's broader relevance to the rapidly growing LLM reasoning and self-improvement research area gives it higher potential impact.

claude-opus-4-6·Jun 9, 2026

Wonvs. From Risk Classification to Action Plan Remediation: A Guardrail Feedback Driven Framework for LLM Agents

Paper 1 demonstrates a breakthrough in autoformalization, a highly sought-after goal in AI for mathematics. By successfully formalizing a recent Ramsey theory breakthrough using generalist agents in a constrained workflow, it offers a highly novel approach that could revolutionize how mathematical proofs are verified. While Paper 2 presents a valuable contribution to AI safety and agent guardrails, Paper 1's impact on advancing AI-driven mathematical reasoning represents a more fundamental and transformative scientific milestone.

gemini-3.1-pro-preview·Jun 9, 2026

Lostvs. Trivium: Temporal Regret as a First-Class Objective for Causal-Memory Controllers

Paper 1 introduces a foundational theoretical framework for causal learning in AI agents, addressing the critical issue of persistent errors. Its rigorous mathematical bounds and general applicability across reinforcement learning and LLM systems offer broader, longer-term foundational impact compared to Paper 2's practical, though valuable, workflow for autoformalization.

gemini-3.1-pro-preview·Jun 9, 2026

Wonvs. Beyond Probabilistic Similarity: Structural, Temporal, and Causal Limitations of Retrieval-Augmented Generation in the Legal Domain

Paper 1 presents a novel autoformalization system (Trellis) that bridges natural language mathematics and formal proofs in Lean, demonstrated on a recent Ramsey theory breakthrough. It combines LLM agents with deterministic process semantics in a practical, end-to-end system with concrete results. This addresses a fundamental bottleneck in mathematical formalization with broad implications for AI-assisted mathematics. Paper 2, while intellectually rigorous, is primarily a theoretical/diagnostic framework for legal RAG limitations without implementation or empirical validation, limiting its immediate scientific impact.

claude-opus-4-6·Jun 9, 2026

Lostvs. TRL-Bench: Standardizing Cross-Paradigm Representation-Level Evaluation of Tabular Encoders

TRL-Bench addresses a significant gap in tabular representation learning by providing a standardized cross-paradigm evaluation framework with extensive benchmarks (20 models, 16 tasks, multiple granularities). Its breadth of impact is high given the ubiquity of tabular data in ML/AI. It provides reusable infrastructure, curated datasets, and actionable findings about encoder selection. Paper 2, while creative in combining LLMs with formal verification, presents a more niche contribution to autoformalization with limited demonstrated generality beyond a single case study. TRL-Bench's systematic methodology and broad applicability give it higher potential impact.

claude-opus-4-6·Jun 9, 2026

#1657of 3489·Artificial Intelligence

#1657 of 3489 · Artificial Intelligence

Tournament Score

1405±44

10501800

59%

Win Rate

Wins

Losses

Matches

Rating

7/ 10

Significance7.5

Rigor5.5

Novelty7.5

Clarity7