Proof-Refactor: Refactoring Generated Formal Proofs into Modular Artifacts

Yiming Fu, Peixuan Liu, Zichen Wang, Kun yuan

Jun 2, 2026

arXiv:2606.03743v1 PDF

cs.AI(primary)

#2586of 3404·Artificial Intelligence

#2586 of 3404 · Artificial Intelligence

Tournament Score

1332±43

10501800

32%

Win Rate

Wins

Losses

Matches

Rating

5.2/ 10

Significance5.5

Rigor4

Novelty5.5

Clarity7

Tournament Score

1332±43

10501800

32%

Win Rate

Wins

Losses

Matches

Rating

5.2/ 10

Significance

Rigor

Novelty

Clarity

Abstract

While Large Language Models (LLMs) have shown strong performance in generating formal proofs, their outputs often remain less readable, modular, maintainable, and reusable than proofs in mature formal mathematics libraries. We argue that this gap stems in part from the compile-first objective implicit in most proof-generation pipelines, which encourages monolithic or ad hoc proof scripts rather than library-quality artifacts. Existing approaches to proof-quality improvement often rely on explicit, computable optimization objectives. In practice, however, the most tractable and experimentally validated objectives are largely length-based, while higher-level qualities such as readability, modularity, maintainability, and reusability are difficult to reduce to reliable automatic metrics. Instead of optimizing proof improvement against a single proxy metric, we take a process-guided approach inspired by human proof-refactoring workflows. We propose an agentic framework $\textbf{Proof-Refactor}$ that decomposes proof refactoring into four phases: extracting candidate proof fragments, designing helper declarations, formally proving the extracted and designed components, and repairing the original proof using the verified components. On generated Lean proofs from PutnamBench and Putnam2025, Proof-Refactor improves rubric-based refactoring scores over a strong Claude Code refactoring baseline, with the largest gains in signature quality and human readability. These results suggest that process-guided refactoring can improve proof structure without treating proof length as the primary objective.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: Proof-Refactor

1. Core Contribution

Proof-Refactor addresses a genuine gap in automated theorem proving: LLM-generated formal proofs compile but are typically monolithic, unreadable, and unsuitable for integration into mature formal libraries like Mathlib. The paper reframes proof improvement as a *process-guided refactoring problem* rather than an optimization against proxy metrics (primarily proof length). The framework decomposes refactoring into four phases—extraction, helper design, proving, and proof repair—mirroring how human mathematicians refactor proofs. A key technical contribution is the custom `extract` tactic for Lean 4, which turns tactic blocks into standalone "scaffold" theorems, enabling structured decomposition of proofs.

The distinction from prior work (ProofOptimizer, REFACTOR) is meaningful: those systems optimize computable objectives (usually length) or extract fragments preserving their original specificity. Proof-Refactor explicitly targets *generalization* and *library-style design* of helper declarations, which is a qualitatively different goal.

2. Methodological Rigor

Strengths in design: The four-phase pipeline is well-motivated and cleanly separates concerns. The decision to decouple high-level reasoning (via external model) from low-level Lean interaction (via Claude Code + lean-lsp-mcp) is validated by the ablation study, which shows substantial degradation when external assistance is removed.

Weaknesses in evaluation: The evaluation methodology is the paper's most significant limitation. The primary metric is a rubric-based LLM-as-a-judge protocol, which is inherently subjective and difficult to reproduce. The reported human agreement rate of 75% on pairwise rankings (28 problems) is modest—essentially one in four judgments disagrees. The rubric itself (Appendix A.2) is detailed but complex, with 15 sub-metrics across 5 dimensions, creating substantial room for evaluator inconsistency.

The dataset is small: 96 PutnamBench proofs and 12 Putnam2025 proofs. The improvements, while consistent across dimensions, are modest in absolute terms (e.g., +0.45 on a 1-5 scale for PutnamBench). Without statistical significance testing or confidence intervals, it's difficult to assess whether these differences are reliable. The paper also acknowledges that the judge uses LeanSearch for retrieval, introducing another potential source of variation.

The baseline comparison is reasonable (Claude Code with lean4-skills refactoring), but only one baseline is compared. No comparison is made against ProofOptimizer or REFACTOR-style approaches, even on overlapping dimensions.

3. Potential Impact

Near-term: The framework could be practically useful for teams building formal proof libraries, where proof quality matters beyond mere compilation. The `extract` tactic itself is a useful tool contribution that could be adopted independently.

Medium-term: As LLM-generated proofs become more common in formal mathematics, the gap between "compiling" and "library-quality" will become increasingly important. Proof-Refactor identifies this gap clearly and offers a structured approach. The process-guided decomposition could inspire similar pipelines for other code refactoring tasks beyond theorem proving.

Limitations on impact: The framework is expensive (four sequential Claude Code sessions plus external model calls), limiting scalability. The restriction to single-file refactoring without modifying existing library declarations significantly narrows applicability. Real library integration requires understanding naming conventions, import structures, and cross-file dependencies—none of which are addressed.

4. Timeliness & Relevance

The paper is well-timed. With systems like DeepSeek, Axiom Math, and others generating increasing volumes of formal proofs, the quality gap between generated and library-quality proofs is becoming a practical bottleneck. The community needs tools that go beyond proof generation to proof curation. The observation that binary compile/no-compile feedback is insufficient for proof quality is important and timely.

5. Strengths & Limitations

Key Strengths:

Well-articulated problem: the paper clearly identifies why compile-first objectives produce low-quality proofs and why length-based optimization is insufficient.

The case study (Putnam 1968 B6) is compelling and concrete, demonstrating how Proof-Refactor produces genuinely more general helpers (e.g., `tendsto_of_mem_ball_one_div_succ` generalized to arbitrary `PseudoMetricSpace`) versus the baseline's problem-specific extraction.

The ablation study provides meaningful evidence for the external-assistance design decision.

The `extract` tactic is a concrete, reusable tool contribution.

Notable Weaknesses:

Evaluation fragility: LLM-as-a-judge with 75% human agreement is a weak validation signal. The rubric, while detailed, is ultimately subjective, and scores on a 5-point scale with differences of 0.2-0.7 may not be robust.

Scale: 108 total problems is modest; no cross-domain evaluation (e.g., different proof styles, different mathematical areas beyond competition problems).

Cost analysis absent: No reporting of computational cost, API calls, or wall-clock time per refactoring, making practical adoption hard to assess.

No downstream validation: The paper doesn't test whether refactored proofs are actually easier for humans to maintain, modify, or integrate—only that an LLM judge scores them higher on a rubric.

Reproducibility concerns: The framework depends on Claude Opus 4.6, Gemini 3.1 Pro, and lean-lsp-mcp—all versioned, proprietary, or rapidly evolving components.

Limited theoretical grounding: The quality preference Q is instantiated purely empirically; no formal properties of the refactoring transformation are established (e.g., semantic preservation guarantees beyond recompilation).

Overall Assessment

Proof-Refactor identifies an important and under-explored problem—bridging the gap between compiling proofs and library-quality formal mathematics—and proposes a reasonable process-guided solution. The framework design is thoughtful and the case studies are illustrative. However, the evaluation methodology lacks the rigor needed to strongly support the claims: small datasets, subjective LLM-based scoring with moderate human agreement, no statistical testing, and no comparison with prior proof-improvement systems. The paper is best viewed as a proof-of-concept that opens a promising direction rather than as a definitive contribution.

Rating:5.2/ 10

Significance 5.5Rigor 4Novelty 5.5Clarity 7

Generated Jun 3, 2026

Comparison History (25)

vs. Multi-ResNets for Subspace Preconditioning in Constrained Optimization

claude-opus-4.66/6/2026

Paper 1 introduces a novel neural network architecture (MResOpt) for constrained optimization with theoretical grounding (infinite-width GP analysis) and demonstrates practical impact on important engineering problems like AC optimal power flow. It combines methodological rigor with broad applicability across optimization domains. Paper 2 addresses an interesting but narrower problem of refactoring LLM-generated formal proofs, with impact largely limited to the formal verification community. Paper 1's contributions to ML-for-optimization are more foundational, with wider potential applications in energy systems, operations research, and engineering.

vs. What Type of Inference is Active Inference?

gemini-3.16/5/2026

Paper 2 addresses a highly timely and critical bottleneck in AI-assisted mathematics: the maintainability and modularity of LLM-generated formal proofs. Given the rapid adoption of tools like Lean in formal verification, a framework to refactor AI proofs into human-readable, library-quality artifacts has immense practical utility and immediate relevance. Paper 1 offers a solid theoretical clarification in the niche subfield of active inference, but Paper 2's potential to bridge automated theorem proving with collaborative human mathematics suggests a broader and more immediate scientific impact.

vs. A Scoping Review of the Ethical Perspectives on Anthropomorphising Large Language Model-Based Conversational Agents

gpt-5.26/5/2026

Paper 1 introduces a novel, process-guided, agentic framework for improving the structure and reusability of LLM-generated formal proofs, with concrete evaluation on Lean benchmarks and demonstrated gains over a strong baseline—suggesting clear methodological contribution and near-term applicability to formal methods, automated theorem proving, and AI-assisted software verification. Its impact could extend across ML-for-formalization, proof engineering, and trustworthy AI tooling. Paper 2 is timely and broadly relevant, but as a scoping review it is primarily synthesizing existing work; its scientific novelty and methodological contribution are typically lower than a new technical system with validated improvements.

vs. Characterizing initial human-AI proof formalization workflows

claude-opus-4.66/5/2026

Paper 2 introduces a novel agentic framework (Proof-Refactor) addressing a concrete, underexplored problem—improving the quality of LLM-generated formal proofs beyond mere correctness. Its process-guided approach to modularity, readability, and maintainability fills an important gap in the proof generation pipeline and has direct practical applications for building formal mathematics libraries. Paper 1, while valuable as an HCI study of early AI-assisted formalization workflows, is primarily observational and descriptive, with findings that may quickly become outdated as AI tools evolve. Paper 2's methodological contribution is more durable and actionable.

vs. Perceive Before Reasoning: A Pre-Reasoning Perception Framework for Efficient and Reliable Proactive Mobile Agents

gemini-3.16/3/2026

Paper 1 addresses a critical bottleneck in proactive mobile agents (efficiency and false triggers) with a novel two-stage framework. Its focus on mobile assistance gives it significantly broader real-world applicability and immediate commercial relevance compared to Paper 2, which targets the narrower, albeit important, niche of formal mathematical proof refactoring.

vs. Cross-Lingual Token Arbitrage: Optimizing Code Agent Context Windows via Local LLM Preprocessing

claude-opus-4.66/3/2026

Paper 1 addresses a fundamental gap in AI-assisted formal verification—improving the quality and modularity of LLM-generated proofs to match library standards. This has broader scientific impact: it advances formal mathematics, proof engineering, and AI reasoning capabilities. The process-guided agentic framework is more novel and transferable. Paper 2 solves a practical cost-optimization problem (token reduction for multilingual prompts) but is more incremental and narrowly scoped—essentially an engineering optimization with limited scientific depth. Paper 1's contributions to formal verification methodology have deeper implications for trustworthy AI and mathematical reasoning research.

vs. Do Real-World Datasets Contain Natural Experiments? An Empirical Study Using Causal Feature Selection

gpt-5.26/3/2026

Paper 2 has higher impact potential: it targets a timely, fast-growing area (LLM-generated formal proofs) and addresses a key bottleneck—turning compilable proofs into reusable library-quality artifacts—enabling practical downstream adoption in formalization workflows. Its process-guided, multi-phase agentic refactoring is novel relative to proxy-metric optimization and has clear real-world applicability in proof assistants and automated theorem proving. The evaluation on established Lean benchmarks and comparison to a strong baseline suggests reasonable rigor. Paper 1 is interesting but relies on indirect evidence for “natural experiments” and may have narrower, more method-sensitive impact.

vs. ThoughtFold: Folding Reasoning Chains via Introspective Preference Learning

gemini-3.16/3/2026

Paper 2 addresses a critical and highly timely issue in state-of-the-art large reasoning models: over-thinking and token inefficiency. By reducing token usage by 56% without sacrificing accuracy, it offers massive practical benefits for deployment cost and latency across a wide range of LLM applications. Paper 1 offers a valuable methodological contribution to automated theorem proving, but its impact is confined to a much narrower domain compared to the broad applicability of the general reasoning efficiency improvements proposed in Paper 2.

vs. Unveiling the Structure of Do-Calculus Reasoning via Derivation Graphs

claude-opus-4.66/3/2026

Paper 1 makes a fundamental theoretical contribution to causal inference by characterizing the complete structure of do-calculus reasoning, proving a tight bound of four rule applications, and demonstrating practical benefits through more efficient estimators. This advances a core problem in causal inference with broad implications across statistics, epidemiology, economics, and AI. Paper 2 addresses an important but more niche problem of refactoring LLM-generated formal proofs, with impact largely limited to the formal verification community. Paper 1's theoretical depth and breadth of applicability give it higher potential impact.

vs. When Helping Hurts and How to Fix It: Multi-Agent Debate for Data Cleaning

gemini-3.16/3/2026

Paper 1 offers a broader scientific impact by addressing a fundamental question in LLM multi-agent systems (when debate helps vs. hurts). It provides a rigorous theoretical framework, extensive empirical validation across thousands of conditions, and generalizes to multiple domains. Paper 2, while valuable for formal mathematics, targets a much narrower niche (refactoring Lean proofs) and relies on subjective rubric-based evaluations, limiting its broader applicability compared to Paper 1's generalizable findings.

vs. CAREAgent: Clinical Agent with Structured Reasoning and Tool-Integrated for Order Generation

gemini-3.16/3/2026

Paper 1 addresses a critical bottleneck in healthcare—translating clinical decisions into executable orders—offering high potential for real-world application and societal impact. While Paper 2 presents an innovative approach to formal proof refactoring, its impact is largely confined to theoretical computer science and mathematics. The direct medical utility and comprehensive training pipeline (RL and verifiable tool usage) of Paper 1 give it a broader and more significant overall scientific and practical impact.

vs. Forget Attention: Importance-Aware Attention Is All You Need

gemini-3.16/3/2026

Paper 2 proposes a fundamental architectural innovation for hybrid language models by integrating SSM-derived importance directly into the attention mechanism. Improvements in foundation model architectures have massive breadth of impact across all of AI. Paper 1 addresses a valuable but much narrower niche (refactoring formal proofs in Lean), limiting its overall scientific impact compared to core LLM advancements.

vs. StepFinder: A Temporal Semantic Framework for Failure Attribution in Multi-Agent Systems

gpt-5.26/3/2026

Paper 2 has higher potential impact due to greater novelty and broader relevance: it targets a key bottleneck in LLM-for-formalization—turning generated proofs into library-quality, reusable artifacts—via a process-guided, multi-phase agentic workflow that aligns with human refactoring practices. This can directly affect formal methods, theorem proving, software verification, and AI-assisted mathematics, with strong real-world applications in maintaining large proof/codebases. Paper 1 is useful and efficient but is more incremental (feature+temporal model for attribution) and narrower in cross-field reach.

vs. LEAP: Supercharging LLMs for Formal Mathematics with Agentic Frameworks

gemini-3.16/3/2026

Paper 2 presents a breakthrough in automated theorem proving, solving highly complex benchmark problems (Putnam 2025, IMO-style problems) and addressing open combinatorial challenges. Its ability to significantly boost the formal solve rate of LLMs and contribute to actual mathematical research gives it a substantially higher scientific impact than Paper 1, which focuses primarily on the post-generation refactoring and readability of proofs.

vs. AgentCL: Toward Rigorous Evaluation of Continual Learning in Language Agents

claude-opus-4.66/3/2026

AgentCL addresses a fundamental and broadly applicable challenge—rigorous evaluation of continual learning in language agents—proposing a comprehensive framework with controlled task streams, transfer metrics, and diagnostic tools (MemProbe). Its breadth of impact spans coding, research, and reasoning tasks, establishing evaluation methodology for a rapidly growing field. Paper 2, while novel in applying process-guided refactoring to formal proofs, addresses a narrower problem (proof readability/modularity in Lean) with more limited applicability. AgentCL's contributions to benchmarking and memory design evaluation have broader implications for the entire language agent community.

vs. Evaluating Transformer and LSTM Frameworks for Prediction in Ungauged Basins

gpt-5.26/3/2026

Paper 1 is more novel and timely, addressing a key bottleneck in LLM-driven formal mathematics: converting generated proofs into maintainable, library-quality artifacts without relying on simplistic proxy metrics. Its process-guided, agentic refactoring pipeline could generalize across proof assistants and improve reliability/reusability of machine-assisted theorem proving, with broad implications for formal verification and automated reasoning. Paper 2 is a solid, rigorous applied comparison study, but its primary conclusion (LSTM outperforming an encoder-only Transformer in this setting) is incremental and narrower in cross-field impact.

vs. EvoTrainer: Co-Evolving LLM Policies and Training Harnesses for Autonomous Agentic Reinforcement Learning

gpt-5.26/3/2026

Paper 1 has higher impact potential due to its broader, more general contribution: co-evolving both agent policies and the training harness addresses a core bottleneck in autonomous RL for LLM agents and is applicable across many domains (math, coding, long-horizon SWE, beyond). It targets timelier, widely relevant needs in scalable agent training and evaluation robustness, with clear real-world leverage for improving autonomous systems. Paper 2 is novel and valuable but is narrower in scope (formal proof refactoring in Lean) and likely to impact a smaller community, despite strong methodological motivation.

vs. What Makes Interaction Trajectories Effective for Training Terminal Agents?

gpt-5.26/3/2026

Paper 1 likely has higher impact due to broader applicability and timeliness: it targets general agent post-training for real-world terminal tasks, proposes a scalable pipeline (Terminal-Lego), and identifies a novel “pedagogical paradox” explained by environment-grounded supervision and harness-visible interaction structure. The reported data-efficiency gains and emphasis on “harness engineering” could influence many agent-training setups beyond coding (tool use, robotics-like environments, web agents). Paper 2 is methodologically solid and valuable for formal methods, but its impact is narrower to Lean proof engineering and less cross-domain.

vs. Community-Aware Assessment of Social Textual Engagement and Resonance: A Human-Centric Perspective on User-Generated Content Evaluation

gemini-3.16/3/2026

Paper 1 proposes a significant paradigm shift in UGC evaluation by modeling collective human cognition via a novel 'Social-CoT' mechanism. Its application to social media engagement offers massive real-world utility and broad impact across AI, multimedia, and computational sociology. While Paper 2 presents a valuable framework for formal mathematics, its impact is relatively confined to the niche field of automated theorem proving, making Paper 1 more broadly influential.

vs. FlowTime: Towards Continuous Generative Watch Time Prediction via Flow-based Personalized Priors

gpt-5.26/3/2026

Paper 2 likely has higher scientific impact: it introduces a new paradigm (continuous generative regression) for a widely deployed, high-stakes problem (watch-time prediction), proposes a practical low-latency generative model with personalized flow-based priors, and validates via extensive offline studies plus online A/B tests. It also contributes community infrastructure (an open-source WTP library and benchmarking metric), which can accelerate follow-on work across recommender systems and generative modeling. Paper 1 is novel and valuable for formal methods, but its immediate real-world reach and cross-field breadth are narrower.