Proof-Refactor: Refactoring Generated Formal Proofs into Modular Artifacts
Yiming Fu, Peixuan Liu, Zichen Wang, Kun yuan
Abstract
While Large Language Models (LLMs) have shown strong performance in generating formal proofs, their outputs often remain less readable, modular, maintainable, and reusable than proofs in mature formal mathematics libraries. We argue that this gap stems in part from the compile-first objective implicit in most proof-generation pipelines, which encourages monolithic or ad hoc proof scripts rather than library-quality artifacts. Existing approaches to proof-quality improvement often rely on explicit, computable optimization objectives. In practice, however, the most tractable and experimentally validated objectives are largely length-based, while higher-level qualities such as readability, modularity, maintainability, and reusability are difficult to reduce to reliable automatic metrics. Instead of optimizing proof improvement against a single proxy metric, we take a process-guided approach inspired by human proof-refactoring workflows. We propose an agentic framework that decomposes proof refactoring into four phases: extracting candidate proof fragments, designing helper declarations, formally proving the extracted and designed components, and repairing the original proof using the verified components. On generated Lean proofs from PutnamBench and Putnam2025, Proof-Refactor improves rubric-based refactoring scores over a strong Claude Code refactoring baseline, with the largest gains in signature quality and human readability. These results suggest that process-guided refactoring can improve proof structure without treating proof length as the primary objective.
AI Impact Assessments
(1 models)Scientific Impact Assessment: Proof-Refactor
1. Core Contribution
Proof-Refactor addresses a genuine gap in automated theorem proving: LLM-generated formal proofs compile but are typically monolithic, unreadable, and unsuitable for integration into mature formal libraries like Mathlib. The paper reframes proof improvement as a *process-guided refactoring problem* rather than an optimization against proxy metrics (primarily proof length). The framework decomposes refactoring into four phases—extraction, helper design, proving, and proof repair—mirroring how human mathematicians refactor proofs. A key technical contribution is the custom `extract` tactic for Lean 4, which turns tactic blocks into standalone "scaffold" theorems, enabling structured decomposition of proofs.
The distinction from prior work (ProofOptimizer, REFACTOR) is meaningful: those systems optimize computable objectives (usually length) or extract fragments preserving their original specificity. Proof-Refactor explicitly targets *generalization* and *library-style design* of helper declarations, which is a qualitatively different goal.
2. Methodological Rigor
Strengths in design: The four-phase pipeline is well-motivated and cleanly separates concerns. The decision to decouple high-level reasoning (via external model) from low-level Lean interaction (via Claude Code + lean-lsp-mcp) is validated by the ablation study, which shows substantial degradation when external assistance is removed.
Weaknesses in evaluation: The evaluation methodology is the paper's most significant limitation. The primary metric is a rubric-based LLM-as-a-judge protocol, which is inherently subjective and difficult to reproduce. The reported human agreement rate of 75% on pairwise rankings (28 problems) is modest—essentially one in four judgments disagrees. The rubric itself (Appendix A.2) is detailed but complex, with 15 sub-metrics across 5 dimensions, creating substantial room for evaluator inconsistency.
The dataset is small: 96 PutnamBench proofs and 12 Putnam2025 proofs. The improvements, while consistent across dimensions, are modest in absolute terms (e.g., +0.45 on a 1-5 scale for PutnamBench). Without statistical significance testing or confidence intervals, it's difficult to assess whether these differences are reliable. The paper also acknowledges that the judge uses LeanSearch for retrieval, introducing another potential source of variation.
The baseline comparison is reasonable (Claude Code with lean4-skills refactoring), but only one baseline is compared. No comparison is made against ProofOptimizer or REFACTOR-style approaches, even on overlapping dimensions.
3. Potential Impact
Near-term: The framework could be practically useful for teams building formal proof libraries, where proof quality matters beyond mere compilation. The `extract` tactic itself is a useful tool contribution that could be adopted independently.
Medium-term: As LLM-generated proofs become more common in formal mathematics, the gap between "compiling" and "library-quality" will become increasingly important. Proof-Refactor identifies this gap clearly and offers a structured approach. The process-guided decomposition could inspire similar pipelines for other code refactoring tasks beyond theorem proving.
Limitations on impact: The framework is expensive (four sequential Claude Code sessions plus external model calls), limiting scalability. The restriction to single-file refactoring without modifying existing library declarations significantly narrows applicability. Real library integration requires understanding naming conventions, import structures, and cross-file dependencies—none of which are addressed.
4. Timeliness & Relevance
The paper is well-timed. With systems like DeepSeek, Axiom Math, and others generating increasing volumes of formal proofs, the quality gap between generated and library-quality proofs is becoming a practical bottleneck. The community needs tools that go beyond proof generation to proof curation. The observation that binary compile/no-compile feedback is insufficient for proof quality is important and timely.
5. Strengths & Limitations
Key Strengths:
Notable Weaknesses:
Overall Assessment
Proof-Refactor identifies an important and under-explored problem—bridging the gap between compiling proofs and library-quality formal mathematics—and proposes a reasonable process-guided solution. The framework design is thoughtful and the case studies are illustrative. However, the evaluation methodology lacks the rigor needed to strongly support the claims: small datasets, subjective LLM-based scoring with moderate human agreement, no statistical testing, and no comparison with prior proof-improvement systems. The paper is best viewed as a proof-of-concept that opens a promising direction rather than as a definitive contribution.
Generated Jun 3, 2026
Comparison History (25)
Paper 1 introduces a novel neural network architecture (MResOpt) for constrained optimization with theoretical grounding (infinite-width GP analysis) and demonstrates practical impact on important engineering problems like AC optimal power flow. It combines methodological rigor with broad applicability across optimization domains. Paper 2 addresses an interesting but narrower problem of refactoring LLM-generated formal proofs, with impact largely limited to the formal verification community. Paper 1's contributions to ML-for-optimization are more foundational, with wider potential applications in energy systems, operations research, and engineering.
Paper 2 addresses a highly timely and critical bottleneck in AI-assisted mathematics: the maintainability and modularity of LLM-generated formal proofs. Given the rapid adoption of tools like Lean in formal verification, a framework to refactor AI proofs into human-readable, library-quality artifacts has immense practical utility and immediate relevance. Paper 1 offers a solid theoretical clarification in the niche subfield of active inference, but Paper 2's potential to bridge automated theorem proving with collaborative human mathematics suggests a broader and more immediate scientific impact.
Paper 1 introduces a novel, process-guided, agentic framework for improving the structure and reusability of LLM-generated formal proofs, with concrete evaluation on Lean benchmarks and demonstrated gains over a strong baseline—suggesting clear methodological contribution and near-term applicability to formal methods, automated theorem proving, and AI-assisted software verification. Its impact could extend across ML-for-formalization, proof engineering, and trustworthy AI tooling. Paper 2 is timely and broadly relevant, but as a scoping review it is primarily synthesizing existing work; its scientific novelty and methodological contribution are typically lower than a new technical system with validated improvements.
Paper 2 introduces a novel agentic framework (Proof-Refactor) addressing a concrete, underexplored problem—improving the quality of LLM-generated formal proofs beyond mere correctness. Its process-guided approach to modularity, readability, and maintainability fills an important gap in the proof generation pipeline and has direct practical applications for building formal mathematics libraries. Paper 1, while valuable as an HCI study of early AI-assisted formalization workflows, is primarily observational and descriptive, with findings that may quickly become outdated as AI tools evolve. Paper 2's methodological contribution is more durable and actionable.
Paper 1 addresses a critical bottleneck in proactive mobile agents (efficiency and false triggers) with a novel two-stage framework. Its focus on mobile assistance gives it significantly broader real-world applicability and immediate commercial relevance compared to Paper 2, which targets the narrower, albeit important, niche of formal mathematical proof refactoring.
Paper 1 addresses a fundamental gap in AI-assisted formal verification—improving the quality and modularity of LLM-generated proofs to match library standards. This has broader scientific impact: it advances formal mathematics, proof engineering, and AI reasoning capabilities. The process-guided agentic framework is more novel and transferable. Paper 2 solves a practical cost-optimization problem (token reduction for multilingual prompts) but is more incremental and narrowly scoped—essentially an engineering optimization with limited scientific depth. Paper 1's contributions to formal verification methodology have deeper implications for trustworthy AI and mathematical reasoning research.
Paper 2 has higher impact potential: it targets a timely, fast-growing area (LLM-generated formal proofs) and addresses a key bottleneck—turning compilable proofs into reusable library-quality artifacts—enabling practical downstream adoption in formalization workflows. Its process-guided, multi-phase agentic refactoring is novel relative to proxy-metric optimization and has clear real-world applicability in proof assistants and automated theorem proving. The evaluation on established Lean benchmarks and comparison to a strong baseline suggests reasonable rigor. Paper 1 is interesting but relies on indirect evidence for “natural experiments” and may have narrower, more method-sensitive impact.
Paper 2 addresses a critical and highly timely issue in state-of-the-art large reasoning models: over-thinking and token inefficiency. By reducing token usage by 56% without sacrificing accuracy, it offers massive practical benefits for deployment cost and latency across a wide range of LLM applications. Paper 1 offers a valuable methodological contribution to automated theorem proving, but its impact is confined to a much narrower domain compared to the broad applicability of the general reasoning efficiency improvements proposed in Paper 2.
Paper 1 makes a fundamental theoretical contribution to causal inference by characterizing the complete structure of do-calculus reasoning, proving a tight bound of four rule applications, and demonstrating practical benefits through more efficient estimators. This advances a core problem in causal inference with broad implications across statistics, epidemiology, economics, and AI. Paper 2 addresses an important but more niche problem of refactoring LLM-generated formal proofs, with impact largely limited to the formal verification community. Paper 1's theoretical depth and breadth of applicability give it higher potential impact.
Paper 1 offers a broader scientific impact by addressing a fundamental question in LLM multi-agent systems (when debate helps vs. hurts). It provides a rigorous theoretical framework, extensive empirical validation across thousands of conditions, and generalizes to multiple domains. Paper 2, while valuable for formal mathematics, targets a much narrower niche (refactoring Lean proofs) and relies on subjective rubric-based evaluations, limiting its broader applicability compared to Paper 1's generalizable findings.
Paper 1 addresses a critical bottleneck in healthcare—translating clinical decisions into executable orders—offering high potential for real-world application and societal impact. While Paper 2 presents an innovative approach to formal proof refactoring, its impact is largely confined to theoretical computer science and mathematics. The direct medical utility and comprehensive training pipeline (RL and verifiable tool usage) of Paper 1 give it a broader and more significant overall scientific and practical impact.
Paper 2 proposes a fundamental architectural innovation for hybrid language models by integrating SSM-derived importance directly into the attention mechanism. Improvements in foundation model architectures have massive breadth of impact across all of AI. Paper 1 addresses a valuable but much narrower niche (refactoring formal proofs in Lean), limiting its overall scientific impact compared to core LLM advancements.
Paper 2 has higher potential impact due to greater novelty and broader relevance: it targets a key bottleneck in LLM-for-formalization—turning generated proofs into library-quality, reusable artifacts—via a process-guided, multi-phase agentic workflow that aligns with human refactoring practices. This can directly affect formal methods, theorem proving, software verification, and AI-assisted mathematics, with strong real-world applications in maintaining large proof/codebases. Paper 1 is useful and efficient but is more incremental (feature+temporal model for attribution) and narrower in cross-field reach.
Paper 2 presents a breakthrough in automated theorem proving, solving highly complex benchmark problems (Putnam 2025, IMO-style problems) and addressing open combinatorial challenges. Its ability to significantly boost the formal solve rate of LLMs and contribute to actual mathematical research gives it a substantially higher scientific impact than Paper 1, which focuses primarily on the post-generation refactoring and readability of proofs.
AgentCL addresses a fundamental and broadly applicable challenge—rigorous evaluation of continual learning in language agents—proposing a comprehensive framework with controlled task streams, transfer metrics, and diagnostic tools (MemProbe). Its breadth of impact spans coding, research, and reasoning tasks, establishing evaluation methodology for a rapidly growing field. Paper 2, while novel in applying process-guided refactoring to formal proofs, addresses a narrower problem (proof readability/modularity in Lean) with more limited applicability. AgentCL's contributions to benchmarking and memory design evaluation have broader implications for the entire language agent community.
Paper 1 is more novel and timely, addressing a key bottleneck in LLM-driven formal mathematics: converting generated proofs into maintainable, library-quality artifacts without relying on simplistic proxy metrics. Its process-guided, agentic refactoring pipeline could generalize across proof assistants and improve reliability/reusability of machine-assisted theorem proving, with broad implications for formal verification and automated reasoning. Paper 2 is a solid, rigorous applied comparison study, but its primary conclusion (LSTM outperforming an encoder-only Transformer in this setting) is incremental and narrower in cross-field impact.
Paper 1 has higher impact potential due to its broader, more general contribution: co-evolving both agent policies and the training harness addresses a core bottleneck in autonomous RL for LLM agents and is applicable across many domains (math, coding, long-horizon SWE, beyond). It targets timelier, widely relevant needs in scalable agent training and evaluation robustness, with clear real-world leverage for improving autonomous systems. Paper 2 is novel and valuable but is narrower in scope (formal proof refactoring in Lean) and likely to impact a smaller community, despite strong methodological motivation.
Paper 1 likely has higher impact due to broader applicability and timeliness: it targets general agent post-training for real-world terminal tasks, proposes a scalable pipeline (Terminal-Lego), and identifies a novel “pedagogical paradox” explained by environment-grounded supervision and harness-visible interaction structure. The reported data-efficiency gains and emphasis on “harness engineering” could influence many agent-training setups beyond coding (tool use, robotics-like environments, web agents). Paper 2 is methodologically solid and valuable for formal methods, but its impact is narrower to Lean proof engineering and less cross-domain.
Paper 1 proposes a significant paradigm shift in UGC evaluation by modeling collective human cognition via a novel 'Social-CoT' mechanism. Its application to social media engagement offers massive real-world utility and broad impact across AI, multimedia, and computational sociology. While Paper 2 presents a valuable framework for formal mathematics, its impact is relatively confined to the niche field of automated theorem proving, making Paper 1 more broadly influential.
Paper 2 likely has higher scientific impact: it introduces a new paradigm (continuous generative regression) for a widely deployed, high-stakes problem (watch-time prediction), proposes a practical low-latency generative model with personalized flow-based priors, and validates via extensive offline studies plus online A/B tests. It also contributes community infrastructure (an open-source WTP library and benchmarking metric), which can accelerate follow-on work across recommender systems and generative modeling. Paper 1 is novel and valuable for formal methods, but its immediate real-world reach and cross-field breadth are narrower.