Characterizing initial human-AI proof formalization workflows

Katherine M. Collins, Simon Frieder, Jonas Bayer, Jacob Loader, Jeck Lim, Peiyang Song, Fabian Zaiser, Lexin Zhou

Jun 2, 2026arXiv:2606.04273v1

cs.AI

#2548of 3572·Artificial Intelligence

#2548 of 3572 · Artificial Intelligence

Tournament Score

1342±44

10501800

44%

Win Rate

Wins

Losses

Matches

Rating

4.8/ 10

Significance5.5

Rigor4

Novelty5

Clarity6.5

Abstract

For centuries, human mathematicians have written proofs to substantiate their mathematical arguments; yet, the ability to automatically verify the validity of proofs has long been a challenge. Advances in AI systems' ability to generate code and engage in increasingly high-level mathematical reasoning promise to transform people's ability to formalize and thereby verify proofs. While many works focus on benchmarking the current frontier, we instead study how people use these tools. We conduct a mixed-methods analysis into the initial impact of AI on people's formalization workflows: what people claim they want, what they see as the barriers to those visions, and how they actually use and adapt AI in practice. A qualitative survey shows that people's preferences are diverse, but with a general desire for AI assistance in formalization that preserves high-level human control over the proof discovery process. To assess how people actually engage with AI for formalization under such limitations, we conduct a controlled user study in which participants formalize informal math problems and their proofs, with and without AI, across a range of mathematical problems at varying levels of difficulty and domains. Despite limitations of the tools at the time for autoformalization, participants tend to attain higher formalization accuracy when allowed access to AI tools than when formalizing on their own, with most participants flexibly choosing to use multiple different AI tools. Taken together, our work sheds light on the early stages of AI integration into formalization workflows, involving an intimate interplay of human and AI engagement.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

Core Contribution

This paper addresses a largely overlooked dimension of AI-assisted formal theorem proving: how humans actually use and want to use AI tools in their formalization workflows. While the field has been dominated by capability benchmarks (e.g., AlphaProof, DeepSeek-Prover), this work turns the lens onto the human side, combining a qualitative survey (N=31) with a controlled user study (N=7) to characterize preferences, perceived barriers, and actual behavior when formalizing proofs in Lean with and without AI assistance. The core finding is that users generally prefer to maintain high-level strategic control while delegating mechanical/tedious subtasks to AI, and that participants flexibly construct multi-tool workflows to compensate for individual tool limitations.

Methodological Rigor

The mixed-methods design is appropriate for an exploratory study of this nature, but there are significant methodological limitations:

Survey (N=31): The sample is small and recruited through convenience sampling (X/Twitter, mailing lists, personal contacts), raising concerns about representativeness. The quality control filtering (removing 9 of 40 respondents) is reasonable but the remaining sample is still modest. The multi-phase coding process involving multiple authors is commendable and adds credibility to the qualitative analysis, though inherent subjectivity is acknowledged.

User Study (N=7): This is the more concerning aspect. Seven participants (reduced from 8) is extremely small for drawing generalizable conclusions. The within-subjects design (3 problems with tools, 3 without) with counterbalancing is sound in principle, but the statistical analysis — a linear mixed-effects model on 42 trials — operates at the margins of what such models can reliably estimate. The reported effect (0.30 increase in probability of correct response, p=0.003) should be interpreted cautiously given the small sample. The random intercept standard deviations (0.20 for participant, 0.28 for problem) suggest meaningful variability that 7 participants cannot adequately capture.

The accuracy grading relies on a single grader per problem (from the author team), with binary correct/incorrect coding that the authors themselves acknowledge is "inherently subjective." The 80+ hours of video data represent a rich resource, but only preliminary qualitative analysis has been conducted on a subset.

Potential Impact

The paper's primary contribution is as an early empirical record of human-AI interaction patterns in a specialized domain (formal theorem proving) at a specific moment in time (late 2025, pre-agentic coding surge). This "moment-in-time snapshot" framing is both a strength and limitation.

For the formal verification community: The finding that users want AI for mechanical tasks but not high-level proof strategy mirrors patterns seen in general software development (Barke et al., 2023; Mozannar et al., 2024), but documenting this specifically for formalization is valuable. The detailed self-reports (Tables 1-3) of multi-tool workflows provide actionable insights for tool builders.

For HCI/human-AI interaction: The observation that users flexibly construct multi-tool workflows, deliberately accepting/rejecting AI suggestions, contributes to the growing literature on human adaptation to AI tools. However, the findings are not dramatically different from what has been documented in general programming contexts.

For AI capabilities research: The paper implicitly argues that benchmarks alone are insufficient — understanding deployment contexts matters. This is a well-established argument in HCI but bears repeating for the AI capabilities community.

Timeliness & Relevance

The timing is both advantageous and challenging. The study captures a pre-agentic-coding moment (before Claude Code, Codex became widespread), which provides a baseline. However, the rapid pace of tool development means the specific tool limitations participants encountered may already be partially resolved, potentially limiting the shelf life of some findings. The paper is timely in the sense that formal verification is receiving unprecedented attention (sphere packing formalization, IMO gold medal), making understanding human workflows increasingly important.

Strengths

1. Unique focus: Studying human users rather than model capabilities fills a genuine gap in the literature. The combination of stated preferences (survey) with revealed preferences (user study) is methodologically sound in principle.

2. Rich qualitative data: The self-reported tool usage descriptions (Tables 1-3) and video analysis excerpts provide genuinely illuminating windows into real formalization practice. The taxonomy of workflow patterns (human with AI assistance, AI with human help, hybrid) is useful.

3. Ecological validity: Allowing participants to use any tools they wanted (rather than constraining to one system) captures more realistic behavior.

4. Transparency: Making survey responses, codings, and eventually formalizations available strengthens reproducibility.

Limitations

1. Small sample sizes: N=31 for the survey and N=7 for the user study severely limit generalizability. The statistical analyses, while technically appropriate, are underpowered.

2. Selection bias: Participants were recruited through networks and specific channels, likely over-representing engaged, technically sophisticated users.

3. Limited video analysis: The 80+ hours of video data — potentially the most valuable contribution — have only been superficially analyzed. The paper promises future work but currently offers only anecdotal examples.

4. Narrow scope: Focusing exclusively on formalizing *given* proofs (not proof discovery) limits the applicability of findings to the broader mathematical workflow.

5. Confound in study design: The instruction to use AI in the tools-allowed week may have biased participants toward AI use they wouldn't otherwise choose, as the authors acknowledge.

6. Incremental novelty: Many findings (preference for human control, multi-tool use, frustration with reliability) echo well-documented patterns from the general AI-assisted programming literature.

Overall Assessment

This is a well-motivated exploratory study that fills a genuine gap by focusing on human users of AI formalization tools rather than model capabilities. However, the small sample sizes, preliminary nature of the video analysis, and overlap with existing findings from AI-assisted programming limit the immediate scientific impact. The paper's greatest value may be as a historical record and as motivation for larger-scale, more systematic studies. The rich qualitative data — particularly the video recordings — represent a significant resource whose potential has not yet been fully realized in this paper.

Rating:4.8/ 10

Significance 5.5Rigor 4Novelty 5Clarity 6.5

Generated Jun 5, 2026

Comparison History (16)

Lostvs. Residual Modeling for High-Fidelity Learned Compression of Scientific Data

Paper 2 demonstrates higher potential scientific impact because it addresses a critical, universal bottleneck in large-scale computational science: massive data compression. By improving high-fidelity learned compression ratios by 10-40% over state-of-the-art methods for large simulation datasets (e.g., climate, fluid dynamics), it offers immediate, foundational utility across numerous physical sciences. While Paper 1 provides valuable HCI insights into AI-assisted math formalization, its scope is relatively niche. Paper 2's methodological rigor and direct applicability to exascale scientific computing give it broader, more immediate cross-disciplinary impact.

gemini-3.1-pro-preview·Jun 6, 2026

Wonvs. Answer Presence Drives RAG Rewriting Gains

Paper 2 addresses the broader and more timely topic of human-AI collaboration in mathematical proof formalization, combining qualitative and quantitative methods in a user study with wide interdisciplinary relevance (AI, HCI, mathematics, formal verification). It has potential to influence how AI tools are designed for mathematicians. Paper 1, while methodologically rigorous, addresses a narrower technical point about RAG pipeline evaluation—specifically whether rewriter gains are driven by answer leakage—which, though useful, has a more limited audience and incremental contribution to the field.

claude-opus-4-6·Jun 6, 2026

Lostvs. An interpretable and trustworthy AI framework for large-scale longitudinal structure-pain association studies using data from the Osteoarthritis Initiative (OAI)

Paper 1 likely has higher scientific impact due to a novel, scalable methodological integration (deep learning MOAKS + conformal uncertainty filtering + longitudinal latent class mixed modeling) applied to a major clinical dataset, yielding clear performance gains and clinically actionable structure–pain risk estimates. It has immediate translational relevance for osteoarthritis research, imaging-derived phenotyping, and trial stratification, with potential adoption across medical imaging and epidemiology. Paper 2 is timely and valuable for HCI/AI-for-math, but its impact is more exploratory and narrower, focusing on early workflow characterization rather than broadly generalizable methods or direct real-world outcomes.

gpt-5.2·Jun 6, 2026

Wonvs. Do More Agents Help? Controlled and Protocol-Aligned Evaluation of LLM Agent Workflows

Paper 1 addresses a novel and timely intersection of AI and mathematical proof formalization through rigorous mixed-methods research (survey + controlled user study), offering foundational insights into human-AI collaboration workflows. Its contributions span HCI, AI, and mathematics, with broad implications for how AI tools are designed and integrated. Paper 2, while useful, primarily benchmarks multi-agent LLM systems and finds mostly negative results (more agents don't help much), which is incrementally informative but narrower in scope and more likely to be superseded by rapid LLM developments.

claude-opus-4-6·Jun 6, 2026

Wonvs. RelGT-AC: A Relational Graph Transformer for Autocomplete Tasks in Relational Databases

Paper 1 addresses a more novel and broadly impactful research question—how humans integrate AI into mathematical proof formalization workflows—combining qualitative and controlled user study methodologies at the intersection of HCI, AI, and formal mathematics. This topic is timely given the rapid advancement of LLMs and has implications across multiple fields. Paper 2 makes incremental engineering contributions (column masking, unified task head, TF-IDF encoding) to an existing architecture on a specific benchmark, representing a narrower, more incremental advance with limited broader impact beyond the relational deep learning community.

claude-opus-4-6·Jun 6, 2026

Wonvs. Enhancing Operational Safety via Agentic Dialogue Hazard Identification Analysis

Paper 1 has higher likely scientific impact: it addresses a timely, foundational shift (human–AI collaboration in proof formalization) with broad implications for formal methods, programming languages, AI-assisted mathematics, and HCI. Its mixed-methods + controlled user study adds methodological rigor and produces actionable insights about real workflows, barriers, and tool use—knowledge that can guide system design and evaluation across many proof assistants. Paper 2 is relevant to safety engineering, but the multi-agent dialogue framing is comparatively incremental and its impact is narrower and more dependent on dataset/task specifics.

gpt-5.2·Jun 6, 2026

Lostvs. VAMPS: Visual-Assisted Mathematical Problem Solving Benchmark

Paper 2 likely has higher impact: it introduces a timely, reusable benchmark targeting a clear and under-tested failure mode of multimodal/tool-using models—benefiting from self-generated visualizations—relevant to real scientific/engineering workflows. Benchmarks often catalyze broad, cross-field progress (multimodal reasoning, tool use, evaluation, curriculum/data generation) and enable standardized comparison and diagnosis. Paper 1 provides valuable mixed-methods insight into early human-AI formalization workflows, but its impact may be narrower (HCI + theorem proving community) and less likely to become a widely adopted artifact than a benchmark.

gpt-5.2·Jun 5, 2026

Lostvs. On Information Self-Locking in Reinforcement Learning for Active Reasoning of LLM agents

Paper 1 identifies a novel and fundamental failure mode (information self-locking) in RL-trained LLM agents, provides both theoretical and empirical analysis, and proposes a practical solution (AREW) with strong experimental results across 9 tasks. This addresses a core challenge in the rapidly growing field of LLM-based agents, with broad implications for agent training methodology. Paper 2, while valuable, is primarily an observational user study of AI-assisted proof formalization—a narrower domain with more limited methodological contribution and less generalizable impact.

claude-opus-4-6·Jun 5, 2026

Wonvs. Community-Aware Assessment of Social Textual Engagement and Resonance: A Human-Centric Perspective on User-Generated Content Evaluation

Paper 2 addresses the fundamental and timely intersection of AI and mathematical proof formalization through rigorous mixed-methods research (qualitative survey + controlled user study). It studies human-AI collaboration in a broadly impactful domain with clear methodological rigor and relevance across mathematics, CS, and HCI. Paper 1, while novel in reframing UGC quality assessment with social dynamics, introduces a narrower, more application-specific framework with several ad-hoc components (Social-CoT, persona simulation) whose generalizability and scientific grounding are less established. Paper 2's findings on human-AI workflows have broader implications for the future of formal verification and AI-assisted reasoning.

claude-opus-4-6·Jun 5, 2026

Lostvs. AICompanionBench: Benchmarking LLMs-as-Judges for AI Companion Safety

AI safety is a critical, high-impact field with immediate real-world consequences given the explosive growth of AI companions. Paper 2 introduces the first public benchmark dataset for this domain, providing a foundational resource for NLP, HCI, and AI ethics researchers. While Paper 1 is rigorous and novel, its impact is largely confined to the niche domain of mathematical formalization, whereas Paper 2 addresses a widespread societal concern affecting millions of users.

gemini-3.1-pro-preview·Jun 5, 2026

#2548of 3572·Artificial Intelligence

#2548 of 3572 · Artificial Intelligence

Tournament Score

1342±44

10501800

44%

Win Rate

Wins

Losses

Matches

Rating

4.8/ 10

Significance5.5

Rigor4

Novelty5

Clarity6.5