Can LLMs Write Correct TLA+ Specifications? Evaluating Natural-Language-to-TLA+ Generation
Arslan Bisharat, Brian Ortiz, Eric Spencer, Khushboo Bhadauria, TaiNing Wang, George K. Thiruvathukal, Konstantin Laufer, Mohammed Abuhamad
Abstract
TLA+ has supported industrial verification at companies such as Amazon and Microsoft, yet writing correct TLA+ specifications from natural language still requires time and expertise, which limits adoption. LLMs show promise, but no prior study measures whether they produce semantically correct TLA+ specifications from natural language. This paper presents the first systematic evaluation of LLM-based TLA+ specification synthesis from natural language. Our study evaluates 30 LLMs across eight families on a curated dataset of 205 TLA+ specifications: 25 open-weight models across four prompting strategies (2,600 runs) and 5 proprietary models under few-shot prompting (130 runs), all validated by the SANY parser and TLC model checker. LLMs achieve up to 26.6% syntactic correctness but only 8.6% semantic correctness, with successes exclusive to progressive prompting. Results show that model size does not predict quality, e.g., DeepSeek r1:8b outperforms its 70B variant across all strategies, which suggests the importance of reasoning alignment for formal languages. Code-specialized models consistently underperform due to negative transfer from mainstream language training. We identify five recurring hallucination categories, all traceable to specific training data biases. These results suggest that current LLMs do not generate reliable TLA+ specifications without expert oversight. We release the evaluation framework, code, and dataset to support reproducibility and future research.
AI Impact Assessments
(1 models)Scientific Impact Assessment
1. Core Contribution
This paper presents the first systematic evaluation of LLM-based TLA+ specification synthesis from natural language descriptions. The core contribution is an empirical benchmark: 30 LLMs across eight families, evaluated on 205 curated TLA+ specifications using four prompting strategies, with outputs validated through both the SANY parser (syntactic correctness) and TLC model checker (semantic correctness). The key finding is a stark syntax-semantics gap: up to 26.6% syntactic correctness but only 8.6% semantic correctness, with semantic success occurring exclusively under progressive prompting. The paper also contributes a taxonomy of five hallucination categories specific to formal specification generation.
The problem addressed—automating formal specification from natural language—is genuinely important. TLA+ adoption is bottlenecked by expertise requirements, and understanding where LLMs fail is a prerequisite for building better tools. However, the contribution is primarily diagnostic rather than constructive; the paper identifies problems without proposing solutions beyond directional suggestions.
2. Methodological Rigor
The experimental design is reasonably thorough in breadth (30 models, 4 strategies, 2,730+ runs) but has notable limitations in depth:
Strengths: The two-stage validation pipeline (SANY → TLC) provides a principled distinction between syntactic and semantic correctness. Using an established community repository (TLA+ Examples) as the data source adds credibility. The stratified train/val/test split and consistent hyperparameter settings across models support fair comparison.
Weaknesses: Each model-specification pair is run only once per strategy, meaning stochastic variation is uncontrolled. The authors acknowledge this but argue scale compensates—a debatable claim since per-model conclusions (e.g., "DeepSeek r1:8b outperforms r1:70b") rest on single runs per specification. The test set of 26 specifications is small, reducing statistical power for per-specification conclusions. The proprietary models are tested under only one prompting strategy (few-shot), making cross-strategy comparisons impossible for the strongest models. The paper does not report confidence intervals or significance tests for any of its headline claims.
The textual similarity metrics (BLEU, ROUGE-L) are applied only to completion strategies and are acknowledged to be poor proxies for specification quality—their inclusion adds bulk but limited insight.
3. Potential Impact
Practical relevance: The finding that current LLMs cannot reliably generate TLA+ specifications is useful for practitioners who might otherwise trust LLM output without verification. The hallucination taxonomy provides actionable insights: Unicode normalization, semicolon stripping, and reasoning-token removal are concrete, implementable post-processing steps.
Research impact: The benchmark dataset and evaluation framework could serve as a foundation for future work, though the dataset is modest (205 specifications, only 26 in the test set). The observation that smaller reasoning-aligned models outperform larger ones and code-specialized models on TLA+ is interesting but not deeply surprising given TLA+'s tiny training corpus.
Broader influence: The paper's implications extend modestly to other low-resource formal languages (Alloy, Event-B, Promela), though the authors do not explore this generalization. The hallucination categories (Unicode substitution, cross-language injection) likely apply to any formal language underrepresented in training data.
4. Timeliness & Relevance
The paper is timely. The TLA+ Foundation launched its GenAI Challenge in 2025, and industrial adoption of formal methods is growing. The intersection of LLMs and formal verification is an active research frontier. Prior work (Specula, SysMoBench) focuses on code-to-specification or specification-to-code translation; this paper fills the natural-language-to-specification gap. The inclusion of very recent models (GPT-5, Claude Opus 4.1, Qwen3) ensures the evaluation reflects the current state of the art.
5. Strengths & Limitations
Key Strengths:
Notable Weaknesses:
Missing comparisons: The paper does not compare against Specula or any retrieval-augmented approach, nor does it attempt even basic post-processing corrections to establish how much the gap could be closed with minimal effort.
Overall Assessment
This is a competent empirical study that establishes a useful baseline for an underexplored problem. Its primary value is diagnostic: it quantifies how badly LLMs fail at TLA+ generation and characterizes the failure modes. However, the small test set, single-run design, lack of statistical rigor in claims, and absence of constructive solutions limit its impact. The paper would be substantially stronger if it had implemented even one mitigation strategy (e.g., post-processing + re-validation) to demonstrate how much of the gap is easily closable. As it stands, it is a solid first step that opens a research direction rather than advancing it significantly.
Generated Jun 5, 2026
Comparison History (19)
Paper 1 introduces a novel, efficiency-aware reinforcement learning approach that significantly improves automated software engineering agents. Its methodology directly tackles a major bottleneck, achieving SOTA performance with massive efficiency gains. In contrast, Paper 2 is primarily an observational benchmarking study in a relatively niche domain (TLA+), and while valuable, it does not propose a new method to solve the identified limitations. Paper 1 offers broader applicability and immediate practical impact in the rapidly growing field of LLM agents.
Paper 1 introduces a novel algorithmic improvement (TAPO) addressing a fundamental flaw (credit misassignment) in highly relevant RL algorithms (like GRPO) for multimodal agents. Its plug-and-play nature and negligible overhead suggest strong adoption potential and broad impact in the rapidly growing field of agentic AI. While Paper 2 provides a valuable and rigorous evaluation of LLMs for formal verification, it acts primarily as a benchmark study for a specialized domain (TLA+), which generally garners less widespread scientific impact and follow-up methodology than foundational algorithmic advances.
Paper 2 addresses a critical bottleneck in industrial software verification by systematically evaluating LLMs on formal TLA+ specifications. Its methodological rigor is exceptionally high, testing 30 models across 2,700+ runs with formal model checker validation. While Paper 1 provides a useful benchmark for multi-agent LLMs, Paper 2 offers profound real-world implications for system reliability and software engineering, alongside deeper insights into model reasoning capabilities and negative transfer. The intersection of formal methods and LLMs gives Paper 2 a broader and more significant potential scientific impact.
Paper 1 demonstrates higher potential scientific impact due to its broad relevance and critical methodological insights into Retrieval-Augmented Generation (RAG). By exposing that RAG rewriting gains are largely driven by answer leakage rather than genuine reasoning improvements, it fundamentally challenges how the field evaluates QA systems. Paper 2, while rigorous and valuable for formal verification, focuses on a much narrower domain (TLA+) and yields the relatively expected result that LLMs struggle with niche formal languages. Paper 1's findings will likely force a widespread correction in mainstream NLP evaluation practices.
AICompanionBench addresses a rapidly growing societal concern (AI companion safety) with broad relevance across AI safety, policy, and HCI communities. It introduces the first public benchmark for a timely problem affecting millions of users, evaluates 20 LLMs comprehensively, and has immediate real-world applications for platform safety monitoring. Paper 2, while rigorous, addresses a narrower domain (TLA+ specification generation) with a smaller potential audience. The AI safety topic has broader interdisciplinary impact and higher urgency given the rapid deployment of AI companion systems.
Paper 1 proposes a novel, foundational framework for federated learning using typed artifacts, enabling collaboration across heterogeneous, frozen LLMs without sharing weights or data. This methodological breakthrough addresses significant privacy and architectural constraints in decentralized AI. Paper 2, while providing a valuable empirical benchmark for LLMs on TLA+ generation, is primarily an evaluation study. The theoretical and practical innovations in Paper 1 offer broader, longer-lasting impact across the fields of federated learning, privacy, and distributed AI.
HyperLoRA addresses fundamental limitations in federated learning of foundation models—a rapidly growing field with broad applications. It introduces novel architectural innovations (hypernetwork-driven LoRA generation, product space aggregation) that could influence future federated learning research broadly. Paper 2, while valuable as an empirical evaluation of LLMs for TLA+ generation, is more niche in scope—primarily a benchmarking study for a specific formal language with relatively narrow impact. Paper 1's methodological contributions are more generalizable and timely given the explosion of foundation model adaptation research.
Paper 1 addresses the critical, broadly applicable challenge of LLM agent memory systems, offering a taxonomy, profiling harness, and system-level recommendations. This work has wide-reaching implications for scaling AI agents across numerous domains. Paper 2, while methodologically rigorous, focuses on the highly niche area of TLA+ specification generation, inherently limiting its broader scientific impact compared to the foundational systems research presented in Paper 1.
Paper 2 introduces a novel and broadly applicable method (ReLAT) that addresses a fundamental limitation of latent reasoning in LLMs—the lack of fidelity checks on intermediate representations. The self-supervised test-time reconstruction loop is a creative, principled contribution with strong empirical gains (16.6-point improvement on AIME 2024) across multiple benchmarks. Its impact spans mathematical reasoning, QA, and code generation. Paper 1, while valuable as a systematic evaluation of LLM-based TLA+ generation, is more narrowly scoped as a benchmarking study for a niche formal language, offering diagnostic insights but no new methodology to improve performance.
Paper 1 provides a highly rigorous, first-of-its-kind evaluation bridging formal verification and LLMs. Its use of formal model checkers guarantees objective evaluation, and its findings on negative transfer and reasoning alignment offer fundamental insights for AI development, likely impacting software engineering and AI safety more broadly than Paper 2's simulated application.
Paper 2 has higher potential impact due to its novel, timely proposal for zero-knowledge verification of frontier AI training—an enabling primitive for governance, compliance, and security with broad cross-field relevance (cryptography, distributed systems, ML infrastructure, policy). If realized, it could materially change how AI training is audited and regulated, with major real-world applications. Paper 1 is methodologically rigorous and valuable for the formal methods/LLM community, but its impact is narrower (evaluation/benchmarking) and largely diagnostic rather than enabling a new capability.
Paper 1 offers high methodological rigor, presenting the first systematic, empirical benchmark of LLMs for formal verification (TLA+). Its release of an evaluation framework, dataset, and surprising findings on model scaling provide immediate, actionable value to both AI and software engineering communities. In contrast, Paper 2 proposes a theoretical architecture for conversational AGI but lacks empirical validation. Paper 1's concrete contributions, high real-world applicability in industrial software verification, and robust experimental design give it a much higher potential for near-term scientific impact and reproducibility.
Paper 2 introduces the first systematic evaluation and dataset for LLM-based TLA+ generation, addressing a critical bottleneck in formal verification. Its rigorous methodology, utilizing formal checkers to evaluate semantic correctness, uncovers profound insights about LLM reasoning and negative transfer. By releasing its framework, it establishes a novel benchmark likely to spur significant future research, whereas Paper 1 offers a more incremental methodological improvement using standard techniques like curriculum learning and ensembling.
Paper 1 offers the first systematic evaluation of LLMs for formal specification (TLA+), a critical area for software verification. It provides rigorous methodology (30 LLMs, extensive runs, model checker validation) and yields surprising insights regarding model size, reasoning alignment, and negative transfer from code specialization. These fundamental insights into LLM capabilities and failure modes will likely impact both AI research and formal methods more broadly than Paper 2's domain-specific scientific visualization skills.
Paper 1 presents a highly rigorous, large-scale empirical study with 30 models and over 2,700 runs, introducing the first systematic evaluation benchmark for TLA+ generation. It offers non-obvious, actionable insights (e.g., model size inverse scaling, negative transfer in code models) and releases an open dataset. While Paper 2 offers a broad conceptual framework for evaluating AI agents, its abstract lacks the robust empirical validation demonstrated in Paper 1, making Paper 1's immediate scientific contribution and reproducibility more impactful.
Paper 2 has higher likely impact: it is the first systematic, tool-validated evaluation of NL-to-TLA+ synthesis across 30 LLMs with a released dataset/framework, enabling reproducible benchmarking and follow-on research in formal methods, program synthesis, and LLM reliability. Its negative results and identified hallucination modes are broadly actionable for both academia and industry verification workflows. Paper 1 is a strong, practical decoding innovation for RAG faithfulness, but it is more incremental within an active subarea and is less likely to reshape evaluation standards or cross-field practice than a foundational benchmark study in formal specification generation.
Paper 2 likely has higher scientific impact: it introduces a new RL post-training framework for a timely, high-demand capability (multi-turn image editing), plus a new benchmark (MICE-Bench) that can catalyze broader progress and standardized evaluation. The methodological contribution (joint optimization over discrete reasoning and continuous generation with trajectory filtering) is substantive and broadly relevant to multimodal agents and long-horizon instruction following. Paper 1 is valuable and rigorous but primarily diagnostic/evaluative within a narrower formal-methods niche, with more limited immediate cross-field adoption.
PLAN-S addresses a critical challenge in autonomous driving—bridging latent world models with controllable planning—demonstrating significant quantitative improvements (42% collision rate reduction) on established benchmarks. Its contributions span safety-critical real-world applications with broad industry relevance. Paper 2, while providing a useful empirical evaluation of LLMs for TLA+ specification generation, is more of a benchmarking/evaluation study with narrower scope and limited novelty beyond documenting current LLM limitations in a specific formal language domain.
Paper 2 presents a large-scale, systematic evaluation of LLMs for formal verification (TLA+), testing 30 models on a curated dataset and yielding deep insights into negative transfer, reasoning alignment, and hallucinations. Its release of the dataset and framework ensures high reproducibility and future utility. Paper 1, while relevant for clinical AI, relies on a small sample size (10 questions) in a niche domain and provides more predictable 'LLM vs human' evaluation results, limiting its broader scientific impact compared to Paper 2.