Can LLMs Write Correct TLA+ Specifications? Evaluating Natural-Language-to-TLA+ Generation

Arslan Bisharat, Brian Ortiz, Eric Spencer, Khushboo Bhadauria, TaiNing Wang, George K. Thiruvathukal, Konstantin Laufer, Mohammed Abuhamad

#2322 of 3355 · Artificial Intelligence
Share
Tournament Score
1355±47
10501800
42%
Win Rate
8
Wins
11
Losses
19
Matches
Rating
5.5/ 10
Significance
Rigor
Novelty
Clarity

Abstract

TLA+ has supported industrial verification at companies such as Amazon and Microsoft, yet writing correct TLA+ specifications from natural language still requires time and expertise, which limits adoption. LLMs show promise, but no prior study measures whether they produce semantically correct TLA+ specifications from natural language. This paper presents the first systematic evaluation of LLM-based TLA+ specification synthesis from natural language. Our study evaluates 30 LLMs across eight families on a curated dataset of 205 TLA+ specifications: 25 open-weight models across four prompting strategies (2,600 runs) and 5 proprietary models under few-shot prompting (130 runs), all validated by the SANY parser and TLC model checker. LLMs achieve up to 26.6% syntactic correctness but only 8.6% semantic correctness, with successes exclusive to progressive prompting. Results show that model size does not predict quality, e.g., DeepSeek r1:8b outperforms its 70B variant across all strategies, which suggests the importance of reasoning alignment for formal languages. Code-specialized models consistently underperform due to negative transfer from mainstream language training. We identify five recurring hallucination categories, all traceable to specific training data biases. These results suggest that current LLMs do not generate reliable TLA+ specifications without expert oversight. We release the evaluation framework, code, and dataset to support reproducibility and future research.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

1. Core Contribution

This paper presents the first systematic evaluation of LLM-based TLA+ specification synthesis from natural language descriptions. The core contribution is an empirical benchmark: 30 LLMs across eight families, evaluated on 205 curated TLA+ specifications using four prompting strategies, with outputs validated through both the SANY parser (syntactic correctness) and TLC model checker (semantic correctness). The key finding is a stark syntax-semantics gap: up to 26.6% syntactic correctness but only 8.6% semantic correctness, with semantic success occurring exclusively under progressive prompting. The paper also contributes a taxonomy of five hallucination categories specific to formal specification generation.

The problem addressed—automating formal specification from natural language—is genuinely important. TLA+ adoption is bottlenecked by expertise requirements, and understanding where LLMs fail is a prerequisite for building better tools. However, the contribution is primarily diagnostic rather than constructive; the paper identifies problems without proposing solutions beyond directional suggestions.

2. Methodological Rigor

The experimental design is reasonably thorough in breadth (30 models, 4 strategies, 2,730+ runs) but has notable limitations in depth:

Strengths: The two-stage validation pipeline (SANY → TLC) provides a principled distinction between syntactic and semantic correctness. Using an established community repository (TLA+ Examples) as the data source adds credibility. The stratified train/val/test split and consistent hyperparameter settings across models support fair comparison.

Weaknesses: Each model-specification pair is run only once per strategy, meaning stochastic variation is uncontrolled. The authors acknowledge this but argue scale compensates—a debatable claim since per-model conclusions (e.g., "DeepSeek r1:8b outperforms r1:70b") rest on single runs per specification. The test set of 26 specifications is small, reducing statistical power for per-specification conclusions. The proprietary models are tested under only one prompting strategy (few-shot), making cross-strategy comparisons impossible for the strongest models. The paper does not report confidence intervals or significance tests for any of its headline claims.

The textual similarity metrics (BLEU, ROUGE-L) are applied only to completion strategies and are acknowledged to be poor proxies for specification quality—their inclusion adds bulk but limited insight.

3. Potential Impact

Practical relevance: The finding that current LLMs cannot reliably generate TLA+ specifications is useful for practitioners who might otherwise trust LLM output without verification. The hallucination taxonomy provides actionable insights: Unicode normalization, semicolon stripping, and reasoning-token removal are concrete, implementable post-processing steps.

Research impact: The benchmark dataset and evaluation framework could serve as a foundation for future work, though the dataset is modest (205 specifications, only 26 in the test set). The observation that smaller reasoning-aligned models outperform larger ones and code-specialized models on TLA+ is interesting but not deeply surprising given TLA+'s tiny training corpus.

Broader influence: The paper's implications extend modestly to other low-resource formal languages (Alloy, Event-B, Promela), though the authors do not explore this generalization. The hallucination categories (Unicode substitution, cross-language injection) likely apply to any formal language underrepresented in training data.

4. Timeliness & Relevance

The paper is timely. The TLA+ Foundation launched its GenAI Challenge in 2025, and industrial adoption of formal methods is growing. The intersection of LLMs and formal verification is an active research frontier. Prior work (Specula, SysMoBench) focuses on code-to-specification or specification-to-code translation; this paper fills the natural-language-to-specification gap. The inclusion of very recent models (GPT-5, Claude Opus 4.1, Qwen3) ensures the evaluation reflects the current state of the art.

5. Strengths & Limitations

Key Strengths:

  • First systematic study of its specific problem formulation (NL → TLA+ with formal validation)
  • Comprehensive model coverage across families, scales, and specializations
  • The syntax-semantics gap finding is well-documented and important
  • Hallucination taxonomy is concrete, reproducible, and traceable to training data biases
  • Open release of code, dataset, and framework supports reproducibility
  • Notable Weaknesses:

  • The 26-specification test set is quite small; results could shift substantially with different test samples
  • Single runs per configuration mean all quantitative claims carry unquantified uncertainty
  • No constructive contribution: the paper identifies problems but does not implement any of the suggested mitigations (grammar-constrained decoding, iterative TLC feedback, post-processing)
  • The proprietary model evaluation is incomplete (one strategy only), yet GPT-5's 26.9% TLC result is arguably the most practically relevant finding
  • The claim that "model size does not predict quality" may be confounded by model family, training data, and distillation procedures—the paper does not control for these factors
  • Some findings (code models underperform on formal specs, larger models aren't always better) echo well-known observations in the LLM literature without sufficient novelty
  • Missing comparisons: The paper does not compare against Specula or any retrieval-augmented approach, nor does it attempt even basic post-processing corrections to establish how much the gap could be closed with minimal effort.

    Overall Assessment

    This is a competent empirical study that establishes a useful baseline for an underexplored problem. Its primary value is diagnostic: it quantifies how badly LLMs fail at TLA+ generation and characterizes the failure modes. However, the small test set, single-run design, lack of statistical rigor in claims, and absence of constructive solutions limit its impact. The paper would be substantially stronger if it had implemented even one mitigation strategy (e.g., post-processing + re-validation) to demonstrate how much of the gap is easily closable. As it stands, it is a solid first step that opens a research direction rather than advancing it significantly.

    Rating:5.5/ 10
    Significance 6Rigor 5Novelty 5.5Clarity 7

    Generated Jun 5, 2026

    Comparison History (19)

    vs. Learning Adaptive Parallel Execution for Efficient Code Localization
    gemini-3.16/6/2026

    Paper 1 introduces a novel, efficiency-aware reinforcement learning approach that significantly improves automated software engineering agents. Its methodology directly tackles a major bottleneck, achieving SOTA performance with massive efficiency gains. In contrast, Paper 2 is primarily an observational benchmarking study in a relatively niche domain (TLA+), and while valuable, it does not propose a new method to solve the identified limitations. Paper 1 offers broader applicability and immediate practical impact in the rapidly growing field of LLM agents.

    vs. TAPO: Tool-Aware Policy Optimization via Credit Transfer for Multimodal Search Agents
    gemini-3.16/6/2026

    Paper 1 introduces a novel algorithmic improvement (TAPO) addressing a fundamental flaw (credit misassignment) in highly relevant RL algorithms (like GRPO) for multimodal agents. Its plug-and-play nature and negligible overhead suggest strong adoption potential and broad impact in the rapidly growing field of agentic AI. While Paper 2 provides a valuable and rigorous evaluation of LLMs for formal verification, it acts primarily as a benchmark study for a specialized domain (TLA+), which generally garners less widespread scientific impact and follow-up methodology than foundational algorithmic advances.

    vs. SMAC-Talk: A Natural Language Extension of the StarCraft Multi-Agent Challenge for Large Language Models
    gemini-3.16/6/2026

    Paper 2 addresses a critical bottleneck in industrial software verification by systematically evaluating LLMs on formal TLA+ specifications. Its methodological rigor is exceptionally high, testing 30 models across 2,700+ runs with formal model checker validation. While Paper 1 provides a useful benchmark for multi-agent LLMs, Paper 2 offers profound real-world implications for system reliability and software engineering, alongside deeper insights into model reasoning capabilities and negative transfer. The intersection of formal methods and LLMs gives Paper 2 a broader and more significant potential scientific impact.

    vs. Answer Presence Drives RAG Rewriting Gains
    gemini-3.16/5/2026

    Paper 1 demonstrates higher potential scientific impact due to its broad relevance and critical methodological insights into Retrieval-Augmented Generation (RAG). By exposing that RAG rewriting gains are largely driven by answer leakage rather than genuine reasoning improvements, it fundamentally challenges how the field evaluates QA systems. Paper 2, while rigorous and valuable for formal verification, focuses on a much narrower domain (TLA+) and yields the relatively expected result that LLMs struggle with niche formal languages. Paper 1's findings will likely force a widespread correction in mainstream NLP evaluation practices.

    vs. AICompanionBench: Benchmarking LLMs-as-Judges for AI Companion Safety
    claude-opus-4.66/5/2026

    AICompanionBench addresses a rapidly growing societal concern (AI companion safety) with broad relevance across AI safety, policy, and HCI communities. It introduces the first public benchmark for a timely problem affecting millions of users, evaluates 20 LLMs comprehensively, and has immediate real-world applications for platform safety monitoring. Paper 2, while rigorous, addresses a narrower domain (TLA+ specification generation) with a smaller potential audience. The AI safety topic has broader interdisciplinary impact and higher urgency given the rapid deployment of AI companion systems.

    vs. Synapse: Federated Tool Routing via Typed Compendium Artifacts
    gemini-3.16/5/2026

    Paper 1 proposes a novel, foundational framework for federated learning using typed artifacts, enabling collaboration across heterogeneous, frozen LLMs without sharing weights or data. This methodological breakthrough addresses significant privacy and architectural constraints in decentralized AI. Paper 2, while providing a valuable empirical benchmark for LLMs on TLA+ generation, is primarily an evaluation study. The theoretical and practical innovations in Paper 1 offer broader, longer-lasting impact across the fields of federated learning, privacy, and distributed AI.

    vs. Amortizing Federated Adaptation: Hypernetwork Driven LoRA for Personalized Foundation Models
    claude-opus-4.66/5/2026

    HyperLoRA addresses fundamental limitations in federated learning of foundation models—a rapidly growing field with broad applications. It introduces novel architectural innovations (hypernetwork-driven LoRA generation, product space aggregation) that could influence future federated learning research broadly. Paper 2, while valuable as an empirical evaluation of LLMs for TLA+ generation, is more niche in scope—primarily a benchmarking study for a specific formal language with relatively narrow impact. Paper 1's methodological contributions are more generalizable and timely given the explosion of foundation model adaptation research.

    vs. Agent Memory: Characterization and System Implications of Stateful Long-Horizon Workloads
    gemini-3.16/5/2026

    Paper 1 addresses the critical, broadly applicable challenge of LLM agent memory systems, offering a taxonomy, profiling harness, and system-level recommendations. This work has wide-reaching implications for scaling AI agents across numerous domains. Paper 2, while methodologically rigorous, focuses on the highly niche area of TLA+ specification generation, inherently limiting its broader scientific impact compared to the foundational systems research presented in Paper 1.

    vs. Closing the Loop on Latent Reasoning via Test-Time Reconstruction
    claude-opus-4.66/5/2026

    Paper 2 introduces a novel and broadly applicable method (ReLAT) that addresses a fundamental limitation of latent reasoning in LLMs—the lack of fidelity checks on intermediate representations. The self-supervised test-time reconstruction loop is a creative, principled contribution with strong empirical gains (16.6-point improvement on AIME 2024) across multiple benchmarks. Its impact spans mathematical reasoning, QA, and code generation. Paper 1, while valuable as a systematic evaluation of LLM-based TLA+ generation, is more narrowly scoped as a benchmarking study for a niche formal language, offering diagnostic insights but no new methodology to improve performance.

    vs. An Infectious Disease Spread Simulation Based on Large Language Model Decision Making
    gemini-3.16/5/2026

    Paper 1 provides a highly rigorous, first-of-its-kind evaluation bridging formal verification and LLMs. Its use of formal model checkers guarantees objective evaluation, and its findings on negative transfer and reasoning alignment offer fundamental insights for AI development, likely impacting software engineering and AI safety more broadly than Paper 2's simulated application.

    vs. Zero knowledge verification for frontier AI training is possible
    gpt-5.26/5/2026

    Paper 2 has higher potential impact due to its novel, timely proposal for zero-knowledge verification of frontier AI training—an enabling primitive for governance, compliance, and security with broad cross-field relevance (cryptography, distributed systems, ML infrastructure, policy). If realized, it could materially change how AI training is audited and regulated, with major real-world applications. Paper 1 is methodologically rigorous and valuable for the formal methods/LLM community, but its impact is narrower (evaluation/benchmarking) and largely diagnostic rather than enabling a new capability.

    vs. A Motivational Architecture for Conversational AGI
    gemini-3.16/5/2026

    Paper 1 offers high methodological rigor, presenting the first systematic, empirical benchmark of LLMs for formal verification (TLA+). Its release of an evaluation framework, dataset, and surprising findings on model scaling provide immediate, actionable value to both AI and software engineering communities. In contrast, Paper 2 proposes a theoretical architecture for conversational AGI but lacks empirical validation. Paper 1's concrete contributions, high real-world applicability in industrial software verification, and robust experimental design give it a much higher potential for near-term scientific impact and reproducibility.

    vs. Severity-Aware Curriculum Learning with Multi-Model Response Selection for Medical Text Generation
    gemini-3.16/5/2026

    Paper 2 introduces the first systematic evaluation and dataset for LLM-based TLA+ generation, addressing a critical bottleneck in formal verification. Its rigorous methodology, utilizing formal checkers to evaluate semantic correctness, uncovers profound insights about LLM reasoning and negative transfer. By releasing its framework, it establishes a novel benchmark likely to spur significant future research, whereas Paper 1 offers a more incremental methodological improvement using standard techniques like curriculum learning and ensembling.

    vs. SciVisAgentSkills: Design and Evaluation of Agent Skills for Scientific Data Analysis and Visualization
    gemini-3.16/5/2026

    Paper 1 offers the first systematic evaluation of LLMs for formal specification (TLA+), a critical area for software verification. It provides rigorous methodology (30 LLMs, extensive runs, model checker validation) and yields surprising insights regarding model size, reasoning alignment, and negative transfer from code specialization. These fundamental insights into LLM capabilities and failure modes will likely impact both AI research and formal methods more broadly than Paper 2's domain-specific scientific visualization skills.

    vs. Entropy-Based Evaluation of AI Agents: A Lightweight Framework for Measuring Behavioral Patterns
    gemini-3.16/5/2026

    Paper 1 presents a highly rigorous, large-scale empirical study with 30 models and over 2,700 runs, introducing the first systematic evaluation benchmark for TLA+ generation. It offers non-obvious, actionable insights (e.g., model size inverse scaling, negative transfer in code models) and releases an open dataset. While Paper 2 offers a broad conceptual framework for evaluating AI agents, its abstract lacks the robust empirical validation demonstrated in Paper 1, making Paper 1's immediate scientific contribution and reproducibility more impactful.

    vs. FIDES: Faithful Inference via Deep Evidence Signals for Retrieval-Memory Conflict in RAG
    gpt-5.26/5/2026

    Paper 2 has higher likely impact: it is the first systematic, tool-validated evaluation of NL-to-TLA+ synthesis across 30 LLMs with a released dataset/framework, enabling reproducible benchmarking and follow-on research in formal methods, program synthesis, and LLM reliability. Its negative results and identified hallucination modes are broadly actionable for both academia and industry verification workflows. Paper 1 is a strong, practical decoding innovation for RAG faithfulness, but it is more incremental within an active subarea and is less likely to reshape evaluation standards or cross-field practice than a foundational benchmark study in formal specification generation.

    vs. Edit-R2: Context-Aware Reinforcement Learning for Multi-Turn Image Editing
    gpt-5.26/5/2026

    Paper 2 likely has higher scientific impact: it introduces a new RL post-training framework for a timely, high-demand capability (multi-turn image editing), plus a new benchmark (MICE-Bench) that can catalyze broader progress and standardized evaluation. The methodological contribution (joint optimization over discrete reasoning and continuous generation with trajectory filtering) is substantive and broadly relevant to multimodal agents and long-horizon instruction following. Paper 1 is valuable and rigorous but primarily diagnostic/evaluative within a narrower formal-methods niche, with more limited immediate cross-field adoption.

    vs. PLAN-S: Bridging Planning with Latent Style Dynamics for Autonomous Driving World Models
    claude-opus-4.66/5/2026

    PLAN-S addresses a critical challenge in autonomous driving—bridging latent world models with controllable planning—demonstrating significant quantitative improvements (42% collision rate reduction) on established benchmarks. Its contributions span safety-critical real-world applications with broad industry relevance. Paper 2, while providing a useful empirical evaluation of LLMs for TLA+ specification generation, is more of a benchmarking/evaluation study with narrower scope and limited novelty beyond documenting current LLM limitations in a specific formal language domain.

    vs. Ten Headache Specialists versus Artificial Intelligence for Clinical Literature Summarization: A Critical Evaluation and Comparison
    gemini-3.16/5/2026

    Paper 2 presents a large-scale, systematic evaluation of LLMs for formal verification (TLA+), testing 30 models on a curated dataset and yielding deep insights into negative transfer, reasoning alignment, and hallucinations. Its release of the dataset and framework ensures high reproducibility and future utility. Paper 1, while relevant for clinical AI, relies on a small sample size (10 questions) in a niche domain and provides more predictable 'LLM vs human' evaluation results, limiting its broader scientific impact compared to Paper 2.