Satisfiability Solving with LLMs: A Matched-Pair Evaluation of Reasoning Capability

Leizhen Zhang, Shuhan Chen, Sheng Chen

cs.AI(primary)cs.CLcs.LO
#924 of 2682 · Artificial Intelligence
Share
Tournament Score
1443±48
10501800
64%
Win Rate
9
Wins
5
Losses
14
Matches
Rating
5.8/ 10
Significance
Rigor
Novelty
Clarity

Abstract

Large language models (LLMs) are increasingly used for tasks that implicitly reduce to Boolean satisfiability (SAT), yet their reasoning ability on SAT remains unclear. We present a systematic study of LLMs on 2-SAT and 3-SAT, together with two canonical reductions, Vertex Cover and discrete 3D packing, to probe representation-invariant reasoning. We first evaluate models using conventional metrics, including accuracy, precision, recall, and F1, as well as the SAT phase-transition setting. We find that these metrics can be misleading: many models obtain high scores by over-predicting satisfiable formulas, fail to reproduce the classical easy-hard-easy signature around the 3-SAT threshold, and degrade sharply as the number of variables grows. To address this problem, we introduce a paired-formula protocol based on minimally different satisfiable and unsatisfiable instances, together with Accurate Differentiation Rate (ADR), which requires both members of each pair to be classified correctly. ADR separates reasoning-oriented models from heuristic ones and correlates with witness validity. Beyond CNF, we test cross-representation consistency by converting CNF to Vertex Cover and 3-SAT to discrete 3D packing. Model decisions on CNF and on the corresponding graph or packing instances agree for most models on more than 80 percent of instances, suggesting stable decision rules across representations. Overall, our results show that SAT is a conservative probe for LLM reasoning, and that paired evaluation with ADR provides a more faithful and representation-robust assessment than conventional metrics.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

1. Core Contribution

This paper makes two primary contributions: (1) a systematic empirical evaluation of 12 LLMs on Boolean satisfiability (2-SAT and 3-SAT), and (2) a new evaluation methodology based on paired formulas and a metric called Accurate Differentiation Rate (ADR). The paired-formula protocol generates minimally different SAT/UNSAT instance pairs (differing by a single literal flip or clause edit), and ADR requires correct classification of *both* members of each pair. The paper also tests cross-representation consistency by reducing SAT instances to Vertex Cover and 3D packing problems.

The key finding is that conventional metrics (accuracy, precision, recall, F1, MCC) are misleading for evaluating LLM reasoning on SAT due to class imbalance across clause-density regimes and systematic SAT-prediction bias. ADR exposes these deficiencies by requiring joint correctness on paired instances, revealing that most LLMs' reasoning collapses beyond N≥25 variables for 3-SAT.

2. Methodological Rigor

Strengths in methodology:

  • The experimental design is thoughtful. Using the SAT phase transition as ground truth for reasoning behavior is well-motivated, and the observation that LLMs fail to reproduce the easy-hard-easy pattern is telling.
  • The paired-formula construction is principled: single-edit perturbations that flip satisfiability while keeping structure nearly identical eliminate superficial cues effectively.
  • Cross-representation experiments (CNF→Vertex Cover, CNF→3D Packing) provide a novel angle on representation invariance.
  • Mathematical properties of ADR are formalized with clear bounds relating it to per-class recalls and accuracy.
  • Weaknesses:

  • Sample sizes are modest (70 pairs per N setting, 320 instances per α). While the authors acknowledge budget constraints, statistical power for fine-grained model comparisons is limited.
  • The paired-formula generation procedure (flip → replace → delete) introduces a subtle bias: the SAT partner is always derived from an UNSAT formula, and the structural edits may not uniformly sample the space of minimally different pairs. The edit statistics are deferred to supplemental material.
  • GPT-5 results stop at N=25 due to timeouts, with only a single supplementary data point at N=50. This limits conclusions about the best-performing model.
  • The 3D packing and Vertex Cover experiments are described only summarily in the main text, with details deferred to supplements, making independent assessment difficult.
  • The paper does not control for prompt sensitivity—a single prompt template is used throughout, and variations in prompting strategy (chain-of-thought variants, few-shot examples) could substantially affect results.
  • 3. Potential Impact

    Evaluation methodology: ADR's requirement for joint correctness on paired instances is a genuinely useful idea that extends beyond SAT. The paper correctly identifies that paired evaluation eliminates class-prior artifacts, a pervasive problem in LLM benchmarking. This approach could influence evaluation practices in program analysis, formal verification, and constraint satisfaction communities.

    Practical insights: The finding that LLMs exhibit strong SAT bias and cannot reliably detect UNSAT formulas has direct implications for practitioners using LLMs for constraint-heavy tasks (scheduling, configuration, verification). The paper's argument that SAT performance upper-bounds performance on reducible tasks is intuitive, though the empirical evidence (>80% agreement across representations) demonstrates consistency rather than strict upper-bounding.

    Limited algorithmic impact: The paper does not propose methods to *improve* LLM reasoning on SAT, limiting its constructive contribution. The discussion of LLM-solver integration is speculative.

    4. Timeliness & Relevance

    The paper is highly timely. LLMs are increasingly deployed for reasoning-intensive tasks, and rigorous evaluation of their logical reasoning capabilities is a current bottleneck. The community needs principled benchmarks and metrics that go beyond standard classification scores. The paper directly addresses this need in a domain (SAT) with well-understood computational complexity and mature solver baselines, providing clean ground truth. The inclusion of recent models (GPT-5, Claude Opus 4, DeepSeek-reasoner) ensures currency.

    5. Strengths & Limitations

    Key Strengths:

  • ADR is simple, well-motivated, and mathematically characterized. Its relationship to accuracy and per-class recalls is clearly derived.
  • The phase-transition analysis provides a compelling computational-theory lens for evaluating LLM behavior, grounding the evaluation in established theory.
  • Cross-representation experiments are novel—testing whether LLM decisions are stable under polynomial reductions is an original contribution.
  • The paper covers 12 models spanning multiple families (OpenAI, Anthropic, DeepSeek), enabling meaningful cross-model comparisons.
  • Artifacts are publicly available.
  • Notable Weaknesses:

  • The "conservative probe" hypothesis—that SAT performance upper-bounds performance on SAT-reducible tasks—is stated as a central thesis but only weakly supported. The cross-representation experiments show consistency (>80% agreement), not that SAT performance strictly dominates. When predictions disagree, SAT is more often correct for VC but equally likely for 3D packing, partially undermining the upper-bound claim.
  • The paper is lengthy and somewhat repetitive. The same narrative (traditional metrics mislead, ADR is better) is restated across 3-SAT, 2-SAT, VC, and packing sections without substantially new insights each time.
  • ADR's limitation is acknowledged but underexplored: it requires paired instances with verified labels, which limits applicability to domains without efficient verifiers or natural perturbation operators.
  • The paper does not compare against existing LLM reasoning benchmarks (e.g., GSM8K, MATH, ARC) to contextualize SAT-solving difficulty relative to other reasoning tasks.
  • No analysis of *why* LLMs fail—attention patterns, token-level analysis, or error categorization—is provided. The paper documents failure without diagnosing mechanisms.
  • Additional Observations

    The ADR metric, while useful, is conceptually straightforward—it is essentially pair-level accuracy. Its main value lies in the *paired instance construction* rather than the metric formula itself. The paper could have been stronger by exploring whether ADR correlates with downstream task performance in a more controlled pipeline setting, rather than relying primarily on the theoretical argument.

    The paper's venue (FSE 2026) is appropriate given the software engineering applications discussed, though the contribution is primarily empirical evaluation methodology rather than a new tool or technique.

    Rating:5.8/ 10
    Significance 6Rigor 5.5Novelty 6Clarity 6

    Generated May 28, 2026

    Comparison History (14)

    vs. Do LLMs Build World Models From Text? A Multilingual Diagnostic of Spatial Reasoning
    gpt-5.25/28/2026

    Paper 2 offers a broadly reusable evaluation methodology (matched-pair protocol + ADR) that directly addresses a known pitfall in LLM reasoning assessment (metric gaming via satisfiable bias) and connects to foundational complexity phenomena (phase transition, scaling). Its representation-invariant tests via reductions (to Vertex Cover and 3D packing) increase rigor and generality, making it applicable across many reasoning benchmarks beyond SAT. Paper 1 is a strong, timely multilingual diagnostic for spatial world modeling, but its impact is more domain-specific and less methodologically generalizable than Paper 2’s evaluation framework.

    vs. Reasoning Is Not Free: Robust Adaptive Cost-Efficient Routing for LLM-as-a-Judge
    gpt-5.25/28/2026

    Paper 2 likely has higher scientific impact due to a clearer, broadly applicable contribution (cost–accuracy routing for LLM-as-a-judge) with direct real-world deployment relevance and a principled optimization framework (distributionally robust formulation, efficient algorithm, theoretical guarantees). Its focus on budgeted evaluation under distribution shift is timely for widespread LLM evaluation pipelines across domains. Paper 1 offers a useful diagnostic metric (paired protocol/ADR) for reasoning assessment on SAT, but its primary impact is narrower (benchmarking/evaluation methodology for SAT-like reasoning) and less directly actionable for production systems.

    vs. CubePart: An Open-Vocabulary Part-Controllable 3D Generator
    gemini-3.15/28/2026

    Paper 1 addresses a foundational question in AI—whether LLMs genuinely reason—by rigorously exposing flaws in standard benchmarks and proposing a robust paired-evaluation protocol using SAT. Its insights into LLM reasoning limitations and evaluation methodologies have broader, more fundamental scientific implications across the AI community than Paper 2, which offers a highly practical but domain-specific advance in 3D asset generation.

    vs. Harness-Bench: Measuring Harness Effects across Models in Realistic Agent Workflows
    claude-opus-4.65/28/2026

    Harness-Bench addresses a critical and timely gap in LLM agent evaluation—isolating the effect of the execution harness from the model itself—which has broad practical implications for the rapidly growing agent deployment ecosystem. It introduces a comprehensive benchmark with 5,194 trajectories and identifies a novel concept (execution-alignment failures) relevant across many agent applications. Paper 2 makes a solid methodological contribution (ADR metric, paired-formula protocol) for evaluating LLM reasoning on SAT, but targets a narrower, more theoretical question with less immediate breadth of impact across the field.

    vs. MACReD: A Multi-Agent Collaborative Reasoning Framework for Reaction Diagram Parsing
    claude-opus-4.65/28/2026

    Paper 1 introduces a novel evaluation methodology (paired-formula protocol and ADR metric) for assessing LLM reasoning on a fundamental computational problem (SAT), with broad implications across AI evaluation. Its findings about misleading conventional metrics and cross-representation consistency are relevant to the entire LLM evaluation community. Paper 2, while solving a useful domain-specific problem (chemical reaction diagram parsing) with solid results, has narrower impact limited to cheminformatics. Paper 1's methodological contribution is more generalizable and addresses a timelier question about LLM reasoning capabilities.

    vs. Where Rollouts Begin: Low-Load, High-Leverage First-Token Diversification for RLVR
    gemini-3.15/28/2026

    Paper 1 addresses a critical bottleneck (rollout diversity) in Reinforcement Learning with Verifiable Rewards (RLVR), a highly impactful and rapidly growing area for training reasoning LLMs. By introducing a simple, low-cost intervention (first-token diversification) that yields consistent performance gains over state-of-the-art baselines like GRPO, it offers immediate and broad practical utility for AI development. While Paper 2 provides valuable insights into LLM evaluation, Paper 1's algorithmic contribution directly advances the capability to train stronger reasoning models, likely resulting in higher immediate scientific and applied impact.

    vs. PortBench: A Correlation-Aware, Full-Pipeline Benchmark for LLM-Driven Portfolio Management
    gpt-5.25/28/2026

    Paper 2 has higher likely scientific impact due to broader, more foundational relevance: it proposes a generally applicable matched-pair evaluation protocol (ADR) for probing LLM reasoning, validated across SAT and representation-changing reductions (Vertex Cover, 3D packing). This methodological contribution can transfer to many reasoning benchmarks and evaluation settings beyond a single domain. Paper 1 is timely and valuable for finance, but its impact is narrower (portfolio management benchmarking) and more application-specific, whereas Paper 2 targets core evaluation methodology for LLM reasoning with potential cross-field uptake.

    vs. Retrying vs Resampling in AI Control
    gemini-3.15/28/2026

    Paper 2 offers broader scientific impact by addressing a fundamental challenge in AI: accurately evaluating LLM reasoning. By introducing the paired-formula protocol and ADR metric, it exposes severe flaws in standard evaluation metrics that falsely reward heuristic guessing. Because Boolean satisfiability is foundational, benchmarking LLMs on representation-invariant SAT problems provides a highly rigorous, universally applicable tool for the AI community. In contrast, while Paper 1 tackles an important alignment issue, its scope is narrower, focusing on specific engineering protocols (retry vs. resample) within specific empirical safety settings.

    vs. A Fixed-Budget, Cluster-Aware Standard for LLM-as-a-Judge Evaluation: A Multi-Hop RAG Stress Test
    claude-opus-4.65/28/2026

    Paper 1 introduces a novel evaluation methodology (paired-formula protocol and ADR metric) for probing LLM reasoning on a fundamental computational problem (SAT), with cross-representation consistency analysis. It addresses a broadly relevant question—whether LLMs truly reason or rely on heuristics—applicable across AI/ML. Paper 2 makes a valuable but narrower methodological contribution to RAG evaluation standards. While rigorous, its impact is confined to the RAG evaluation niche. Paper 1's connection to computational complexity theory and its generalizable evaluation framework give it broader and deeper potential impact.

    vs. When Mean CE Fails: Median CE Can Better Track Language Model Quality
    gpt-5.25/28/2026

    Paper 2 likely has higher impact due to broader cross-field relevance (logic, complexity theory, automated reasoning, evaluation of LLM reasoning), clearer real-world applicability (reliable assessment of models used in planning/verification-like tasks), and a methodological contribution (matched-pair protocol + ADR) that addresses known pitfalls in SAT evaluation and extends across representations (CNF, Vertex Cover, 3D packing). Paper 1 offers a useful diagnostic for LM training/selection, but its scope is narrower and its metric refinement (median/percentile CE) is less broadly transformative.

    vs. Helicase: Uncertainty-Guided Supply Chain Knowledge Graph Construction with Autonomous Multi-Agent LLMs
    gemini-3.15/28/2026

    Paper 1 addresses a fundamental question regarding the true reasoning capabilities of LLMs versus heuristic pattern matching. By introducing a matched-pair evaluation protocol and the ADR metric, it exposes flaws in conventional evaluation methods and offers a rigorous standard for future foundational AI research. While Paper 2 presents a valuable and highly practical multi-agent framework for supply chains, Paper 1's theoretical contributions to understanding and evaluating core LLM reasoning mechanisms give it broader potential scientific impact across the entire artificial intelligence community.

    vs. Verifiable Benchmarking of Long-Horizon Spatial Biology
    gemini-3.15/28/2026

    Paper 1 addresses a major frontier in AI—autonomous scientific discovery and reasoning in a highly complex, real-world domain (spatial biology). Its end-to-end evaluation paradigm has massive potential to accelerate biomedical research. While Paper 2 offers valuable methodological insights into LLM reasoning on SAT problems, Paper 1's benchmark directly bridges AI and translational medicine, offering broader interdisciplinary impact and real-world applicability.

    vs. ZipRL: Adaptive Multi-Turn Context Compression with Hindsight Response Replay
    gpt-5.25/28/2026

    Paper 1 has higher impact potential due to a novel, actionable method (adaptive multi-granularity compression + hindsight replay for RLVR) that directly improves long-horizon agent performance and token efficiency—highly timely for scalable LLM agents. It includes theoretical justification, integration into existing RL pipelines (GRPO), broad validation across models/tasks, and strong reported gains plus extrapolation robustness, enabling immediate real-world deployment. Paper 2 offers a rigorous, valuable evaluation protocol (paired instances/ADR) with cross-representation checks, but is primarily diagnostic rather than enabling new capabilities, so its downstream practical impact is likely narrower.

    vs. BatteryMFormer: Multi-level Learning for Battery Degradation Trajectory Forecasting
    gemini-3.15/28/2026

    While Paper 2 presents a valuable application of Transformers to battery degradation with clear environmental benefits, Paper 1 addresses a fundamental and pervasive issue in modern AI: accurately evaluating the reasoning capabilities of LLMs. By exposing flaws in standard metrics and introducing a robust paired-evaluation protocol for SAT problems, Paper 1 has the potential to broadly influence how researchers benchmark and understand reasoning across the entire field of artificial intelligence.