Satisfiability Solving with LLMs: A Matched-Pair Evaluation of Reasoning Capability
Leizhen Zhang, Shuhan Chen, Sheng Chen
Abstract
Large language models (LLMs) are increasingly used for tasks that implicitly reduce to Boolean satisfiability (SAT), yet their reasoning ability on SAT remains unclear. We present a systematic study of LLMs on 2-SAT and 3-SAT, together with two canonical reductions, Vertex Cover and discrete 3D packing, to probe representation-invariant reasoning. We first evaluate models using conventional metrics, including accuracy, precision, recall, and F1, as well as the SAT phase-transition setting. We find that these metrics can be misleading: many models obtain high scores by over-predicting satisfiable formulas, fail to reproduce the classical easy-hard-easy signature around the 3-SAT threshold, and degrade sharply as the number of variables grows. To address this problem, we introduce a paired-formula protocol based on minimally different satisfiable and unsatisfiable instances, together with Accurate Differentiation Rate (ADR), which requires both members of each pair to be classified correctly. ADR separates reasoning-oriented models from heuristic ones and correlates with witness validity. Beyond CNF, we test cross-representation consistency by converting CNF to Vertex Cover and 3-SAT to discrete 3D packing. Model decisions on CNF and on the corresponding graph or packing instances agree for most models on more than 80 percent of instances, suggesting stable decision rules across representations. Overall, our results show that SAT is a conservative probe for LLM reasoning, and that paired evaluation with ADR provides a more faithful and representation-robust assessment than conventional metrics.
AI Impact Assessments
(1 models)Scientific Impact Assessment
1. Core Contribution
This paper makes two primary contributions: (1) a systematic empirical evaluation of 12 LLMs on Boolean satisfiability (2-SAT and 3-SAT), and (2) a new evaluation methodology based on paired formulas and a metric called Accurate Differentiation Rate (ADR). The paired-formula protocol generates minimally different SAT/UNSAT instance pairs (differing by a single literal flip or clause edit), and ADR requires correct classification of *both* members of each pair. The paper also tests cross-representation consistency by reducing SAT instances to Vertex Cover and 3D packing problems.
The key finding is that conventional metrics (accuracy, precision, recall, F1, MCC) are misleading for evaluating LLM reasoning on SAT due to class imbalance across clause-density regimes and systematic SAT-prediction bias. ADR exposes these deficiencies by requiring joint correctness on paired instances, revealing that most LLMs' reasoning collapses beyond N≥25 variables for 3-SAT.
2. Methodological Rigor
Strengths in methodology:
Weaknesses:
3. Potential Impact
Evaluation methodology: ADR's requirement for joint correctness on paired instances is a genuinely useful idea that extends beyond SAT. The paper correctly identifies that paired evaluation eliminates class-prior artifacts, a pervasive problem in LLM benchmarking. This approach could influence evaluation practices in program analysis, formal verification, and constraint satisfaction communities.
Practical insights: The finding that LLMs exhibit strong SAT bias and cannot reliably detect UNSAT formulas has direct implications for practitioners using LLMs for constraint-heavy tasks (scheduling, configuration, verification). The paper's argument that SAT performance upper-bounds performance on reducible tasks is intuitive, though the empirical evidence (>80% agreement across representations) demonstrates consistency rather than strict upper-bounding.
Limited algorithmic impact: The paper does not propose methods to *improve* LLM reasoning on SAT, limiting its constructive contribution. The discussion of LLM-solver integration is speculative.
4. Timeliness & Relevance
The paper is highly timely. LLMs are increasingly deployed for reasoning-intensive tasks, and rigorous evaluation of their logical reasoning capabilities is a current bottleneck. The community needs principled benchmarks and metrics that go beyond standard classification scores. The paper directly addresses this need in a domain (SAT) with well-understood computational complexity and mature solver baselines, providing clean ground truth. The inclusion of recent models (GPT-5, Claude Opus 4, DeepSeek-reasoner) ensures currency.
5. Strengths & Limitations
Key Strengths:
Notable Weaknesses:
Additional Observations
The ADR metric, while useful, is conceptually straightforward—it is essentially pair-level accuracy. Its main value lies in the *paired instance construction* rather than the metric formula itself. The paper could have been stronger by exploring whether ADR correlates with downstream task performance in a more controlled pipeline setting, rather than relying primarily on the theoretical argument.
The paper's venue (FSE 2026) is appropriate given the software engineering applications discussed, though the contribution is primarily empirical evaluation methodology rather than a new tool or technique.
Generated May 28, 2026
Comparison History (14)
Paper 2 offers a broadly reusable evaluation methodology (matched-pair protocol + ADR) that directly addresses a known pitfall in LLM reasoning assessment (metric gaming via satisfiable bias) and connects to foundational complexity phenomena (phase transition, scaling). Its representation-invariant tests via reductions (to Vertex Cover and 3D packing) increase rigor and generality, making it applicable across many reasoning benchmarks beyond SAT. Paper 1 is a strong, timely multilingual diagnostic for spatial world modeling, but its impact is more domain-specific and less methodologically generalizable than Paper 2’s evaluation framework.
Paper 2 likely has higher scientific impact due to a clearer, broadly applicable contribution (cost–accuracy routing for LLM-as-a-judge) with direct real-world deployment relevance and a principled optimization framework (distributionally robust formulation, efficient algorithm, theoretical guarantees). Its focus on budgeted evaluation under distribution shift is timely for widespread LLM evaluation pipelines across domains. Paper 1 offers a useful diagnostic metric (paired protocol/ADR) for reasoning assessment on SAT, but its primary impact is narrower (benchmarking/evaluation methodology for SAT-like reasoning) and less directly actionable for production systems.
Paper 1 addresses a foundational question in AI—whether LLMs genuinely reason—by rigorously exposing flaws in standard benchmarks and proposing a robust paired-evaluation protocol using SAT. Its insights into LLM reasoning limitations and evaluation methodologies have broader, more fundamental scientific implications across the AI community than Paper 2, which offers a highly practical but domain-specific advance in 3D asset generation.
Harness-Bench addresses a critical and timely gap in LLM agent evaluation—isolating the effect of the execution harness from the model itself—which has broad practical implications for the rapidly growing agent deployment ecosystem. It introduces a comprehensive benchmark with 5,194 trajectories and identifies a novel concept (execution-alignment failures) relevant across many agent applications. Paper 2 makes a solid methodological contribution (ADR metric, paired-formula protocol) for evaluating LLM reasoning on SAT, but targets a narrower, more theoretical question with less immediate breadth of impact across the field.
Paper 1 introduces a novel evaluation methodology (paired-formula protocol and ADR metric) for assessing LLM reasoning on a fundamental computational problem (SAT), with broad implications across AI evaluation. Its findings about misleading conventional metrics and cross-representation consistency are relevant to the entire LLM evaluation community. Paper 2, while solving a useful domain-specific problem (chemical reaction diagram parsing) with solid results, has narrower impact limited to cheminformatics. Paper 1's methodological contribution is more generalizable and addresses a timelier question about LLM reasoning capabilities.
Paper 1 addresses a critical bottleneck (rollout diversity) in Reinforcement Learning with Verifiable Rewards (RLVR), a highly impactful and rapidly growing area for training reasoning LLMs. By introducing a simple, low-cost intervention (first-token diversification) that yields consistent performance gains over state-of-the-art baselines like GRPO, it offers immediate and broad practical utility for AI development. While Paper 2 provides valuable insights into LLM evaluation, Paper 1's algorithmic contribution directly advances the capability to train stronger reasoning models, likely resulting in higher immediate scientific and applied impact.
Paper 2 has higher likely scientific impact due to broader, more foundational relevance: it proposes a generally applicable matched-pair evaluation protocol (ADR) for probing LLM reasoning, validated across SAT and representation-changing reductions (Vertex Cover, 3D packing). This methodological contribution can transfer to many reasoning benchmarks and evaluation settings beyond a single domain. Paper 1 is timely and valuable for finance, but its impact is narrower (portfolio management benchmarking) and more application-specific, whereas Paper 2 targets core evaluation methodology for LLM reasoning with potential cross-field uptake.
Paper 2 offers broader scientific impact by addressing a fundamental challenge in AI: accurately evaluating LLM reasoning. By introducing the paired-formula protocol and ADR metric, it exposes severe flaws in standard evaluation metrics that falsely reward heuristic guessing. Because Boolean satisfiability is foundational, benchmarking LLMs on representation-invariant SAT problems provides a highly rigorous, universally applicable tool for the AI community. In contrast, while Paper 1 tackles an important alignment issue, its scope is narrower, focusing on specific engineering protocols (retry vs. resample) within specific empirical safety settings.
Paper 1 introduces a novel evaluation methodology (paired-formula protocol and ADR metric) for probing LLM reasoning on a fundamental computational problem (SAT), with cross-representation consistency analysis. It addresses a broadly relevant question—whether LLMs truly reason or rely on heuristics—applicable across AI/ML. Paper 2 makes a valuable but narrower methodological contribution to RAG evaluation standards. While rigorous, its impact is confined to the RAG evaluation niche. Paper 1's connection to computational complexity theory and its generalizable evaluation framework give it broader and deeper potential impact.
Paper 2 likely has higher impact due to broader cross-field relevance (logic, complexity theory, automated reasoning, evaluation of LLM reasoning), clearer real-world applicability (reliable assessment of models used in planning/verification-like tasks), and a methodological contribution (matched-pair protocol + ADR) that addresses known pitfalls in SAT evaluation and extends across representations (CNF, Vertex Cover, 3D packing). Paper 1 offers a useful diagnostic for LM training/selection, but its scope is narrower and its metric refinement (median/percentile CE) is less broadly transformative.
Paper 1 addresses a fundamental question regarding the true reasoning capabilities of LLMs versus heuristic pattern matching. By introducing a matched-pair evaluation protocol and the ADR metric, it exposes flaws in conventional evaluation methods and offers a rigorous standard for future foundational AI research. While Paper 2 presents a valuable and highly practical multi-agent framework for supply chains, Paper 1's theoretical contributions to understanding and evaluating core LLM reasoning mechanisms give it broader potential scientific impact across the entire artificial intelligence community.
Paper 1 addresses a major frontier in AI—autonomous scientific discovery and reasoning in a highly complex, real-world domain (spatial biology). Its end-to-end evaluation paradigm has massive potential to accelerate biomedical research. While Paper 2 offers valuable methodological insights into LLM reasoning on SAT problems, Paper 1's benchmark directly bridges AI and translational medicine, offering broader interdisciplinary impact and real-world applicability.
Paper 1 has higher impact potential due to a novel, actionable method (adaptive multi-granularity compression + hindsight replay for RLVR) that directly improves long-horizon agent performance and token efficiency—highly timely for scalable LLM agents. It includes theoretical justification, integration into existing RL pipelines (GRPO), broad validation across models/tasks, and strong reported gains plus extrapolation robustness, enabling immediate real-world deployment. Paper 2 offers a rigorous, valuable evaluation protocol (paired instances/ADR) with cross-representation checks, but is primarily diagnostic rather than enabling new capabilities, so its downstream practical impact is likely narrower.
While Paper 2 presents a valuable application of Transformers to battery degradation with clear environmental benefits, Paper 1 addresses a fundamental and pervasive issue in modern AI: accurately evaluating the reasoning capabilities of LLMs. By exposing flaws in standard metrics and introducing a robust paired-evaluation protocol for SAT problems, Paper 1 has the potential to broadly influence how researchers benchmark and understand reasoning across the entire field of artificial intelligence.