Laura Wynter, Nirvik Sahoo, Paul Griffin
Large language models (LLMs) now solve a wide range of expert-level exams at or above human level, yet remain brittle on specialised, evidence-intensive domains such as law. On these tasks, errors arise not only from gaps in world knowledge but also from subtle distinctions between pieces of evidence and inconsistent use of supporting evidence. The most common aggregator over sampled chain-of-thought (CoT) traces, majority vote, returns the most popular answer regardless of whether its evidence is actually strongest. We propose to treat the selection of CoT reasoning fragments into a set of evidence as an explicit combinatorial optimisation problem, allowing well-supported but minority hypotheses to override noisy majorities, and to evaluate the approach on legal-reasoning benchmarks that are particularly sensitive to evidence quality. We introduce EP-HUBO (Evidence Pool Higher-Order Binary Optimisation), which generates multiple CoT traces with a small local model, parses fragments into per-hypothesis evidence pools, solves a higher-order unconstrained binary optimisation per pool with quality-derived weights (relevance, specificity, distinctiveness), and delegates a single adjudication call per question to a frontier model. We evaluate EP-HUBO on two evidence-intensive legal benchmarks using both simulated annealing on classical hardware and the Dirac-3 photonic entropy-quantum machine from Quantum Computing Inc. HUBO-style optimisation gives a principled way to aggregate reasoning fragments while preserving minority-but-correct hypotheses, and is most valuable in low-contamination domains where frontier models have not already absorbed the benchmark material.
The paper introduces EP-HUBO (Evidence Pool Higher-Order Binary Optimisation), a pipeline that reframes CoT trace aggregation as a combinatorial optimization problem. Rather than majority voting over sampled reasoning traces, the system: (1) generates multiple CoT traces with a small local model, (2) parses fragments into per-hypothesis evidence pools, (3) solves a HUBO problem per pool using quality-derived weights (relevance, specificity, distinctiveness), and (4) delegates a single adjudication call to a frontier model. The key conceptual insight is that majority vote conflates popularity with quality—a minority-but-correct hypothesis backed by strong evidence should be recoverable. The per-hypothesis decomposition is a clean design choice that enables independent optimization per answer candidate, and the quality-based (rather than frequency-based) weighting explicitly decouples from majority-vote signal.
Strengths in experimental design: The paper evaluates on two legal benchmarks (MMLU-Pro law, LEXam) with two trace generators and two frontier adjudicators, providing a reasonably thorough evaluation matrix. The ablation study isolates five method components and adjudicator strength. The HUBO precision metric (Definition 3.7) is a useful diagnostic that conditions on disagreement events.
The paper addresses a genuine limitation of majority-vote aggregation: it cannot recover minority-correct answers. The per-hypothesis evidence pooling is an intuitive and potentially generalizable idea. The legal domain application is well-motivated—legal reasoning genuinely requires compiling independent evidence pieces.
However, the practical impact is tempered by several factors:
The paper is timely in addressing test-time compute scaling and reasoning trace aggregation, which are active research areas. The focus on evidence-intensive legal reasoning is relevant as LLMs are increasingly deployed in regulated domains. The benchmark contamination angle (LEXam as low-contamination) is a valid concern. However, the quantum computing framing feels somewhat forced—the results don't demonstrate quantum advantage, and the "quantum-inspired" label in the title overpromises relative to what the quantum experiments deliver.
EP-HUBO presents a reasonable idea—treating evidence selection as combinatorial optimization rather than majority voting—applied to legal reasoning. The per-hypothesis pooling design is clean, and the large gains over MV are convincing. However, the gains over zero-shot frontier models are modest on the cleaner benchmark, the theoretical analysis is superficial, the scoring mechanism is unvalidated, and the quantum computing angle doesn't deliver on its promise. The paper makes a useful incremental contribution to test-time reasoning aggregation but overstates its significance through the quantum framing and the LEXam position-bias results.
Generated Jun 8, 2026
Paper 2 has higher estimated impact due to strong timeliness and broad relevance: it targets current weaknesses of LLM reasoning (evidence selection/aggregation) with a novel combinatorial-optimisation framing (HUBO) that can generalize beyond law to other evidence-intensive domains. It also has clearer near-term applications in AI evaluation and reliable decision support. Paper 1 is methodologically rigorous and useful for pattern mining, but its niche scope (interval pattern sampling under syntactic constraints) likely limits cross-field adoption and visibility compared to LLM-centric methods.
Paper 2 has higher likely scientific impact: it addresses a broadly relevant, timely problem in model-based RL (long-horizon planning in latent world models) with a clear, generally applicable hierarchical idea (action-free latent planner + action-conditioned dynamics) and potential downstream use in robotics/control where goal images are unavailable. If validated beyond preliminary PushT results, it could influence many planning/world-model systems. Paper 1 is novel but more niche (legal LLM evidence aggregation) and relies on complex, harder-to-reproduce optimization/quantum-inspired hardware, limiting breadth and adoption despite practical value in specialized domains.
Paper 2 presents a highly novel, interdisciplinary approach by integrating quantum-inspired combinatorial optimization with LLM reasoning. This methodological innovation addresses the known flaws of majority-vote in Chain-of-Thought prompting, offering a rigorous framework for evidence-intensive domains like law. Its potential to improve reasoning accuracy by preserving minority-but-correct hypotheses gives it a broader and more transformative impact across fields compared to Paper 1's empirical investigation of LLM alignment.
Paper 1 has higher likely impact: it addresses a widely observed, timely problem (LLM/LRM overthinking) with a training-free, model-agnostic method validated across multiple model scales and 12 benchmarks, suggesting broad applicability and easier adoption. Its core claim—difficulty evolution encoded in step embeddings—could influence future inference-time control and efficiency work across tasks (math, QA, coding). Paper 2 is innovative but narrower (legal evidence selection), relies on complex parsing/optimization pipelines and specialized hardware evaluation, and its benefits may depend strongly on domain “contamination” assumptions, limiting generalizability.
Paper 2 likely has higher impact due to broader relevance and timeliness: a benchmark suite for evaluating “researcher-like” agent behavior targets a rapidly growing area (agentic LLMs) and can become a community standard, enabling systematic comparison across models, harnesses, safety/ethics, and long-horizon research workflows. Released data further boosts adoption and downstream citations. Paper 1 is novel in optimization-based evidence aggregation for legal reasoning, but its impact is narrower (domain-specific, depends on CoT parsing/availability, and quantum hardware aspects may be seen as peripheral), limiting breadth and uptake.
Paper 2 introduces a novel framework (EP-HUBO) that bridges multiple high-impact fields—LLM reasoning, combinatorial optimization, and quantum computing—addressing a significant limitation of majority-vote aggregation in evidence-intensive domains. Its interdisciplinary nature, practical applicability to legal reasoning, and exploration of quantum-inspired optimization for LLM pipelines give it broader potential impact. Paper 1, while technically solid, is primarily an engineering optimization of an existing proof-of-concept SAT solver, offering incremental improvements in GPU acceleration rather than a fundamentally new paradigm.
Paper 2 is likely to have higher impact: it targets a broadly relevant, timely problem (post-deployment adaptation of LLM agents without supervision) and proposes a general framework with wide applicability across domains (software, web tasks, tool use). Its method—bootstrapping skills plus verifiers from open-world anchors and virtual tasks—could influence agent training, evaluation, and continual learning research. Paper 1 is novel in evidence aggregation for legal reasoning, but its scope is narrower (evidence-intensive QA) and depends on complex optimization/quantum-inspired hardware with less clear generality and adoption path.
Paper 2 addresses a critical and urgent issue in AI safety: the ability of frontier models to perform complex reasoning without observable Chain-of-Thought. Its massive empirical scale (30,000 questions across 43 benchmarks) and introduction of standardized capability metrics (Time Horizon and reasoning token horizon) provide foundational tools for future AI capability tracking and policy-making. While Paper 1 introduces a highly novel quantum-inspired optimization approach for LLM reasoning, Paper 2's direct relevance to AI alignment, oversight, and scaling laws gives it broader and more immediate impact across the AI research community.
Paper 2 likely has higher impact: adaptive, unified memory for LLM agents is a timely, broadly applicable problem with clear real-world utility (tool use, automation, long-horizon workflows) across many domains. The integrated semantic/episodic/procedural design with evaluation, pruning, and multi-agent roles suggests a general framework that can be adopted and extended by others, increasing breadth and follow-on work. Paper 1 is novel but more niche (legal benchmarks, CoT fragment parsing, HUBO/quantum-inspired optimization), with higher methodological and deployment friction and narrower applicability.
Paper 1 bridges LLM reasoning, combinatorial optimization, and quantum computing, introducing a highly novel approach to evidence selection. Its use of quantum-inspired hardware and higher-order binary optimization to solve CoT aggregation issues offers a paradigm shift with broad applicability across complex, evidence-intensive domains. While Paper 2 presents a valuable bidirectional neuro-symbolic method, it is more narrowly focused on geometry problems, making Paper 1's cross-disciplinary innovation and methodological novelty likely to have a broader scientific impact.