Forge: Quality-Aware Reinforcement Learning for NP-Hard Optimization in LLMs
Xiaozhe Li, Xinyu Fang, Shengyuan Ding, Yang Li, Linyang Li, Haodong Duan, Qingwen Liu, Kai Chen
Abstract
Large Language Models (LLMs) have achieved remarkable success on reasoning benchmarks through Reinforcement Learning with Verifiable Rewards (RLVR), excelling at tasks such as math, coding, logic, and puzzles. However, existing benchmarks evaluate only correctness, while overlooking optimality, namely the ability to find the best solutions under constraints. We propose OPT-BENCH, the first comprehensive framework for training and evaluating LLMs on NP-hard optimization problems through quality-aware RLVR. OPT-BENCH provides three key components: a scalable training infrastructure with instance generators, quality verifiers, and optimal baselines across 10 tasks; a rigorous benchmark with 1,000 instances evaluating both feasibility, measured by Success Rate, and quality, measured by Quality Ratio; and quality-aware rewards that enable continuous improvement beyond binary correctness. Training on Qwen2.5-7B-Instruct-1M with 15K examples achieves 93.1% SR and 46.6% QR, significantly outperforming GPT-4o, which achieves 29.6% SR and 14.6% QR. Beyond optimization, training on OPT-BENCH transfers to diverse tasks, including mathematics (+2.2%), logic (+1.2%), knowledge (+4.1%), and instruction following (+6.1%). Our analysis reveals that quality-aware rewards improve solutions by 28.8% over binary rewards, and that task diversity drives generalization more than data quantity, offering insights into RLVR scaling for complex reasoning.
AI Impact Assessments
(1 models)Scientific Impact Assessment: FORGE — Quality-Aware Reinforcement Learning for NP-Hard Optimization in LLMs
1. Core Contribution
FORGE introduces a comprehensive framework (FORGE-ENGINE) for training and evaluating LLMs on NP-hard combinatorial optimization problems using quality-aware RLVR. The key insight is that existing RLVR approaches use binary (correct/incorrect) reward signals, which are insufficient for optimization problems where solution quality exists on a continuous spectrum. The framework has three main components: (1) a scalable data generation pipeline with controllable difficulty across 10 NP-hard tasks, (2) FORGE-BENCH, a 1,000-instance evaluation benchmark measuring both feasibility (Success Rate) and optimality (Quality Ratio), and (3) a quality-aware reward function that provides continuous feedback by comparing model solutions against heuristic baselines.
The conceptual shift from binary correctness to continuous quality optimization for RLVR is the paper's most important idea. By training a 7B model to achieve 93.1% SR and 46.6% QR—substantially outperforming GPT-4o at 62.1% SR and 36.2% QR—the work demonstrates that this approach is practically effective.
2. Methodological Rigor
The experimental design is generally sound but has notable gaps:
Strengths: The hierarchical reward function (format → feasibility → optimality) is well-motivated and cleanly designed. The curriculum replay strategy addresses catastrophic forgetting in a principled way. Ablation studies on reward design (binary vs. quality-aware, +28.8% improvement), curriculum strategies, and task diversity are informative. The comparison against SFT baselines convincingly demonstrates RLVR's superiority, with SFT showing catastrophic OOD degradation (-12.0 points).
Concerns: The Quality Ratio metric is somewhat misleading—46.6% QR sounds low, but since it's measured against heuristic solvers and includes zeros for infeasible solutions, interpretation is non-trivial. The heuristic baselines themselves are not thoroughly validated (how close to optimal are they?). The paper uses Qwen2.5-7B-Instruct-1M rather than the standard 7B variant for better instruction adherence, which muddies the base comparison. The SFT baseline uses distillation from Qwen3-235B-Thinking, meaning the RLVR pipeline also benefits indirectly from a much larger model's knowledge. The OOD generalization gains (+2.2% math, +1.2% logic) are modest and could partly be attributed to continued training effects rather than genuine transfer from optimization reasoning.
The joint training with DAPO-17K math data complicates attribution of improvements. The claim that "optimization training transfers to general reasoning" would be stronger with more controlled experiments isolating the transfer mechanism.
3. Potential Impact
Near-term: The framework provides immediate value as infrastructure for the community. The generator-verifier-solver pipeline is modular and extensible. Researchers working on LLM reasoning can use FORGE-BENCH as a complementary evaluation to existing math/logic benchmarks that captures a different reasoning dimension.
Broader implications: The paper opens a relatively unexplored direction—using optimization problems with continuous quality signals for RLVR training. This could influence reward design in other domains where solution quality varies continuously (e.g., code optimization, essay writing, planning). The finding that task diversity matters more than data quantity for generalization is practically useful for RLVR practitioners, though this finding is not entirely novel.
Practical applications: LLMs capable of solving optimization problems could assist in logistics, scheduling, and resource allocation. However, the 46.6% QR suggests these models are far from replacing specialized solvers for real-world optimization.
4. Timeliness & Relevance
The paper is highly timely. RLVR has become the dominant paradigm for improving LLM reasoning (DeepSeek-R1, Kimi K1.5), and the community is actively searching for new domains and reward signals beyond math and coding. The observation that binary rewards are limiting for optimization tasks addresses a genuine bottleneck. The work sits at the intersection of two rapidly growing areas: LLM reasoning and combinatorial optimization, making it relevant to multiple communities.
5. Strengths & Limitations
Key Strengths:
Notable Weaknesses:
Overall Assessment
FORGE makes a solid contribution by identifying an important gap (optimality vs. correctness) in RLVR training and providing comprehensive infrastructure to address it. The quality-aware reward design is the most impactful methodological contribution. The work is timely and practically useful, though some claims—particularly around OOD generalization—are stronger than the evidence supports. The framework's value as infrastructure may ultimately exceed its specific empirical contributions.
Generated May 12, 2026
Comparison History (22)
SciCore-Mol addresses a fundamental challenge in AI-driven scientific discovery by creating pluggable cognitive modules that bridge the gap between LLMs and molecular data. Its direct applications in drug design, chemical synthesis, and scientific discovery give it broad real-world impact across chemistry and biology. While Paper 2 (Forge/OPT-BENCH) makes valuable contributions to LLM optimization reasoning with quality-aware rewards and demonstrates interesting transfer learning, its scope is more incremental within the existing RLVR paradigm. Paper 1's modular framework for integrating heterogeneous scientific data into LLMs represents a more transformative architectural contribution with wider cross-disciplinary implications.
Paper 1 introduces a novel framework (OPT-BENCH) addressing a significant gap in LLM evaluation—optimality beyond correctness for NP-hard problems. Its quality-aware RLVR approach is broadly applicable across optimization domains, demonstrates strong transfer learning to diverse tasks, and provides fundamental insights into RLVR scaling. The breadth of impact across fields (operations research, combinatorics, general reasoning) and the generalizable methodology give it higher potential impact than Paper 2, which, while valuable for chemistry, addresses a more domain-specific problem with modular architecture that builds on established paradigms.
Paper 2 has higher potential impact: it introduces a principled, general framework (SMC-based sampling view) for LLM-driven program evolution with convergence-control mechanisms and finite-sample complexity guarantees, improving methodological rigor and broader applicability across domains (math, algorithms, symbolic regression, ML research). This theoretical grounding plus demonstrated efficiency gains can influence automated discovery, evolutionary computation, and probabilistic inference communities. Paper 1 is timely and strong for optimization-centric RLVR benchmarking, but is more benchmark/task-specific and offers fewer formal guarantees, making its cross-field impact likely narrower.
Paper 2 introduces a novel benchmark and training framework (OPT-BENCH) addressing a significant gap: LLM evaluation/training for optimization quality beyond binary correctness. It demonstrates strong empirical results with transfer learning benefits across diverse tasks and provides actionable insights on quality-aware rewards and task diversity. Paper 1 offers useful diagnostic theory for self-correction but is more incremental—formalizing known observations with a Markov model. Paper 2 opens a new research direction (quality-aware RLVR for NP-hard problems) with broader impact across optimization, reasoning, and RL communities.
Paper 2 likely has higher impact: it introduces a new benchmark/training framework (OPT-BENCH) targeting NP-hard optimization with quality-aware RLVR, expanding evaluation beyond correctness to optimality—highly timely and broadly useful for reasoning and planning applications. It provides infrastructure, metrics, and demonstrated gains plus transfer to other tasks, suggesting wide applicability across fields. Paper 1 is valuable and methodologically careful, but is primarily diagnostic/ablation-focused on OPD/OPSD failure modes and mitigations, with a narrower immediate application surface than a scalable benchmark + training paradigm.
Paper 1 likely has higher scientific impact: it introduces a new benchmark/training framework (OPT-BENCH) for quality-aware RLVR on NP-hard optimization with scalable infrastructure, metrics, and demonstrated gains plus cross-task transfer—broadly useful for advancing LLM reasoning and optimization, with clear downstream applications (planning, scheduling, routing, resource allocation). Paper 2 provides important evidence of label bias in LLM-as-a-judge and parallels with human heuristics, but its contributions are more diagnostic and narrower in application scope, with less direct methodological/tooling leverage for many domains.
Paper 1 introduces OPT-BENCH, a comprehensive framework addressing a clear gap in LLM evaluation—optimality beyond correctness for NP-hard problems. It demonstrates strong empirical results with quality-aware RLVR, showing significant improvements over GPT-4o and positive transfer to diverse tasks. The practical impact is broad: optimization problems are ubiquitous in industry. Paper 2 presents an intellectually interesting but more speculative contribution (self-programmed execution with Spell), tested only on existing models not trained for the paradigm. Its impact depends on future training advances that remain undemonstrated, making Paper 1's contributions more immediately impactful and actionable.
Paper 2 introduces a novel benchmark and training framework (OPT-BENCH) that addresses a fundamental gap in LLM evaluation—optimality beyond correctness—for NP-hard problems. Its contributions span multiple dimensions: a new benchmark, quality-aware rewards for RLVR, strong empirical results surpassing GPT-4o, and demonstrated transfer learning to diverse reasoning tasks. The breadth of impact across LLM research, combinatorial optimization, and reasoning is substantial. Paper 1, while technically solid, addresses a narrower domain (power systems dynamics) with a more incremental foundation-model application. Paper 2's insights on RLVR scaling and task diversity are likely to influence broader AI research.
While Paper 1 presents a highly effective training framework for optimization tasks, Paper 2 provides a profound breakthrough in mechanistic interpretability. Uncovering the exact causal circuit for how LLMs are persuaded to abandon facts has immense implications for AI safety, alignment, and defending against data poisoning. Its rigorous intervention-based validation and discovery of a monitorable, generalizable mechanism across models offer deeper foundational scientific insights compared to the benchmarking and training improvements in Paper 1.
Paper 2 introduces a novel framework (OPT-BENCH) addressing a significant gap in LLM evaluation—optimality beyond correctness—for NP-hard problems. It demonstrates strong empirical results, transfer learning benefits across diverse tasks, and provides scalable infrastructure. The insights about quality-aware rewards vs binary rewards and task diversity vs data quantity have broad implications for RLVR research. Paper 1, while addressing an important problem in scientific ML, represents a more incremental contribution combining LLMs with symbolic regression in a narrower domain. Paper 2's broader applicability and methodological contributions give it higher potential impact.
Paper 2 likely has higher impact: it introduces a new benchmark (OPT-BENCH) and a quality-aware RLVR methodology for NP-hard optimization, expanding LLM evaluation/training from correctness to solution optimality—broadly relevant to operations research, planning, and decision-making. It reports strong empirical gains and cross-task transfer, suggesting wider applicability and timely relevance as RL-based reasoning matures. Paper 1 is innovative for runtime alignment via activation steering, but its impact may be narrower (safety/alignment defense) and more sensitive to threat-model proxies and deployment constraints.
While Paper 1 offers a profound theoretical critique of AI evaluation, Paper 2 presents a highly timely, empirical framework (OPT-BENCH) for improving LLM reasoning via Reinforcement Learning with Verifiable Rewards. Its introduction of scalable infrastructure, quality-aware rewards for NP-hard problems, and demonstrated transfer learning to general reasoning tasks provides immediate, practical utility to the rapidly growing field of LLM reasoning. This tangibility and strong empirical performance make it more likely to achieve rapid, widespread citation and adoption.
Paper 1 pioneers the application of LLMs to NP-hard optimization using quality-aware rewards, moving beyond binary correctness. Its introduction of OPT-BENCH, demonstrating a 7B model vastly outperforming GPT-4o on constrained optimization while improving general reasoning, offers massive real-world utility in operations research. Paper 2 provides a valuable but more incremental methodological improvement to RL exploration.
Paper 2 introduces a novel and broadly impactful contribution: quality-aware RLVR for NP-hard optimization, addressing the overlooked gap between correctness and optimality in LLM reasoning. OPT-BENCH provides a reusable benchmark infrastructure across 10 tasks, and the finding that quality-aware rewards transfer to diverse downstream tasks (math, logic, knowledge) has wide implications for RLVR training methodology. Paper 1, while solid, is more incremental—improving multi-agent workflow coordination for specific benchmarks. Paper 2's insights on task diversity vs. data quantity and continuous reward design are more foundational and likely to influence broader LLM training research.
Paper 1 addresses a highly timely and critical challenge in LLM reasoning by targeting NP-hard optimization problems with quality-aware reinforcement learning. Its development of a novel benchmark, significant performance gains over state-of-the-art models like GPT-4o, and demonstrated transferability to diverse tasks (math, logic) suggest broad, immediate applications and high impact across AI and operations research. Paper 2, while theoretically interesting for biologically plausible learning, focuses on analyzing existing algorithms on standard vision tasks, which has a narrower and less immediate practical impact.
Paper 1 likely has higher scientific impact due to broader novelty and cross-domain relevance: it introduces OPT-BENCH, a general framework and benchmark for quality-aware RLVR on NP-hard optimization with continuous (non-binary) rewards, plus scalable instance generation and optimal baselines across many tasks. This addresses a timely gap (optimality vs correctness) and shows transfer gains beyond optimization, suggesting wider implications for LLM reasoning and RL scaling laws. Paper 2 is strong and practical for program repair, but its scope is narrower and more application-specific.
Paper 1 offers a concrete, scalable benchmark/training framework (OPT-BENCH) for quality-aware RL on NP-hard optimization with clear metrics, baselines, and strong empirical results plus transfer effects, making it methodologically rigorous and broadly useful across reasoning, RL, and optimization communities. Its focus on moving beyond binary correctness to optimality is timely and likely to drive follow-on work and standardized evaluation. Paper 2 raises important safety/design considerations, but appears more conceptual/derivative from prior findings with less evidence and narrower immediate methodological contribution, reducing near-term scientific impact.
Paper 1 introduces a novel benchmark and training framework (OPT-BENCH) addressing a significant gap in LLM evaluation—optimality beyond correctness for NP-hard problems. It demonstrates strong empirical results, shows transfer learning benefits across diverse tasks, and provides actionable insights about quality-aware rewards and task diversity for RLVR scaling. The breadth of impact spans optimization, reasoning, and LLM training methodology. Paper 2, while interesting, presents a narrower proof-of-concept for adaptive prompt engineering in task planning explanations, with more limited applicability and less rigorous large-scale evaluation.
Paper 2 likely has higher impact due to timeliness and broad relevance to LLM post-training, introducing a new benchmark (OPT-BENCH) and quality-aware RLVR for NP-hard optimization—an important, under-evaluated capability with clear real-world applications. It provides scalable infrastructure, quantitative gains over strong baselines, and demonstrated transfer to multiple domains, suggesting wide cross-field influence. Paper 1 is methodologically solid and novel within ASP/SMT integration, but its impact is more specialized to logic programming and formal methods communities, with narrower immediate adoption.
Paper 1 introduces a novel and comprehensive framework (OPT-BENCH) for training and evaluating LLMs on NP-hard optimization problems using quality-aware reinforcement learning. It addresses a significant gap in current LLM evaluation (optimality vs. mere correctness), demonstrates strong empirical results surpassing GPT-4o, shows transfer learning benefits across diverse tasks, and provides actionable insights on RLVR scaling. Paper 2 makes a valid but relatively narrow observation about clock skew affecting observability in distributed systems—an important engineering concern but with limited novelty and narrower scientific impact compared to Paper 1's contributions to LLM reasoning and optimization.