SUPERNOVA: Eliciting General Reasoning in LLMs with Reinforcement Learning on Natural Instructions

Ashima Suvarna, Kendrick Phan, Mehrab Beikzadeh, Hritik Bansal, Saadia Gabriel

#168 of 2292 · Artificial Intelligence
Share
Tournament Score
1526±24
10501800
70%
Win Rate
47
Wins
20
Losses
67
Matches
Rating
6.2/ 10
Significance
Rigor
Novelty
Clarity

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has significantly improved large language model (LLM) reasoning in formal domains such as mathematics and code. Despite these advancements, LLMs still struggle with general reasoning tasks requiring capabilities such as causal inference and temporal understanding. Extending RLVR to general reasoning is fundamentally constrained by the lack of high-quality, verifiable training data that spans diverse reasoning skills. To address this challenge, we propose SUPERNOVA, a data curation framework for RLVR aimed at enhancing general reasoning. Our key insight is that instruction-tuning datasets containing expert-annotated ground-truth encode rich reasoning patterns that can be systematically adapted for RLVR. To study this, we conduct 100+ controlled RL experiments to analyze how data design choices impact downstream reasoning performance. In particular, we investigate three key factors: (i) source task selection, (ii) task mixing strategies, and (iii) synthetic interventions for improving data quality. Our analysis reveals that source task selection is non-trivial and has a significant impact on downstream reasoning performance. Moreover, selecting tasks based on their performance for individual target tasks outperforms strategies based on overall average performance. Finally, models trained on SUPERNOVA outperform strong baselines (e.g., Qwen3.5) on challenging reasoning benchmarks including BBEH, Zebralogic, and MMLU-Pro. In particular, training on SUPERNOVA yields relative improvements of up to 52.8\% on BBEH across model sizes, demonstrating the effectiveness of principled data curation for RLVR. Our findings provide practical insights for curating human-annotated resources to extend RLVR to general reasoning. The code and data is available at https://github.com/asuvarna31/supernova.

AI Impact Assessments

(3 models)

Scientific Impact Assessment: SUPERNOVA

1. Core Contribution

SUPERNOVA addresses a genuine gap in the RLVR literature: while RLVR has been highly successful for math and code reasoning, extending it to general reasoning (causal inference, temporal understanding, logical deduction, etc.) has been limited by the scarcity of verifiable training data in these domains. The paper's key insight is that existing instruction-tuning datasets (specifically SuperNI) already contain expert-annotated ground-truth labels that can be systematically repurposed for RLVR training.

The contribution is primarily a data curation framework rather than a novel algorithm. The framework involves three stages: (1) task selection from instruction-tuning datasets, (2) task mixing strategies (macro vs. micro), and (3) synthetic data interventions. The paper conducts 100+ controlled RL experiments to understand how each design choice impacts downstream general reasoning performance, ultimately producing a 25K-sample dataset that yields strong improvements on BBEH and other benchmarks.

2. Methodological Rigor

Strengths in experimental design:

  • The controlled, compute-matched experimental setup across 100+ experiments is commendable and adds credibility to the findings.
  • The ablation structure is well-organized: each stage (selection → mixing → interventions) is investigated independently.
  • Using pass@8 over pass@1 is well-motivated with variance analysis (Figure 6), showing 2.5× greater discriminability.
  • Cross-model generalization experiments (Qwen3, Qwen3.5, LLaMA) strengthen the claims.
  • Weaknesses and concerns:

  • The initial candidate pool of 83 tasks (from 1,600) is selected using Claude-Opus with a simple yes/no prompt. This introduces an uncontrolled dependency on the LLM's judgment, and the criteria are vague ("suitability for RL training reasoning models"). This bottleneck could significantly affect all downstream conclusions.
  • Task ranking via approach (c)—training on each task individually and evaluating—is extremely compute-intensive and doesn't scale well. The paper acknowledges that cheaper proxies (similarity, difficulty) don't correlate well, but this makes the framework less practical for broader adoption.
  • The validation benchmark (BBEH-mini, 460 examples) is used both for task selection/ranking and for reporting intermediate results. While they hold out BBEH-test separately, the entire curation pipeline is optimized toward BBEH performance, raising concerns about overfitting to this benchmark family.
  • Data curation experiments use only Qwen3-0.6B, and while larger models are trained for final evaluation, findings may not transfer across scales for all design decisions.
  • The negative result on synthetic interventions (§5.3) is interesting but underexplored—the paper doesn't deeply analyze why all interventions fail.
  • 3. Potential Impact

    The paper addresses a real and timely need: extending RLVR beyond math/code to general reasoning. The practical insights are valuable:

  • Task selection matters dramatically (7.6pp spread between best and worst tasks) — this is an actionable finding for practitioners.
  • Micro mixing > Macro mixing — the insight that different target reasoning skills benefit from different source tasks is intuitive but empirically validated here.
  • Synthetic interventions don't help — a useful negative result that could save compute for others.
  • The dataset and code are released, which aids reproducibility. However, the framework's reliance on extensive per-task RL training for ranking limits scalability to much larger task pools.

    The results are notable: SUPERNOVA-4B outperforming Qwen3-8B on BBEH (a 2× size advantage), and strong gains on Zebralogic (+21pp for 4B). Generalization to BBH, MMLU-Pro, and Zebralogic suggests the improvements are not benchmark-specific.

    4. Timeliness & Relevance

    This work is highly timely. The RLVR community has been focused almost exclusively on math/code, and recent papers (General Reasoner, Nemotron-CrossThink) have just begun exploring broader domains. SUPERNOVA differentiates itself by leveraging high-quality human-annotated instruction data rather than noisy web-scraped data, which is a principled design choice. The finding that math RLVR doesn't transfer to general reasoning (OpenReasoner/OpenThinker degrade BBEH by -8%) motivates the work well.

    5. Strengths & Limitations

    Key Strengths:

  • Well-structured ablation study with practical, actionable insights
  • Strong empirical results across multiple model sizes and families
  • Effective use of existing high-quality human annotations (no expensive new data collection)
  • Open-source dataset and code
  • Important negative results (interventions, similarity-based selection) that inform the community
  • Notable Limitations:

  • The task selection pipeline is computationally expensive and somewhat circular (requires RL training to rank tasks)
  • Limited model scale (max 4B parameters); unclear if findings hold for 7B+ models
  • Heavy reliance on BBEH as the primary optimization target during curation; generalization claims would be stronger if curation were benchmark-agnostic
  • The reformatting step (converting open-ended tasks to MCQ) using GPT-5-mini is a significant transformation whose quality is only checked on 100 samples
  • 7 of 23 BBEH subtasks show "negligible gains" (Table 5), but this is underexplored
  • No analysis of training dynamics, reward curves, or what the models actually learn differently
  • The paper's framing as a general framework is somewhat overstated—it's specifically a pipeline for adapting SuperNI data, and the principles may not transfer to arbitrary task pools
  • Overall Assessment

    SUPERNOVA makes a solid empirical contribution to an important problem—extending RLVR to general reasoning through principled data curation. The controlled experimental methodology and practical insights are its greatest strengths. However, the scalability of the curation pipeline, reliance on a single benchmark family for optimization, and limited theoretical understanding of *why* certain tasks help constrain its broader impact. This is a useful applied contribution with clear practical value, but falls short of a fundamental advance.

    Rating:6.2/ 10
    Significance 6.5Rigor 6Novelty 5.5Clarity 7.5

    Generated Apr 10, 2026

    Comparison History (67)

    vs. cotomi Act: Learning to Automate Work by Watching You
    gemini-35/6/2026

    Paper 2 addresses a fundamental bottleneck in AI: improving general reasoning in LLMs beyond formal domains like math and code. By providing a rigorous data curation framework for Reinforcement Learning with Verifiable Rewards (RLVR) and conducting over 100 controlled experiments, it offers broad methodological insights applicable across the field. While Paper 1 presents an impressive and practical web agent system, Paper 2's focus on enhancing core reasoning capabilities via RL has a higher potential for widespread theoretical and foundational impact in shaping next-generation language models.

    vs. OracleProto: A Reproducible Framework for Benchmarking LLM Native Forecasting via Knowledge Cutoff and Temporal Masking
    claude-opus-4.65/6/2026

    SUPERNOVA addresses a more fundamental and broadly impactful problem: extending RLVR beyond formal domains (math/code) to general reasoning in LLMs. Its systematic study of data curation for RL training, with 100+ controlled experiments and strong empirical results (up to 52.8% improvement on BBEH), provides actionable insights for the rapidly growing RLVR community. The findings about task selection and mixing strategies have broad applicability. OracleProto addresses the important but narrower niche of LLM forecasting evaluation, with impact primarily limited to the forecasting benchmark community. SUPERNOVA's contributions are more foundational to LLM training methodology.

    vs. cotomi Act: Learning to Automate Work by Watching You
    gemini-35/6/2026

    Paper 1 addresses a fundamental limitation in LLMs—general reasoning beyond math and code—by extending Reinforcement Learning with Verifiable Rewards (RLVR). Its rigorous methodology, involving over 100 controlled experiments, provides critical insights into data curation for RL. Improving core reasoning capabilities of foundation models has profound and widespread implications across all of AI. While Paper 2 offers a strong applied system for browser agents, Paper 1's focus on foundational LLM reasoning methodologies gives it higher potential for broad scientific impact.

    vs. OracleProto: A Reproducible Framework for Benchmarking LLM Native Forecasting via Knowledge Cutoff and Temporal Masking
    claude-opus-4.65/6/2026

    SUPERNOVA addresses the fundamental challenge of extending RLVR beyond formal domains (math/code) to general reasoning, which is a highly active and impactful research direction. Its systematic study of data curation for RLVR with 100+ controlled experiments provides actionable insights for the community, and the demonstrated improvements (up to 52.8% on BBEH) across model sizes are substantial. The breadth of impact is wider—improving general LLM reasoning capabilities affects virtually all downstream applications. OracleProto addresses an important but narrower benchmarking problem for LLM forecasting, with impact primarily limited to the forecasting evaluation community.

    vs. Compiling Deterministic Structure into SLM Harnesses
    gemini-35/5/2026

    Paper 1 addresses a critical bottleneck in foundational model research: extending RL-driven reasoning beyond math and code into general domains. By providing a scalable data curation framework (SUPERNOVA) and extensive empirical analysis on RLVR, it offers highly timely insights that will directly influence how next-generation LLMs are post-trained, likely resulting in broader foundational impact and citations compared to the workflow-specific SLM optimizations in Paper 2.

    vs. Compiling Deterministic Structure into SLM Harnesses
    gemini-35/5/2026

    Paper 2 introduces a highly novel conceptual framework (Semantic Gradient Descent) for optimizing SLM workflows, supported by rigorous theoretical grounding (PAC learning bounds). Its approach to compiling deterministic structures and addressing the epistemic asymmetry in enterprise deployment offers deeper methodological innovation compared to Paper 1's purely empirical study on data curation. The substantial empirical gains (+34.3%) and rigorous evaluation on adversarial datasets further solidify its potential for higher scientific impact.

    vs. Lifting Traces to Logic: Programmatic Skill Induction with Neuro-Symbolic Learning for Long-Horizon Agentic Tasks
    claude-opus-4.65/5/2026

    SUPERNOVA addresses a fundamental and timely challenge—extending RLVR beyond formal domains to general reasoning—with rigorous empirical methodology (100+ controlled experiments), practical insights on data curation, and strong results across multiple benchmarks. Its findings on task selection and mixing strategies provide broadly applicable principles for the rapidly growing RLVR community. Paper 1, while innovative in neuro-symbolic skill induction, targets a narrower problem (long-horizon agentic tasks). Paper 2's open-source release, breadth of impact across reasoning tasks, and relevance to the mainstream LLM training paradigm give it higher potential impact.

    vs. Position: How can Graphs Help Large Language Models?
    gpt-5.25/5/2026

    Paper 2 has higher likely scientific impact because it proposes a concrete, testable framework (SUPERNOVA) with extensive controlled RL experiments, clear methodological contributions (data curation factors for RLVR), and strong benchmark improvements, plus released code/data enabling adoption. It directly advances a timely bottleneck—scaling verifiable-reward RL beyond math/code into general reasoning—with broad applicability to LLM training. Paper 1 is a position/survey-style piece: useful for framing and synthesizing directions, but typically less novel empirically and less immediately actionable for measurable progress.

    vs. Lifting Traces to Logic: Programmatic Skill Induction with Neuro-Symbolic Learning for Long-Horizon Agentic Tasks
    gemini-35/5/2026

    Paper 2 addresses a critical bottleneck in LLM development: extending Reinforcement Learning with Verifiable Rewards (RLVR) to general reasoning. By conducting over 100 controlled experiments and releasing an open-source data curation framework, it provides highly timely and actionable insights for the broader AI community. While Paper 1 presents an innovative neuro-symbolic approach for agentic tasks, Paper 2's focus on foundational LLM reasoning improvements positions it to have a broader and more immediate scientific impact across the field.

    vs. Position: How can Graphs Help Large Language Models?
    gpt-5.25/5/2026

    Paper 2 likely has higher impact: it proposes a concrete, reproducible framework (SUPERNOVA) with extensive controlled RL experiments, clear methodological contributions on data curation for RLVR, and strong benchmark gains, plus released code/data—supporting adoption and follow-on work. Its focus on extending RLVR from formal to general reasoning is timely and broadly relevant to LLM training. Paper 1 is a position/survey-style piece with valuable synthesis and direction-setting, but it offers fewer directly testable innovations and less empirical rigor, which typically limits near-term scientific impact.

    vs. Trustworthy AI Suffers from Invariance Conflicts and Causality is The Solution
    gemini-35/5/2026

    Paper 1 offers a highly timely, empirically validated framework for improving LLM general reasoning using RL, backed by over 100 controlled experiments and open-source assets. While Paper 2 presents a strong conceptual argument for trustworthy AI using causality, Paper 1's concrete methodology, demonstrable benchmark improvements, and direct applicability to current foundation model training paradigms are likely to yield higher near-term scientific impact and citation volume.

    vs. On Training Large Language Models for Long-Horizon Tasks: An Empirical Study of Horizon Length
    claude-opus-4.65/5/2026

    SUPERNOVA addresses a more impactful and timely problem—extending RLVR beyond math/code to general reasoning—with strong empirical results (52.8% improvement on BBEH) and practical, reproducible contributions (open-source code/data). Its data curation framework has broad applicability across LLM training pipelines. Paper 1 provides useful insights on horizon length effects but is more narrowly focused on a specific training dynamic. Paper 2's findings on task selection and mixing strategies offer immediately actionable guidance for the rapidly growing RLVR community, giving it broader potential impact.

    vs. First-Order Efficiency for Probabilistic Value Estimation via A Statistical Viewpoint
    claude-opus-4.65/5/2026

    SUPERNOVA addresses a timely and high-impact problem—extending RLVR beyond math/code to general reasoning in LLMs—which is a central challenge in the current AI landscape. Its systematic data curation framework with 100+ experiments provides practical, actionable insights for the rapidly growing LLM training community. The demonstrated improvements (up to 52.8% on BBEH) over strong baselines like Qwen3.5 are substantial. Paper 2, while theoretically rigorous and elegant in unifying probabilistic value estimators, addresses a narrower problem in data valuation/explainability with more incremental improvements over existing estimators, limiting its breadth of impact.

    vs. Dr. RTL: Autonomous Agentic RTL Optimization through Tool-Grounded Self-Improvement
    gemini-35/5/2026

    Paper 1 addresses a fundamental and highly active area in AI: improving general reasoning in LLMs through Reinforcement Learning. Its framework for data curation and verifiable rewards tackles a universal bottleneck in AI development, ensuring broad applicability and high citation potential across numerous domains. While Paper 2 presents an excellent, commercially viable solution for hardware design (EDA), its impact is largely confined to a specific subfield, making Paper 1's scientific breadth and general AI relevance significantly higher.

    vs. Escher-Loop: Mutual Evolution by Closed-Loop Self-Referential Optimization
    gemini-35/5/2026

    Paper 2 proposes a highly novel closed-loop, self-referential optimization framework for autonomous agents, addressing the critical challenge of open-ended self-improvement. While Paper 1 provides a strong, practical data curation method for LLM reinforcement learning, Paper 2's focus on mutual evolution and self-improving optimizers offers broader theoretical implications for AGI and autonomous systems, potentially sparking an entirely new direction in agentic optimization.

    vs. Escher-Loop: Mutual Evolution by Closed-Loop Self-Referential Optimization
    gemini-35/5/2026

    Paper 2 introduces a self-referential optimization framework enabling mutual evolution between task and optimizer agents. This represents a significant paradigm shift toward open-ended AI self-improvement, offering higher theoretical novelty and broader long-term impact than Paper 1's data curation framework, which, while highly timely and practical, is an incremental optimization of existing RLVR methods.

    vs. Understanding and Enforcing Weight Disentanglement in Task Arithmetic
    gpt-5.25/5/2026

    Paper 2 has higher likely scientific impact due to stronger novelty and broader conceptual reach: it provides a unifying theoretical principle (Task-Feature Specialization) explaining task arithmetic, derives testable geometric consequences (orthogonality), and proposes a simple regularizer (OrthoReg) with theoretical guarantees and broad applicability to model editing/composition. This can influence multiple areas (fine-tuning, modularity, continual learning, representation learning). Paper 1 is valuable and timely for RLVR data curation, but is more empirical/engineering-focused and may generalize less beyond the specific RLVR setup and datasets.

    vs. Are Tools All We Need? Unveiling the Tool-Use Tax in LLM Agents
    claude-opus-4.65/5/2026

    SUPERNOVA addresses a more fundamental and broadly impactful challenge—extending RLVR to general reasoning beyond math/code—with a systematic data curation framework backed by 100+ controlled experiments. It demonstrates substantial improvements (up to 52.8% on BBEH) across model sizes and provides actionable insights for the community. Paper 1 identifies an interesting phenomenon (tool-use tax) and proposes a useful framework, but its scope is narrower, its proposed solution (G-STEP) yields only partial recovery, and the findings are more diagnostic than constructive. Paper 2's open-source contributions and broader applicability give it higher potential impact.

    vs. GR-Ben: A General Reasoning Benchmark for Evaluating Process Reward Models
    gemini-35/5/2026

    Paper 1 presents a novel data curation framework (SUPERNOVA) that actively improves the general reasoning capabilities of LLMs using Reinforcement Learning with Verifiable Rewards (RLVR). While Paper 2 offers a valuable benchmark for evaluation, Paper 1 provides a concrete methodological solution that yields significant downstream performance gains (up to 52.8%). Given the intense current focus on scaling RL for LLM reasoning, Paper 1's actionable insights and proven improvements offer higher potential for broad real-world application and immediate scientific impact.

    vs. Distill-Belief: Closed-Loop Inverse Source Localization and Characterization in Physical Fields
    gpt-5.24/30/2026

    Paper 2 (SUPERNOVA) has higher likely scientific impact due to broader applicability and timeliness: improving general reasoning in LLMs via RLVR data curation targets a central, fast-moving area with immediate downstream use across many domains. Its contributions (systematic framework + 100+ controlled experiments + strong benchmark gains) are widely reusable by the community and can influence dataset design, RL training practice, and evaluation. Paper 1 is methodologically solid and novel for active sensing/ISLC, but its impact is narrower (robotic/source localization in physical fields) and less cross-field than general LLM reasoning improvements.