CORE: Contrastive Reflection Enables Rapid Improvements in Reasoning

Linas Nasvytis, Simon Jerome Han, Ben Prystawski, Satchel Grant, Noah D. Goodman, Judith E. Fan

May 27, 2026

arXiv:2605.28742v1 PDF

cs.AI(primary)

#430of 2682·Artificial Intelligence

#430 of 2682 · Artificial Intelligence

Tournament Score

1488±48

10501800

76%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance7

Rigor6

Novelty7.5

Clarity8

Tournament Score

1488±48

10501800

76%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Language models can use verifiable rewards to improve at a wide variety of reasoning tasks. However, both parametric (e.g. RLVR) and non-parametric (e.g. prompt optimization) approaches to doing so typically require hundreds of training samples and thousands of model rollouts, making them expensive in the best case and intractable in the worst. To address this challenge, we introduce Contrastive Reflection (CORE), a non-parametric learning algorithm that compares past reasoning traces to generate insights: short natural-language descriptions of reasoning strategies and constraints that capture differences between successful and unsuccessful problem attempts. Across four reasoning tasks, we demonstrate that CORE enables more rapid improvement than both parametric (GRPO) and non-parametric (GEPA, episodic RAG, and MemRL) methods, while using fewer rollouts. Under fixed rollout budgets with as few as five training samples, we then show that CORE also achieves comparable or greater performance gains than each baseline. Finally, we highlight how CORE is also substantially more context-efficient than non-parametric baselines, requiring fewer prompt tokens while storing learned knowledge as compact, interpretable natural-language insights. Our results therefore suggest that distilling contrasts between successful and unsuccessful reasoning traces into abstract and useful insights can provide a more efficient and interpretable route to model self-improvement than weight updates, prompt optimization, or direct reuse of stored reasoning traces.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: CORE: Contrastive Reflection Enables Rapid Improvements in Reasoning

1. Core Contribution

CORE introduces a non-parametric learning algorithm that enables frozen language models to improve at reasoning tasks by building a memory of natural-language "insights" — compact descriptions of reasoning strategies derived from contrasting successful and unsuccessful reasoning traces. The key innovation is the combination of three mechanisms: (1) contrastive reflection that compares failed attempts against semantically similar successful ones to generate candidate insights, (2) admission testing that gates insights through verifier feedback before storage, and (3) utility-aware retrieval that tracks and exploits empirical performance estimates for each insight across problems.

The problem being addressed is clear and important: existing methods for learning from verifiable rewards (both parametric like GRPO and non-parametric like GEPA) require substantial data and compute. CORE claims to achieve comparable or better performance with as few as 5 training examples and significantly fewer rollouts.

2. Methodological Rigor

The experimental design is generally sound, with four diverse reasoning tasks, three training-set sizes (5, 10, 100), three independent runs per condition, and multiple baselines spanning both parametric and non-parametric approaches. The inclusion of rollout-efficiency curves (Figure 2) and context-efficiency analysis (Figure 3) provides multi-dimensional evaluation.

However, several aspects weaken the rigor:

Single model evaluation: All experiments use GPT-OSS-120B exclusively. The authors note that CORE "generates more useful insights with larger model sizes (>30B parameters)" but provide no systematic analysis of this dependency. This raises questions about generalizability — does CORE work primarily because very large models are good at meta-reasoning?

Statistical reporting: Results are reported as mean ± SEM across only 3 runs, which provides limited statistical power. Several comparisons where CORE is declared superior have overlapping error bars (e.g., Tower of Hanoi with 5 training items: CORE 0.400±0.049 vs. MemRL 0.517±0.069).

Ablation scope: Ablations are conducted only on one task (Matchstick Arithmetic, 10 examples), making it unclear whether the same component contributions hold across tasks. Notably, CORE underperforms MemRL on Tower of Hanoi in 2 of 3 data regimes — understanding why would strengthen the analysis.

Cost accounting: While rollout counts include admission tests, the compute cost of the contrastive reflection step itself (which involves prompting the model to generate and filter insights) is not clearly accounted for. The total inference cost may be substantially higher than rollout counts suggest.

Hyperparameter sensitivity: The paper uses K=25 retrieved insights, Z nearest neighbors, β exploration bonuses, etc., but provides no sensitivity analysis for these choices.

3. Potential Impact

CORE has several promising practical implications:

Low-data reasoning improvement: The ability to improve with 5-10 training examples addresses a genuine bottleneck for deploying reasoning models in specialized domains where training data is scarce.

Interpretability: Storing learning artifacts as natural-language insights with utility estimates is a meaningful advantage over weight updates or opaque prompt optimization. Table 2 demonstrates that insights are human-readable and categorizable, which has value for safety and debugging.

Context efficiency: Using ~36× fewer tokens than episodic RAG while achieving better performance is practically significant for deployment costs.

Modular and combinable: The authors correctly identify that CORE could complement RLVR-style training, and the non-parametric nature means insights could potentially transfer across model versions.

The broader impact on the field could be moderate. The idea of learning reusable abstractions from contrastive experience connects to well-established cognitive science principles (case comparison, complementary learning systems), and the paper makes these connections explicit. However, the approach is constrained to tasks with verifiable rewards, limiting its applicability.

4. Timeliness & Relevance

This work is highly timely. The community is actively exploring alternatives and complements to RLVR for reasoning improvement, and efficiency concerns (both data and compute) are increasingly prominent as models scale. The paper directly addresses the observation that current methods require thousands of rollouts and hundreds of examples — a real bottleneck. The cognitive science framing (contrastive learning, insight discovery, complementary memory systems) also connects to growing interest in cognitively-inspired AI architectures.

5. Strengths & Limitations

Key Strengths:

Clear, well-motivated algorithm with principled components (contrastive generation, admission gating, utility-aware retrieval)

Strong empirical results on rollout efficiency — CORE at 350 rollouts exceeds all baselines at 4000 rollouts (Figure 2)

Interpretable learning artifacts that could enable debugging and trust

Good selection of diverse reasoning tasks, including the novel matchstick arithmetic domain

Ablations confirming that both contrastive reflection and utility-aware retrieval are necessary

Notable Limitations:

Single-model evaluation (GPT-OSS-120B only) limits generalizability claims

The approach inherently depends on the model's meta-cognitive ability to generate useful insights — this may not work for smaller or less capable models

Restricted to verifiable-reward settings

Tower of Hanoi results are mixed (MemRL outperforms in 2/3 regimes), suggesting CORE may struggle with certain task structures

No analysis of insight quality degradation or memory management as the number of insights grows over longer training horizons

The matchstick arithmetic task/verifier is novel and contributed by the authors, making external validation difficult

Group-level credit assignment for retrieved insights is acknowledged as a limitation but not addressed

Additional Observations

The paper's framing around cognitive science (complementary learning systems, contrastive case comparison) is a strength for motivation but the actual implementation diverges significantly from these theories — the connection is more inspirational than mechanistic. The paper would benefit from analysis of failure modes: when do generated insights mislead the model? How does insight quality vary across tasks?

The contribution of the matchstick arithmetic benchmark and verifier is a useful secondary contribution, though its difficulty calibration relative to other tasks is not well characterized.

Rating:6.8/ 10

Significance 7Rigor 6Novelty 7.5Clarity 8

Generated May 28, 2026

Comparison History (17)

vs. Better Accuracies, Worse Reasoning: A Step-Level Audit of Medical Chain-of-Thought Distillation

gemini-3.15/28/2026

Paper 2 exposes a critical flaw in current LLM evaluation paradigms by revealing that improved final-answer accuracy in CoT distillation can mask degrading reasoning quality, especially in high-stakes domains like medicine. This finding challenges standard metrics and has broad implications for AI safety, evaluation methodology, and clinical applications. While Paper 1 presents a useful efficiency improvement for reasoning tasks, Paper 2's fundamental critique of how reasoning traces are assessed is likely to drive more significant shifts in how the field evaluates and deploys reasoning models.

vs. Calibrating Conservatism for Scalable Oversight

gemini-3.15/28/2026

While Paper 1 offers valuable efficiency improvements for LLM reasoning, Paper 2 addresses the critical and highly pressing challenge of scalable oversight in autonomous AI systems. By introducing a method with rigorous statistical guarantees using Conformal Decision Theory, Paper 2 provides a foundational contribution to AI safety and alignment, a field where theoretical guarantees are rare but urgently needed as models scale.

vs. TRACER: Turn-level Regret Matching with Inner Reinforcement Credit for Cooperative Multi-LLM Reasoning

claude-opus-4.65/28/2026

CORE introduces a fundamentally novel and elegant approach—contrastive reflection using natural language insights—that is more sample-efficient, interpretable, and broadly applicable across reasoning tasks. Its non-parametric nature makes it accessible and practical, addressing a key bottleneck (sample/rollout efficiency) that limits current RLVR approaches. TRACER, while technically sophisticated in combining game theory with multi-agent RL, addresses a narrower problem (multi-LLM collaboration) with higher complexity and more limited generalizability. CORE's simplicity, interpretability, and strong empirical results with minimal data suggest broader adoption potential and wider cross-field impact.

vs. Multi-Adapter Representation Interventions via Energy Calibration

claude-opus-4.65/28/2026

CORE introduces a novel, interpretable non-parametric learning paradigm that addresses a fundamental efficiency bottleneck in LLM reasoning improvement. Its ability to achieve strong performance with as few as 5 training samples and fewer rollouts than both parametric and non-parametric baselines represents a significant practical advance. The method's interpretability through natural-language insights adds unique value. Paper 2 (MARI) makes a solid contribution to representation intervention for alignment, but it is more incremental—refining existing intervention methods with adaptive mechanisms. CORE's broader applicability across reasoning tasks and its efficiency advantages give it higher potential impact.

vs. A Conflict-Aware Penalty and Statistical Loss Framework for Balancing Modalities and Enhancing Stability in Multimodal Sentiment Analysis

gemini-3.15/28/2026

Paper 2 addresses the highly critical and timely challenge of improving reasoning capabilities in Large Language Models. Its introduction of an efficient, non-parametric learning algorithm (CORE) offers broad implications across the rapidly expanding field of LLMs, promising significant reductions in compute costs while enhancing interpretability. In contrast, Paper 1, while methodologically sound, focuses on Multimodal Sentiment Analysis, which is a narrower subfield and thus likely to have a more constrained overall scientific and real-world impact compared to foundational improvements in LLM reasoning.

vs. Plan Before Search: Search Agents Need Plan

gemini-3.15/28/2026

Paper 2 addresses a major bottleneck in LLM reasoning improvement—the high computational cost of parametric RL and prompt optimization. By proposing a non-parametric method that achieves significant gains with as few as five samples, it offers a highly accessible, sample-efficient, and interpretable solution. This rapid adaptation capability is likely to have a broader and more immediate impact across various applications compared to Paper 1's focus on multi-hop retrieval and self-bootstrapping, which, while valuable, is more specialized.

vs. Efficient Post-training of LLMs for Code Generation With Offline Reinforcement Learning

gpt-5.25/28/2026

Paper 1 presents a more novel, broadly applicable non-parametric self-improvement method (contrastive reflection distilled into interpretable “insights”) that claims strong sample/rollout efficiency and context efficiency across multiple reasoning tasks, with comparisons to several parametric and non-parametric baselines. Its idea generalizes beyond a single domain (code) and could influence both training-time and inference-time adaptation, improving interpretability and efficiency—highly timely given costs of RLVR/RL. Paper 2 is valuable and practical, but offline RL for LLM post-training is a more incremental direction with narrower impact centered on code generation.

vs. Imperfect World Models are Exploitable

claude-opus-4.65/28/2026

Paper 1 establishes fundamental theoretical results about model exploitation in RL, proving its near-inevitability and formally connecting it to reward hacking—a central concern in AI safety. These results have broad, lasting implications for anyone using learned world models for planning. Paper 2 presents a useful practical method (CORE) for efficient reasoning improvement, but it is more incremental—a new prompting/non-parametric strategy among many—and its impact is more narrow and potentially shorter-lived as LLM techniques evolve rapidly. Paper 1's theoretical contributions are more foundational and likely to be widely cited across RL and AI safety.

vs. Utility-Aware Multimodal Contrastive Learning for Product Image Generation

claude-opus-4.65/28/2026

CORE introduces a novel, broadly applicable approach to LLM self-improvement through contrastive reflection that is both sample-efficient and interpretable. It addresses a fundamental challenge in reasoning improvement across diverse tasks, with clear advantages over established methods (GRPO, prompt optimization). Paper 1, while methodologically sound, addresses a narrower application domain (product image generation for e-commerce). Paper 2's contributions to efficient non-parametric learning, interpretability, and reasoning improvement have broader impact potential across the rapidly growing LLM research community.

vs. Reliability and Effectiveness of Autonomous AI Agents in Supply Chain Management

gpt-5.25/28/2026

Paper 1 introduces a broadly applicable, non-parametric self-improvement algorithm (CORE) that achieves rapid reasoning gains with extremely low sample/rollout budgets while improving context efficiency and interpretability. This is methodologically focused, directly comparable against multiple baselines, and highly timely for scalable LLM alignment and reasoning—likely impacting many tasks and subfields (reasoning, optimization, interpretability, agentic learning). Paper 2 is strong and rigorous with clear real-world relevance in supply chains, but its primary impact is more domain-specific and leverages an existing post-training paradigm (GRPO) rather than a more general new learning mechanism.

vs. Localizing Input Uncertainty Quantification for Large Language Models via Shapley Values

gemini-3.15/28/2026

Paper 1 addresses a critical bottleneck in LLM development: the high computational cost of improving reasoning capabilities. By introducing a highly efficient, non-parametric method (CORE) that requires significantly fewer samples and rollouts, it offers a scalable and interpretable solution for model self-improvement. This has widespread applicability across AI development, potentially fundamentally shifting how models are optimized for complex tasks, giving it a broader scientific and practical impact than the targeted uncertainty quantification approach in Paper 2.

vs. PrimeKG-CL: A Continual Graph Learning Benchmark on Evolving Biomedical Knowledge Graphs

gemini-3.15/28/2026

Paper 1 addresses a highly prominent and timely challenge in AI: improving LLM reasoning efficiently. By introducing a non-parametric, highly context-efficient algorithm (CORE) that outperforms existing parametric and non-parametric baselines, it has broad applicability across all domains utilizing LLMs. While Paper 2 provides a valuable biomedical benchmark, Paper 1's methodological innovation in model self-improvement promises much wider scientific and practical impact across the rapidly growing field of natural language processing.

vs. Diagnosing Live Within-Policy Instruction Conflicts in LLM Agents with Witnessed Resolution Profiles

claude-opus-4.65/28/2026

CORE addresses a fundamental and broadly applicable challenge in LLM reasoning improvement—sample and compute efficiency—with a novel contrastive reflection approach that outperforms multiple strong baselines across four tasks. Its practical applicability is immediate and broad, offering interpretable, compact insights with fewer rollouts. Paper 2, while addressing an interesting niche problem of intra-policy rule conflicts, has a narrower scope, more limited practical applicability, and introduces a diagnostic pipeline rather than a broadly transformative method. CORE's potential to influence the large and active RLVR/self-improvement research community gives it substantially higher impact.

vs. MedGuideX: Internalizing Decision Logic from Executable Guidelines into Large Language Models for Clinical Reasoning

claude-opus-4.65/28/2026

CORE introduces a broadly applicable, novel learning paradigm (contrastive reflection for non-parametric self-improvement) that generalizes across reasoning tasks with strong efficiency gains over established methods like GRPO. Its contributions to sample efficiency, interpretability, and the general framework of LLM self-improvement have broader impact across multiple fields. MedGuideX, while valuable for clinical AI, addresses a narrower domain-specific problem (medical guideline internalization) with a more incremental contribution—transforming structured guidelines into training data. CORE's methodological novelty and cross-domain applicability give it higher potential impact.

vs. Bandwidth-Efficient and Privacy-Preserving Edge-Cloud Many-to-Many Speech Translation

gemini-3.15/28/2026

Paper 2 addresses a highly timely and critical bottleneck in LLM development: the high cost and inefficiency of improving reasoning capabilities. By introducing a sample-efficient, non-parametric learning algorithm that outperforms standard RL and optimization baselines, it offers broad applicability across AI domains. While Paper 1 presents a strong edge-cloud speech translation system, Paper 2's fundamental methodological advancement in LLM reasoning self-improvement is likely to have a wider and more immediate impact across the rapidly moving field of generative AI.

vs. Helicase: Uncertainty-Guided Supply Chain Knowledge Graph Construction with Autonomous Multi-Agent LLMs

gemini-3.15/28/2026

Paper 1 offers a foundational advance in LLM self-improvement, a critical bottleneck in AI research. By introducing CORE, it provides an efficient, non-parametric method for improving reasoning that outperforms expensive RL and parametric baselines. This has broad, cross-disciplinary impact for any domain relying on LLM reasoning. In contrast, Paper 2 presents a specialized, albeit valuable, application of multi-agent LLMs for supply chain knowledge graphs. Because Paper 1 tackles fundamental AI reasoning capabilities with a highly generalizable and compute-efficient approach, its potential scientific breadth and impact are significantly higher.

vs. AutoScientists: Self-Organizing Agent Teams for Long-Running Scientific Experimentation

gpt-5.25/28/2026

Paper 1 has higher estimated scientific impact due to broader, more directly actionable scope: a decentralized multi-agent framework for long-running scientific experimentation with demonstrated gains across diverse domains (biomedical ML benchmark, LLM training optimization, and protein fitness prediction) and large task coverage (24 BioML-Bench tasks; 217 ProteinGym assays). This suggests strong real-world applicability and cross-field impact. Paper 2 is novel and timely for efficient reasoning improvement, but its evidence is narrower (four tasks) and primarily within LM reasoning, implying a more limited immediate breadth and application range.