CORE: Contrastive Reflection Enables Rapid Improvements in Reasoning
Linas Nasvytis, Simon Jerome Han, Ben Prystawski, Satchel Grant, Noah D. Goodman, Judith E. Fan
Abstract
Language models can use verifiable rewards to improve at a wide variety of reasoning tasks. However, both parametric (e.g. RLVR) and non-parametric (e.g. prompt optimization) approaches to doing so typically require hundreds of training samples and thousands of model rollouts, making them expensive in the best case and intractable in the worst. To address this challenge, we introduce Contrastive Reflection (CORE), a non-parametric learning algorithm that compares past reasoning traces to generate insights: short natural-language descriptions of reasoning strategies and constraints that capture differences between successful and unsuccessful problem attempts. Across four reasoning tasks, we demonstrate that CORE enables more rapid improvement than both parametric (GRPO) and non-parametric (GEPA, episodic RAG, and MemRL) methods, while using fewer rollouts. Under fixed rollout budgets with as few as five training samples, we then show that CORE also achieves comparable or greater performance gains than each baseline. Finally, we highlight how CORE is also substantially more context-efficient than non-parametric baselines, requiring fewer prompt tokens while storing learned knowledge as compact, interpretable natural-language insights. Our results therefore suggest that distilling contrasts between successful and unsuccessful reasoning traces into abstract and useful insights can provide a more efficient and interpretable route to model self-improvement than weight updates, prompt optimization, or direct reuse of stored reasoning traces.
AI Impact Assessments
(1 models)Scientific Impact Assessment: CORE: Contrastive Reflection Enables Rapid Improvements in Reasoning
1. Core Contribution
CORE introduces a non-parametric learning algorithm that enables frozen language models to improve at reasoning tasks by building a memory of natural-language "insights" — compact descriptions of reasoning strategies derived from contrasting successful and unsuccessful reasoning traces. The key innovation is the combination of three mechanisms: (1) contrastive reflection that compares failed attempts against semantically similar successful ones to generate candidate insights, (2) admission testing that gates insights through verifier feedback before storage, and (3) utility-aware retrieval that tracks and exploits empirical performance estimates for each insight across problems.
The problem being addressed is clear and important: existing methods for learning from verifiable rewards (both parametric like GRPO and non-parametric like GEPA) require substantial data and compute. CORE claims to achieve comparable or better performance with as few as 5 training examples and significantly fewer rollouts.
2. Methodological Rigor
The experimental design is generally sound, with four diverse reasoning tasks, three training-set sizes (5, 10, 100), three independent runs per condition, and multiple baselines spanning both parametric and non-parametric approaches. The inclusion of rollout-efficiency curves (Figure 2) and context-efficiency analysis (Figure 3) provides multi-dimensional evaluation.
However, several aspects weaken the rigor:
3. Potential Impact
CORE has several promising practical implications:
The broader impact on the field could be moderate. The idea of learning reusable abstractions from contrastive experience connects to well-established cognitive science principles (case comparison, complementary learning systems), and the paper makes these connections explicit. However, the approach is constrained to tasks with verifiable rewards, limiting its applicability.
4. Timeliness & Relevance
This work is highly timely. The community is actively exploring alternatives and complements to RLVR for reasoning improvement, and efficiency concerns (both data and compute) are increasingly prominent as models scale. The paper directly addresses the observation that current methods require thousands of rollouts and hundreds of examples — a real bottleneck. The cognitive science framing (contrastive learning, insight discovery, complementary memory systems) also connects to growing interest in cognitively-inspired AI architectures.
5. Strengths & Limitations
Key Strengths:
Notable Limitations:
Additional Observations
The paper's framing around cognitive science (complementary learning systems, contrastive case comparison) is a strength for motivation but the actual implementation diverges significantly from these theories — the connection is more inspirational than mechanistic. The paper would benefit from analysis of failure modes: when do generated insights mislead the model? How does insight quality vary across tasks?
The contribution of the matchstick arithmetic benchmark and verifier is a useful secondary contribution, though its difficulty calibration relative to other tasks is not well characterized.
Generated May 28, 2026
Comparison History (17)
Paper 2 exposes a critical flaw in current LLM evaluation paradigms by revealing that improved final-answer accuracy in CoT distillation can mask degrading reasoning quality, especially in high-stakes domains like medicine. This finding challenges standard metrics and has broad implications for AI safety, evaluation methodology, and clinical applications. While Paper 1 presents a useful efficiency improvement for reasoning tasks, Paper 2's fundamental critique of how reasoning traces are assessed is likely to drive more significant shifts in how the field evaluates and deploys reasoning models.
While Paper 1 offers valuable efficiency improvements for LLM reasoning, Paper 2 addresses the critical and highly pressing challenge of scalable oversight in autonomous AI systems. By introducing a method with rigorous statistical guarantees using Conformal Decision Theory, Paper 2 provides a foundational contribution to AI safety and alignment, a field where theoretical guarantees are rare but urgently needed as models scale.
CORE introduces a fundamentally novel and elegant approach—contrastive reflection using natural language insights—that is more sample-efficient, interpretable, and broadly applicable across reasoning tasks. Its non-parametric nature makes it accessible and practical, addressing a key bottleneck (sample/rollout efficiency) that limits current RLVR approaches. TRACER, while technically sophisticated in combining game theory with multi-agent RL, addresses a narrower problem (multi-LLM collaboration) with higher complexity and more limited generalizability. CORE's simplicity, interpretability, and strong empirical results with minimal data suggest broader adoption potential and wider cross-field impact.
CORE introduces a novel, interpretable non-parametric learning paradigm that addresses a fundamental efficiency bottleneck in LLM reasoning improvement. Its ability to achieve strong performance with as few as 5 training samples and fewer rollouts than both parametric and non-parametric baselines represents a significant practical advance. The method's interpretability through natural-language insights adds unique value. Paper 2 (MARI) makes a solid contribution to representation intervention for alignment, but it is more incremental—refining existing intervention methods with adaptive mechanisms. CORE's broader applicability across reasoning tasks and its efficiency advantages give it higher potential impact.
Paper 2 addresses the highly critical and timely challenge of improving reasoning capabilities in Large Language Models. Its introduction of an efficient, non-parametric learning algorithm (CORE) offers broad implications across the rapidly expanding field of LLMs, promising significant reductions in compute costs while enhancing interpretability. In contrast, Paper 1, while methodologically sound, focuses on Multimodal Sentiment Analysis, which is a narrower subfield and thus likely to have a more constrained overall scientific and real-world impact compared to foundational improvements in LLM reasoning.
Paper 2 addresses a major bottleneck in LLM reasoning improvement—the high computational cost of parametric RL and prompt optimization. By proposing a non-parametric method that achieves significant gains with as few as five samples, it offers a highly accessible, sample-efficient, and interpretable solution. This rapid adaptation capability is likely to have a broader and more immediate impact across various applications compared to Paper 1's focus on multi-hop retrieval and self-bootstrapping, which, while valuable, is more specialized.
Paper 1 presents a more novel, broadly applicable non-parametric self-improvement method (contrastive reflection distilled into interpretable “insights”) that claims strong sample/rollout efficiency and context efficiency across multiple reasoning tasks, with comparisons to several parametric and non-parametric baselines. Its idea generalizes beyond a single domain (code) and could influence both training-time and inference-time adaptation, improving interpretability and efficiency—highly timely given costs of RLVR/RL. Paper 2 is valuable and practical, but offline RL for LLM post-training is a more incremental direction with narrower impact centered on code generation.
Paper 1 establishes fundamental theoretical results about model exploitation in RL, proving its near-inevitability and formally connecting it to reward hacking—a central concern in AI safety. These results have broad, lasting implications for anyone using learned world models for planning. Paper 2 presents a useful practical method (CORE) for efficient reasoning improvement, but it is more incremental—a new prompting/non-parametric strategy among many—and its impact is more narrow and potentially shorter-lived as LLM techniques evolve rapidly. Paper 1's theoretical contributions are more foundational and likely to be widely cited across RL and AI safety.
CORE introduces a novel, broadly applicable approach to LLM self-improvement through contrastive reflection that is both sample-efficient and interpretable. It addresses a fundamental challenge in reasoning improvement across diverse tasks, with clear advantages over established methods (GRPO, prompt optimization). Paper 1, while methodologically sound, addresses a narrower application domain (product image generation for e-commerce). Paper 2's contributions to efficient non-parametric learning, interpretability, and reasoning improvement have broader impact potential across the rapidly growing LLM research community.
Paper 1 introduces a broadly applicable, non-parametric self-improvement algorithm (CORE) that achieves rapid reasoning gains with extremely low sample/rollout budgets while improving context efficiency and interpretability. This is methodologically focused, directly comparable against multiple baselines, and highly timely for scalable LLM alignment and reasoning—likely impacting many tasks and subfields (reasoning, optimization, interpretability, agentic learning). Paper 2 is strong and rigorous with clear real-world relevance in supply chains, but its primary impact is more domain-specific and leverages an existing post-training paradigm (GRPO) rather than a more general new learning mechanism.
Paper 1 addresses a critical bottleneck in LLM development: the high computational cost of improving reasoning capabilities. By introducing a highly efficient, non-parametric method (CORE) that requires significantly fewer samples and rollouts, it offers a scalable and interpretable solution for model self-improvement. This has widespread applicability across AI development, potentially fundamentally shifting how models are optimized for complex tasks, giving it a broader scientific and practical impact than the targeted uncertainty quantification approach in Paper 2.
Paper 1 addresses a highly prominent and timely challenge in AI: improving LLM reasoning efficiently. By introducing a non-parametric, highly context-efficient algorithm (CORE) that outperforms existing parametric and non-parametric baselines, it has broad applicability across all domains utilizing LLMs. While Paper 2 provides a valuable biomedical benchmark, Paper 1's methodological innovation in model self-improvement promises much wider scientific and practical impact across the rapidly growing field of natural language processing.
CORE addresses a fundamental and broadly applicable challenge in LLM reasoning improvement—sample and compute efficiency—with a novel contrastive reflection approach that outperforms multiple strong baselines across four tasks. Its practical applicability is immediate and broad, offering interpretable, compact insights with fewer rollouts. Paper 2, while addressing an interesting niche problem of intra-policy rule conflicts, has a narrower scope, more limited practical applicability, and introduces a diagnostic pipeline rather than a broadly transformative method. CORE's potential to influence the large and active RLVR/self-improvement research community gives it substantially higher impact.
CORE introduces a broadly applicable, novel learning paradigm (contrastive reflection for non-parametric self-improvement) that generalizes across reasoning tasks with strong efficiency gains over established methods like GRPO. Its contributions to sample efficiency, interpretability, and the general framework of LLM self-improvement have broader impact across multiple fields. MedGuideX, while valuable for clinical AI, addresses a narrower domain-specific problem (medical guideline internalization) with a more incremental contribution—transforming structured guidelines into training data. CORE's methodological novelty and cross-domain applicability give it higher potential impact.
Paper 2 addresses a highly timely and critical bottleneck in LLM development: the high cost and inefficiency of improving reasoning capabilities. By introducing a sample-efficient, non-parametric learning algorithm that outperforms standard RL and optimization baselines, it offers broad applicability across AI domains. While Paper 1 presents a strong edge-cloud speech translation system, Paper 2's fundamental methodological advancement in LLM reasoning self-improvement is likely to have a wider and more immediate impact across the rapidly moving field of generative AI.
Paper 1 offers a foundational advance in LLM self-improvement, a critical bottleneck in AI research. By introducing CORE, it provides an efficient, non-parametric method for improving reasoning that outperforms expensive RL and parametric baselines. This has broad, cross-disciplinary impact for any domain relying on LLM reasoning. In contrast, Paper 2 presents a specialized, albeit valuable, application of multi-agent LLMs for supply chain knowledge graphs. Because Paper 1 tackles fundamental AI reasoning capabilities with a highly generalizable and compute-efficient approach, its potential scientific breadth and impact are significantly higher.
Paper 1 has higher estimated scientific impact due to broader, more directly actionable scope: a decentralized multi-agent framework for long-running scientific experimentation with demonstrated gains across diverse domains (biomedical ML benchmark, LLM training optimization, and protein fitness prediction) and large task coverage (24 BioML-Bench tasks; 217 ProteinGym assays). This suggests strong real-world applicability and cross-field impact. Paper 2 is novel and timely for efficient reasoning improvement, but its evidence is narrower (four tasks) and primarily within LM reasoning, implying a more limited immediate breadth and application range.