Can Large Language Models Reinvent Foundational Algorithms?

Jian Zhao, Haoren Luo, Yu Wang, Yuhan Cao, Pingyue Sheng, Tianxing He

Apr 7, 2026

arXiv:2604.05716v1 PDF

cs.AI(primary)

#170of 2292·Artificial Intelligence

#170 of 2292 · Artificial Intelligence

Tournament Score

1525±21

10501800

66%

Win Rate

Wins

Losses

111

Matches

Rating

6.2/ 10

Significance6.5

Rigor5.8

Novelty7

Clarity7.5

Tournament Score

1525±21

10501800

66%

Win Rate

Wins

Losses

111

Matches

Rating

6.2/ 10

Significance

Rigor

Novelty

Clarity

Abstract

LLMs have shown strong potential to advance scientific discovery. Whether they possess the capacity for foundational innovation, however, remains an open question. In this work, we focus on a prerequisite for foundational innovation: can LLMs reinvent foundational algorithms in computer science? Our \textit{Unlearn-and-Reinvent} pipeline applies LLM unlearning to remove a specific foundational algorithm, such as Dijkstra's or Euclid's algorithm, from an LLM's pretrained knowledge, and then tests whether the model can reinvent it in a controlled environment. To enable effective unlearning, we adopt a GRPO-based, on-policy unlearning method. Across 10 target algorithms, 3 strong open-weight models, and 3 hint levels, our experiments demonstrate that (1) the strongest model Qwen3-4B-Thinking-2507 successfully reinvents 50% of the algorithms with no hint, 70% at hint level 1, and 90% at hint level 2; (2) a few high-level hints can enhance the reinvention success rate, but even step-by-step hints fail for those complicated algorithms; and (3) test-time reinforcement learning enables successful reinvention for the Strassen algorithm at hint level 2. Through analyses of output trajectories and ablation studies, we find that generative verifier in the reinvention phase plays a critical role in sustaining models' reasoning strength, helping to avoid the ``thought collapse'' phenomenon. These findings offer insights into both the potential and current limits of LLMs' innovative thinking.

AI Impact Assessments

(3 models)

Scientific Impact Assessment: "Can Large Language Models Reinvent Foundational Algorithms?"

1. Core Contribution

This paper introduces the Unlearn-and-Reinvent pipeline — a two-phase methodology that first removes a specific foundational algorithm from an LLM's pretrained knowledge via unlearning, then tests whether the model can independently reinvent it. The key innovation is using LLM unlearning as a proxy for "clean-slate" evaluation of inventive capacity, avoiding the prohibitive cost of retraining models from scratch without target algorithm data. The pipeline includes a GRPO-based on-policy unlearning method with a cold-start stage, a multi-round reinvention framework with a generative verifier, and hierarchical hint levels to probe the boundary between retrieval and genuine reasoning.

The paper addresses a genuinely important question: can LLMs produce foundational innovations rather than merely recombine memorized knowledge? By operationalizing this through algorithm reinvention, the authors create a tractable experimental framework for a question that is otherwise philosophically elusive.

2. Methodological Rigor

Strengths in methodology:

The experimental design is systematic: 10 algorithms spanning graph theory, string processing, number theory, and linear algebra; 3 models of varying capability; 3 hint levels; 128 trials per condition with 8 problem variants per algorithm.

The GRPO-based unlearning method is well-motivated, with a carefully designed three-attribute reward function (knowledge disclosure, name corruption, readability) that addresses specific reward-hacking failure modes documented with concrete examples.

The cold-start stage addresses the practical problem of zero initial reward signals.

Robustness checks include distillation experiments (Table 7) and comparison of GRPO against NPO, DPO, and GradAscent baselines (Table 8).

Weaknesses in methodology:

The central assumption — that unlearning fully removes target knowledge — is acknowledged but not adequately resolved. The authors cannot rule out that residual knowledge in internal representations subtly guides reinvention. This is a fundamental threat to validity. The forgetting rate evaluation relies on an LLM-as-judge approach, which may miss implicit knowledge leakage below the surface of generated text.

The choice of relatively small models (4B-14B) raises questions about scalability. Would frontier models (70B+) show qualitatively different behavior?

The generative verifier is instantiated from the unlearned model itself, creating a potential confound: if the verifier retains implicit knowledge about the target algorithm, its feedback could indirectly guide reinvention.

Test-time RL for Strassen at level 2 essentially provides nearly complete algorithmic pseudocode as a hint, making "reinvention" a stretch — it's closer to debugging a provided solution than genuine invention.

3. Potential Impact

The paper contributes to a timely and high-profile debate about AI's capacity for scientific discovery. The Unlearn-and-Reinvent paradigm could become a standard evaluation methodology for probing LLM creativity and reasoning depth beyond memorization. Specific contributions with lasting impact include:

The "thought collapse" phenomenon: The finding that without verifier feedback, models progressively reduce exploration effort across rounds, eventually blaming the environment rather than improving their solutions. This has implications for any multi-turn LLM reasoning system.

The difficulty gradient across algorithms: The clear separation between reinventable algorithms (Gray, Euclidean, Floyd-Warshall) and resistant ones (KMP, Strassen, Manacher) provides empirical evidence for a taxonomy of algorithmic difficulty from an LLM reasoning perspective.

Practical insights for AI-driven research: The effectiveness of high-level hints versus step-by-step guidance informs how human-AI collaboration should be structured for algorithmic discovery.

However, the narrow scope (10 CS algorithms) and the fundamental uncertainty about residual knowledge limit the strength of conclusions about LLM "innovation" capacity more broadly.

4. Timeliness & Relevance

The paper is highly timely. With systems like FunSearch, AlphaEvolve, and various AI-for-science initiatives claiming algorithmic breakthroughs, the community needs rigorous frameworks to distinguish genuine reasoning from sophisticated retrieval. The concurrent work by Yang (2025) proposing a similar conceptual framework validates the timeliness of this research direction.

The paper also connects to active research in LLM unlearning, test-time computation, and generative verification, making it relevant across multiple subcommunities.

5. Strengths & Limitations

Key Strengths:

Novel and well-motivated experimental paradigm that operationalizes a philosophical question

Comprehensive experimental coverage with systematic variation of models, algorithms, and hint levels

Identification of thought collapse as a concrete, analyzable failure mode

Thorough ablation studies (verifier settings, unlearning methods, distillation robustness)

Good reproducibility: code released, detailed prompts and configurations provided

Notable Limitations:

The validity gap between unlearning and true absence of knowledge is the elephant in the room. Without retraining experiments (even on smaller models), the proxy nature of unlearning limits interpretive confidence.

The 10-algorithm benchmark, while carefully selected, is narrow. Many are well-known textbook algorithms with extensive online discussion, making them likely over-represented in training data in ways that unlearning may not fully address.

The paper conflates "reinvention" with "solving a constrained programming problem." True algorithmic invention involves identifying that a problem exists and formulating it, not just solving a well-specified task.

The hint system, particularly level 2, provides so much information that success becomes more about code implementation than algorithmic invention.

Model scale is limited to 4B-14B; the most interesting results would come from frontier-scale models.

6. Additional Observations

The paper's writing is clear and the figures are effective, particularly Figure 2's trajectory comparison. The appendix is exceptionally thorough, with detailed case studies that substantiate the claims. The connection between verifier feedback and sustained reasoning depth is the paper's most actionable finding for the broader LLM research community.

Rating:6.2/ 10

Significance 6.5Rigor 5.8Novelty 7Clarity 7.5

Generated Apr 8, 2026

Comparison History (111)

vs. Orchard: An Open-Source Agentic Modeling Framework

gemini-3.15/16/2026

Paper 2 presents a highly practical, open-source framework for training autonomous agents, achieving state-of-the-art results across multiple domains. Its provision of reusable infrastructure, datasets, and training recipes ensures immediate and widespread adoption by researchers and developers. While Paper 1 asks a profound theoretical question about LLM innovation, Paper 2 provides tools and methodologies that directly advance the rapidly growing field of agentic AI, resulting in higher potential for broad, real-world impact and citations.

vs. To Whom Do Language Models Align? Measuring Principal Hierarchies Under High-Stakes Competing Demands

gemini-3.15/16/2026

Paper 1 addresses an urgent, high-stakes issue in AI safety and alignment with immediate real-world consequences in critical domains like medicine and law. Its discovery that reasoning models deliberately suppress known safety risks under authority pressure exposes a severe vulnerability in current alignment methods. While Paper 2 presents a highly novel methodology for testing AI innovation capabilities, Paper 1 offers broader, more actionable societal and policy impact regarding the safe deployment of frontier models.

vs. Knows: Agent-Native Structured Research Representations

gpt-5.25/5/2026

Paper 1 has higher impact potential: it introduces a practical, lightweight standard (YAML sidecar + schema linter) that can be broadly adopted across disciplines to make papers agent-consumable, with strong empirical gains (accuracy and token reductions) and clear real-world utility (search, review, meta-analysis, automation). Its breadth and timeliness align with accelerating agentic workflows, and adoption evidence (10k+ indexed publications) suggests scalability. Paper 2 is novel and timely for AI science, but its impact is narrower (LLM capability probing), more dependent on specific models/unlearning setups, and less directly transferable to cross-field workflows.

vs. SkillGraph: Self-Evolving Multi-Agent Collaboration with Multimodal Graph Topology

gemini-35/5/2026

Paper 2 addresses a fundamental question about the capacity of LLMs for true foundational innovation, rather than mere memorization. By utilizing a novel 'Unlearn-and-Reinvent' pipeline, it provides profound insights into the limits and capabilities of LLM reasoning and scientific discovery. This has broader, more transformative implications for AGI and AI methodology compared to Paper 1, which, while methodologically sound, focuses on more specific architectural improvements within multi-agent systems.

vs. Knows: Agent-Native Structured Research Representations

gpt-5.25/5/2026

Paper 1 is more likely to have higher scientific impact due to broader, immediate real-world applicability: a lightweight, backward-compatible structured sidecar for papers that directly improves LLM/agent comprehension while reducing token cost. It spans many disciplines, has a deterministic schema/linter (strong rigor for a spec), and includes multi-model evaluation plus evidence of large-scale adoption (10k+ indexed). Paper 2 is novel and timely for AI cognition, but its impact is narrower (CS algorithms/LLM capability probing) and depends on specialized unlearning setups that may be less broadly deployable.

vs. SkillGraph: Self-Evolving Multi-Agent Collaboration with Multimodal Graph Topology

gpt-5.25/5/2026

Paper 1 targets a foundational, broadly relevant question—whether LLMs can (re)derive core algorithms under controlled “unlearning,” offering a novel experimental paradigm with clear methodological knobs (unlearning, hint levels, test-time RL, ablations) and insights like thought-collapse mitigation via a generative verifier. Its implications span AI evaluation, mechanistic understanding, and scientific discovery claims, making impact potentially cross-field and timely. Paper 2 is a solid systems contribution for multi-agent VLMs, but is more incremental (dynamic topology + skill refinement) and likely narrower in long-term scientific reach.

vs. The Model Knows, the Decoder Finds: Future Value Guided Particle Power Sampling

gemini-35/5/2026

Paper 1 addresses a critical bottleneck in LLM reasoning by introducing a novel, practical algorithm for inference-time compute scaling. Its approach to improving accuracy-runtime trade-offs without retraining aligns perfectly with current industry trends toward test-time compute. While Paper 2 offers an intriguing empirical study on LLM capabilities, Paper 1 provides a concrete methodological advancement with immediate real-world applications and direct impact on deploying advanced reasoning systems.

vs. The Model Knows, the Decoder Finds: Future Value Guided Particle Power Sampling

gemini-35/5/2026

Paper 1 introduces a novel, principled decoding algorithm (APPS) that directly improves the accuracy-runtime trade-off for LLM reasoning at inference time. This methodological advancement has immediate, widespread practical applications across any domain using LLMs. While Paper 2 offers interesting empirical insights into LLM capabilities, its reliance on unlearning makes it harder to generalize, whereas Paper 1 provides a concrete tool to unlock existing model capabilities, likely leading to broader adoption and higher impact.

vs. SAVE: A Generalizable Framework for Multi-Condition Single-Cell Generation with Gene Block Attention

gemini-35/5/2026

Paper 2 presents a generalizable framework with direct, high-impact applications in computational biology, disease modeling, and drug discovery. While Paper 1 offers a novel and intriguing evaluation of LLM capabilities, Paper 2 provides a tangible solution to complex single-cell data modeling challenges, likely leading to broader real-world utility and immediate scientific advancements in bioinformatics and virtual cell synthesis.

vs. Prompt Optimization Is a Coin Flip: Diagnosing When It Helps in Compound AI Systems

claude-opus-4.65/5/2026

Paper 1 addresses a more fundamental and timely question—whether LLMs can achieve foundational algorithmic innovation—introducing a novel 'Unlearn-and-Reinvent' pipeline with broad implications for AI-driven scientific discovery. Its methodology (GRPO-based unlearning, systematic evaluation across 10 algorithms and multiple models) is rigorous and opens new research directions. Paper 2, while practically useful, offers a more incremental diagnostic contribution to prompt optimization, a narrower subfield. Paper 1's findings about LLM creative capacity have broader impact across AI, cognitive science, and scientific discovery.

vs. Mistake gating leads to energy and memory efficient continual learning

gemini-35/5/2026

Paper 1 addresses a fundamental and highly timely question regarding the true reasoning and innovative capabilities of LLMs versus memorization. By introducing an unlearn-and-reinvent pipeline, it provides a novel methodological framework to rigorously test AI capabilities. This has profound implications for AI's role in scientific discovery. While Paper 2 offers a highly practical continual learning algorithm, Paper 1's exploration of foundational AI reasoning limits is likely to spur broader theoretical and empirical research across the AI community.

vs. RankGuide: Tensor-Rank-Guided Routing and Steering for Efficient Reasoning

gemini-35/5/2026

Paper 1 addresses a fundamental scientific question: whether LLMs are capable of true innovation rather than mere memorization. Its novel methodology—unlearning foundational algorithms to test reinvention capabilities—provides critical insights into the reasoning limits and potential for AGI. While Paper 2 offers a highly valuable technical solution for inference efficiency and model routing, Paper 1 has a broader scientific impact, significantly advancing our theoretical understanding of machine intelligence, cognitive capacities, and the future role of LLMs in autonomous scientific discovery.

vs. Response-Aware User Memory Selection for LLM Personalization

gpt-5.25/5/2026

Paper 2 likely has higher scientific impact: it introduces a broadly applicable, information-theoretic objective (mutual-information-based selection) for personalization that can improve quality while drastically reducing compute, making it immediately deployable across many LLM systems. Its methodological framing (response-aware utility vs. similarity heuristics) generalizes to retrieval/memory selection beyond personalization, with clear real-world benefits (cost, latency, accuracy). Paper 1 is novel and insightful for understanding LLM innovation limits, but is more diagnostic/experimental with narrower direct application and potentially higher sensitivity to unlearning setup details.

vs. How Adversarial Environments Mislead Agentic AI?

gpt-5.24/22/2026

Paper 2 likely has higher impact: it introduces a timely, security-critical threat model (AEI) for tool-using agents, formalizes the “Trust Gap,” and provides an MCP-compatible evaluation harness (POTEMKIN) that can be widely adopted in real deployments. Its findings across large-scale experiments (11,000+ runs, multiple frontier agents) suggest broadly relevant failure modes (epistemic vs navigational robustness) with clear practical implications for agent design, evaluation, and safety. Paper 1 is novel and insightful but more specialized to LLM cognition and unlearning methodology, with less immediate real-world risk/mitigation leverage.

vs. How Adversarial Environments Mislead Agentic AI?

claude-opus-4.64/22/2026

Paper 2 addresses a more fundamental question about LLM capabilities—whether they can achieve foundational innovation—with a novel 'Unlearn-and-Reinvent' methodology that combines machine unlearning with algorithm reinvention. This has broader implications for understanding AI creativity and scientific discovery. Paper 1, while practically important for AI security, addresses a more specific vulnerability in tool-integrated agents. Paper 2's methodology is more innovative, its findings about thought collapse and the role of generative verifiers offer deeper scientific insights, and its relevance spans AI reasoning, creativity, and scientific discovery—giving it wider interdisciplinary impact.

vs. MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval

gemini-34/21/2026

Paper 1 offers higher potential scientific impact by introducing a comprehensive, large-scale benchmark for mathematical reasoning and retrieval. High-quality benchmarks historically act as standard evaluation metrics, driving widespread adoption and guaranteeing high citation counts across the AI community. While Paper 2 presents a highly novel and fascinating theoretical probe into LLM creativity, Paper 1 provides essential infrastructure that addresses an immediate bottleneck in AI, supporting practical applications in retrieval-augmented generation and next-generation multimodal model evaluation.

vs. Training LLM Agents for Spontaneous, Reward-Free Self-Evolution via World Knowledge Exploration

claude-opus-4.64/21/2026

Paper 1 addresses a more fundamental scientific question—whether LLMs can achieve foundational innovation—with a novel 'Unlearn-and-Reinvent' methodology combining LLM unlearning with algorithm reinvention. This probes the nature of LLM reasoning and creativity, with broad implications for AI-driven scientific discovery. The rigorous experimental design across 10 algorithms, multiple models, and hint levels, plus insights like 'thought collapse,' contribute lasting knowledge. Paper 2, while practically impactful with strong benchmark results, addresses a more incremental advance in agent self-evolution with narrower conceptual novelty. Paper 1's fundamental insights have broader cross-field impact.

vs. LLM Reasoning Is Latent, Not the Chain of Thought

gemini-34/20/2026

Paper 1 proposes a fundamental paradigm shift in understanding LLM reasoning, moving the focus from surface Chain-of-Thought to latent-state dynamics. This theoretical reframing has broad implications across interpretability, benchmarking, and model intervention, potentially redirecting the methodology of future LLM research. While Paper 2 offers an innovative empirical pipeline for testing capabilities, Paper 1's conceptual contribution addresses the foundational mechanics of how LLMs operate, giving it a higher potential for deep, field-wide scientific impact.

vs. MEDLEY-BENCH: Scale Buys Evaluation but Not Control in AI Metacognition

gemini-34/20/2026

Paper 2 addresses a fundamental grand challenge in AI: whether LLMs possess true capacity for foundational innovation rather than mere memorization. Its highly novel 'Unlearn-and-Reinvent' methodology rigorously isolates reasoning from pretraining data. This provides profound insights into AGI capabilities, test-time compute, and algorithmic reasoning, giving it broader, field-defining implications compared to the metacognitive benchmarking in Paper 1.

vs. DeepER-Med: Advancing Deep Evidence-Based Research in Medicine Through Agentic AI

claude-opus-4.64/20/2026

DeepER-Med addresses a critical need for trustworthy AI in healthcare with immediate real-world clinical applications, validated by expert evaluation across real cases. Its evidence-based framework with explicit criteria for appraisal tackles transparency and reliability—key barriers to clinical AI adoption. While Paper 1 is intellectually interesting in probing LLM innovation capacity, it is more exploratory and narrower in scope. Paper 2's combination of a practical framework, expert-curated benchmark, and demonstrated clinical utility across multiple disciplines gives it broader and more immediate scientific and societal impact.