Can Large Language Models Reinvent Foundational Algorithms?
Jian Zhao, Haoren Luo, Yu Wang, Yuhan Cao, Pingyue Sheng, Tianxing He
Abstract
LLMs have shown strong potential to advance scientific discovery. Whether they possess the capacity for foundational innovation, however, remains an open question. In this work, we focus on a prerequisite for foundational innovation: can LLMs reinvent foundational algorithms in computer science? Our \textit{Unlearn-and-Reinvent} pipeline applies LLM unlearning to remove a specific foundational algorithm, such as Dijkstra's or Euclid's algorithm, from an LLM's pretrained knowledge, and then tests whether the model can reinvent it in a controlled environment. To enable effective unlearning, we adopt a GRPO-based, on-policy unlearning method. Across 10 target algorithms, 3 strong open-weight models, and 3 hint levels, our experiments demonstrate that (1) the strongest model Qwen3-4B-Thinking-2507 successfully reinvents 50% of the algorithms with no hint, 70% at hint level 1, and 90% at hint level 2; (2) a few high-level hints can enhance the reinvention success rate, but even step-by-step hints fail for those complicated algorithms; and (3) test-time reinforcement learning enables successful reinvention for the Strassen algorithm at hint level 2. Through analyses of output trajectories and ablation studies, we find that generative verifier in the reinvention phase plays a critical role in sustaining models' reasoning strength, helping to avoid the ``thought collapse'' phenomenon. These findings offer insights into both the potential and current limits of LLMs' innovative thinking.
AI Impact Assessments
(3 models)Scientific Impact Assessment: "Can Large Language Models Reinvent Foundational Algorithms?"
1. Core Contribution
This paper introduces the Unlearn-and-Reinvent pipeline — a two-phase methodology that first removes a specific foundational algorithm from an LLM's pretrained knowledge via unlearning, then tests whether the model can independently reinvent it. The key innovation is using LLM unlearning as a proxy for "clean-slate" evaluation of inventive capacity, avoiding the prohibitive cost of retraining models from scratch without target algorithm data. The pipeline includes a GRPO-based on-policy unlearning method with a cold-start stage, a multi-round reinvention framework with a generative verifier, and hierarchical hint levels to probe the boundary between retrieval and genuine reasoning.
The paper addresses a genuinely important question: can LLMs produce foundational innovations rather than merely recombine memorized knowledge? By operationalizing this through algorithm reinvention, the authors create a tractable experimental framework for a question that is otherwise philosophically elusive.
2. Methodological Rigor
Strengths in methodology:
Weaknesses in methodology:
3. Potential Impact
The paper contributes to a timely and high-profile debate about AI's capacity for scientific discovery. The Unlearn-and-Reinvent paradigm could become a standard evaluation methodology for probing LLM creativity and reasoning depth beyond memorization. Specific contributions with lasting impact include:
However, the narrow scope (10 CS algorithms) and the fundamental uncertainty about residual knowledge limit the strength of conclusions about LLM "innovation" capacity more broadly.
4. Timeliness & Relevance
The paper is highly timely. With systems like FunSearch, AlphaEvolve, and various AI-for-science initiatives claiming algorithmic breakthroughs, the community needs rigorous frameworks to distinguish genuine reasoning from sophisticated retrieval. The concurrent work by Yang (2025) proposing a similar conceptual framework validates the timeliness of this research direction.
The paper also connects to active research in LLM unlearning, test-time computation, and generative verification, making it relevant across multiple subcommunities.
5. Strengths & Limitations
Key Strengths:
Notable Limitations:
6. Additional Observations
The paper's writing is clear and the figures are effective, particularly Figure 2's trajectory comparison. The appendix is exceptionally thorough, with detailed case studies that substantiate the claims. The connection between verifier feedback and sustained reasoning depth is the paper's most actionable finding for the broader LLM research community.
Generated Apr 8, 2026
Comparison History (111)
Paper 2 presents a highly practical, open-source framework for training autonomous agents, achieving state-of-the-art results across multiple domains. Its provision of reusable infrastructure, datasets, and training recipes ensures immediate and widespread adoption by researchers and developers. While Paper 1 asks a profound theoretical question about LLM innovation, Paper 2 provides tools and methodologies that directly advance the rapidly growing field of agentic AI, resulting in higher potential for broad, real-world impact and citations.
Paper 1 addresses an urgent, high-stakes issue in AI safety and alignment with immediate real-world consequences in critical domains like medicine and law. Its discovery that reasoning models deliberately suppress known safety risks under authority pressure exposes a severe vulnerability in current alignment methods. While Paper 2 presents a highly novel methodology for testing AI innovation capabilities, Paper 1 offers broader, more actionable societal and policy impact regarding the safe deployment of frontier models.
Paper 1 has higher impact potential: it introduces a practical, lightweight standard (YAML sidecar + schema linter) that can be broadly adopted across disciplines to make papers agent-consumable, with strong empirical gains (accuracy and token reductions) and clear real-world utility (search, review, meta-analysis, automation). Its breadth and timeliness align with accelerating agentic workflows, and adoption evidence (10k+ indexed publications) suggests scalability. Paper 2 is novel and timely for AI science, but its impact is narrower (LLM capability probing), more dependent on specific models/unlearning setups, and less directly transferable to cross-field workflows.
Paper 2 addresses a fundamental question about the capacity of LLMs for true foundational innovation, rather than mere memorization. By utilizing a novel 'Unlearn-and-Reinvent' pipeline, it provides profound insights into the limits and capabilities of LLM reasoning and scientific discovery. This has broader, more transformative implications for AGI and AI methodology compared to Paper 1, which, while methodologically sound, focuses on more specific architectural improvements within multi-agent systems.
Paper 1 is more likely to have higher scientific impact due to broader, immediate real-world applicability: a lightweight, backward-compatible structured sidecar for papers that directly improves LLM/agent comprehension while reducing token cost. It spans many disciplines, has a deterministic schema/linter (strong rigor for a spec), and includes multi-model evaluation plus evidence of large-scale adoption (10k+ indexed). Paper 2 is novel and timely for AI cognition, but its impact is narrower (CS algorithms/LLM capability probing) and depends on specialized unlearning setups that may be less broadly deployable.
Paper 1 targets a foundational, broadly relevant question—whether LLMs can (re)derive core algorithms under controlled “unlearning,” offering a novel experimental paradigm with clear methodological knobs (unlearning, hint levels, test-time RL, ablations) and insights like thought-collapse mitigation via a generative verifier. Its implications span AI evaluation, mechanistic understanding, and scientific discovery claims, making impact potentially cross-field and timely. Paper 2 is a solid systems contribution for multi-agent VLMs, but is more incremental (dynamic topology + skill refinement) and likely narrower in long-term scientific reach.
Paper 1 addresses a critical bottleneck in LLM reasoning by introducing a novel, practical algorithm for inference-time compute scaling. Its approach to improving accuracy-runtime trade-offs without retraining aligns perfectly with current industry trends toward test-time compute. While Paper 2 offers an intriguing empirical study on LLM capabilities, Paper 1 provides a concrete methodological advancement with immediate real-world applications and direct impact on deploying advanced reasoning systems.
Paper 1 introduces a novel, principled decoding algorithm (APPS) that directly improves the accuracy-runtime trade-off for LLM reasoning at inference time. This methodological advancement has immediate, widespread practical applications across any domain using LLMs. While Paper 2 offers interesting empirical insights into LLM capabilities, its reliance on unlearning makes it harder to generalize, whereas Paper 1 provides a concrete tool to unlock existing model capabilities, likely leading to broader adoption and higher impact.
Paper 2 presents a generalizable framework with direct, high-impact applications in computational biology, disease modeling, and drug discovery. While Paper 1 offers a novel and intriguing evaluation of LLM capabilities, Paper 2 provides a tangible solution to complex single-cell data modeling challenges, likely leading to broader real-world utility and immediate scientific advancements in bioinformatics and virtual cell synthesis.
Paper 1 addresses a more fundamental and timely question—whether LLMs can achieve foundational algorithmic innovation—introducing a novel 'Unlearn-and-Reinvent' pipeline with broad implications for AI-driven scientific discovery. Its methodology (GRPO-based unlearning, systematic evaluation across 10 algorithms and multiple models) is rigorous and opens new research directions. Paper 2, while practically useful, offers a more incremental diagnostic contribution to prompt optimization, a narrower subfield. Paper 1's findings about LLM creative capacity have broader impact across AI, cognitive science, and scientific discovery.
Paper 1 addresses a fundamental and highly timely question regarding the true reasoning and innovative capabilities of LLMs versus memorization. By introducing an unlearn-and-reinvent pipeline, it provides a novel methodological framework to rigorously test AI capabilities. This has profound implications for AI's role in scientific discovery. While Paper 2 offers a highly practical continual learning algorithm, Paper 1's exploration of foundational AI reasoning limits is likely to spur broader theoretical and empirical research across the AI community.
Paper 1 addresses a fundamental scientific question: whether LLMs are capable of true innovation rather than mere memorization. Its novel methodology—unlearning foundational algorithms to test reinvention capabilities—provides critical insights into the reasoning limits and potential for AGI. While Paper 2 offers a highly valuable technical solution for inference efficiency and model routing, Paper 1 has a broader scientific impact, significantly advancing our theoretical understanding of machine intelligence, cognitive capacities, and the future role of LLMs in autonomous scientific discovery.
Paper 2 likely has higher scientific impact: it introduces a broadly applicable, information-theoretic objective (mutual-information-based selection) for personalization that can improve quality while drastically reducing compute, making it immediately deployable across many LLM systems. Its methodological framing (response-aware utility vs. similarity heuristics) generalizes to retrieval/memory selection beyond personalization, with clear real-world benefits (cost, latency, accuracy). Paper 1 is novel and insightful for understanding LLM innovation limits, but is more diagnostic/experimental with narrower direct application and potentially higher sensitivity to unlearning setup details.
Paper 2 likely has higher impact: it introduces a timely, security-critical threat model (AEI) for tool-using agents, formalizes the “Trust Gap,” and provides an MCP-compatible evaluation harness (POTEMKIN) that can be widely adopted in real deployments. Its findings across large-scale experiments (11,000+ runs, multiple frontier agents) suggest broadly relevant failure modes (epistemic vs navigational robustness) with clear practical implications for agent design, evaluation, and safety. Paper 1 is novel and insightful but more specialized to LLM cognition and unlearning methodology, with less immediate real-world risk/mitigation leverage.
Paper 2 addresses a more fundamental question about LLM capabilities—whether they can achieve foundational innovation—with a novel 'Unlearn-and-Reinvent' methodology that combines machine unlearning with algorithm reinvention. This has broader implications for understanding AI creativity and scientific discovery. Paper 1, while practically important for AI security, addresses a more specific vulnerability in tool-integrated agents. Paper 2's methodology is more innovative, its findings about thought collapse and the role of generative verifiers offer deeper scientific insights, and its relevance spans AI reasoning, creativity, and scientific discovery—giving it wider interdisciplinary impact.
Paper 1 offers higher potential scientific impact by introducing a comprehensive, large-scale benchmark for mathematical reasoning and retrieval. High-quality benchmarks historically act as standard evaluation metrics, driving widespread adoption and guaranteeing high citation counts across the AI community. While Paper 2 presents a highly novel and fascinating theoretical probe into LLM creativity, Paper 1 provides essential infrastructure that addresses an immediate bottleneck in AI, supporting practical applications in retrieval-augmented generation and next-generation multimodal model evaluation.
Paper 1 addresses a more fundamental scientific question—whether LLMs can achieve foundational innovation—with a novel 'Unlearn-and-Reinvent' methodology combining LLM unlearning with algorithm reinvention. This probes the nature of LLM reasoning and creativity, with broad implications for AI-driven scientific discovery. The rigorous experimental design across 10 algorithms, multiple models, and hint levels, plus insights like 'thought collapse,' contribute lasting knowledge. Paper 2, while practically impactful with strong benchmark results, addresses a more incremental advance in agent self-evolution with narrower conceptual novelty. Paper 1's fundamental insights have broader cross-field impact.
Paper 1 proposes a fundamental paradigm shift in understanding LLM reasoning, moving the focus from surface Chain-of-Thought to latent-state dynamics. This theoretical reframing has broad implications across interpretability, benchmarking, and model intervention, potentially redirecting the methodology of future LLM research. While Paper 2 offers an innovative empirical pipeline for testing capabilities, Paper 1's conceptual contribution addresses the foundational mechanics of how LLMs operate, giving it a higher potential for deep, field-wide scientific impact.
Paper 2 addresses a fundamental grand challenge in AI: whether LLMs possess true capacity for foundational innovation rather than mere memorization. Its highly novel 'Unlearn-and-Reinvent' methodology rigorously isolates reasoning from pretraining data. This provides profound insights into AGI capabilities, test-time compute, and algorithmic reasoning, giving it broader, field-defining implications compared to the metacognitive benchmarking in Paper 1.
DeepER-Med addresses a critical need for trustworthy AI in healthcare with immediate real-world clinical applications, validated by expert evaluation across real cases. Its evidence-based framework with explicit criteria for appraisal tackles transparency and reliability—key barriers to clinical AI adoption. While Paper 1 is intellectually interesting in probing LLM innovation capacity, it is more exploratory and narrower in scope. Paper 2's combination of a practical framework, expert-curated benchmark, and demonstrated clinical utility across multiple disciplines gives it broader and more immediate scientific and societal impact.