EvoDS: Self-Evolving Autonomous Data Science Agent with Skill Learning and Context Management

Zherui Yang, Fan Liu, Yansong Ning, Hao Liu

#594 of 3355 · Artificial Intelligence
Share
Tournament Score
1474±45
10501800
75%
Win Rate
15
Wins
5
Losses
20
Matches
Rating
7.2/ 10
Significance
Rigor
Novelty
Clarity

Abstract

Recent progress in Large Language Model (LLM) agents has enabled promising advances in automated data science. However, existing approaches remain fundamentally limited by their static action sets and lack of principled long-horizon context management, hindering their ability to accumulate reusable experience across tasks and operate reliably in multi-stage, iterative data science pipelines. To address these challenges, we introduce EvoDS, a self-evolving autonomous data science agent that learns to expand its skills and adaptively managing long-term context through agentic reinforcement learning. Specifically, EvoDS introduces two key strategies: (1) Autonomous Skill Acquisition (ASA) mechanism, which enables agents to synthesize, validate, and reuse executable skills; and (2) Adaptive Context Compression (ACC) strategy, which treats context management as a learned control problem rather than passive truncation. These strategies are orchestrated within a two-stage multi-agent training scheme, enabling EvoDS to autonomously improve over time. Theoretically, we prove that EvoDS's hierarchical design reduces tool-selection error, and its optimization objective aligns with an information bottleneck principle, ensuring efficient context use. Empirically, EvoDS outperforms state-of-the-art open-source data science agents by an average of 28.9% across four diverse benchmarks while eliminating out-of-token failures. Our code and data are available at https://github.com/usail-hkust/EvoDS.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: EvoDS

1. Core Contribution

EvoDS addresses two coupled challenges in LLM-based data science agents: (1) the inability to accumulate reusable skills across tasks, and (2) the lack of principled long-horizon context management. The main novelty lies in integrating Autonomous Skill Acquisition (ASA) — a four-stage mechanism (synthesis, verification, caching, usage-driven expansion) for building a growing skill library — with Adaptive Context Compression (ACC) — treating context management as a learned control problem rather than passive truncation. These are unified within a hierarchical multi-agent architecture trained via a two-stage pipeline (SFT + joint multi-agent RL). The framing of context compression as a learnable action (the agent decides *when* to compress, not just reacting to token limits) is a genuinely interesting design choice that distinguishes this work from prior passive truncation approaches.

2. Methodological Rigor

Architecture design: The hierarchical multi-agent decomposition (Manager → Cleaner, Featurizer, Modeler, Visualizer, Debugger) is well-motivated for data science workflows but is somewhat hand-designed. The disjoint skill partitioning across sub-agents is clean but may limit flexibility for tasks requiring cross-domain tool composition.

Training pipeline: The two-stage SFT→RL approach is sound. Using DeepSeek-V3.1 as a teacher for trajectory collection, followed by GRPO-based joint RL optimization of both manager and sub-agent trajectories, is a reasonable strategy. The reward design (Eq. 8) balances outcome quality, subtask completion, context efficiency, and turn efficiency, though the hyperparameter choices (α=0.2, β=γ=0.1) appear manually tuned without sensitivity analysis.

Theoretical analysis: Theorem 5.1 (hierarchical tool selection reduces error bounds) relies on standard Gaussian noise assumptions and union bounds — the result is intuitive and expected rather than surprising. Theorem 5.2 (optimization aligns with information bottleneck) is interesting but requires strong assumptions (monotonic performance-MI relationship, entropy-proportional token cost) that may not hold precisely in practice. These theoretical contributions provide useful framing but are not deep results.

Experimental evaluation: The evaluation spans four benchmarks (DABench, DA-Code, ScienceAgentBench, MLE-Dojo) covering diverse data science tasks. However, the MLE-Dojo evaluation uses only 10 samples, which is statistically weak for drawing reliable conclusions. The comparison against baselines is comprehensive, including both proprietary and open-source models. The 28.9% relative improvement over the best open-source baseline (DataMind-14B) is substantial. The ablation study is thorough, isolating contributions of each component.

3. Potential Impact

Practical significance: EvoDS addresses a real and growing need — automating end-to-end data science workflows. The elimination of out-of-token failures (Table 3) is practically important for deployment reliability. The 69% cross-task skill reuse rate suggests genuine capability accumulation rather than task-specific memorization.

Broader influence: The skill acquisition mechanism could influence broader agent design beyond data science. The idea of treating context compression as a learned action rather than a heuristic is transferable to other long-horizon agent settings (web agents, coding agents, research agents). The joint multi-agent RL training strategy adds to the growing literature on multi-agent LLM optimization.

Limitations on impact: The approach is tightly coupled to data science workflows — the sub-agent taxonomy is domain-specific. Generalization to other agent domains would require redesign. The reliance on a strong teacher model (DeepSeek-V3.1) for SFT initialization raises questions about bootstrapping in domains without such teachers.

4. Timeliness & Relevance

This paper is highly timely. The intersection of LLM agents, reinforcement learning for agents, and automated data science is experiencing rapid growth. The paper addresses two genuine bottlenecks: (1) most data science agents cannot learn from experience, and (2) context management remains ad hoc. The KDD '26 venue is appropriate. The work builds on and extends recent systems like DataMind, DeepAnalyze, and AutoKaggle in meaningful ways.

5. Strengths & Limitations

Key Strengths:

  • Well-identified problem: the coupling of skill acquisition and context management is a genuine challenge
  • Complete system design with both theoretical grounding and empirical validation
  • Strong empirical results: 28.9% average improvement over best open-source baseline; competitive with proprietary models using a much smaller (8B) backbone
  • Elimination of out-of-token failures is a concrete, practically valuable result
  • Cross-task skill reuse experiments (Table 4) demonstrate genuine transferability, including cross-benchmark generalization
  • Open-sourced code and data enhance reproducibility
  • Notable Weaknesses:

  • MLE-Dojo evaluation on only 10 samples is insufficient for statistical reliability — confidence intervals would be very wide
  • The skill expansion threshold τ=3 and other hyperparameters lack sensitivity analysis
  • The theoretical results, while correct, are relatively straightforward and rely on strong assumptions
  • The sub-agent taxonomy is manually designed; an adaptive agent topology would strengthen the self-evolving claim
  • The failure analysis (Section 6.5) reveals that 52% of failures are instruction-following errors, suggesting the fundamental LLM capability is a bottleneck that the framework cannot address
  • Training cost analysis is limited — the paper mentions 4×A800 GPUs but doesn't report total training time or cost
  • The comparison with LATM (the most directly comparable skill-creation baseline) shows EvoDS's advantage, but the gap narrows when LATM uses stronger backbones (o4-mini)
  • 6. Additional Observations

    The paper's writing is generally clear, though dense. The extensive appendix (prompts, tool descriptions) aids reproducibility. The case studies (Figure 2) effectively illustrate skill synthesis and reuse. The curriculum learning strategy during RL (gradually increasing turn budget) is a practical detail that likely contributes significantly to training stability but receives little analysis.

    The work represents solid systems-level research with meaningful empirical contributions, though the theoretical and conceptual novelty is incremental rather than foundational.

    Rating:7.2/ 10
    Significance 7.5Rigor 6.8Novelty 6.5Clarity 7.5

    Generated Jun 3, 2026

    Comparison History (20)

    vs. MindGames Arena Generalization Track: In2AI Solution with Delayed Per-Step Reward Attribution
    gpt-5.26/5/2026

    Paper 2 has higher likely scientific impact due to broader real-world applicability (automated data science pipelines), clearer general-purpose contributions (self-evolving skill acquisition plus learned context compression), and stronger methodological framing (theoretical guarantees and open-source release enabling replication and follow-on work). Its ideas could transfer across many agentic LLM settings beyond data science. Paper 1 is impactful within multi-agent RL/game benchmarks and appears highly competitive empirically, but it is more benchmark/task-specific and its main contribution (delayed reward attribution pipeline) is narrower in cross-domain reach.

    vs. SMAC-Talk: A Natural Language Extension of the StarCraft Multi-Agent Challenge for Large Language Models
    claude-opus-4.66/5/2026

    EvoDS addresses fundamental limitations in LLM-based data science agents with novel contributions in skill learning and context management, backed by theoretical guarantees and strong empirical results (28.9% improvement across four benchmarks). It introduces broadly applicable techniques (autonomous skill acquisition, adaptive context compression) with clear real-world utility in automating data science workflows. SMAC-Talk, while valuable as a benchmark for multi-agent LLM coordination, is primarily an evaluation framework extending existing work (SMAC) rather than proposing new methods, limiting its direct scientific contribution compared to EvoDS's methodological innovations.

    vs. Can Generalist Agents Automate Data Curation?
    gemini-3.16/5/2026

    Paper 2 offers a more comprehensive and methodologically rigorous contribution by introducing a self-evolving agent with autonomous skill acquisition and adaptive context compression. Its inclusion of theoretical proofs regarding tool-selection error and information bottleneck, combined with multi-agent reinforcement learning and strong empirical gains across diverse benchmarks, suggests a broader and more foundational impact on the field of autonomous agents than Paper 1's focus on benchmarking and scaffolding data curation.

    vs. Interfaze: The Future of AI is built on Task-Specific Small Models
    claude-opus-4.66/5/2026

    EvoDS presents a more scientifically rigorous contribution with novel mechanisms (Autonomous Skill Acquisition and Adaptive Context Compression) grounded in theoretical analysis (information bottleneck principle, provable error reduction). It addresses fundamental limitations in LLM agents with principled solutions, offers reproducible open-source code, and demonstrates strong empirical gains (28.9% improvement). Paper 1 (Interfaze) reads more as a product/system announcement with impressive but hard-to-verify benchmarks against future model versions (GPT-5.4-Mini, Grok-4.3), raising credibility concerns. EvoDS's contributions to skill learning and context management have broader applicability across AI agent research.

    vs. Diagnosing Knowledge Gaps in LLM Tool Use: An Agentic Benchmark for Novel API Acquisition
    claude-opus-4.66/3/2026

    EvoDS presents a more comprehensive and novel framework with broader impact. It introduces two innovative mechanisms (ASA and ACC) for autonomous data science agents, provides theoretical guarantees, and demonstrates strong empirical results (28.9% improvement across four benchmarks). The self-evolving agent paradigm with skill learning and adaptive context management addresses fundamental limitations in LLM-based automation with wide applicability. Paper 2, while valuable for understanding API knowledge gaps, is more narrowly focused on benchmarking novel API acquisition and provides primarily diagnostic insights rather than a transformative new capability.

    vs. SHARP: Sleep-based Hierarchical Accelerated Replay for Long Range Non-Stationary Temporal Pattern Recognition
    gpt-5.26/3/2026

    Paper 1 (EvoDS) has higher likely impact due to stronger timeliness and clearer real-world applicability: autonomous LLM-based data science agents are an active, fast-moving area with immediate adoption potential. EvoDS contributes novelty via self-evolving skill acquisition plus learned context compression, includes theoretical guarantees, and reports sizable benchmark gains while addressing practical failure modes (token limits). Paper 2 (SHARP) is conceptually interesting for streaming non-stationary sequences, but appears validated mainly on simulations and limited benchmarks, with a narrower near-term deployment path versus agent automation.

    vs. GTBench: A Curriculum-Grounded Benchmark for Evaluating LLMs as Mathematical Research Assistants in Graph Theory
    gemini-3.16/3/2026

    Paper 2 presents a novel, self-evolving agent architecture with broad applications across automated data science. It combines strong methodological innovation (Autonomous Skill Acquisition and Adaptive Context Compression) with rigorous theoretical proofs and significant empirical improvements over state-of-the-art models. In contrast, Paper 1, while highly relevant to AI evaluation, is limited to a domain-specific benchmark (graph theory). Paper 2's capacity to autonomously learn skills and manage context addresses fundamental bottlenecks in agentic AI, offering wider cross-disciplinary utility and higher potential for long-term scientific impact.

    vs. Community-Aware Assessment of Social Textual Engagement and Resonance: A Human-Centric Perspective on User-Generated Content Evaluation
    gpt-5.26/3/2026

    Paper 1 is likely to have higher scientific impact due to broader applicability and timeliness: a self-evolving LLM data-science agent with learned skill acquisition and principled long-horizon context management addresses a core bottleneck for autonomous agents across many domains. It offers methodological rigor (agentic RL framework, theoretical guarantees, benchmarked gains, open code) and clear real-world utility in automating iterative analytics workflows. Paper 2 is novel and valuable for human-centric UGC evaluation, but its scope is narrower (social resonance assessment) and impact is more confined to multimedia/UGC communities and platform moderation.

    vs. Ryze: Evidence-Enriched Data Synthesis from Biomedical Papers
    gpt-5.26/3/2026

    Paper 1 likely has higher scientific impact due to stronger novelty and timeliness in a high-value domain: fully automated evidence-enriched dataset construction from real biomedical papers (preserving figures/tables/captions/referring text) plus a cost-efficient post-training pipeline, culminating in an open BioVLM with strong benchmark gains and broad usability for biomedical QA and literature mining. Its real-world applicability (biomed research reliability) and released model/tooling can catalyze downstream work. Paper 2 is impactful for agent design, but autonomous skill learning/context compression is a crowded area and its applications are less domain-critical.

    vs. TriEval: A Resource-Efficient Pipeline for LLM Bias, Toxicity, and Truthfulness Assessment
    gpt-5.26/3/2026

    Paper 1 likely has higher scientific impact due to greater methodological novelty (self-evolving agent with learned skill acquisition and learned context compression via agentic RL), stronger technical contributions (theoretical guarantees plus substantial benchmark gains), and broader applicability across automated data science workflows and agent research. Its innovations generalize to other long-horizon LLM agent settings beyond data science. Paper 2 is timely and practically useful, but is more of a systems/pipeline integration contribution with narrower scientific novelty and potentially faster commoditization as evaluation tooling evolves.

    vs. Probe Before You Edit: Probing-Guided Molecular Optimization for LLM Agents in Structure-Based Drug Design
    claude-opus-4.66/3/2026

    Paper 2 (PROBE) addresses a high-impact problem at the intersection of AI and drug discovery, introducing novel diagnostic metrics and a principled framework inspired by medicinal chemistry practice. Its domain-specific innovation—probing pocket-ligand interactions before optimization—is more scientifically novel and has clearer real-world applications in pharmaceutical development. While Paper 1 (EvoDS) makes solid contributions to autonomous data science agents with strong empirical gains, its advances are more incremental in the crowded LLM-agent space. Paper 2's cross-disciplinary impact (AI + structural biology + medicinal chemistry) and direct translational potential give it higher estimated scientific impact.

    vs. Reasoning Structure of Large Language Models
    gemini-3.16/3/2026

    Paper 1 offers a foundational methodological innovation by converting unstructured LLM reasoning into measurable, topological graphs. While Paper 2 presents a highly effective applied agent system, Paper 1 addresses a fundamental scientific gap in evaluating and interpreting the 'black box' of Large Reasoning Models. As the field shifts toward complex reasoning models (like OpenAI's o1), verifiable evaluation frameworks and efficiency metrics will have a broader, longer-lasting impact across AI research than specific agent architectures, which tend to be superseded rapidly.

    vs. Characterization of Multi-Model Agentic AI Systems on General Tasks via Trace-Driven Simulation
    claude-opus-4.66/3/2026

    EvoDS presents a more novel and impactful contribution: a self-evolving agent framework with autonomous skill acquisition and adaptive context compression, backed by both theoretical guarantees and strong empirical results (28.9% improvement over SOTA). It addresses fundamental limitations in automated data science with a principled approach combining reinforcement learning, skill learning, and information-theoretic context management. Paper 2 provides a useful trace dataset and simulator for characterizing agentic systems, but is primarily an empirical characterization/benchmarking contribution with narrower methodological innovation and less potential to drive new research directions.

    vs. AURA: Action-Gated Memory for Robot Policies at Constant VRAM
    gpt-5.26/3/2026

    Paper 2 (AURA) has higher impact potential due to a clearer, timely systems+algorithm contribution for embodied/edge robotics: constant-size memory with action-gated writes directly targets a major deployment bottleneck (VRAM/bandwidth/write endurance) and is broadly applicable to long-horizon agents beyond a specific task suite. It evaluates in closed-loop robot benchmarks and reports concrete hardware-relevant savings. Paper 1 is strong but more incremental within LLM-agent data-science automation and may face faster commoditization; its real-world constraints are less fundamental than edge-memory limits in robotics.

    vs. SkillDAG: Self-Evolving Typed Skill Graphs for LLM Skill Selection at Scale
    gemini-3.16/3/2026

    Paper 2 (EvoDS) presents higher potential impact due to its combination of theoretical rigor (mathematical proofs on tool-selection error and information bottlenecks) and substantial empirical gains (28.9% improvement across four benchmarks). While Paper 1 offers a highly novel structural approach to skill retrieval, Paper 2 tackles two fundamental bottlenecks in long-horizon agentic systems simultaneously: dynamic skill acquisition and active context management. Additionally, EvoDS provides open-source code and data, which significantly accelerates community adoption, reproducibility, and follow-up research.

    vs. Forget Attention: Importance-Aware Attention Is All You Need
    gemini-3.16/3/2026

    Paper 2 proposes a fundamental architectural innovation by fusing State Space Models directly into the attention mechanism (score-level fusion). This foundational improvement to language modeling architecture has the potential to influence a vast array of downstream applications and models across AI. While Paper 1 presents a strong, highly applicable system for automated data science, Paper 2's contribution tackles a core mechanism in foundation models, offering a broader and deeper potential impact across the entire field of deep learning.

    vs. SafeSteer: Localized On-Policy Distillation for Efficient Safety Alignment
    gemini-3.16/3/2026

    Paper 1 presents a self-evolving autonomous data science agent, which has immense potential for cross-disciplinary applications across all sciences. Its combination of skill acquisition, adaptive context compression, and theoretical grounding provides a robust framework for long-horizon tasks. While Paper 2 offers a highly efficient solution to the LLM alignment tax, Paper 1's potential to automate and accelerate broader scientific discovery gives it a wider and more transformative real-world impact.

    vs. Cross-Lingual Token Arbitrage: Optimizing Code Agent Context Windows via Local LLM Preprocessing
    gpt-5.26/3/2026

    Paper 1 has higher potential impact due to greater methodological and conceptual novelty: a self-evolving data science agent that learns new executable skills and treats long-horizon context management as a learned control problem, with theoretical guarantees and broad applicability to autonomous agent reliability. Its contributions generalize across many multi-step agent settings beyond data science. Paper 2 is timely and practically valuable for cost reduction via prompt rewriting and translation, but is more incremental (middleware optimization) and narrower in scientific scope, with less fundamental methodological innovation.

    vs. TSQAgent: Rating Time Series Data Quality via Dedicated Agentic Reasoning
    claude-opus-4.66/3/2026

    EvoDS addresses broader challenges in autonomous AI agents (skill learning, context management, reinforcement learning) applicable across data science tasks, with strong empirical gains (28.9% improvement), theoretical guarantees, and open-source code. Its self-evolving framework with reusable skill acquisition has wider applicability beyond a single domain. While TSQAgent tackles an important niche (time series quality assessment) with a solid benchmark, its scope is narrower. EvoDS's contributions to agentic RL, context compression, and autonomous skill learning are more broadly impactful across the AI and data science communities.

    vs. Diagnosing LLM Arbitration Behavior over Pre-evidence Epistemic States in RAG-based Fact-Checking
    gpt-5.26/3/2026

    Paper 1 offers a more novel, system-level contribution: a self-evolving autonomous data science agent with learned skill acquisition and learned long-horizon context management, plus theoretical analysis and sizable benchmark gains with practical open-source release—likely to influence both agent design and applied AutoDS workflows. Paper 2 is timely and valuable but primarily provides a diagnostic benchmark and a lightweight mitigation for RAG fact-checking; its impact is narrower (evaluation-focused) and less likely to reshape broader agentic architectures compared to Paper 1’s reusable methods.