Echo: Learning from Experience Data via User-Driven Refinement

Hande Dong, Xiaoyun Liang, Jiarui Yu, Jiayi Lin, Changqing Ai, Feng Liu, Wenjun Zhang, Rongbi Wei

#225 of 2292 · Artificial Intelligence
Share
Tournament Score
1513±48
10501800
75%
Win Rate
12
Wins
4
Losses
16
Matches
Rating
5.5/ 10
Significance
Rigor
Novelty
Clarity

Abstract

Static "human data" faces inherent limitations: it is expensive to scale and bounded by the knowledge of its creators. Continuous learning from "experience data" - interactions between agents and their environments - promises to transcend these barriers. Today, the widespread deployment of AI agents grants us low-cost access to massive streams of such real-world experience. However, raw interaction logs are inherently noisy, filled with trial-and-error and low information density, rendering them inefficient for direct model training. We introduce Echo, a generalized framework designed to operationalize the transition from raw experience to learnable knowledge, effectively "echoing" environmental feedback back into the training loop for model optimization. In today's agent ecosystem, user refinement serves as a primary source of such feedback: driven by responsibility for the outcome, users rigorously transform flawed agent proposals into verified solutions. These user-driven refinement sequences inherently distill agents' crude attempts into high-quality training signals. Echo systematically harvests these signals to continuously align the agent with real-world needs. Large-scale validation in a production code completion environment confirms that Echo effectively harnesses this pipeline, breaking the static performance ceiling by increasing the acceptance rate from 25.7% to 35.7%.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: Echo: Learning from Experience Data via User-Driven Refinement

1. Core Contribution

Echo proposes a framework for learning from "experience data" — specifically, the discrepancy between an AI agent's initial proposals and user-refined final outcomes in production settings. The key insight is that users, as "accountable stakeholders," naturally correct flawed agent outputs into verified solutions, creating high-quality supervision signals. The framework is formalized as a three-stage pipeline: Experience Acquisition (capturing raw interaction streams), Knowledge Extraction (mining the gap between agent proposals C₁ and user-committed final states C_N), and Model Optimization (aligning the model to predict C_N directly from context C₀).

The concrete instantiation is in code auto-completion within Tencent's CodeBuddy product, where the system tracks code gaps between prefix and suffix anchors, monitors user edits until a contextual break, and extracts the final committed code as ground truth. A multi-stage data refinery pipeline handles truncation, quality filtering, distribution balancing, and PPL-based denoising.

2. Methodological Rigor

Strengths in evaluation design: The paper deliberately anchors evaluation in production metrics (Acceptance Rate, Generation Rate) rather than static benchmarks, which is appropriate for the industrial context and provides more convincing evidence of real-world utility. The five-month longitudinal evaluation with 10,000+ DAU is compelling.

Weaknesses in experimental rigor:

  • The primary result (25.7% → 35.7% AR) is presented as a cumulative improvement over five months of iterative pipeline refinement, making it impossible to attribute gains to specific components. There is no proper ablation study decomposing the contribution of gap-based extraction, intent-aligned rewriting, quality filtering, and distribution proportioning individually.
  • The scaling analysis (Figure 5) uses only BLEU scores on a proprietary benchmark rather than the production metrics advocated by the paper's own evaluation philosophy, creating an internal inconsistency.
  • The base model (DeepSeek-Coder-6.7B with 500B continued pre-training) itself represents significant additional training beyond the original model. The paper does not clearly separate gains from continued pre-training versus Echo-specific data.
  • The "50k samples" primary training set is relatively small; justification for this choice vs. the available data volume is thin.
  • The comparison is only against a single SFT baseline. No comparison against RLHF, DPO, or other online/offline learning methods is provided, weakening claims about Echo's relative advantage.
  • The generalization experiment (Table 1) lacks baseline details — the internal vs. external "statistical caliber" differences are acknowledged but not quantified, making cross-environment comparison imprecise.
  • 3. Potential Impact

    The paper addresses a genuinely important problem: how to continuously improve AI agents from deployment-time interactions rather than static training data. The conceptual framing — users as accountable stakeholders who naturally produce ground-truth corrections — is intuitive and practically grounded.

    Real-world applications: The framework is directly applicable to any AI copilot system where users refine agent outputs (code completion, writing assistants, design tools). The production deployment at Tencent scale provides credibility. The 10% absolute improvement in acceptance rate is commercially significant.

    Broader influence: The "experience data" framing connects to important ongoing discussions about data scarcity for LLM training. However, the actual technical contribution is relatively straightforward — it essentially amounts to mining edit histories for SFT targets with careful data curation. The conceptual framework, while articulated with ambitious language, does not introduce fundamentally new technical machinery.

    4. Timeliness & Relevance

    The paper is highly timely. The question of how to learn from deployment interactions is central to the current AI agent ecosystem. Products like Cursor, GitHub Copilot, and Claude Code generate enormous interaction logs, and the industry desperately needs principled approaches to leverage this data. The paper's framing aligns well with the "Era of Experience" discussion (Silver & Sutton, 2025) and concerns about static data exhaustion.

    5. Strengths & Limitations

    Key Strengths:

  • Production validation at scale: Real deployment with real users over five months is far more convincing than synthetic benchmarks.
  • Practical pipeline design: The gap-based extraction, anchor monitoring, and lifecycle tracking are well-engineered contributions useful to practitioners.
  • Generalization evidence: Improvements transferring to external users suggests the approach captures genuine capability improvements rather than user-specific overfitting.
  • Scaling behavior: The absence of saturation in the scaling curve (though measured only via BLEU) is an encouraging signal.
  • Notable Limitations:

  • Limited technical novelty: The core idea — using user corrections as training signal — is well-established in interactive machine learning and learning from demonstrations. The novelty is primarily in the engineering pipeline and scale, not in conceptual or algorithmic innovation.
  • Overclaimed generality: The paper positions Echo as "environment-agnostic" and a "universal paradigm," but provides evidence only for code completion — a relatively constrained, single-turn generation task where the optimization naturally reduces to SFT. The leap to multi-step agents, creative tasks, or complex reasoning remains entirely speculative.
  • Missing ablations and baselines: No component-wise ablation; no comparison to DPO, RLHF, or other experience learning methods cited in related work.
  • Confounding factors: The five-month improvement trajectory conflates data pipeline improvements with data volume scaling, making causal claims difficult.
  • Privacy and ethics: While acknowledged, the data governance discussion is superficial for a system mining production interaction logs at scale.
  • Reproducibility: The proprietary nature of the data, product, and evaluation infrastructure makes independent verification impossible.
  • Additional Observations

    The writing is clear but at times excessively promotional, using terms like "inexhaustible data engine," "strategic moat," and "paradigm shift" for what is essentially a well-executed data curation pipeline with SFT. The discussion section (6.3) speculates that experience data might surpass pre-training, which is unsupported by the evidence presented. The paper would benefit from more measured claims proportional to its actual experimental evidence.

    Rating:5.5/ 10
    Significance 6Rigor 4.5Novelty 4Clarity 6.5

    Generated May 22, 2026

    Comparison History (16)

    vs. Remembering More, Risking More: Longitudinal Safety Risks in Memory-Equipped LLM Agents
    gemini-3.15/22/2026

    Paper 1 identifies a novel, fundamental failure mode in agentic AI (temporal memory contamination) and introduces a rigorous evaluation protocol for longitudinal safety. As long-term memory becomes standard in LLMs, establishing how accumulated context degrades safety will have profound implications across AI alignment, safety evaluations, and architecture design, offering broader scientific impact than Paper 2's application-focused data pipeline.

    vs. Remembering More, Risking More: Longitudinal Safety Risks in Memory-Equipped LLM Agents
    gemini-3.15/22/2026

    Paper 1 offers higher scientific impact by identifying a novel, fundamental vulnerability in AI agents: temporal memory contamination. While Paper 2 provides a highly practical, production-validated framework for continuous learning from user feedback, it builds on established paradigms of interaction-based alignment. Paper 1 pioneers a new longitudinal evaluation paradigm for AI safety, demonstrating that risks compound over time across unrelated tasks. Its rigorous trigger-probe protocol and early detection mechanism provide foundational tools for future research in secure, long-horizon autonomous agents, making its conceptual contributions more broadly impactful.

    vs. LoopVLA: Learning Sufficiency in Recurrent Refinement for Vision-Language-Action Models
    gemini-3.15/22/2026

    Paper 1 addresses the critical AI bottleneck of data scaling by presenting a scalable, continuous learning framework from user interactions. Its validation in a large-scale production environment demonstrates immediate, widespread real-world utility across the rapidly growing domain of LLM agents, offering broader impact than Paper 2's robotics-specific optimizations.

    vs. LoopVLA: Learning Sufficiency in Recurrent Refinement for Vision-Language-Action Models
    gemini-3.15/22/2026

    Paper 2 addresses a fundamental bottleneck in AI training—the reliance on expensive, static human data—by introducing a scalable framework for continuous learning from user-driven refinements. This approach has massive breadth of impact across virtually all interactive AI agents. Furthermore, its validation in a large-scale production environment demonstrating a 10% absolute increase in acceptance rates highlights exceptional real-world applicability and timeliness. While Paper 1 offers a rigorous and novel architectural improvement for robotic control, Paper 2's paradigm shift toward continuous experience-based learning has a wider potential footprint across the broader AI ecosystem.

    vs. Unleashing LLMs in Bayesian Optimization: Preference-Guided Framework for Scientific Discovery
    claude-opus-4.65/22/2026

    Paper 1 (LGBO) demonstrates higher scientific impact potential due to: (1) Novel theoretical framework integrating LLMs into Bayesian Optimization with formal convergence guarantees, (2) Broad applicability across multiple scientific domains (physics, chemistry, biology, materials science), (3) Validated in both dry benchmarks and wet-lab experiments showing significant efficiency gains, (4) Addresses fundamental challenges in scientific discovery (costly experiments, cold-start, high dimensionality). Paper 2 (Echo) addresses an important but narrower problem in code completion agent training. While practically valuable with strong production results, its impact is more domain-specific compared to LGBO's cross-disciplinary scientific optimization framework.

    vs. Unleashing LLMs in Bayesian Optimization: Preference-Guided Framework for Scientific Discovery
    claude-opus-4.65/22/2026

    Paper 2 introduces a novel framework (LGBO) that integrates LLM reasoning into Bayesian Optimization with theoretical guarantees and demonstrates impact across multiple scientific domains including wet-lab validation. Its breadth of impact spans physics, chemistry, biology, and materials science, addressing a fundamental challenge in scientific discovery. Paper 1, while practically valuable with strong production results in code completion, is more narrowly focused on a single application domain. Paper 2's theoretical contributions, cross-disciplinary applicability, and real experimental validation suggest broader and deeper scientific impact.

    vs. Claw AI Lab: An Autonomous Multi-Agent Research Team
    gpt-5.25/22/2026

    Paper 1 (Echo) is more scientifically impactful: it proposes a broadly applicable learning framework for converting noisy real-world agent interaction logs into high-quality training signals via user-driven refinement, and demonstrates a substantial, quantitative production gain (acceptance 25.7%→35.7%). This is methodologically closer to a generalizable learning paradigm with clear downstream applications across deployed agents and continual alignment. Paper 2 is valuable infrastructure/UI for autonomous research, but appears more product/system-integration oriented with limited, internal case-study evaluation and less clear general scientific novelty.

    vs. Skill Weaving: Efficient LLM Improvement via Modular Skillpacks
    claude-opus-4.65/22/2026

    Echo addresses a fundamental challenge in continuous learning from real-world deployment data, with validated production results showing a 39% relative improvement in code completion acceptance rates. Its framework for converting noisy interaction logs into training signals has broad applicability across all deployed AI agent systems. While SkillWeave presents useful modular specialization techniques, Echo's contribution is more transformative—it establishes a scalable paradigm for post-deployment improvement that could reshape how AI systems learn continuously, with concrete production validation rather than just benchmark results.

    vs. Claw AI Lab: An Autonomous Multi-Agent Research Team
    gemini-3.15/22/2026

    While Paper 1 presents an ambitious framework for autonomous AI research, its evaluation is limited to a small internal study. Paper 2 tackles a fundamental bottleneck in AI scaling—continuous learning from noisy real-world experience—and demonstrates substantial methodological rigor through large-scale validation in a production environment. The proven 10% absolute increase in acceptance rates highlights immediate, highly scalable real-world applicability and a robust data-flywheel paradigm with broader impact across all deployed agentic systems.

    vs. Skill Weaving: Efficient LLM Improvement via Modular Skillpacks
    gpt-5.25/22/2026

    Paper 1 is likely higher impact due to timeliness and real-world applicability: it leverages ubiquitous post-deployment interaction/refinement data to enable continuous learning, directly addressing a key bottleneck (scalable high-quality supervision) and showing production-scale gains. The framework generalizes beyond coding to any agent with user edits, potentially affecting alignment, RLHF alternatives, and agent deployment practices broadly. Paper 2 is strong for efficient specialization and modularity, but resembles an incremental advance on parameter-efficient fine-tuning/modular adapters; its impact may be narrower to deployment/efficiency compared with a paradigm for learning from live experience.

    vs. Data-driven Circuit Discovery for Interpretability of Language Models
    gemini-3.15/22/2026

    Paper 2 addresses the critical bottleneck of static training data by introducing a scalable framework for continuous learning from user interactions. Its successful validation in a large-scale production environment demonstrates high real-world applicability and offers a practical path to improving deployed AI agents. While Paper 1 provides valuable insights into mechanistic interpretability, Paper 2's approach to harnessing continuous, real-world experience promises broader and more immediate impacts across the AI industry and applied machine learning research.

    vs. Data-driven Circuit Discovery for Interpretability of Language Models
    gpt-5.25/22/2026

    Paper 2 (Echo) has higher likely scientific impact due to strong real-world applicability and demonstrated large-scale production gains (code completion acceptance +10 pp), addressing a timely problem: leveraging abundant agent interaction data via user refinement. The framework is broadly relevant to continual learning, RLHF/RLAIF-style alignment, and deployed agents across domains. While Paper 1 is novel and important for mechanistic interpretability, its impact is more specialized and research-facing, with less immediate practical adoption. Echo’s deployment evidence and general “experience-to-training-signal” pipeline suggest wider near-term influence.

    vs. Trace2Skill: Verifier-Guided Skill Evolution for Long-Context EDA Agents
    gpt-5.25/22/2026

    Paper 1 has broader, more timely impact: it proposes a general framework for turning ubiquitous real-world agent interaction logs plus user refinements into scalable training signals, demonstrated at production scale with a clear quantitative gain. This targets a central bottleneck (data and continual alignment) across many agent domains, making applications wide and immediate. Paper 2 is innovative and rigorous for a hard, important niche (EDA/Verilog) and introduces verifier-guided test-time skill evolution, but its scope and real-world deployment footprint are narrower and results appear more domain-specific, reducing cross-field impact relative to Paper 1.

    vs. Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR
    claude-opus-4.65/22/2026

    Echo addresses a fundamental challenge in continuous AI learning from real-world deployment data, with validated production results showing significant improvement (25.7% to 35.7% acceptance rate) in code completion. Its framework for harvesting user refinement signals from deployed agents is broadly applicable across AI agent ecosystems and addresses the critical bottleneck of training data scalability. Paper 2, while technically solid in improving rubric-based RL training efficiency, addresses a narrower optimization problem within RLVR. Echo's production-scale validation and generalizable framework for experience-driven learning have broader potential impact across the rapidly growing AI agent deployment landscape.

    vs. Interference-Aware Multi-Task Unlearning
    gemini-3.15/22/2026

    Paper 2 addresses a critical bottleneck in AI scaling—reliance on expensive static human data—by proposing a scalable framework for continuous learning from real-world user interactions. Its validation in a large-scale production environment demonstrates significant and immediate real-world utility. While Paper 1 provides a strong methodological advance in the important niche of multi-task unlearning, Paper 2 offers a broader impact by providing a blueprint for the continuous, automated improvement of broadly deployed AI agents.

    vs. Generative Auto-Bidding with Unified Modeling and Exploration
    gpt-5.25/22/2026

    Paper 1 has higher estimated scientific impact due to broader novelty and applicability: it proposes a general framework for turning noisy, real-world agent interaction logs into high-quality training signals via user-driven refinement, a paradigm relevant across many deployed AI agents (coding, assistants, workflow tools) and timely for continual learning at scale. Its demonstrated improvement in production suggests strong real-world leverage. Paper 2 is methodologically solid and impactful in ad bidding, but is more domain-specific and combines established components (DT, Q-guidance, IDM) into a tailored system, limiting breadth across fields.