Echo: Learning from Experience Data via User-Driven Refinement
Hande Dong, Xiaoyun Liang, Jiarui Yu, Jiayi Lin, Changqing Ai, Feng Liu, Wenjun Zhang, Rongbi Wei
Abstract
Static "human data" faces inherent limitations: it is expensive to scale and bounded by the knowledge of its creators. Continuous learning from "experience data" - interactions between agents and their environments - promises to transcend these barriers. Today, the widespread deployment of AI agents grants us low-cost access to massive streams of such real-world experience. However, raw interaction logs are inherently noisy, filled with trial-and-error and low information density, rendering them inefficient for direct model training. We introduce Echo, a generalized framework designed to operationalize the transition from raw experience to learnable knowledge, effectively "echoing" environmental feedback back into the training loop for model optimization. In today's agent ecosystem, user refinement serves as a primary source of such feedback: driven by responsibility for the outcome, users rigorously transform flawed agent proposals into verified solutions. These user-driven refinement sequences inherently distill agents' crude attempts into high-quality training signals. Echo systematically harvests these signals to continuously align the agent with real-world needs. Large-scale validation in a production code completion environment confirms that Echo effectively harnesses this pipeline, breaking the static performance ceiling by increasing the acceptance rate from 25.7% to 35.7%.
AI Impact Assessments
(1 models)Scientific Impact Assessment: Echo: Learning from Experience Data via User-Driven Refinement
1. Core Contribution
Echo proposes a framework for learning from "experience data" — specifically, the discrepancy between an AI agent's initial proposals and user-refined final outcomes in production settings. The key insight is that users, as "accountable stakeholders," naturally correct flawed agent outputs into verified solutions, creating high-quality supervision signals. The framework is formalized as a three-stage pipeline: Experience Acquisition (capturing raw interaction streams), Knowledge Extraction (mining the gap between agent proposals C₁ and user-committed final states C_N), and Model Optimization (aligning the model to predict C_N directly from context C₀).
The concrete instantiation is in code auto-completion within Tencent's CodeBuddy product, where the system tracks code gaps between prefix and suffix anchors, monitors user edits until a contextual break, and extracts the final committed code as ground truth. A multi-stage data refinery pipeline handles truncation, quality filtering, distribution balancing, and PPL-based denoising.
2. Methodological Rigor
Strengths in evaluation design: The paper deliberately anchors evaluation in production metrics (Acceptance Rate, Generation Rate) rather than static benchmarks, which is appropriate for the industrial context and provides more convincing evidence of real-world utility. The five-month longitudinal evaluation with 10,000+ DAU is compelling.
Weaknesses in experimental rigor:
3. Potential Impact
The paper addresses a genuinely important problem: how to continuously improve AI agents from deployment-time interactions rather than static training data. The conceptual framing — users as accountable stakeholders who naturally produce ground-truth corrections — is intuitive and practically grounded.
Real-world applications: The framework is directly applicable to any AI copilot system where users refine agent outputs (code completion, writing assistants, design tools). The production deployment at Tencent scale provides credibility. The 10% absolute improvement in acceptance rate is commercially significant.
Broader influence: The "experience data" framing connects to important ongoing discussions about data scarcity for LLM training. However, the actual technical contribution is relatively straightforward — it essentially amounts to mining edit histories for SFT targets with careful data curation. The conceptual framework, while articulated with ambitious language, does not introduce fundamentally new technical machinery.
4. Timeliness & Relevance
The paper is highly timely. The question of how to learn from deployment interactions is central to the current AI agent ecosystem. Products like Cursor, GitHub Copilot, and Claude Code generate enormous interaction logs, and the industry desperately needs principled approaches to leverage this data. The paper's framing aligns well with the "Era of Experience" discussion (Silver & Sutton, 2025) and concerns about static data exhaustion.
5. Strengths & Limitations
Key Strengths:
Notable Limitations:
Additional Observations
The writing is clear but at times excessively promotional, using terms like "inexhaustible data engine," "strategic moat," and "paradigm shift" for what is essentially a well-executed data curation pipeline with SFT. The discussion section (6.3) speculates that experience data might surpass pre-training, which is unsupported by the evidence presented. The paper would benefit from more measured claims proportional to its actual experimental evidence.
Generated May 22, 2026
Comparison History (16)
Paper 1 identifies a novel, fundamental failure mode in agentic AI (temporal memory contamination) and introduces a rigorous evaluation protocol for longitudinal safety. As long-term memory becomes standard in LLMs, establishing how accumulated context degrades safety will have profound implications across AI alignment, safety evaluations, and architecture design, offering broader scientific impact than Paper 2's application-focused data pipeline.
Paper 1 offers higher scientific impact by identifying a novel, fundamental vulnerability in AI agents: temporal memory contamination. While Paper 2 provides a highly practical, production-validated framework for continuous learning from user feedback, it builds on established paradigms of interaction-based alignment. Paper 1 pioneers a new longitudinal evaluation paradigm for AI safety, demonstrating that risks compound over time across unrelated tasks. Its rigorous trigger-probe protocol and early detection mechanism provide foundational tools for future research in secure, long-horizon autonomous agents, making its conceptual contributions more broadly impactful.
Paper 1 addresses the critical AI bottleneck of data scaling by presenting a scalable, continuous learning framework from user interactions. Its validation in a large-scale production environment demonstrates immediate, widespread real-world utility across the rapidly growing domain of LLM agents, offering broader impact than Paper 2's robotics-specific optimizations.
Paper 2 addresses a fundamental bottleneck in AI training—the reliance on expensive, static human data—by introducing a scalable framework for continuous learning from user-driven refinements. This approach has massive breadth of impact across virtually all interactive AI agents. Furthermore, its validation in a large-scale production environment demonstrating a 10% absolute increase in acceptance rates highlights exceptional real-world applicability and timeliness. While Paper 1 offers a rigorous and novel architectural improvement for robotic control, Paper 2's paradigm shift toward continuous experience-based learning has a wider potential footprint across the broader AI ecosystem.
Paper 1 (LGBO) demonstrates higher scientific impact potential due to: (1) Novel theoretical framework integrating LLMs into Bayesian Optimization with formal convergence guarantees, (2) Broad applicability across multiple scientific domains (physics, chemistry, biology, materials science), (3) Validated in both dry benchmarks and wet-lab experiments showing significant efficiency gains, (4) Addresses fundamental challenges in scientific discovery (costly experiments, cold-start, high dimensionality). Paper 2 (Echo) addresses an important but narrower problem in code completion agent training. While practically valuable with strong production results, its impact is more domain-specific compared to LGBO's cross-disciplinary scientific optimization framework.
Paper 2 introduces a novel framework (LGBO) that integrates LLM reasoning into Bayesian Optimization with theoretical guarantees and demonstrates impact across multiple scientific domains including wet-lab validation. Its breadth of impact spans physics, chemistry, biology, and materials science, addressing a fundamental challenge in scientific discovery. Paper 1, while practically valuable with strong production results in code completion, is more narrowly focused on a single application domain. Paper 2's theoretical contributions, cross-disciplinary applicability, and real experimental validation suggest broader and deeper scientific impact.
Paper 1 (Echo) is more scientifically impactful: it proposes a broadly applicable learning framework for converting noisy real-world agent interaction logs into high-quality training signals via user-driven refinement, and demonstrates a substantial, quantitative production gain (acceptance 25.7%→35.7%). This is methodologically closer to a generalizable learning paradigm with clear downstream applications across deployed agents and continual alignment. Paper 2 is valuable infrastructure/UI for autonomous research, but appears more product/system-integration oriented with limited, internal case-study evaluation and less clear general scientific novelty.
Echo addresses a fundamental challenge in continuous learning from real-world deployment data, with validated production results showing a 39% relative improvement in code completion acceptance rates. Its framework for converting noisy interaction logs into training signals has broad applicability across all deployed AI agent systems. While SkillWeave presents useful modular specialization techniques, Echo's contribution is more transformative—it establishes a scalable paradigm for post-deployment improvement that could reshape how AI systems learn continuously, with concrete production validation rather than just benchmark results.
While Paper 1 presents an ambitious framework for autonomous AI research, its evaluation is limited to a small internal study. Paper 2 tackles a fundamental bottleneck in AI scaling—continuous learning from noisy real-world experience—and demonstrates substantial methodological rigor through large-scale validation in a production environment. The proven 10% absolute increase in acceptance rates highlights immediate, highly scalable real-world applicability and a robust data-flywheel paradigm with broader impact across all deployed agentic systems.
Paper 1 is likely higher impact due to timeliness and real-world applicability: it leverages ubiquitous post-deployment interaction/refinement data to enable continuous learning, directly addressing a key bottleneck (scalable high-quality supervision) and showing production-scale gains. The framework generalizes beyond coding to any agent with user edits, potentially affecting alignment, RLHF alternatives, and agent deployment practices broadly. Paper 2 is strong for efficient specialization and modularity, but resembles an incremental advance on parameter-efficient fine-tuning/modular adapters; its impact may be narrower to deployment/efficiency compared with a paradigm for learning from live experience.
Paper 2 addresses the critical bottleneck of static training data by introducing a scalable framework for continuous learning from user interactions. Its successful validation in a large-scale production environment demonstrates high real-world applicability and offers a practical path to improving deployed AI agents. While Paper 1 provides valuable insights into mechanistic interpretability, Paper 2's approach to harnessing continuous, real-world experience promises broader and more immediate impacts across the AI industry and applied machine learning research.
Paper 2 (Echo) has higher likely scientific impact due to strong real-world applicability and demonstrated large-scale production gains (code completion acceptance +10 pp), addressing a timely problem: leveraging abundant agent interaction data via user refinement. The framework is broadly relevant to continual learning, RLHF/RLAIF-style alignment, and deployed agents across domains. While Paper 1 is novel and important for mechanistic interpretability, its impact is more specialized and research-facing, with less immediate practical adoption. Echo’s deployment evidence and general “experience-to-training-signal” pipeline suggest wider near-term influence.
Paper 1 has broader, more timely impact: it proposes a general framework for turning ubiquitous real-world agent interaction logs plus user refinements into scalable training signals, demonstrated at production scale with a clear quantitative gain. This targets a central bottleneck (data and continual alignment) across many agent domains, making applications wide and immediate. Paper 2 is innovative and rigorous for a hard, important niche (EDA/Verilog) and introduces verifier-guided test-time skill evolution, but its scope and real-world deployment footprint are narrower and results appear more domain-specific, reducing cross-field impact relative to Paper 1.
Echo addresses a fundamental challenge in continuous AI learning from real-world deployment data, with validated production results showing significant improvement (25.7% to 35.7% acceptance rate) in code completion. Its framework for harvesting user refinement signals from deployed agents is broadly applicable across AI agent ecosystems and addresses the critical bottleneck of training data scalability. Paper 2, while technically solid in improving rubric-based RL training efficiency, addresses a narrower optimization problem within RLVR. Echo's production-scale validation and generalizable framework for experience-driven learning have broader potential impact across the rapidly growing AI agent deployment landscape.
Paper 2 addresses a critical bottleneck in AI scaling—reliance on expensive static human data—by proposing a scalable framework for continuous learning from real-world user interactions. Its validation in a large-scale production environment demonstrates significant and immediate real-world utility. While Paper 1 provides a strong methodological advance in the important niche of multi-task unlearning, Paper 2 offers a broader impact by providing a blueprint for the continuous, automated improvement of broadly deployed AI agents.
Paper 1 has higher estimated scientific impact due to broader novelty and applicability: it proposes a general framework for turning noisy, real-world agent interaction logs into high-quality training signals via user-driven refinement, a paradigm relevant across many deployed AI agents (coding, assistants, workflow tools) and timely for continual learning at scale. Its demonstrated improvement in production suggests strong real-world leverage. Paper 2 is methodologically solid and impactful in ad bidding, but is more domain-specific and combines established components (DT, Q-guidance, IDM) into a tailored system, limiting breadth across fields.