Unsupervised Skill Discovery for Agentic Data Analysis
Zhisong Qiu, Kangqi Song, Shengwei Tang, Shuofei Qiao, Lei Liang, Huajun Chen, Shumin Deng
Abstract
Inference-time skill augmentation provides a lightweight way to improve data-analytic agents by injecting reusable procedural knowledge without updating model parameters. However, discovering effective skills for data analysis remains challenging, as reliable supervision is expensive and success criteria vary across analytical formats. This raises the key question of how to discover reusable data-analysis skills from unlabeled exploration alone. We propose DataCOPE, an unsupervised verifier-guided skill discovery framework for data-analytic agents. DataCOPE derives verifier signals from the exploration trajectories and uses them to characterize relative quality or aggreement among trajectories. It iteratively coordinates a Data-Analytic Agent for trajectory generation, an Unsupervised Verifier for signal extraction, and a Skill Manager for contrastive skill distillation. For report-style analysis, we instantiate the verifier as an Adaptive Checklist Verifier that derives task-specific criteria, scores reports by verifiable coverage, and iteratively refines the checklist. For reasoning-style analysis, we instantiate it as an Answer Agreement Verifier that groups trajectories by answer agreement and uses self-consistency as an auxiliary signal. We evaluate DataCOPE on report-style analysis from Deep Data Research and reasoning-style analysis from DABStep. Across both settings, DataCOPE consistently improves held-out performance over baselines. Averaged across four model settings, DataCOPE improves the mean score by 9.71% and 32.30% on report-style and reasoning-style tasks respectively.
AI Impact Assessments
(1 models)Scientific Impact Assessment: "Unsupervised Skill Discovery for Agentic Data Analysis"
1. Core Contribution
DataCOPE introduces an unsupervised framework for discovering reusable procedural "skills" that improve LLM-based data-analysis agents at inference time, without requiring ground-truth labels, human annotations, or model parameter updates. The key insight is that verifier signals can be derived from the agent's own exploration trajectories rather than from external supervision. The framework coordinates three components: a Data-Analytic Agent for trajectory sampling, an Unsupervised Verifier for signal extraction, and a Skill Manager for contrastive skill distillation.
The paper makes a meaningful distinction between two data-analysis modalities—report-style and reasoning-style—and designs tailored verifiers for each: an Adaptive Checklist Verifier (which generates task-specific criteria and iteratively refines them) and an Answer Agreement Verifier (which clusters trajectories by answer agreement and uses self-consistency). This dual instantiation addresses a genuine heterogeneity problem in data analysis evaluation.
2. Methodological Rigor
The experimental design is generally sound. The paper evaluates on two established benchmarks (Deep Data Research and DABStep) with a proper train/test split (1:3 ratio), tests across five different base models (Claude, GPT, DeepSeek, Qwen), and compares against a relevant baseline (Anthropic's Skill Creator). The ablation studies systematically remove individual components (task-specific checklists, checklist refinement, answer clustering, self-consistency) and demonstrate their contributions.
However, several methodological concerns arise:
The iterative refinement analysis (Figure 3) is informative, showing that improvements are not monotonic and that later iterations can be harmful—this is an honest and useful finding. The alternating optimization between the Data-Analytic Agent and Checklist Agent is well-motivated as addressing verifier overfitting.
3. Potential Impact
The practical impact could be significant in several ways:
The framework could influence adjacent fields including automated report generation, scientific data exploration, and business intelligence automation.
4. Timeliness & Relevance
This paper is highly timely. It addresses the emerging paradigm of LLM agent skills (citing Anthropic's recent skill framework from 2026), and the broader trend toward inference-time adaptation without parameter updates. The explosion of data-analysis agent benchmarks and frameworks in 2025-2026 creates clear demand for methods to improve these agents cheaply. The unsupervised angle is particularly relevant given the difficulty and cost of creating ground-truth annotations for complex analytical tasks.
5. Strengths & Limitations
Strengths:
Limitations:
Additional Observations
The paper's framing as "unsupervised" is slightly overstated—the framework still requires substantial architectural decisions, prompt engineering, and model selection that encode human knowledge. The comparison with supervised discovery (Figure 4c) is informative but limited to small label counts; the crossover point where supervision becomes clearly superior is not well characterized.
The writing is generally clear, though the mathematical formalization (Section II) adds little beyond what the system description already conveys.
Generated Jun 5, 2026
Comparison History (15)
Paper 1 investigates the fundamental mechanisms of activation steering and hidden state representations in LLMs. By disentangling angular and radial components, it provides foundational insights into LLM interpretability and control. These fundamental insights are likely to have a broader and longer-lasting scientific impact across AI safety and alignment compared to Paper 2, which focuses on a more applied, domain-specific framework for data analysis agents.
Paper 1 addresses a fundamental bottleneck in brain decoding—scarce labeled neural data—with a novel data augmentation approach using large-scale pretrained encoding models. The demonstration of zero-shot brain-to-image decoding and up to 68% improvement in retrieval accuracy is striking. It bridges neuroscience and AI with broad implications for brain-computer interfaces and clinical applications. Paper 2, while solid, represents an incremental advance in LLM-based agent skill discovery with narrower applicability. Paper 1's cross-disciplinary impact, novel paradigm of synthetic fMRI augmentation, and potential for real-world neurotechnology applications give it higher estimated impact.
The Universal Quantum Transformer proposes a fundamentally new architecture bridging quantum computing and AI, addressing core limitations of classical transformers for exact reasoning. If validated, it would have transformative cross-disciplinary impact spanning quantum computing, machine learning, and mathematics. Its claims of exponential advantages and demonstration on real quantum hardware are highly ambitious. While Paper 2 offers solid incremental improvements in agentic AI skill discovery, its scope is narrower and more incremental. Paper 1's novelty and potential paradigm-shifting nature give it higher estimated impact, despite the risk that some claims may be overstated.
Paper 2 (DataCOPE) proposes a novel framework for unsupervised skill discovery in data-analytic agents, offering a constructive contribution with strong empirical gains (9.71% and 32.30% improvements) across multiple settings. It addresses a timely problem in LLM-based agentic systems with broad applicability. Paper 1, while methodologically rigorous in its causal analysis of RAG rewriting gains, is primarily a diagnostic/audit contribution that does not propose new methods or mitigations. Its impact is more narrow—clarifying an existing phenomenon rather than enabling new capabilities. Paper 2's constructive framework with practical applications gives it broader potential impact.
Paper 2 demonstrates higher potential impact by addressing a critical bottleneck in life-critical systems: the misalignment between XAI outputs and autonomous driving safety standards. While Paper 1 offers a valuable framework for LLM agents, Paper 2 bridges machine learning, systems engineering, and regulatory compliance. By formally deriving a rubric to evaluate XAI admissibility for safety assurance, it provides a foundational framework with immediate, high-stakes real-world applications. Its structural approach to the 'evidence-type gap' will likely heavily influence both future XAI research directions and the practical deployment and regulation of autonomous vehicles.
Paper 2 has higher estimated impact due to broader applicability and timeliness: unsupervised skill discovery for agentic data analysis generalizes across domains, datasets, and model backends, aligning with current trends in LLM agents and inference-time augmentation. Its verifier-guided framework (multiple verifier instantiations) is a reusable methodological contribution with clear real-world use in analytics automation. Paper 1 is insightful for graph-augmented RAG and tool/operator framing, but the evaluation is relatively small and domain-specific (46-node KG, 23 queries), likely limiting breadth despite good novelty.
Paper 2 (DataCOPE) is likely higher impact due to greater novelty and broader applicability: it proposes an unsupervised, verifier-guided framework for discovering reusable skills from unlabeled exploration—addressing a key bottleneck for agentic data analysis where supervision is costly and criteria vary. Its modular verifier/skill-manager design can transfer across analytical formats and domains, with clear real-world relevance to automated analytics. Paper 1 is strong and timely but is a more incremental advance on inference-time reasoning/memory frameworks with narrower scope and reliance on benchmark gains.
Paper 1 (DataCOPE) presents a more general and broadly applicable framework for unsupervised skill discovery applicable to any data-analytic agent, with clear quantitative improvements (9.71% and 32.30%) across multiple settings. Its unsupervised approach to skill discovery without labeled data addresses a fundamental challenge in agentic AI. Paper 2 (Parthenon) is valuable but more domain-specific (legal), limiting its breadth of impact. DataCOPE's methodological contributions—contrastive skill distillation, adaptive checklist verification, answer agreement verification—are transferable across many domains, giving it higher potential for broad scientific influence.
Paper 1 likely has higher impact due to clearer novelty with an empirically validated, generally applicable unsupervised skill-discovery framework for data-analysis agents. It addresses a timely, widely relevant problem (agentic data analysis) and demonstrates substantial gains across two task families with concrete instantiations and evaluations, suggesting methodological rigor and reproducibility. Paper 2 is ambitious and potentially useful for systems engineering, but relies heavily on architectural claims and formal properties that may be narrower in applicability (typed KG pipelines) and less validated empirically in the abstract, making near-term scientific uptake less certain.
Paper 2 (DataCOPE) introduces a novel unsupervised framework for skill discovery in data-analytic agents with strong empirical improvements (9.71% and 32.30%), addressing the practical and timely challenge of improving LLM agents without parameter updates. Its methodological contributions—contrastive skill distillation, adaptive checklist verification, and answer agreement verification—are broadly applicable across analytical tasks. Paper 1, while addressing an interesting gap in chronological reasoning benchmarks, is primarily a diagnostic benchmark contribution with findings (shortcut biases) that, while useful, have narrower methodological novelty and more limited downstream impact.
Paper 2 addresses a fundamental question about LLM-driven program evolution that has broad implications across multiple fields (evolutionary computation, program synthesis, AI-driven search). Its finding that LLMs exhibit systematic convergence bias is a novel, rigorous empirical contribution that challenges assumptions underlying many LLM-based optimization systems (e.g., FunSearch, EvoPrompting). This insight is widely applicable and timely. Paper 1, while showing solid improvements on data analysis benchmarks, is more incremental and narrowly focused on a specific application domain with a complex multi-component framework.
Paper 1 likely has higher impact due to greater novelty and broader applicability: it proposes an unsupervised, verifier-guided framework for skill discovery that can generalize across multiple data-analysis formats (report-style and reasoning-style), addressing a key bottleneck (expensive supervision) in agent improvement. Its methodological contribution (trajectory-derived verifier signals + iterative skill distillation) is more foundational and transferable to many agentic workflows beyond math. Paper 2 is timely and useful but builds on a more common critic/multi-agent pattern and is evaluated primarily on GSM8K, limiting breadth.
Paper 1 likely has higher scientific impact: it introduces a more novel, general framework (unsupervised verifier-guided skill discovery) applicable across multiple agentic data-analysis formats, with clear methodological components and broad relevance to LLM agents, self-improvement, and unsupervised learning. Its potential applications span many domains where analytical agents are used, and it is timely given rapid interest in agentic AI and inference-time augmentation. Paper 2 is valuable and applied, but is a narrower empirical adaptation of existing memory-augmented trajectory models to AIS vessel data, with more limited cross-field breadth and novelty.
QCFuse addresses a fundamental efficiency bottleneck in RAG serving—a critical infrastructure problem affecting widespread LLM deployment. It offers a principled solution (compressed-view query-aware selection) with strong empirical results (1.7x speedup with no quality loss) across multiple models and datasets. The work has immediate practical impact on LLM serving systems. Paper 2, while solid, addresses a narrower problem (skill discovery for data-analytic agents) with less generalizable methodology. QCFuse's contribution to the RAG/LLM serving infrastructure has broader applicability and timeliness given the explosive growth of RAG systems.
Paper 1 likely has higher scientific impact: it introduces a broadly applicable unsupervised skill discovery framework for data-analytic agents, addressing a core, general problem (learning reusable skills without labels) with a modular verifier-guided method adaptable to multiple analysis formats. This is novel and timely for agentic LLM research, and its concepts (trajectory-derived verifier signals, contrastive skill distillation) can transfer across domains beyond data analysis. Paper 2 is strong and application-relevant for industrial anomaly detection, but is more domain-specific and system/engineering-focused, limiting breadth despite large gains.