Unsupervised Skill Discovery for Agentic Data Analysis

Zhisong Qiu, Kangqi Song, Shengwei Tang, Shuofei Qiao, Lei Liang, Huajun Chen, Shumin Deng

Jun 4, 2026

arXiv:2606.06416v1 PDF

cs.AI(primary)cs.CLcs.LGcs.MA

#1847of 3355·Artificial Intelligence

#1847 of 3355 · Artificial Intelligence

Tournament Score

1393±47

10501800

60%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance7

Rigor6

Novelty6.5

Clarity7

Tournament Score

1393±47

10501800

60%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Inference-time skill augmentation provides a lightweight way to improve data-analytic agents by injecting reusable procedural knowledge without updating model parameters. However, discovering effective skills for data analysis remains challenging, as reliable supervision is expensive and success criteria vary across analytical formats. This raises the key question of how to discover reusable data-analysis skills from unlabeled exploration alone. We propose DataCOPE, an unsupervised verifier-guided skill discovery framework for data-analytic agents. DataCOPE derives verifier signals from the exploration trajectories and uses them to characterize relative quality or aggreement among trajectories. It iteratively coordinates a Data-Analytic Agent for trajectory generation, an Unsupervised Verifier for signal extraction, and a Skill Manager for contrastive skill distillation. For report-style analysis, we instantiate the verifier as an Adaptive Checklist Verifier that derives task-specific criteria, scores reports by verifiable coverage, and iteratively refines the checklist. For reasoning-style analysis, we instantiate it as an Answer Agreement Verifier that groups trajectories by answer agreement and uses self-consistency as an auxiliary signal. We evaluate DataCOPE on report-style analysis from Deep Data Research and reasoning-style analysis from DABStep. Across both settings, DataCOPE consistently improves held-out performance over baselines. Averaged across four model settings, DataCOPE improves the mean score by 9.71% and 32.30% on report-style and reasoning-style tasks respectively.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: "Unsupervised Skill Discovery for Agentic Data Analysis"

1. Core Contribution

DataCOPE introduces an unsupervised framework for discovering reusable procedural "skills" that improve LLM-based data-analysis agents at inference time, without requiring ground-truth labels, human annotations, or model parameter updates. The key insight is that verifier signals can be derived from the agent's own exploration trajectories rather than from external supervision. The framework coordinates three components: a Data-Analytic Agent for trajectory sampling, an Unsupervised Verifier for signal extraction, and a Skill Manager for contrastive skill distillation.

The paper makes a meaningful distinction between two data-analysis modalities—report-style and reasoning-style—and designs tailored verifiers for each: an Adaptive Checklist Verifier (which generates task-specific criteria and iteratively refines them) and an Answer Agreement Verifier (which clusters trajectories by answer agreement and uses self-consistency). This dual instantiation addresses a genuine heterogeneity problem in data analysis evaluation.

2. Methodological Rigor

The experimental design is generally sound. The paper evaluates on two established benchmarks (Deep Data Research and DABStep) with a proper train/test split (1:3 ratio), tests across five different base models (Claude, GPT, DeepSeek, Qwen), and compares against a relevant baseline (Anthropic's Skill Creator). The ablation studies systematically remove individual components (task-specific checklists, checklist refinement, answer clustering, self-consistency) and demonstrate their contributions.

However, several methodological concerns arise:

Single baseline comparison: The paper only compares against Anthropic's Skill Creator. Given the rich related work on skill discovery (cited extensively), additional baselines—such as simple self-refinement, experience replay, or other unsupervised skill methods—would strengthen claims of novelty.

Verifier implementation details: The Checklist Agent and Skill Manager both use powerful models (Qwen3.5-397B and Claude Sonnet 4.6 respectively). The reliance on frontier models for the "unsupervised" components raises questions about whether the improvements stem from the framework design or from additional compute/model capacity.

Statistical significance: No error bars or confidence intervals are reported. With single-run evaluations, it's difficult to assess reliability of the reported improvements.

Exploration/test split: Using only 25% of data for exploration and 75% for testing is somewhat unusual; the sensitivity to this ratio is not analyzed.

The iterative refinement analysis (Figure 3) is informative, showing that improvements are not monotonic and that later iterations can be harmful—this is an honest and useful finding. The alternating optimization between the Data-Analytic Agent and Checklist Agent is well-motivated as addressing verifier overfitting.

3. Potential Impact

The practical impact could be significant in several ways:

Cost reduction: The unsupervised nature eliminates the need for expensive analytical annotations, which is a genuine bottleneck in deploying data-analysis agents at scale.

Model-agnostic transferability: Skills discovered using one model can benefit others (demonstrated in the cross-model analysis), suggesting a reusable knowledge infrastructure.

Efficiency gains: Table V shows that discovered skills reduce token consumption by 40-73% while improving accuracy, making deployment more cost-effective.

Industry relevance: Data analysis automation is a major commercial application of LLMs, and lightweight skill augmentation without fine-tuning is attractive for enterprise deployment.

The framework could influence adjacent fields including automated report generation, scientific data exploration, and business intelligence automation.

4. Timeliness & Relevance

This paper is highly timely. It addresses the emerging paradigm of LLM agent skills (citing Anthropic's recent skill framework from 2026), and the broader trend toward inference-time adaptation without parameter updates. The explosion of data-analysis agent benchmarks and frameworks in 2025-2026 creates clear demand for methods to improve these agents cheaply. The unsupervised angle is particularly relevant given the difficulty and cost of creating ground-truth annotations for complex analytical tasks.

5. Strengths & Limitations

Strengths:

Well-motivated problem: The two challenges (expensive annotation, heterogeneous success criteria) are clearly articulated and genuine.

Principled dual-verifier design: The distinction between report-style and reasoning-style tasks with tailored verification approaches is well-conceived.

Comprehensive analysis: The paper includes granularity analysis, cross-model transfer, supervised comparison, and cost analysis—providing a thorough empirical picture.

Contrastive checklist refinement: The alternating optimization between report generation and checklist generation is an elegant solution to verifier overfitting.

Strong empirical results: 9.71% and 32.30% mean improvements across four models are substantial, particularly the reasoning-task gains.

Limitations:

Narrow baseline comparison: Only one baseline (Skill Creator) is compared. The paper would benefit from comparing against prompt optimization, self-reflection, or other unsupervised learning approaches.

Scalability questions: The framework requires multiple LLM calls (trajectory sampling, checklist generation, scoring, skill distillation) across iterations. The total compute cost of the discovery phase is not clearly reported.

Generalizability of verifier design: The two verifier types are hand-designed for specific task formats. Extending to other analytical modalities (e.g., visualization quality, data cleaning) would require new verifier designs.

Potential circularity: Using LLM-generated checklists to evaluate LLM-generated reports introduces potential biases, especially when both share similar knowledge distributions.

Limited theoretical grounding: The paper lacks formal analysis of when/why the unsupervised signals approximate ground-truth quality signals.

Reasoning task ceiling: Figure 3c shows that accuracy on reasoning tasks plateaus quickly while self-consistency increases, suggesting the method may amplify confident-but-wrong answers.

Additional Observations

The paper's framing as "unsupervised" is slightly overstated—the framework still requires substantial architectural decisions, prompt engineering, and model selection that encode human knowledge. The comparison with supervised discovery (Figure 4c) is informative but limited to small label counts; the crossover point where supervision becomes clearly superior is not well characterized.

The writing is generally clear, though the mathematical formalization (Section II) adds little beyond what the system description already conveys.

Rating:6.5/ 10

Significance 7Rigor 6Novelty 6.5Clarity 7

Generated Jun 5, 2026

Comparison History (15)

vs. A Geometric Account of Activation Steering through Angle-Norm Decomposition

gemini-3.16/8/2026

Paper 1 investigates the fundamental mechanisms of activation steering and hidden state representations in LLMs. By disentangling angular and radial components, it provides foundational insights into LLM interpretability and control. These fundamental insights are likely to have a broader and longer-lasting scientific impact across AI safety and alignment compared to Paper 2, which focuses on a more applied, domain-specific framework for data analysis agents.

vs. Boosting Brain-to-Image Decoding with TRIBE v2 Data Augmentation

claude-opus-4.66/6/2026

Paper 1 addresses a fundamental bottleneck in brain decoding—scarce labeled neural data—with a novel data augmentation approach using large-scale pretrained encoding models. The demonstration of zero-shot brain-to-image decoding and up to 68% improvement in retrieval accuracy is striking. It bridges neuroscience and AI with broad implications for brain-computer interfaces and clinical applications. Paper 2, while solid, represents an incremental advance in LLM-based agent skill discovery with narrower applicability. Paper 1's cross-disciplinary impact, novel paradigm of synthetic fMRI augmentation, and potential for real-world neurotechnology applications give it higher estimated impact.

vs. Universal Quantum Transformer

claude-opus-4.66/6/2026

The Universal Quantum Transformer proposes a fundamentally new architecture bridging quantum computing and AI, addressing core limitations of classical transformers for exact reasoning. If validated, it would have transformative cross-disciplinary impact spanning quantum computing, machine learning, and mathematics. Its claims of exponential advantages and demonstration on real quantum hardware are highly ambitious. While Paper 2 offers solid incremental improvements in agentic AI skill discovery, its scope is narrower and more incremental. Paper 1's novelty and potential paradigm-shifting nature give it higher estimated impact, despite the risk that some claims may be overstated.

vs. Answer Presence Drives RAG Rewriting Gains

claude-opus-4.66/6/2026

Paper 2 (DataCOPE) proposes a novel framework for unsupervised skill discovery in data-analytic agents, offering a constructive contribution with strong empirical gains (9.71% and 32.30% improvements) across multiple settings. It addresses a timely problem in LLM-based agentic systems with broad applicability. Paper 1, while methodologically rigorous in its causal analysis of RAG rewriting gains, is primarily a diagnostic/audit contribution that does not propose new methods or mitigations. Its impact is more narrow—clarifying an existing phenomenon rather than enabling new capabilities. Paper 2's constructive framework with practical applications gives it broader potential impact.

vs. Output Type Before Quality: A Standards-Derived XAI Admissibility Rubric for Autonomous-Driving Safety

gemini-3.16/6/2026

Paper 2 demonstrates higher potential impact by addressing a critical bottleneck in life-critical systems: the misalignment between XAI outputs and autonomous driving safety standards. While Paper 1 offers a valuable framework for LLM agents, Paper 2 bridges machine learning, systems engineering, and regulatory compliance. By formally deriving a rubric to evaluate XAI admissibility for safety assurance, it provides a foundational framework with immediate, high-stakes real-world applications. Its structural approach to the 'evidence-type gap' will likely heavily influence both future XAI research directions and the practical deployment and regulation of autonomous vehicles.

vs. Beyond Vector Similarity: A Structural Analysis of Graph-Augmented Retrieval for Industrial Knowledge Graphs

gpt-5.26/6/2026

Paper 2 has higher estimated impact due to broader applicability and timeliness: unsupervised skill discovery for agentic data analysis generalizes across domains, datasets, and model backends, aligning with current trends in LLM agents and inference-time augmentation. Its verifier-guided framework (multiple verifier instantiations) is a reusable methodological contribution with clear real-world use in analytics automation. Paper 1 is insightful for graph-augmented RAG and tool/operator framing, but the evaluation is relatively small and domain-specific (46-node KG, 23 queries), likely limiting breadth despite good novelty.

vs. ReTreVal: Reasoning Tree with Validation and Cross-Problem Memory for Large Language Models

gpt-5.26/6/2026

Paper 2 (DataCOPE) is likely higher impact due to greater novelty and broader applicability: it proposes an unsupervised, verifier-guided framework for discovering reusable skills from unlabeled exploration—addressing a key bottleneck for agentic data analysis where supervision is costly and criteria vary. Its modular verifier/skill-manager design can transfer across analytical formats and domains, with clear real-world relevance to automated analytics. Paper 1 is strong and timely but is a more incremental advance on inference-time reasoning/memory frameworks with narrower scope and reliance on benchmark gains.

vs. Parthenon Law: A Self-Evolving Legal-Agent Framework

claude-opus-4.66/5/2026

Paper 1 (DataCOPE) presents a more general and broadly applicable framework for unsupervised skill discovery applicable to any data-analytic agent, with clear quantitative improvements (9.71% and 32.30%) across multiple settings. Its unsupervised approach to skill discovery without labeled data addresses a fundamental challenge in agentic AI. Paper 2 (Parthenon) is valuable but more domain-specific (legal), limiting its breadth of impact. DataCOPE's methodological contributions—contrastive skill distillation, adaptive checklist verification, answer agreement verification—are transferable across many domains, giving it higher potential for broad scientific influence.

vs. Grokers: Bottom-Up Inductive Comprehension and Write-Time Intelligence over Typed Knowledge Graphs

gpt-5.26/5/2026

Paper 1 likely has higher impact due to clearer novelty with an empirically validated, generally applicable unsupervised skill-discovery framework for data-analysis agents. It addresses a timely, widely relevant problem (agentic data analysis) and demonstrates substantial gains across two task families with concrete instantiations and evaluations, suggesting methodological rigor and reproducibility. Paper 2 is ambitious and potentially useful for systems engineering, but relies heavily on architectural claims and formal properties that may be narrower in applicability (typed KG pipelines) and less validated empirically in the abstract, making near-term scientific uptake less certain.

vs. Seeing Time: Benchmarking Chronological Reasoning and Shortcut Biases in Vision-Language Models

claude-opus-4.66/5/2026

Paper 2 (DataCOPE) introduces a novel unsupervised framework for skill discovery in data-analytic agents with strong empirical improvements (9.71% and 32.30%), addressing the practical and timely challenge of improving LLM agents without parameter updates. Its methodological contributions—contrastive skill distillation, adaptive checklist verification, and answer agreement verification—are broadly applicable across analytical tasks. Paper 1, while addressing an interesting gap in chronological reasoning benchmarks, is primarily a diagnostic benchmark contribution with findings (shortcut biases) that, while useful, have narrower methodological novelty and more limited downstream impact.

vs. Mutation Without Variation: Convergence Dynamics in LLM-Driven Program Evolution

claude-opus-4.66/5/2026

Paper 2 addresses a fundamental question about LLM-driven program evolution that has broad implications across multiple fields (evolutionary computation, program synthesis, AI-driven search). Its finding that LLMs exhibit systematic convergence bias is a novel, rigorous empirical contribution that challenges assumptions underlying many LLM-based optimization systems (e.g., FunSearch, EvoPrompting). This insight is widely applicable and timely. Paper 1, while showing solid improvements on data analysis benchmarks, is more incremental and narrowly focused on a specific application domain with a complex multi-component framework.

vs. Critic-Guided Heterogeneous Multi-Agent Reasoning for Reliable Mathematical Problem Solving

gpt-5.26/5/2026

Paper 1 likely has higher impact due to greater novelty and broader applicability: it proposes an unsupervised, verifier-guided framework for skill discovery that can generalize across multiple data-analysis formats (report-style and reasoning-style), addressing a key bottleneck (expensive supervision) in agent improvement. Its methodological contribution (trajectory-derived verifier signals + iterative skill distillation) is more foundational and transferable to many agentic workflows beyond math. Paper 2 is timely and useful but builds on a more common critic/multi-agent pattern and is evaluated primarily on GSM8K, limiting breadth.

vs. AIS-Based Vessel Trajectory Prediction Using Memory-Augmented Neural Networks

gpt-5.26/5/2026

Paper 1 likely has higher scientific impact: it introduces a more novel, general framework (unsupervised verifier-guided skill discovery) applicable across multiple agentic data-analysis formats, with clear methodological components and broad relevance to LLM agents, self-improvement, and unsupervised learning. Its potential applications span many domains where analytical agents are used, and it is timely given rapid interest in agentic AI and inference-time augmentation. Paper 2 is valuable and applied, but is a narrower empirical adaptation of existing memory-augmented trajectory models to AIS vessel data, with more limited cross-field breadth and novelty.

vs. QCFuse: Query-Aware Cache Fusion via Compressed View for Efficient RAG Serving

claude-opus-4.66/5/2026

QCFuse addresses a fundamental efficiency bottleneck in RAG serving—a critical infrastructure problem affecting widespread LLM deployment. It offers a principled solution (compressed-view query-aware selection) with strong empirical results (1.7x speedup with no quality loss) across multiple models and datasets. The work has immediate practical impact on LLM serving systems. Paper 2, while solid, addresses a narrower problem (skill discovery for data-analytic agents) with less generalizable methodology. QCFuse's contribution to the RAG/LLM serving infrastructure has broader applicability and timeliness given the explosive growth of RAG systems.

vs. Plan First, Judge Later, Run Better: A DMAIC-Inspired Agentic System for Industrial Anomaly Detection

gpt-5.26/5/2026

Paper 1 likely has higher scientific impact: it introduces a broadly applicable unsupervised skill discovery framework for data-analytic agents, addressing a core, general problem (learning reusable skills without labels) with a modular verifier-guided method adaptable to multiple analysis formats. This is novel and timely for agentic LLM research, and its concepts (trajectory-derived verifier signals, contrastive skill distillation) can transfer across domains beyond data analysis. Paper 2 is strong and application-relevant for industrial anomaly detection, but is more domain-specific and system/engineering-focused, limiting breadth despite large gains.