Toward AI VIS Co-Scientists: A General and End-to-End Agent Harness for Solving Complex Data Visualization Tasks

Haichao Miao, Zhimin Li, Kuangshi Ai, Kaiyuan Tang, Chaoli Wang, Peer-Timo Bremer, Shusen Liu

May 20, 2026

arXiv:2605.21825v1 PDF

cs.AI(primary)cs.HC

#1500of 2292·Artificial Intelligence

#1500 of 2292 · Artificial Intelligence

Tournament Score

1373±48

10501800

56%

Win Rate

Wins

Losses

Matches

Rating

5/ 10

Significance5.5

Rigor3.5

Novelty5

Clarity6.5

Tournament Score

1373±48

10501800

56%

Win Rate

Wins

Losses

Matches

Rating

5/ 10

Significance

Rigor

Novelty

Clarity

Abstract

The ability to inspect, interpret, and communicate complex data is crucial for virtually any scientific endeavor, but often requires significant expertise outside the core domain ranging from data management and analysis to visualization design and implementation. We present an end-to-end agentic harness that, based on only the data and a high level description of the tasks, independently designs custom visual analysis applications (VIS apps). This represents an important step towards a general AI co-scientist envisioned by many as an autonomous system that can autonomously execute long horizon tasks based on high-level directions. Our proposed VIS co-scientist is an essential component of this broader AI co-scientist vision: a harness that can autonomously analyze data and design visualization solutions using a collection of agents and specialized skills that coordinate exploratory analysis, plan, configure the environment, implement, validate the interface, and most importantly evaluate the overall task completion. Each stage produces document and instruction artifacts that guide downstream work and enable iterative refinement. We validate this approach on IEEE SciVis Contests spanning multiple science and engineering fields. These contests serve as ideal proving grounds because they encode real-world complexity: ambiguous requirements, diverse data modalities, design trade-offs, and task-driven validation. Given only the data and target tasks, our system autonomously produces functional single-page VIS Apps with verified linked-view behavior, highly customized to domain experts' specified tasks and needs.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

Core Contribution

This paper introduces the concept of a "VIS co-scientist" — a multi-agent AI system that autonomously generates complete, interactive visualization applications (VIS Apps) from raw data and high-level task descriptions. The system orchestrates multiple specialized subagents (Exploratory Data Analyzer, Planner, Environment Builder, VIS Designer, Evaluator) through a structured harness that produces intermediate artifacts, enabling iterative refinement and traceability. The key novelty is the end-to-end pipeline: from data profiling through environment setup, visualization design, browser-based validation, to task-completion evaluation — all without human intervention beyond the initial prompt.

The system is validated on IEEE SciVis Contests (2021–2026), which serve as realistic benchmarks involving ambiguous requirements, heterogeneous data formats, and domain-specific analytical tasks.

Methodological Rigor

The methodology has both strengths and notable gaps. The multi-agent architecture is well-motivated, with clear role separation and artifact-based communication that improves transparency. The use of Playwright-MCP for browser-based validation is a practical choice that enables mechanical verification of linked-view behavior, console errors, and rendering correctness.

However, the evaluation is thin for a systems paper claiming significant advances:

1. Limited quantitative evaluation: The structured expert audit involves only 5 author-team members (introducing potential bias), using Likert-scale assessments on a single case study (2025 contest). The other contest years receive only qualitative descriptions ("mechanically valid").

2. No ablation study: The contribution of individual components (EDA agent, Planner, Evaluator feedback loops) is never isolated. The memory system is described but explicitly not evaluated — the authors state results "do not yet depend on prior memory retrieval."

3. Baseline comparison is weak: The comparison is against a "baseline coding agent" (same model without the harness), which predictably produces standalone plots rather than coordinated apps. There is no comparison against other agentic frameworks, human-produced solutions at equivalent effort levels, or even systematic comparison against contest winners.

4. Reproducibility concerns: The code is described as "currently waiting upon code review completion" for open-source release, and the system depends on proprietary models (GPT-5.4 via OpenAI Codex), making independent verification difficult.

Potential Impact

The paper addresses a genuine need: visualization application development is a bottleneck in scientific workflows, requiring expertise in data management, visual encoding design, frontend development, and domain understanding. An autonomous system that produces functional, interactive VIS Apps could democratize data exploration across scientific domains.

The multi-agent harness design pattern — with explicit artifact handoffs, layered validation, and specialized subagents — could influence how agentic systems are designed for other complex, multi-stage creative tasks beyond visualization. The use of SciVis Contests as benchmarks is clever and could establish a standard evaluation paradigm for visualization agents.

However, the practical impact may be limited by several factors: the generated apps lack visual design novelty (acknowledged by the authors), the system struggles with 3D visualization and temporal reasoning, and the token costs (5.5M+ tokens for one run) are substantial. The expert audit scores for visual encoding quality (mean 3.8) and domain insight extraction (as low as 2.4 for some statements) suggest the system is not yet reliable enough for unsupervised scientific use.

Timeliness & Relevance

The paper is highly timely, situated at the intersection of two active trends: AI co-scientist systems (Google's AI co-scientist, AI Scientist v2, etc.) and LLM-based visualization agents. The argument that visualization is an essential but underexplored component of the AI co-scientist vision is well-made and fills a conceptual gap in the literature. The positioning against SciVisAgentBench (from some of the same authors) suggests awareness of the evaluation infrastructure needed.

Strengths

Ambitious scope: End-to-end autonomous generation of interactive, linked-view visualization applications is a meaningful step beyond chart generation or isolated coding tasks.

Architectural clarity: The multi-agent design with explicit artifacts is well-structured and potentially reusable.

Realistic evaluation domain: SciVis Contests provide genuinely complex, multidisciplinary challenges rather than toy benchmarks.

Honest limitations: The paper is forthcoming about weaknesses — lack of creative design, poor 3D/temporal reasoning, unverified memory system.

Practical validation: Browser-based mechanical validation via Playwright is a solid engineering contribution for verification of interactive systems.

Limitations

Evaluation depth: Five self-selected expert reviewers on one primary case study is insufficient for the claims made. The paper would benefit greatly from independent expert evaluation, systematic comparison with human-produced contest entries, and cross-case quantitative metrics.

Novelty vs. engineering: Much of the contribution is systems integration — connecting existing tools (OpenAI Codex, Playwright, MCP connectors) in a structured pipeline. The individual components are not novel; the contribution lies in their composition.

Memory system: Described at length but explicitly unevaluated, making it feel premature for inclusion.

Generalization claims: The system targets single-page web-based VIS Apps. Extension to desktop visualization tools, notebook environments, or domain-specific platforms remains unaddressed.

Cost and scalability: 2+ hours and millions of tokens per run, with no discussion of failure rates across multiple runs or variance in output quality.

Self-evaluation circularity: The Evaluator agent uses the same underlying LLM family to judge outputs, raising questions about whether it can identify failure modes the generation agents share.

Additional Observations

The paper reads more as a short/workshop paper (which it appears to be, given its length and venue positioning) than a full research contribution. Several design choices reference very recent (2026) publications and tools, suggesting rapid development. The plan to submit a VIS co-scientist entry to the 2026 SciVis Contest is an excellent future validation strategy. The conceptual framing of what a VIS co-scientist should be (Figure 2) provides useful scaffolding for the community, even if the current instantiation is preliminary.

Rating:5/ 10

Significance 5.5Rigor 3.5Novelty 5Clarity 6.5

Generated May 22, 2026

Comparison History (16)

vs. AttuneBench: A Conversation-Based Benchmark for LLM Emotional Intelligence

gpt-5.25/22/2026

Paper 2 has higher potential impact due to its broader applicability and stronger real-world utility: an end-to-end agent harness that autonomously builds task-specific visualization apps can benefit many scientific domains and accelerate data-driven discovery. Its validation on IEEE SciVis contests suggests practical relevance and methodological rigor around complex, ambiguous, multi-modal tasks. Paper 1 is timely and novel in EI evaluation with real multi-turn, user-annotated conversations, but its impact is more specialized to LLM assessment and conversational safety/alignment rather than a cross-disciplinary workflow enabler.

vs. AI-Enabled Serious Games: Integrating Intelligence and Adaptivity in Training Systems

claude-opus-4.65/22/2026

Paper 1 presents a novel, concrete end-to-end agentic system for autonomous data visualization that is validated on real-world benchmarks (IEEE SciVis Contests), demonstrating tangible outputs. It addresses a broadly applicable problem—automating complex visualization tasks—with methodological rigor and clear innovation in multi-agent coordination. Paper 2, while comprehensive, is primarily a review/survey chapter discussing how AI could enhance serious games, offering less novelty and no new system or empirical validation. Paper 1's concrete contribution and broader cross-domain applicability give it higher impact potential.

vs. HarnessAPI: A Skill-First Framework for Unified Streaming APIs and MCP Tools

gemini-3.15/22/2026

Paper 1 introduces an AI co-scientist framework for autonomous data visualization, addressing a fundamental bottleneck in data analysis across all scientific domains. By automating complex visual analysis and validating on real-world IEEE SciVis contests, it demonstrates broad applicability and high potential to accelerate multidisciplinary research. In contrast, Paper 2 presents a useful software engineering framework for LLM tool deployment, which greatly improves developer efficiency but has limited direct impact on fundamental scientific discovery methodologies. Therefore, Paper 1 exhibits significantly higher scientific innovation and broader potential impact.

vs. Scaling Observation-aware Planning in Uncertain Domains

gemini-3.15/22/2026

Paper 1 proposes an end-to-end AI agent harness for complex data visualization, directly contributing to the highly relevant 'AI Scientist' paradigm. Its ability to autonomously generate customized visual analysis apps from high-level descriptions offers massive cross-disciplinary applicability across virtually all scientific fields dealing with complex data. While Paper 2 presents impressive methodological improvements in POMDP solving with huge performance gains, its impact is primarily confined to robotics and formal planning. Paper 1's broader real-world utility, cross-domain relevance, and alignment with cutting-edge autonomous AI research give it higher potential scientific impact.

vs. Prior Knowledge or Search? A Study of LLM Agents in Hardware-Aware Code Optimization

claude-opus-4.65/22/2026

Paper 2 provides fundamental insights into how LLM agents actually work in optimization tasks, revealing that they rely heavily on pretrained priors rather than feedback or agentic structure. These findings have broad implications across the rapidly growing field of LLM-based agents and optimization systems, challenging common assumptions about agentic AI. Paper 1, while technically impressive in automating visualization pipelines, represents more of an engineering contribution with narrower applicability. Paper 2's controlled experimental methodology and generalizable conclusions about LLM behavior are likely to influence agent design across many domains.

vs. Position: A Three-Layer Probabilistic Assume-Guarantee Architecture Is Structurally Required for Safe LLM Agent Deployment

claude-opus-4.65/22/2026

Paper 2 addresses a fundamental and broadly applicable problem—safety guarantees for deployed LLM agents—which is timely and critically important as LLM agents proliferate across domains. Its formal, contract-based architecture with compositional probabilistic safety bounds provides a principled theoretical framework that could influence standards across the entire AI safety community. Paper 1, while technically impressive in automating visualization pipelines, addresses a narrower application domain. Paper 2's identification of three open problems and its structural argument for layered safety have broader cross-disciplinary implications for AI deployment at scale.

vs. SciCore-Mol: Augmenting Large Language Models with Pluggable Molecular Cognition Modules

claude-opus-4.65/22/2026

SciCore-Mol addresses a fundamental challenge in integrating molecular/scientific data with LLMs through a modular, generalizable framework with broad implications for drug design, chemical synthesis, and scientific discovery. Its pluggable architecture provides a systematic blueprint applicable beyond chemistry. While Paper 1 presents impressive autonomous visualization capabilities, it targets a narrower problem (visualization generation) with less transformative potential. Paper 2's methodological contribution—bridging discrete text and continuous scientific representations—addresses a deeper scientific computing challenge with wider cross-disciplinary impact and stronger alignment with current AI-for-science trends.

vs. SciCore-Mol: Augmenting Large Language Models with Pluggable Molecular Cognition Modules

gpt-5.25/22/2026

Paper 2 likely has higher scientific impact due to broader cross-domain applicability and timeliness: an end-to-end agent harness that autonomously builds data-visualization analysis apps can benefit many scientific fields, accelerating exploratory analysis and communication. Its validation on IEEE SciVis contests suggests realistic, task-driven evaluation and generality across modalities. Paper 1 is innovative and potentially high-impact for chemistry/drug discovery, but its primary impact is more domain-constrained to molecular/reaction tasks and depends on integration quality across specialized modules.

vs. Deep Reinforcement Learning for Flexible Job Shop Scheduling with Random Job Arrivals

claude-opus-4.65/22/2026

Paper 1 presents a more novel and broadly impactful contribution—an end-to-end agentic system for autonomous data visualization that advances the AI co-scientist vision across multiple scientific domains. It addresses a widely relevant problem (data visualization for scientists), demonstrates generality across diverse fields via SciVis Contest benchmarks, and sits at the intersection of rapidly growing areas (LLM agents, autonomous scientific workflows). Paper 2 applies established DRL techniques (PPO, MLPs) to a well-studied scheduling variant, offering incremental improvements over dispatching rules with narrower applicability to manufacturing/operations research.

vs. S2ED: From Story to Executable Descriptions for Consistency-Aware Story Illustration

claude-opus-4.65/22/2026

Paper 2 addresses a broader scientific challenge—autonomous data visualization for scientific discovery—with implications across virtually all scientific domains. Its alignment with the 'AI co-scientist' vision and validation on real-world IEEE SciVis contests demonstrates wider applicability. While Paper 1 offers a solid contribution to story illustration consistency, its scope is narrower (creative content generation). Paper 2's end-to-end agentic framework for complex visualization tasks has greater potential to accelerate scientific workflows across multiple fields, giving it higher estimated impact.

vs. Cross-domain benchmarks reveal when coordinated AI agents improve scientific inference from partial evidence

gemini-3.15/22/2026

Paper 1 addresses a fundamental question in AI for science: evaluating when and how coordinated AI agents genuinely improve scientific inference across multiple distinct disciplines. By establishing rigorous cross-domain benchmarks, baselines, and operating regimes, it provides a crucial framework for future AI research. Paper 2 presents a valuable but narrower system focused specifically on automating data visualization tasks. Paper 1's broader methodological contributions and multi-disciplinary scope give it higher potential scientific impact.

vs. The Log is the Agent: Event-Sourced Reactive Graphs for Auditable, Forkable Agentic Systems

gpt-5.25/22/2026

Paper 2 offers a more novel and broadly applicable systems contribution: an event-sourced, deterministic, reactive-graph substrate for agentic computation with replay, forking, and full lineage. These properties address pressing needs (auditability, reproducibility, governance, debugging) across many domains and agent frameworks, making impact potentially wide and timely. Methodologically, it proposes a clear architectural principle plus a determinism contract, enabling rigorous evaluation via replay/fork tests. Paper 1 is valuable and application-forward but is narrower (VIS-app generation) and more dependent on benchmark-style validation, limiting cross-field substrate impact.

vs. VBFDD-Agent for Electric Vehicle Battery Fault Detection and Diagnosis: Descriptive Text Modeling of Battery Digital Signals

claude-opus-4.65/22/2026

Paper 1 presents a more broadly impactful contribution: a general-purpose end-to-end agentic framework for autonomous data visualization applicable across virtually all scientific domains. Its validation on diverse IEEE SciVis Contests demonstrates cross-domain generality. The concept of an AI VIS co-scientist aligns with the high-impact trend of autonomous AI research assistants. Paper 2, while practically valuable for EV battery diagnostics, addresses a narrower application domain with more incremental innovation (converting signals to text for LLM reasoning). Paper 1's breadth of applicability and methodological novelty suggest greater scientific impact.

vs. VBFDD-Agent for Electric Vehicle Battery Fault Detection and Diagnosis: Descriptive Text Modeling of Battery Digital Signals

gemini-3.15/22/2026

Paper 1 presents a generalized AI co-scientist framework for data visualization with cross-disciplinary applicability in almost any scientific field. In contrast, Paper 2, while highly valuable and timely for EV safety, has a much narrower domain focus (battery fault diagnosis). The breadth of impact, combined with the novelty of autonomously generating complex, domain-specific visual analysis applications from high-level tasks, gives Paper 1 a significantly higher potential for widespread scientific impact.

vs. LACO: Adaptive Latent Communication for Collaborative Driving

claude-opus-4.65/22/2026

LACO addresses a fundamental challenge in collaborative autonomous driving—efficient latent communication between connected vehicles—with a novel training-free paradigm that reduces latency and bandwidth while maintaining performance. It tackles core technical problems (agent identity confusion, communication efficiency) with principled solutions validated in closed-loop simulation. Paper 2, while practical, primarily orchestrates existing LLM agents for visualization generation, representing incremental engineering over rapidly commoditizing AI coding capabilities. LACO's contributions to multi-agent communication and autonomous driving have broader and more lasting scientific impact across robotics, multi-agent systems, and transportation.

vs. Claw AI Lab: An Autonomous Multi-Agent Research Team

gemini-3.15/22/2026

Paper 1 presents a paradigm-shifting approach by automating the entire research lifecycle (from idea generation to paper writing) through an interactive, multi-agent platform. While Paper 2 offers a strong, domain-agnostic tool for data visualization, Paper 1 has a broader potential impact by fundamentally changing how scientific research is conducted, executed, and managed, offering a general-purpose infrastructure for autonomous scientific discovery.