SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science

Nithin Somasekharan, Youssef Hassan, Shiyao Lin, Gihan Panapitiya, Patrick Emami, Anurag Acharya, Sameera Horawalavithana, Shaowu Pan

May 18, 2026

arXiv:2605.18630v1 PDF

cs.AI(primary)physics.comp-ph

#1188of 2292·Artificial Intelligence

#1188 of 2292 · Artificial Intelligence

Tournament Score

1407±43

10501800

52%

Win Rate

Wins

Losses

Matches

Rating

7.2/ 10

Significance7.5

Rigor7

Novelty7.5

Clarity7.5

Tournament Score

1407±43

10501800

52%

Win Rate

Wins

Losses

Matches

Rating

7.2/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Large Language Models (LLMs) are increasingly deployed as scientific AI as- sistants, and a growing body of benchmarks evaluates their capabilities across knowledge retrieval, reasoning, code generation, and tool use. These evaluations, however, typically assume the scientific problem is already well-posed, whereas practical scientific assistance often begins with an ill-posed user request that must be refined through dialogue before any computation, analysis, or experiment can be carried out reliably. We introduce SCICONVBENCH, a benchmark for multi- turn clarification in scientific task formulation across four computational science problem domains: fluid mechanics, solid mechanics, materials science, and par- tial differential equations (PDEs). SCICONVBENCH targets two complementary capabilities: eliciting missing information (disambiguation) and detecting and correcting erroneous requests containing internally contradictory information (in- consistency resolution). Our benchmark pairs a structured task ontology with a rubric-based evaluation framework, enabling systematic measurement of LLM per- formance across three dimensions: clarification behavior, conversational grounding, and final-specification fidelity. Current frontier models perform relatively well on inconsistency resolution, but even the best model resolves only 52.7% of the disambiguation cases in fluid mechanics. We further find that frontier LLMs fre- quently make silent assumptions and perform implicit specification repairs that are not grounded in the conversation with users. SCICONVBENCH establishes a foundation for evaluating the upstream conversational reasoning that a reliable computational science assistant requires. The code and data can be found at https://github.com/csml-rpi/SciConvBench.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: SCICONVBENCH

1. Core Contribution

SCICONVBENCH addresses a genuinely underexplored gap in LLM evaluation for scientific applications: the upstream task formulation stage where an ill-posed or incomplete user request must be refined into a well-specified scientific problem through conversation. While numerous benchmarks evaluate LLMs on scientific QA, code generation, and tool use *given* a well-posed problem, this work formalizes the prerequisite conversational step. The benchmark spans four computational science domains (fluid mechanics, solid mechanics, materials science, PDEs) and targets two complementary capabilities: disambiguation (eliciting missing information) and inconsistency resolution (detecting and correcting contradictions).

The key conceptual innovation is the decomposition of resolution into Final Resolution Rate (FRR) vs. Conversation-Grounded Resolution Rate (CGRR), which reveals a pervasive "silent resolution" phenomenon—models produce correct final specifications without having explicitly clarified the issue with users. This distinction is both scientifically important (reproducibility risk) and practically actionable.

2. Methodological Rigor

Strengths:

The structured task ontology (Equation 1) provides systematic coverage of scientific specification components, enabling fine-grained component-level analysis.

The evaluation framework is multi-layered: case-level rates, component-level rates, and diagnostic axes (Capability, Robustness, Usability) via Pareto analysis.

Extensive robustness checks are conducted: judge ablation across three LLM judges with human validation (80-case annotated subset), user-simulator ablation (three different LLMs), and prompt-paraphrase ablation (three variants). Results are stable across all three axes.

The judge-human agreement is reasonable for headline metrics (κ=0.64-0.70 for FRR, κ=0.47-0.55 for CGRR), though CGRR agreement is only moderate.

Weaknesses:

The human annotation relies on a single expert annotator rather than multiple independent raters, meaning inter-annotator reliability cannot be assessed. This is acknowledged but is a meaningful limitation for a benchmark paper.

The 80-case human-validated subset is relatively small compared to the 1,142-case benchmark.

Judge agreement on some secondary metrics (DR, IC, MC) is weak (κ as low as 0.05-0.22), which undermines confidence in the diagnostic axes that depend on these metrics.

The LLM-simulated user introduces a confound: the user simulator's instruction to say "make a reasonable assumption" when the reference doesn't specify something may not reflect real scientific user behavior, where users might provide additional context or push back.

3. Potential Impact

Direct impact: The benchmark fills a clear gap—Table 1's comparison with CLAMBER (86.1% vs. 18.2% resolution rate on the same model) compellingly demonstrates that general-domain clarification benchmarks vastly overestimate model capability in scientific contexts. This alone justifies a domain-specific benchmark.

Broader implications:

The silent resolution finding has implications beyond benchmarking: it reveals a fundamental auditability problem for scientific AI assistants. Models that silently fill in boundary conditions, solver choices, or constitutive assumptions create reproducibility risks in computational science workflows.

The finding that no single model dominates both disambiguation and inconsistency resolution suggests these are genuinely different capabilities, which could inform model training strategies.

The ontology-based framework is generalizable to other scientific domains (e.g., chemistry simulation, bioinformatics pipeline specification).

Limitations on impact: The benchmark is restricted to four domains, English text, and undergraduate-to-early-graduate difficulty. There is no end-to-end validation showing that improved CGRR actually leads to better downstream simulation outcomes, though Figure 1 provides a compelling motivating example.

4. Timeliness & Relevance

This work is highly timely. The deployment of LLM-based scientific assistants (OpenFOAMGPT, MetaOpenFOAM, FEABench, etc.) is accelerating, and the assumption that users will provide well-posed problems is clearly unrealistic. The paper correctly identifies that evaluation has been focused on downstream capabilities while neglecting the upstream formulation step. The rapid proliferation of agentic scientific workflows (cited extensively in Section 2) makes this evaluation gap increasingly consequential.

5. Strengths & Limitations

Key strengths:

Novel evaluation paradigm: The FRR/CGRR/SRR decomposition is an elegant and practically meaningful framework that captures a failure mode invisible to standard metrics.

Compelling empirical findings: The persistent FRR-CGRR gap (averaging 8.2pp for disambiguation, 14.7pp for inconsistency) across all models, and the dramatic drop from CLAMBER performance, are striking results.

Comprehensive ablations: The systematic variation of judges, simulators, and prompts (Tables 2, 5, 6, 7) substantially strengthens confidence in the findings.

Actionable component-level analysis: Identifying numerics/solver choices and governing physics as the most fragile components provides clear direction for improvement.

Notable weaknesses:

Dataset scale: 1,142 cases is modest, and the per-domain per-task cell sizes are sometimes small (e.g., 61 PDE disambiguation cases).

Manual construction: The prompt-by-prompt transformation approach, while ensuring quality, limits scalability and raises questions about coverage and potential biases.

No real-user validation: The entire benchmark operates with simulated users; how the findings transfer to actual scientist interactions is unknown.

Evaluation metric limitations: CGRR's binary case-level scoring (all-or-nothing across all planted issues) may be overly strict for complex cases with many planted issues, potentially underestimating partial successes.

Missing baselines: No retrieval-augmented or few-shot baselines are reported; it would be informative to see whether domain-specific prompting (e.g., providing a checklist of ontology components) substantially improves performance.

Additional Observations

The paper's dataset contribution (structured ontology annotations, conversation transcripts, rubrics) combined with code release should enable meaningful follow-up work. The finding that Claude Sonnet 4.6 achieves 31.5% FRR but 0.0% CGRR on PDE inconsistency is a particularly stark illustration of the benchmark's diagnostic value. The work would benefit from a follow-up demonstrating that CGRR-optimized models produce better downstream simulation outcomes.

Rating:7.2/ 10

Significance 7.5Rigor 7Novelty 7.5Clarity 7.5

Generated May 19, 2026

Comparison History (21)

vs. Generative Auto-Bidding with Unified Modeling and Exploration

gemini-3.15/20/2026

Paper 1 addresses a critical gap in deploying AI for computational science by benchmarking multi-turn clarification for ill-posed problems. This has broad implications for accelerating scientific discovery across multiple disciplines. Paper 2, while demonstrating strong real-world financial impact in digital advertising, focuses on a much narrower commercial application rather than advancing foundational scientific research.

vs. Evaluating the Utility of Personal Health Records in Personalized Health AI

gpt-5.25/20/2026

Paper 1 likely has higher scientific impact due to greater novelty and broader, field-general relevance: it introduces a new benchmark and rubric to evaluate an under-measured capability (multi-turn clarification for ill-posed scientific tasks), with clear methodological structure (ontology + multi-dimensional scoring) and actionable failure analyses. Benchmarks often become community standards, influencing model design and evaluation across domains beyond computational science. Paper 2 is timely and high-application, but is more incremental (single-model study, limited clinician subset) and more domain-specific, with impact constrained by deployment/regulatory and data-access barriers.

vs. MOCHA: Multi-Objective Chebyshev Annealing for Agent Skill Optimization

gemini-3.15/20/2026

Paper 2 (MOCHA) presents a fundamental algorithmic advancement for LLM agent optimization. By framing agent skill generation as a multi-objective problem with real-world constraints (e.g., context windows) and applying Chebyshev scalarization, MOCHA offers a broad, domain-agnostic solution. In contrast, Paper 1 introduces a specialized benchmark for computational science. While valuable, benchmarks typically have a narrower scope and shorter shelf-life compared to foundational optimization methods. MOCHA's potential to improve agent architectures across diverse fields gives it significantly higher breadth of impact and broader real-world applicability.

vs. AQuaUI: Visual Token Reduction for GUI Agents with Adaptive Quadtrees

claude-opus-4.65/20/2026

SCICONVBENCH addresses a fundamental and underexplored gap in LLM evaluation—the ability to handle ill-posed scientific problems through multi-turn dialogue—which is highly relevant as LLMs are increasingly used as scientific assistants. It introduces a novel benchmark paradigm (clarification and disambiguation) that could influence evaluation practices across many scientific domains. Paper 1, while technically sound, addresses a narrower optimization problem (token reduction for GUI agents) with incremental improvements. Paper 2's broader applicability, novel evaluation framework, and identification of critical LLM limitations (silent assumptions, implicit repairs) give it higher potential to shape future research directions.

vs. Generative Recursive Reasoning

gpt-5.25/20/2026

Paper 2 (GRAM) is likely to have higher scientific impact due to its methodological novelty: a probabilistic, multi-trajectory extension of recursive reasoning with variational training, enabling alternative hypotheses and inference-time scaling. This is broadly applicable across reasoning, generative modeling, and scalable inference, with potential downstream influence on model architectures beyond a single domain. Paper 1 is timely and useful but is a benchmark focused on a specific interaction setting in computational science, likely yielding narrower cross-field impact compared to a new general modeling framework.

vs. Harnessing AI for Inverse Partial Differential Equation Problems: Past, Present, and Prospects

claude-opus-4.65/19/2026

Paper 1 provides a comprehensive, first-of-its-kind survey unifying AI approaches for inverse PDE problems across three major categories (inverse problems, inverse design, and control), covering broad applications from medical imaging to aerodynamics. Its scope, systematic organization, and coverage of open challenges position it as a foundational reference for a large and growing research community. Paper 2, while introducing a valuable benchmark for LLM-based scientific assistants, addresses a narrower problem (multi-turn clarification) with more limited immediate applicability and a smaller target audience.

vs. Learning Lifted Action Models from Traces with Minimal Information About Actions and States

gpt-5.25/19/2026

Paper 1 likely has higher scientific impact: it introduces a new benchmark addressing a timely, broadly relevant gap in LLM evaluation—multi-turn clarification and task formulation—critical for deploying LLMs as reliable scientific assistants. The benchmark spans multiple computational science domains, provides an ontology and rubric-based evaluation, and releases code/data, enabling community-wide adoption and follow-on work across NLP, HCI, and scientific computing. Paper 2 advances planning theory with learning guarantees under partial observability, but the impact is more specialized to symbolic planning/action-model learning and may see narrower cross-field uptake.

vs. ShopGym: An Integrated Framework for Realistic Simulation and Scalable Benchmarking of E-Commerce Web Agents

gemini-3.15/19/2026

Paper 1 addresses a critical bottleneck in deploying LLMs for scientific discovery: refining ill-posed requests into actionable tasks. By focusing on computational science domains like fluid mechanics and materials science, it provides a foundation for AI assistants that can accelerate actual research. While Paper 2 offers a valuable framework for commercial e-commerce agents, Paper 1's direct contribution to enhancing scientific methodology and its potential to catalyze cross-disciplinary scientific breakthroughs give it a broader and more profound scientific impact.

vs. EXG: Self-Evolving Agents with Experience Graphs

gemini-3.15/19/2026

Paper 2 introduces a foundational framework (Experience Graphs) for self-evolving LLM agents, addressing a critical bottleneck in AI: the inability of agents to systematically learn from experience over time. Its plug-and-play nature and broad applicability across general reasoning and coding tasks give it significantly wider cross-disciplinary impact. In contrast, Paper 1, while valuable, is a domain-specific benchmark constrained to computational science. The generalized architectural innovation in Paper 2 promises broader adoption, higher potential for downstream real-world applications, and greater overall scientific impact across the AI community.

vs. Responsible Agentic AI Requires Explicit Provenance

gpt-5.25/19/2026

Paper 2 has higher likely impact: it targets a broad, urgent, cross-domain problem (responsibility and trust in agentic AI) with a unifying proposal—explicit, computable provenance—supported by formal constructs (causal attribution function, responsibility tensor) and lifecycle framing, plus preliminary experiments and an incident analysis. Its applications span many high-stakes deployments (software engineering and beyond) and align with fast-moving governance needs. Paper 1 is novel and rigorous as a benchmark for scientific task-clarification, but its scope is narrower (computational science domains) and impact is primarily within LLM evaluation rather than system-level accountability.

vs. AcuityBench: Evaluating Clinical Acuity Identification and Uncertainty Alignment

gemini-3.15/19/2026

While both papers introduce valuable benchmarks, Paper 2 addresses clinical acuity and medical triage, a highly safety-critical capability for LLMs. The potential real-world impact of ensuring AI models safely guide patients to the correct level of medical care carries profound societal implications. By evaluating uncertainty alignment in high-stakes healthcare scenarios, Paper 2 is likely to have broader, more immediate relevance across the rapidly expanding field of medical AI and AI safety compared to the more domain-specific focus of computational science in Paper 1.

vs. Whispers in the Noise: Surrogate-Guided Concept Awakening via a Multi-Agent Framework

claude-opus-4.65/19/2026

Paper 2 addresses a fundamental and timely AI safety concern—the robustness of concept erasure in diffusion models—with a novel black-box multi-agent framework. Its finding that concept erasure merely suppresses rather than eliminates concepts has broad implications for AI safety policy and practice. The trajectory-based analysis provides deep mechanistic insight into diffusion model behavior. Paper 1, while valuable, targets a narrower niche (LLM benchmarking for scientific task formulation) with incremental contribution to the benchmark landscape. Paper 2's safety implications give it broader cross-field impact and greater urgency.

vs. State Contamination in Memory-Augmented LLM Agents

gemini-3.15/19/2026

Paper 2 addresses a critical, broadly applicable safety vulnerability in memory-augmented LLM agents ('memory laundering'). Its findings on hidden state contamination impact the fundamental design and security of all long-horizon AI agents. Paper 1 offers a valuable but more niche benchmark for computational science, making Paper 2's potential breadth of impact and conceptual innovation significantly higher across the broader AI safety and capabilities community.

vs. Multi-Paradigm Agent Interaction in Practice:A Systematic Analysis of Generator-Evaluator, ReAct Loop,and Adversarial Evaluation in the buddyMe Framework

claude-opus-4.65/19/2026

SCICONVBENCH addresses a fundamental and underexplored gap in LLM evaluation: the ability to handle ill-posed scientific problems through multi-turn clarification dialogue. This is highly novel, methodologically rigorous (structured ontology, rubric-based evaluation), and broadly applicable across computational science. It provides a reusable benchmark with open-source code/data. Paper 1, while practically useful, primarily describes a specific framework (buddyMe) with modest empirical findings from limited case studies. Paper 2's identification that even frontier models resolve only ~53% of disambiguation cases reveals an important limitation with wide implications for AI-assisted science.

vs. Position: Artificial Intelligence Needs Meta Intelligence -- the Case for Metacognitive AI

gpt-5.25/19/2026

Paper 2 likely has higher scientific impact due to a concrete, timely benchmark addressing a clear gap in evaluating LLM-based scientific assistants: multi-turn clarification for ill-posed tasks. It provides datasets, an ontology, and rubric-based evaluation, enabling reproducible comparisons and driving measurable progress across models and tool-using agents. Its immediate applicability spans computational science domains and broader human-AI interaction for specification elicitation. Paper 1 is conceptually ambitious and potentially broad, but as a position paper its impact depends more on future adoption and empirical validation beyond a single FL case study.

vs. CatalyticMLLM: A Graph-Text Multimodal Large Language Model for Catalytic Materials

claude-opus-4.65/19/2026

Paper 2 introduces a unified multimodal LLM framework (QE-Catalytic-V2) that integrates property prediction and inverse structural design for catalytic materials into a single model with shared representations. This addresses a fundamental challenge in computational materials science—the inconsistency between generation and evaluation models in closed-loop optimization. It has direct real-world applications in catalyst discovery and materials design. Paper 1, while addressing an important gap in LLM evaluation for scientific dialogue, is primarily a benchmark contribution with more limited scope (four computational science domains) and identifies problems without solving them. Paper 2's methodological innovation in unifying two traditionally separate tasks has broader transformative potential for accelerating materials discovery.

vs. CAREBench: Evaluating LLMs' Emotion Understanding by Assessing Cognitive Appraisal Reasoning

gpt-5.25/19/2026

Paper 2 (SCICONVBENCH) has higher potential scientific impact due to stronger real-world applicability and timeliness: it targets an upstream bottleneck for deploying LLMs as reliable scientific assistants—multi-turn clarification and specification of ill-posed computational science tasks—across multiple engineering/science domains with an ontology and rubric-based evaluation. Its scope connects directly to practical workflows (simulation setup, PDE problem formulation) and can influence tool-building, human-AI interaction, and scientific computing broadly. Paper 1 is novel and valuable for affective cognition diagnostics, but its applications are comparatively narrower and less directly tied to high-stakes scientific/engineering outcomes.

vs. NeuroAgent: LLM Agents for Multimodal Neuroimaging Analysis and Research

gpt-5.25/19/2026

Paper 2 (NeuroAgent) has higher likely scientific impact due to a concrete, end-to-end system with immediate real-world utility in a major applied domain (neuroimaging). It demonstrates methodological rigor via large-scale evaluation (1,470 subjects), ablations across LLM backends, measurable preprocessing correctness, and downstream task gains (AD AUC 0.9518). Its agentic generate-execute-validate design can generalize to other scientific pipelines. Paper 1 is novel and timely as a benchmark for clarification, but its impact is more indirect (evaluation infrastructure) and narrower in near-term application.

vs. New Insight of Variance reduce in Zero-Order Hard-Thresholding: Mitigating Gradient Error and Expansivity Contradictions

claude-opus-4.65/19/2026

SCICONVBENCH addresses a critical and timely gap in LLM evaluation for scientific applications—the ability to handle ill-posed problems through multi-turn dialogue. This has broad impact across AI-for-science, LLM benchmarking, and human-AI interaction communities. The finding that even frontier models resolve only ~53% of disambiguation cases highlights a fundamental limitation with wide implications. Paper 2 makes a solid but incremental contribution to zeroth-order optimization with sparsity constraints, addressing a niche technical limitation. Paper 1's broader applicability, timeliness given the LLM revolution, and potential to shape how scientific AI assistants are developed give it higher impact potential.

vs. Shared Backbone PPO for Multi-UAV Communication Coverage with Connection Preservation

gpt-5.25/19/2026

Paper 1 is likely higher impact due to its novelty in targeting an under-evaluated but crucial capability for scientific LLM assistants: multi-turn clarification and inconsistency resolution before task execution. It provides a reusable benchmark, ontology, and rubric-based evaluation framework across multiple computational science domains, enabling broad, timely adoption and standardization. Paper 2 presents an incremental PPO architecture tweak (shared actor-critic backbone) validated on a specific multi-UAV coverage setting; it has clear applications but narrower scope and less methodological novelty relative to existing RL literature.