SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science
Nithin Somasekharan, Youssef Hassan, Shiyao Lin, Gihan Panapitiya, Patrick Emami, Anurag Acharya, Sameera Horawalavithana, Shaowu Pan
Abstract
Large Language Models (LLMs) are increasingly deployed as scientific AI as- sistants, and a growing body of benchmarks evaluates their capabilities across knowledge retrieval, reasoning, code generation, and tool use. These evaluations, however, typically assume the scientific problem is already well-posed, whereas practical scientific assistance often begins with an ill-posed user request that must be refined through dialogue before any computation, analysis, or experiment can be carried out reliably. We introduce SCICONVBENCH, a benchmark for multi- turn clarification in scientific task formulation across four computational science problem domains: fluid mechanics, solid mechanics, materials science, and par- tial differential equations (PDEs). SCICONVBENCH targets two complementary capabilities: eliciting missing information (disambiguation) and detecting and correcting erroneous requests containing internally contradictory information (in- consistency resolution). Our benchmark pairs a structured task ontology with a rubric-based evaluation framework, enabling systematic measurement of LLM per- formance across three dimensions: clarification behavior, conversational grounding, and final-specification fidelity. Current frontier models perform relatively well on inconsistency resolution, but even the best model resolves only 52.7% of the disambiguation cases in fluid mechanics. We further find that frontier LLMs fre- quently make silent assumptions and perform implicit specification repairs that are not grounded in the conversation with users. SCICONVBENCH establishes a foundation for evaluating the upstream conversational reasoning that a reliable computational science assistant requires. The code and data can be found at https://github.com/csml-rpi/SciConvBench.
AI Impact Assessments
(1 models)Scientific Impact Assessment: SCICONVBENCH
1. Core Contribution
SCICONVBENCH addresses a genuinely underexplored gap in LLM evaluation for scientific applications: the upstream task formulation stage where an ill-posed or incomplete user request must be refined into a well-specified scientific problem through conversation. While numerous benchmarks evaluate LLMs on scientific QA, code generation, and tool use *given* a well-posed problem, this work formalizes the prerequisite conversational step. The benchmark spans four computational science domains (fluid mechanics, solid mechanics, materials science, PDEs) and targets two complementary capabilities: disambiguation (eliciting missing information) and inconsistency resolution (detecting and correcting contradictions).
The key conceptual innovation is the decomposition of resolution into Final Resolution Rate (FRR) vs. Conversation-Grounded Resolution Rate (CGRR), which reveals a pervasive "silent resolution" phenomenon—models produce correct final specifications without having explicitly clarified the issue with users. This distinction is both scientifically important (reproducibility risk) and practically actionable.
2. Methodological Rigor
Strengths:
Weaknesses:
3. Potential Impact
Direct impact: The benchmark fills a clear gap—Table 1's comparison with CLAMBER (86.1% vs. 18.2% resolution rate on the same model) compellingly demonstrates that general-domain clarification benchmarks vastly overestimate model capability in scientific contexts. This alone justifies a domain-specific benchmark.
Broader implications:
Limitations on impact: The benchmark is restricted to four domains, English text, and undergraduate-to-early-graduate difficulty. There is no end-to-end validation showing that improved CGRR actually leads to better downstream simulation outcomes, though Figure 1 provides a compelling motivating example.
4. Timeliness & Relevance
This work is highly timely. The deployment of LLM-based scientific assistants (OpenFOAMGPT, MetaOpenFOAM, FEABench, etc.) is accelerating, and the assumption that users will provide well-posed problems is clearly unrealistic. The paper correctly identifies that evaluation has been focused on downstream capabilities while neglecting the upstream formulation step. The rapid proliferation of agentic scientific workflows (cited extensively in Section 2) makes this evaluation gap increasingly consequential.
5. Strengths & Limitations
Key strengths:
Notable weaknesses:
Additional Observations
The paper's dataset contribution (structured ontology annotations, conversation transcripts, rubrics) combined with code release should enable meaningful follow-up work. The finding that Claude Sonnet 4.6 achieves 31.5% FRR but 0.0% CGRR on PDE inconsistency is a particularly stark illustration of the benchmark's diagnostic value. The work would benefit from a follow-up demonstrating that CGRR-optimized models produce better downstream simulation outcomes.
Generated May 19, 2026
Comparison History (21)
Paper 1 addresses a critical gap in deploying AI for computational science by benchmarking multi-turn clarification for ill-posed problems. This has broad implications for accelerating scientific discovery across multiple disciplines. Paper 2, while demonstrating strong real-world financial impact in digital advertising, focuses on a much narrower commercial application rather than advancing foundational scientific research.
Paper 1 likely has higher scientific impact due to greater novelty and broader, field-general relevance: it introduces a new benchmark and rubric to evaluate an under-measured capability (multi-turn clarification for ill-posed scientific tasks), with clear methodological structure (ontology + multi-dimensional scoring) and actionable failure analyses. Benchmarks often become community standards, influencing model design and evaluation across domains beyond computational science. Paper 2 is timely and high-application, but is more incremental (single-model study, limited clinician subset) and more domain-specific, with impact constrained by deployment/regulatory and data-access barriers.
Paper 2 (MOCHA) presents a fundamental algorithmic advancement for LLM agent optimization. By framing agent skill generation as a multi-objective problem with real-world constraints (e.g., context windows) and applying Chebyshev scalarization, MOCHA offers a broad, domain-agnostic solution. In contrast, Paper 1 introduces a specialized benchmark for computational science. While valuable, benchmarks typically have a narrower scope and shorter shelf-life compared to foundational optimization methods. MOCHA's potential to improve agent architectures across diverse fields gives it significantly higher breadth of impact and broader real-world applicability.
SCICONVBENCH addresses a fundamental and underexplored gap in LLM evaluation—the ability to handle ill-posed scientific problems through multi-turn dialogue—which is highly relevant as LLMs are increasingly used as scientific assistants. It introduces a novel benchmark paradigm (clarification and disambiguation) that could influence evaluation practices across many scientific domains. Paper 1, while technically sound, addresses a narrower optimization problem (token reduction for GUI agents) with incremental improvements. Paper 2's broader applicability, novel evaluation framework, and identification of critical LLM limitations (silent assumptions, implicit repairs) give it higher potential to shape future research directions.
Paper 2 (GRAM) is likely to have higher scientific impact due to its methodological novelty: a probabilistic, multi-trajectory extension of recursive reasoning with variational training, enabling alternative hypotheses and inference-time scaling. This is broadly applicable across reasoning, generative modeling, and scalable inference, with potential downstream influence on model architectures beyond a single domain. Paper 1 is timely and useful but is a benchmark focused on a specific interaction setting in computational science, likely yielding narrower cross-field impact compared to a new general modeling framework.
Paper 1 provides a comprehensive, first-of-its-kind survey unifying AI approaches for inverse PDE problems across three major categories (inverse problems, inverse design, and control), covering broad applications from medical imaging to aerodynamics. Its scope, systematic organization, and coverage of open challenges position it as a foundational reference for a large and growing research community. Paper 2, while introducing a valuable benchmark for LLM-based scientific assistants, addresses a narrower problem (multi-turn clarification) with more limited immediate applicability and a smaller target audience.
Paper 1 likely has higher scientific impact: it introduces a new benchmark addressing a timely, broadly relevant gap in LLM evaluation—multi-turn clarification and task formulation—critical for deploying LLMs as reliable scientific assistants. The benchmark spans multiple computational science domains, provides an ontology and rubric-based evaluation, and releases code/data, enabling community-wide adoption and follow-on work across NLP, HCI, and scientific computing. Paper 2 advances planning theory with learning guarantees under partial observability, but the impact is more specialized to symbolic planning/action-model learning and may see narrower cross-field uptake.
Paper 1 addresses a critical bottleneck in deploying LLMs for scientific discovery: refining ill-posed requests into actionable tasks. By focusing on computational science domains like fluid mechanics and materials science, it provides a foundation for AI assistants that can accelerate actual research. While Paper 2 offers a valuable framework for commercial e-commerce agents, Paper 1's direct contribution to enhancing scientific methodology and its potential to catalyze cross-disciplinary scientific breakthroughs give it a broader and more profound scientific impact.
Paper 2 introduces a foundational framework (Experience Graphs) for self-evolving LLM agents, addressing a critical bottleneck in AI: the inability of agents to systematically learn from experience over time. Its plug-and-play nature and broad applicability across general reasoning and coding tasks give it significantly wider cross-disciplinary impact. In contrast, Paper 1, while valuable, is a domain-specific benchmark constrained to computational science. The generalized architectural innovation in Paper 2 promises broader adoption, higher potential for downstream real-world applications, and greater overall scientific impact across the AI community.
Paper 2 has higher likely impact: it targets a broad, urgent, cross-domain problem (responsibility and trust in agentic AI) with a unifying proposal—explicit, computable provenance—supported by formal constructs (causal attribution function, responsibility tensor) and lifecycle framing, plus preliminary experiments and an incident analysis. Its applications span many high-stakes deployments (software engineering and beyond) and align with fast-moving governance needs. Paper 1 is novel and rigorous as a benchmark for scientific task-clarification, but its scope is narrower (computational science domains) and impact is primarily within LLM evaluation rather than system-level accountability.
While both papers introduce valuable benchmarks, Paper 2 addresses clinical acuity and medical triage, a highly safety-critical capability for LLMs. The potential real-world impact of ensuring AI models safely guide patients to the correct level of medical care carries profound societal implications. By evaluating uncertainty alignment in high-stakes healthcare scenarios, Paper 2 is likely to have broader, more immediate relevance across the rapidly expanding field of medical AI and AI safety compared to the more domain-specific focus of computational science in Paper 1.
Paper 2 addresses a fundamental and timely AI safety concern—the robustness of concept erasure in diffusion models—with a novel black-box multi-agent framework. Its finding that concept erasure merely suppresses rather than eliminates concepts has broad implications for AI safety policy and practice. The trajectory-based analysis provides deep mechanistic insight into diffusion model behavior. Paper 1, while valuable, targets a narrower niche (LLM benchmarking for scientific task formulation) with incremental contribution to the benchmark landscape. Paper 2's safety implications give it broader cross-field impact and greater urgency.
Paper 2 addresses a critical, broadly applicable safety vulnerability in memory-augmented LLM agents ('memory laundering'). Its findings on hidden state contamination impact the fundamental design and security of all long-horizon AI agents. Paper 1 offers a valuable but more niche benchmark for computational science, making Paper 2's potential breadth of impact and conceptual innovation significantly higher across the broader AI safety and capabilities community.
SCICONVBENCH addresses a fundamental and underexplored gap in LLM evaluation: the ability to handle ill-posed scientific problems through multi-turn clarification dialogue. This is highly novel, methodologically rigorous (structured ontology, rubric-based evaluation), and broadly applicable across computational science. It provides a reusable benchmark with open-source code/data. Paper 1, while practically useful, primarily describes a specific framework (buddyMe) with modest empirical findings from limited case studies. Paper 2's identification that even frontier models resolve only ~53% of disambiguation cases reveals an important limitation with wide implications for AI-assisted science.
Paper 2 likely has higher scientific impact due to a concrete, timely benchmark addressing a clear gap in evaluating LLM-based scientific assistants: multi-turn clarification for ill-posed tasks. It provides datasets, an ontology, and rubric-based evaluation, enabling reproducible comparisons and driving measurable progress across models and tool-using agents. Its immediate applicability spans computational science domains and broader human-AI interaction for specification elicitation. Paper 1 is conceptually ambitious and potentially broad, but as a position paper its impact depends more on future adoption and empirical validation beyond a single FL case study.
Paper 2 introduces a unified multimodal LLM framework (QE-Catalytic-V2) that integrates property prediction and inverse structural design for catalytic materials into a single model with shared representations. This addresses a fundamental challenge in computational materials science—the inconsistency between generation and evaluation models in closed-loop optimization. It has direct real-world applications in catalyst discovery and materials design. Paper 1, while addressing an important gap in LLM evaluation for scientific dialogue, is primarily a benchmark contribution with more limited scope (four computational science domains) and identifies problems without solving them. Paper 2's methodological innovation in unifying two traditionally separate tasks has broader transformative potential for accelerating materials discovery.
Paper 2 (SCICONVBENCH) has higher potential scientific impact due to stronger real-world applicability and timeliness: it targets an upstream bottleneck for deploying LLMs as reliable scientific assistants—multi-turn clarification and specification of ill-posed computational science tasks—across multiple engineering/science domains with an ontology and rubric-based evaluation. Its scope connects directly to practical workflows (simulation setup, PDE problem formulation) and can influence tool-building, human-AI interaction, and scientific computing broadly. Paper 1 is novel and valuable for affective cognition diagnostics, but its applications are comparatively narrower and less directly tied to high-stakes scientific/engineering outcomes.
Paper 2 (NeuroAgent) has higher likely scientific impact due to a concrete, end-to-end system with immediate real-world utility in a major applied domain (neuroimaging). It demonstrates methodological rigor via large-scale evaluation (1,470 subjects), ablations across LLM backends, measurable preprocessing correctness, and downstream task gains (AD AUC 0.9518). Its agentic generate-execute-validate design can generalize to other scientific pipelines. Paper 1 is novel and timely as a benchmark for clarification, but its impact is more indirect (evaluation infrastructure) and narrower in near-term application.
SCICONVBENCH addresses a critical and timely gap in LLM evaluation for scientific applications—the ability to handle ill-posed problems through multi-turn dialogue. This has broad impact across AI-for-science, LLM benchmarking, and human-AI interaction communities. The finding that even frontier models resolve only ~53% of disambiguation cases highlights a fundamental limitation with wide implications. Paper 2 makes a solid but incremental contribution to zeroth-order optimization with sparsity constraints, addressing a niche technical limitation. Paper 1's broader applicability, timeliness given the LLM revolution, and potential to shape how scientific AI assistants are developed give it higher impact potential.
Paper 1 is likely higher impact due to its novelty in targeting an under-evaluated but crucial capability for scientific LLM assistants: multi-turn clarification and inconsistency resolution before task execution. It provides a reusable benchmark, ontology, and rubric-based evaluation framework across multiple computational science domains, enabling broad, timely adoption and standardization. Paper 2 presents an incremental PPO architecture tweak (shared actor-critic backbone) validated on a specific multi-UAV coverage setting; it has clear applications but narrower scope and less methodological novelty relative to existing RL literature.