Om Dobariya, Akhil Kumar
The use of Large Language Models (LLMs) is proliferating, yet their performance is observed to vary based on prompting styles and tones. In this study, we investigate both whether and how tonal variations in prompts lead to disparate LLM accuracy for objective multiple-choice questions. We use two datasets: a 50-base question dataset with five tone variants and a 570-base question MMLU subset spanning 57 subjects with seven tone variants. Experiments were conducted to evaluate the performance of four cost-efficient, popular LLMs: ChatGPT-4o, ChatGPT-5-nano, Gemini 2.5 Flash, and Gemini 2.5 Flash Lite. Across models, tonal effects are systematic but highly model-dependent. Some models show small, yet statistically significant, shifts, while others exhibit large accuracy swings across tones. Further, we identify subject-level differences in tone sensitivity and present a routing framework to explain how tones may attune internal reasoning modes. Our findings caution users against assuming tone-robust reliability in LLM deployments.
This paper investigates whether the tone (on a rudeness-politeness spectrum) of prompts affects LLM accuracy on multiple-choice questions. The study uses two datasets—a custom 50-question set with 5 tone variants and a 570-question MMLU subset with 7 tone variants—tested across four LLMs (ChatGPT-4o, ChatGPT-5-nano, Gemini 2.5 Flash, Gemini 2.5 Flash Lite). The main finding is that tonal effects are systematic but highly model-dependent: some models show negligible sensitivity while others exhibit accuracy swings of up to 12.5 percentage points. The paper also proposes a speculative routing framework suggesting that tones may trigger different internal reasoning modes.
The core contribution is primarily empirical—documenting tone sensitivity across a specific set of models and conditions. The routing framework (Figure 1) is presented as a conceptual explanation but remains unvalidated.
The paper demonstrates reasonable statistical care in several areas. The authors employ repeated-measures ANOVA, Friedman tests, paired t-tests with Holm correction, McNemar's exact tests, Cochran's Q, and FDR-controlled subject-level analyses. Effect sizes (Cohen's d_z) and non-parametric robustness checks are reported, which is commendable.
However, there are notable methodological concerns:
The practical relevance is clear: as LLMs are deployed in enterprise settings, understanding that prompt phrasing—even seemingly irrelevant emotional framing—can alter accuracy is important for reliability assurance. The finding that some models lose 11+ percentage points under certain tones is operationally significant.
However, the impact is tempered by several factors:
The topic is timely. Prompt engineering is an active area of interest, and understanding how social/emotional cues in prompts affect LLM behavior is practically relevant. The paper builds on prior work by Yin et al. (2024) and others, extending it to newer models and more tone variants. The inclusion of extreme tones (Sycophantic, Threatening) adds novelty beyond the standard politeness spectrum.
However, given the rapid pace of LLM development, findings tied to specific model versions (GPT-4o, Gemini 2.5 Flash, etc.) have a short shelf life. The paper would benefit from more generalizable theoretical insights.
The paper reads as a solid conference-level empirical study suitable for the AMCIS venue, but its contribution to fundamental understanding is limited. The most interesting thread—that different tones may activate different computational pathways—deserves much deeper investigation. The latency analysis mentioned in passing (40% variation for Gemini 2.5 Flash Lite) could have been a compelling additional result if reported systematically. The disconnect between the empirical findings (which are careful) and the theoretical framework (which is speculative) weakens the overall contribution.
Generated May 29, 2026
Paper 2 has a significantly broader impact, addressing a fundamental issue (prompt robustness and tone sensitivity) in foundational LLMs used across almost all scientific and commercial domains. While Paper 1 presents a valuable, timely framework for AI governance, its scope is heavily focused on public administration and specific regional infrastructure, limiting its generalizability compared to Paper 2's insights, which affect the broader AI/NLP community and general LLM deployment.
Paper 2 addresses a timely, empirical question about LLM behavior with concrete experimental methodology, practical implications for the rapidly growing LLM user base, and actionable findings (tone sensitivity varies by model and subject). While Paper 1 offers a thoughtful theoretical framework for interaction-centered intelligence, it is primarily conceptual and synthesizes existing ideas rather than generating novel empirical evidence. Paper 2's practical relevance to prompt engineering, LLM reliability, and deployment practices gives it broader near-term impact across the large community of LLM researchers and practitioners.
Paper 2 addresses a broadly relevant and timely question about LLM behavior that affects virtually all users and applications of these rapidly proliferating systems. Its findings about tone sensitivity have immediate practical implications for prompt engineering, AI safety, and deployment reliability across many domains. Paper 1, while addressing an important clinical problem, proposes incremental improvements (marginal gains in BLEU-4, METEOR, ROUGE-L) over existing methods in a narrower application domain. Paper 2's broader applicability, timeliness given the LLM explosion, and novel routing framework give it higher potential impact.
Paper 1 addresses a fundamental and ubiquitous issue in prompt engineering: how tone affects objective LLM accuracy. Its findings have broad, immediate implications for LLM reliability, safety, and user interaction across virtually all domains. Paper 2, while presenting an interesting methodology for LLM-driven algorithm development, is primarily a case study focused on a niche area (tensor networks), limiting its immediate breadth of impact compared to Paper 1.
Paper 2 addresses the rapidly growing and broadly relevant field of LLM reliability, investigating how prompt tone affects performance across multiple models and subjects. Its findings have immediate practical implications for the vast and expanding community of LLM users and developers, and the routing framework offers novel theoretical contribution. Paper 1, while methodologically sound, represents an incremental contribution combining two established techniques (ASP and CARCASS) in a niche area of relational RL with limited evaluation scope. Paper 2's timeliness, broader audience, and practical relevance give it higher impact potential.
Paper 2 has higher likely scientific impact: it presents an empirical, multi-model, multi-dataset study with statistical evidence that prompt tone can systematically shift accuracy, plus subject-level analyses and a proposed explanatory routing framework. This is timely for evaluation, reliability, and deployment of LLMs and is broadly relevant across NLP, HCI, and AI safety/robustness. Paper 1 offers useful conceptual framing and governance guidance for agentic systems, but appears more managerial/definitional with less methodological rigor and fewer falsifiable or generalizable results.
CrystalXRD-Bench introduces a novel, domain-specific benchmark targeting a concrete scientific task (XRD peak indexing) that bridges vision-language AI with crystallography—a gap unaddressed by existing benchmarks. It has clear real-world applications in materials science, provides a public dataset from 10 databases, and identifies specific failure modes of VLMs on quantitative scientific figures. Paper 1, while useful, studies tone sensitivity in LLMs—a topic already partially explored in prompt engineering literature—with relatively incremental findings. Paper 2's interdisciplinary novelty and methodological contribution to scientific AI evaluation give it higher impact potential.
Paper 1 provides rigorous empirical analysis and quantifiable metrics on how prompt tones affect LLM accuracy, offering immediate practical applications for AI deployment and prompt engineering. While Paper 2 presents valuable philosophical concepts regarding AI ethics and guardrails, empirical studies with measurable performance impacts generally see broader citation and direct integration into ongoing technical research and development.
Paper 1 has higher likely scientific impact due to broader, timely relevance to LLM reliability across domains and users, and a clearer empirical contribution: controlled tone variants, multiple models, large multi-subject benchmark (MMLU subset), statistical testing, and subject-level analysis plus an explanatory routing framework. Its findings generalize to any LLM deployment and touch evaluation, alignment, safety, and HCI. Paper 2 is innovative and practically valuable for finance, but is more domain-specific and appears more design/case-study oriented with less generalizable, rigorously validated empirical evidence.
Paper 2 addresses a fundamental challenge in LLM post-training (distribution shift during SFT before RL) with a novel, principled method (entropy-KL token masking). It offers a concrete algorithmic contribution with theoretical motivation, empirical validation on mathematical reasoning benchmarks, and open-source code. Its impact spans the broad and active research area of LLM alignment and training pipelines. Paper 1, while timely, is primarily observational—documenting tone sensitivity in LLMs without proposing a solution—and its contributions are more incremental and application-specific.