Towards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic Evaluation

Zixuan Jiang, Yanqiao Zhu, Peng Wang, Qinyuan Chen, Xinjian Zhao, Xipeng Qiu, Wupeng Wang, Zhifu Gao

May 28, 2026

arXiv:2605.29430v1 PDF

cs.AI(primary)cs.CL

#953of 2821·Artificial Intelligence

#953 of 2821 · Artificial Intelligence

Tournament Score

1443±44

10501800

65%

Win Rate

Wins

Losses

Matches

Rating

6.2/ 10

Significance6.5

Rigor5.5

Novelty6.8

Clarity7.5

Tournament Score

1443±44

10501800

65%

Win Rate

Wins

Losses

Matches

Rating

6.2/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Automatic speech recognition (ASR) is a core component of human--computer interaction and an increasingly important front-end for LLM-based assistants and agents. However, most current ASR systems still follow a single-pass paradigm, which is poorly aligned with human communication, where misunderstandings are resolved through iterative clarification and refinement. This mismatch makes it difficult to correct meaning-critical errors once they occur. Meanwhile, token-level metrics such as WER or CER cannot adequately reflect such a problem. To address these limitations, we formulate \emph{Interactive ASR} as a multi-turn refinement task and propose \textbf{Agentic ASR}, a closed-loop framework that combines a single-pass ASR front-end with semantic correction, intent routing, and reasoning-based editing. We further introduce the \textbf{Sentence-level Semantic Error Rate} ( $S^{2} E R$ ), an LLM-based semantic evaluation metric, together with an \textbf{Interactive Simulation System} for scalable and reproducible benchmarking. Experiments on multilingual, named-entity-intensive, and code-switching benchmarks show that iterative interaction consistently reduces semantic errors, with much larger gains in $S^{2} E R$ than in conventional token-level metrics. Human--AI alignment and ablation studies further validate the reliability of the semantic judge and the robustness of the proposed framework. The code is available at: https://interactiveasr.github.io/ and the live demo is available at https://i-asr.sjtuxlance.com/

AI Impact Assessments

(1 models)

Scientific Impact Assessment

1. Core Contribution

This paper makes three interrelated contributions: (1) Interactive ASR task formulation, recasting ASR as a stateful multi-turn refinement process rather than single-pass decoding; (2) Agentic ASR, a closed-loop framework combining a conventional ASR front-end with LLM-based semantic correction, intent routing, and a structured Locate–Reason–Modify correction pipeline; and (3) Sentence-level Semantic Error Rate (S²ER), an LLM-as-a-judge evaluation metric with a bidirectional multi-round voting protocol, accompanied by an Interactive Simulation System (ISS) for automated multi-turn benchmarking.

The central insight—that ASR should mirror human-like iterative repair rather than operate as a one-shot transcription engine—is intuitive and well-motivated. The paper draws explicitly on conversational repair theory (Clark & Brennan, Schegloff et al.) to ground this design. The formulation elegantly separates the problem into semantic correction, intent classification (confirmation/new input/correction), and structured reasoning-based editing.

2. Methodological Rigor

Strengths in experimental design:

The evaluation spans six benchmarks across three challenging categories (multilingual, named-entity-intensive, code-switching), providing reasonable breadth.

Ablation studies systematically examine ASR backbone choice (Whisper, Qwen3-ASR, FireRedASR2), LLM reasoner scale (8B vs. 32B), and judge voting strategy.

The human–AI alignment study (120 samples, 25 non-expert + 5 expert annotators) demonstrates that S²ER correlates with human judgments (r > 0.82 across datasets), with the LLM judge slightly outperforming domain experts.

Weaknesses and concerns:

The evaluation is entirely simulation-based. No real human-in-the-loop experiments are conducted. The User Simulator generates corrections from ground-truth transcripts using an LLM + TTS pipeline, which creates an idealized interaction that may not reflect real user behavior (ambiguous corrections, impatience, cascading misunderstandings).

The S²ER metric is validated on only 120 samples total (40 per language condition). While correlation numbers are reasonable, this is a small validation set for establishing a new metric's reliability.

The Interactive Simulation System uses the same LLM family (Qwen3) for the reasoner, user simulator, and semantic judge, raising concerns about circular evaluation. If the same model generates corrections and judges outcomes, inflated performance is possible.

S²ER is binary (semantically equivalent or not), which loses granularity. The paper acknowledges this implicitly but doesn't explore graded semantic similarity.

Token-level metrics sometimes *worsen* with interaction (e.g., WER on GigaSpeech increases slightly, MER on CS-Dialogue degrades), suggesting the correction process can introduce surface-level artifacts. The paper acknowledges this for the 8B model but it also appears in some 32B results, which warrants deeper investigation.

3. Potential Impact

The paper addresses a genuine gap in how ASR systems handle errors in practice. As speech interfaces become front-ends for LLM agents, the inability to correct misrecognized named entities or intent-critical content is a real bottleneck. The interactive paradigm could influence:

Voice assistant design: Enabling clarification dialogues when recognition confidence is low.

Accessibility applications: Users with atypical speech patterns could iteratively refine transcriptions.

ASR evaluation methodology: S²ER could complement WER/CER in settings where semantic fidelity matters more than surface accuracy.

However, the practical deployment path is unclear. The framework requires a full LLM (32B preferred) running alongside the ASR system, which adds significant latency and compute cost. The paper doesn't discuss inference latency, which is critical for real-time interactive systems.

4. Timeliness & Relevance

The paper is highly timely. The proliferation of LLM-based agents (ChatGPT, Claude, etc.) that accept speech input makes ASR error correction increasingly consequential. The observation that WER doesn't capture semantic impact is well-established but still underaddressed in practice. Framing ASR within an agentic, multi-turn paradigm aligns with current trends in AI agent research (ReAct, tool use, etc.).

5. Strengths & Limitations

Key strengths:

Novel and well-motivated problem formulation that bridges conversational repair theory with modern ASR.

The Locate–Reason–Modify decomposition is principled and interpretable.

Strong ablation showing the framework works even with weak ASR backbones (Whisper), demonstrating generality.

Code and live demo availability enhance reproducibility.

The finding that S²ER captures gains invisible to token-level metrics is compelling and practically important.

Notable limitations:

No real user studies—the entire evaluation loop is simulated, which is the single biggest gap for a paper claiming to move "towards human-like interactive" ASR.

Potential evaluation circularity from using the same model family across components.

The paper doesn't compare against existing interactive or feedback-based ASR approaches (e.g., N-best reranking with user selection, respeaking methods) in a controlled manner.

Scalability concerns: 10 interaction rounds with LLM calls per round is expensive; no cost analysis is provided.

S²ER's binary nature means it cannot distinguish between "almost correct" and "completely wrong" transcriptions, limiting its diagnostic utility.

The paper doesn't address how the system handles adversarial or contradictory user feedback, or how it degrades under noisy TTS in the simulation loop.

Overall Assessment

This is a well-structured paper that identifies a genuine problem and proposes a coherent solution. The Interactive ASR formulation and Agentic ASR framework are conceptually sound, and the experiments demonstrate consistent semantic improvements. However, the reliance on fully simulated evaluation, the small validation set for S²ER, and the potential circularity in using the same model family for generation and evaluation temper the strength of the empirical claims. The lack of real user studies is particularly notable given the paper's emphasis on human-like interaction. The work opens an interesting research direction but would benefit from real-world deployment validation and comparison with simpler baselines.

Rating:6.2/ 10

Significance 6.5Rigor 5.5Novelty 6.8Clarity 7.5

Generated May 29, 2026

Comparison History (20)

vs. Domain-Specific Data Synthesis for LLMs via Minimal Sufficient Representation Learning

gemini-3.15/29/2026

Paper 1 addresses a critical bottleneck in LLM development: domain-specific data scarcity. Its shift from a deductive, prompt-heavy paradigm to an inductive, representation-learning approach is highly innovative. By providing both theoretical proofs for data diversity and strong empirical gains, it establishes a foundational method applicable across countless domains. While Paper 2 presents a practical and novel agentic approach to ASR, Paper 1's contribution to data synthesis will likely have a broader, cross-disciplinary impact on the fundamental training pipelines of large language models.

vs. MEMENTO: Leveraging Web as a Learning Signal for Low-Data Domains

gemini-3.15/29/2026

Paper 1 proposes a fundamental paradigm shift in ASR from single-pass to multi-turn interactive systems, introducing a novel semantic evaluation metric that addresses the critical flaws of traditional token-level metrics like WER. By redefining how speech recognition is evaluated and integrated with LLMs, it offers foundational contributions that could reshape human-computer voice interaction across numerous fields. While Paper 2 provides an effective agentic web-exploration framework, Paper 1's systemic changes to core ASR methodology have a higher potential for broad and lasting scientific impact.

vs. Planning with the Views via Scene Self-Exploration

gpt-5.25/29/2026

Paper 2 likely has higher scientific impact due to stronger novelty (explicitly targeting multi-turn view planning and exposing a clear capability gap), broader cross-field relevance (VLMs, embodied AI, robotics, 3D navigation, RL/distillation), and a compelling methodological contribution (ViewSuite benchmark + self-exploration with view-graph distillation to address sparse-reward planning). The reported gains are large and tied to an important frontier problem—agentic, spatially grounded planning—making it timely and widely applicable. Paper 1 is valuable, but more incremental (interactive refinement + LLM metric) and narrower to ASR.

vs. BitTP: The Lightweight Trajectory Prediction Model with BitLLM for Edge-Devices

claude-opus-4.65/29/2026

Paper 1 introduces a fundamentally new paradigm for ASR—interactive, multi-turn refinement—along with a novel semantic evaluation metric (S²ER) and a simulation framework. This reframes how ASR systems operate, aligning them with human communication patterns, and has broad implications for human-computer interaction, LLM-based assistants, and multilingual processing. Paper 2 presents a useful engineering contribution (quantizing LLM-based trajectory predictors for edge deployment), but it is more incremental, applying known quantization techniques to a specific application domain with narrower impact.

vs. The Confidence Shortcut: A Reasoning Failure Mode of Masked Diffusion Models

gpt-5.25/29/2026

Paper 2 likely has higher impact: it introduces a new problem framing (Interactive ASR), a practical closed-loop agentic correction framework, and a new semantic metric (S^2ER) with a scalable simulation benchmark, directly addressing real-world ASR failures in multilingual/NER/code-switching settings. This is timely given LLM-based assistants and could influence ASR evaluation, HCI, and agent pipelines broadly. Paper 1 is novel and rigorous in diagnosing a specific failure mode in masked diffusion LMs, but its applicability is narrower (primarily MDM inference/training and synthetic reasoning tasks) and less immediately deployable.

vs. Refusal Before Decoding: Detecting and Exploiting Refusal Signals in Intermediate LLM Activations

gpt-5.25/29/2026

Paper 2 has higher likely impact due to broader real-world applicability (interactive ASR for assistants), clear system contribution plus a new evaluation metric and benchmarking infrastructure, and strong timeliness as speech+LLM agents grow. Its closed-loop framework can influence ASR, HCI, and agent design, with multilingual/code-switching relevance and public code/demo aiding adoption. Paper 1 is novel and insightful for LLM interpretability/safety and attack efficiency, but its primary application is red-teaming/jailbreak optimization, which is narrower and may face deployment constraints, limiting breadth despite methodological interest.

vs. Surfacing Isolated Learners with Outcome-Independent Mediation of Feedback between Teachers and Students Using AI

gpt-5.25/29/2026

Paper 2 has higher likely impact due to stronger novelty (reframing ASR as interactive, agentic refinement), broad applicability (voice assistants, multilingual and code-switching settings, LLM agent front-ends), and timeliness (interactive LLM-based systems). It contributes a new evaluation metric (S^2ER) and a scalable simulation benchmark plus released code/demo, supporting reproducibility and adoption. Paper 1 is useful and interpretable for education, but evidence is preliminary and limited to a single course context, likely narrowing immediate cross-field impact.

vs. KairosAgent: Agentic Time Series Forecasting with Fused Semantic Reasoning

claude-opus-4.65/29/2026

KairosAgent addresses the fundamental challenge of multimodal time series forecasting across domains by combining LLMs and TSFMs in a novel agentic framework with reinforcement learning from forecasting. Its broader applicability across multiple domains (finance, weather, energy, etc.), novel fusion of semantic reasoning with numerical forecasting, and the introduction of RL-based training paradigm for time series agents represent a more impactful contribution. Paper 2, while valuable for interactive ASR, addresses a more niche problem with incremental improvements to an established pipeline. Paper 1's cross-domain generality and methodological innovations give it higher potential impact.

vs. Think Fast, Talk Smart: Partitioning Deterministic and Neural Computation for Structured Health Text Generation

gpt-5.25/29/2026

Paper 2 has higher potential impact due to broader applicability (any speech-driven HCI/agent pipeline, multilingual and code-switching settings), a more novel framing (interactive, multi-turn ASR with agentic correction), and two reusable research artifacts: an evaluation metric (S^2ER) and a scalable simulation benchmark, plus released code/demo—likely to catalyze follow-on work. Paper 1 is timely and rigorous for structured health text generation, but its design rule (deterministic precompute + bounded LLM) is less field-general and more application-specific, limiting cross-domain breadth.

vs. Accelerating Constrained Decoding with Token Space Compression

gemini-3.15/29/2026

Paper 2 addresses a critical and widespread bottleneck in LLM deployment—constrained decoding overhead—offering a highly impactful solution with massive speedups (up to 7.5x). Its improvements will broadly benefit structured generation tasks like code generation and data extraction across various fields. While Paper 1 presents a novel paradigm for ASR, Paper 2's methodological contribution has a broader and more immediate potential for real-world impact in the rapidly growing domain of LLM applications.

vs. NaRA: Noise-Aware LoRA for Parameter-Efficient Fine-Tuning of Diffusion LLMs

gpt-5.25/29/2026

Paper 1 likely has higher impact: it reframes ASR into an interactive, multi-turn paradigm aligned with real human communication, introduces an evaluation metric (S^2ER) and a simulation/benchmarking system, and targets a widely deployed technology with immediate HCI and agent applications. The contribution spans methodology (closed-loop correction), evaluation (semantic metric), and tooling (simulator), broadening uptake across ASR, LLM agents, and human–AI interaction. Paper 2 is solid and timely for diffusion LLMs, but its scope is narrower and dependent on diffusion-LM adoption.

vs. FundaPod: A Multi-Persona Agent Pod Platform with Knowledge Graph Memory for AI-Assisted Fundamental Investment Research

gemini-3.15/29/2026

Paper 1 proposes a fundamental paradigm shift in Automatic Speech Recognition from single-pass to interactive, agentic refinement, addressing a critical bottleneck in HCI. It introduces a novel semantic metric, open-source benchmarks, and rigorous multilingual evaluation. Its impact is highly broad, affecting ubiquitous voice assistants and LLM agents. In contrast, Paper 2 presents an innovative but highly domain-specific system architecture for financial investment research. Paper 1's broader applicability, rigorous quantitative benchmarking, and foundational contributions to a core AI technology give it higher potential scientific impact.

vs. Learning to Choose: An Empowerment-Guided Multi-Agent System with semantic communication for Adaptive Method Selection

gpt-5.25/29/2026

Paper 1 has higher impact potential due to clearer novelty and a concrete, reproducible contribution: a defined “Interactive ASR” task, an agentic closed-loop correction framework, a new semantic metric (S^2ER), and a simulation system with multilingual/code-switching evaluations plus demos/code. It targets a large, mature, high-need domain (ASR for LLM assistants), enabling broad adoption and benchmarking. Paper 2 is promising but reads more conceptual/system-level, with harder-to-validate claims and narrower demonstrated scope (specific scientific workflows), making near-term community uptake and measurable impact less certain.

vs. Rubric-Guided Process Reward for Stepwise Model Routing

claude-opus-4.65/29/2026

Paper 1 introduces a novel paradigm shift in ASR by formulating it as an interactive multi-turn refinement task, proposes a new semantic evaluation metric (S²ER), and provides a complete benchmarking framework. This addresses a fundamental limitation in a widely-used technology (ASR) with broad real-world applications in human-computer interaction and LLM-based assistants. Paper 2 offers an incremental improvement to model routing with process rewards, which is a narrower contribution within the LRM efficiency space. Paper 1's broader applicability, new evaluation paradigm, and alignment with the growing LLM-agent ecosystem give it higher potential impact.

vs. Intelligence as Managed Autonomy: Failure, Escalation, and Governance for Agentic AI Systems

claude-opus-4.65/29/2026

Paper 1 presents a concrete, implementable framework (Agentic ASR) with experimental validation on multilingual benchmarks, a new evaluation metric (S²ER), and publicly available code and demo. It addresses a practical gap in ASR systems with measurable results. Paper 2 presents a theoretical governance framework (SMARt) for agentic AI safety, which is timely but remains largely theoretical without empirical validation. While Paper 2 addresses an important problem, Paper 1's combination of novelty, reproducible experiments, practical applicability to the growing LLM-agent ecosystem, and concrete contributions gives it higher near-term scientific impact.

vs. MIRA: Mid-training Rubric Anchoring for Source-Aware Data Selection

claude-opus-4.65/29/2026

Paper 1 introduces a novel paradigm shift in ASR by formulating interactive, multi-turn speech recognition with agentic correction, a new semantic evaluation metric (S²ER), and a simulation benchmark. This addresses a fundamental limitation in human-computer interaction with broad applications across multilingual and real-world settings. Paper 2, while solid, addresses a more incremental improvement in data selection for LLM mid-training. Paper 1's novelty in redefining the ASR problem, introducing new evaluation methodology, and broader cross-field impact (ASR, HCI, LLM agents) gives it higher potential impact.

vs. Beyond Consensus: Trace-Level Synthesis in Mixture of Agents

claude-opus-4.65/29/2026

Paper 1 presents a more fundamental and broadly applicable insight—that reasoning traces, not just answers, should be the unit of aggregation in multi-agent LLM systems. The 'aggregation paradox' is a novel theoretical finding with implications across all LLM reasoning tasks. It introduces a principled framework (Self-Consistent MoA) with provable guarantees and demonstrates improvements across diverse domains. Paper 2 addresses a meaningful but more narrowly scoped problem (interactive ASR correction) with an engineering-oriented framework. While valuable, its impact is more domain-specific compared to Paper 1's foundational contribution to LLM aggregation methodology.

vs. PassNet: Scaling Large Language Models for Graph Compiler Pass Generation

gpt-5.25/29/2026

Paper 1 has higher potential impact due to its strong novelty (LLM-generated compiler passes rather than kernels), large-scale open ecosystem (18K graphs + rigorous benchmark), and clear methodology (new metric, integrity defenses, discriminative/unsaturated benchmark). It targets a concrete, high-leverage bottleneck in ML systems (long-tail compiler performance) with broad downstream effects across deep learning frameworks and hardware backends. Paper 2 is timely and useful for HCI/ASR, but agentic correction and LLM-based semantic metrics build on rapidly evolving patterns and may face reproducibility/subjectivity risks versus Paper 1’s more grounded systems benchmark.

vs. PRO-CUA: Process-Reward Optimization for Computer Use Agents

gpt-5.25/29/2026

Paper 2 likely has higher impact due to broader applicability and timeliness: process-reward, step-level RL for computer-use agents targets a fast-growing area (GUI/web agents) with clear real-world automation value. Methodologically, decoupling live interaction from optimization via PRM-guided dense feedback and group-relative advantages addresses key RL bottlenecks (sparse rewards, credit assignment, distribution shift) and could generalize across agentic tasks beyond GUIs. Paper 1 is novel for interactive ASR and semantic evaluation, but its impact is more domain-specific to speech recognition and depends on LLM-judge reliability.

vs. Do Agents Know What They Can't Do? Evaluating Feasibility Awareness in Tool-Using Agents

claude-opus-4.65/29/2026

Paper 1 introduces a novel paradigm shift in ASR by formulating interactive, multi-turn speech recognition with a complete framework (Agentic ASR), a new semantic evaluation metric (S²ER), and a simulation system for benchmarking. It addresses a fundamental limitation of current ASR systems with broad real-world applications in human-computer interaction and LLM-based assistants. Paper 2, while valuable in evaluating feasibility awareness in tool-using agents, is more narrowly focused on evaluation/benchmarking of an existing problem. Paper 1's contributions span methodology, metrics, and systems, giving it broader potential impact.