Towards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic Evaluation
Zixuan Jiang, Yanqiao Zhu, Peng Wang, Qinyuan Chen, Xinjian Zhao, Xipeng Qiu, Wupeng Wang, Zhifu Gao
Abstract
Automatic speech recognition (ASR) is a core component of human--computer interaction and an increasingly important front-end for LLM-based assistants and agents. However, most current ASR systems still follow a single-pass paradigm, which is poorly aligned with human communication, where misunderstandings are resolved through iterative clarification and refinement. This mismatch makes it difficult to correct meaning-critical errors once they occur. Meanwhile, token-level metrics such as WER or CER cannot adequately reflect such a problem. To address these limitations, we formulate \emph{Interactive ASR} as a multi-turn refinement task and propose \textbf{Agentic ASR}, a closed-loop framework that combines a single-pass ASR front-end with semantic correction, intent routing, and reasoning-based editing. We further introduce the \textbf{Sentence-level Semantic Error Rate} (), an LLM-based semantic evaluation metric, together with an \textbf{Interactive Simulation System} for scalable and reproducible benchmarking. Experiments on multilingual, named-entity-intensive, and code-switching benchmarks show that iterative interaction consistently reduces semantic errors, with much larger gains in than in conventional token-level metrics. Human--AI alignment and ablation studies further validate the reliability of the semantic judge and the robustness of the proposed framework. The code is available at: https://interactiveasr.github.io/ and the live demo is available at https://i-asr.sjtuxlance.com/
AI Impact Assessments
(1 models)Scientific Impact Assessment
1. Core Contribution
This paper makes three interrelated contributions: (1) Interactive ASR task formulation, recasting ASR as a stateful multi-turn refinement process rather than single-pass decoding; (2) Agentic ASR, a closed-loop framework combining a conventional ASR front-end with LLM-based semantic correction, intent routing, and a structured Locate–Reason–Modify correction pipeline; and (3) Sentence-level Semantic Error Rate (S²ER), an LLM-as-a-judge evaluation metric with a bidirectional multi-round voting protocol, accompanied by an Interactive Simulation System (ISS) for automated multi-turn benchmarking.
The central insight—that ASR should mirror human-like iterative repair rather than operate as a one-shot transcription engine—is intuitive and well-motivated. The paper draws explicitly on conversational repair theory (Clark & Brennan, Schegloff et al.) to ground this design. The formulation elegantly separates the problem into semantic correction, intent classification (confirmation/new input/correction), and structured reasoning-based editing.
2. Methodological Rigor
Strengths in experimental design:
Weaknesses and concerns:
3. Potential Impact
The paper addresses a genuine gap in how ASR systems handle errors in practice. As speech interfaces become front-ends for LLM agents, the inability to correct misrecognized named entities or intent-critical content is a real bottleneck. The interactive paradigm could influence:
However, the practical deployment path is unclear. The framework requires a full LLM (32B preferred) running alongside the ASR system, which adds significant latency and compute cost. The paper doesn't discuss inference latency, which is critical for real-time interactive systems.
4. Timeliness & Relevance
The paper is highly timely. The proliferation of LLM-based agents (ChatGPT, Claude, etc.) that accept speech input makes ASR error correction increasingly consequential. The observation that WER doesn't capture semantic impact is well-established but still underaddressed in practice. Framing ASR within an agentic, multi-turn paradigm aligns with current trends in AI agent research (ReAct, tool use, etc.).
5. Strengths & Limitations
Key strengths:
Notable limitations:
Overall Assessment
This is a well-structured paper that identifies a genuine problem and proposes a coherent solution. The Interactive ASR formulation and Agentic ASR framework are conceptually sound, and the experiments demonstrate consistent semantic improvements. However, the reliance on fully simulated evaluation, the small validation set for S²ER, and the potential circularity in using the same model family for generation and evaluation temper the strength of the empirical claims. The lack of real user studies is particularly notable given the paper's emphasis on human-like interaction. The work opens an interesting research direction but would benefit from real-world deployment validation and comparison with simpler baselines.
Generated May 29, 2026
Comparison History (20)
Paper 1 addresses a critical bottleneck in LLM development: domain-specific data scarcity. Its shift from a deductive, prompt-heavy paradigm to an inductive, representation-learning approach is highly innovative. By providing both theoretical proofs for data diversity and strong empirical gains, it establishes a foundational method applicable across countless domains. While Paper 2 presents a practical and novel agentic approach to ASR, Paper 1's contribution to data synthesis will likely have a broader, cross-disciplinary impact on the fundamental training pipelines of large language models.
Paper 1 proposes a fundamental paradigm shift in ASR from single-pass to multi-turn interactive systems, introducing a novel semantic evaluation metric that addresses the critical flaws of traditional token-level metrics like WER. By redefining how speech recognition is evaluated and integrated with LLMs, it offers foundational contributions that could reshape human-computer voice interaction across numerous fields. While Paper 2 provides an effective agentic web-exploration framework, Paper 1's systemic changes to core ASR methodology have a higher potential for broad and lasting scientific impact.
Paper 2 likely has higher scientific impact due to stronger novelty (explicitly targeting multi-turn view planning and exposing a clear capability gap), broader cross-field relevance (VLMs, embodied AI, robotics, 3D navigation, RL/distillation), and a compelling methodological contribution (ViewSuite benchmark + self-exploration with view-graph distillation to address sparse-reward planning). The reported gains are large and tied to an important frontier problem—agentic, spatially grounded planning—making it timely and widely applicable. Paper 1 is valuable, but more incremental (interactive refinement + LLM metric) and narrower to ASR.
Paper 1 introduces a fundamentally new paradigm for ASR—interactive, multi-turn refinement—along with a novel semantic evaluation metric (S²ER) and a simulation framework. This reframes how ASR systems operate, aligning them with human communication patterns, and has broad implications for human-computer interaction, LLM-based assistants, and multilingual processing. Paper 2 presents a useful engineering contribution (quantizing LLM-based trajectory predictors for edge deployment), but it is more incremental, applying known quantization techniques to a specific application domain with narrower impact.
Paper 2 likely has higher impact: it introduces a new problem framing (Interactive ASR), a practical closed-loop agentic correction framework, and a new semantic metric (S^2ER) with a scalable simulation benchmark, directly addressing real-world ASR failures in multilingual/NER/code-switching settings. This is timely given LLM-based assistants and could influence ASR evaluation, HCI, and agent pipelines broadly. Paper 1 is novel and rigorous in diagnosing a specific failure mode in masked diffusion LMs, but its applicability is narrower (primarily MDM inference/training and synthetic reasoning tasks) and less immediately deployable.
Paper 2 has higher likely impact due to broader real-world applicability (interactive ASR for assistants), clear system contribution plus a new evaluation metric and benchmarking infrastructure, and strong timeliness as speech+LLM agents grow. Its closed-loop framework can influence ASR, HCI, and agent design, with multilingual/code-switching relevance and public code/demo aiding adoption. Paper 1 is novel and insightful for LLM interpretability/safety and attack efficiency, but its primary application is red-teaming/jailbreak optimization, which is narrower and may face deployment constraints, limiting breadth despite methodological interest.
Paper 2 has higher likely impact due to stronger novelty (reframing ASR as interactive, agentic refinement), broad applicability (voice assistants, multilingual and code-switching settings, LLM agent front-ends), and timeliness (interactive LLM-based systems). It contributes a new evaluation metric (S^2ER) and a scalable simulation benchmark plus released code/demo, supporting reproducibility and adoption. Paper 1 is useful and interpretable for education, but evidence is preliminary and limited to a single course context, likely narrowing immediate cross-field impact.
KairosAgent addresses the fundamental challenge of multimodal time series forecasting across domains by combining LLMs and TSFMs in a novel agentic framework with reinforcement learning from forecasting. Its broader applicability across multiple domains (finance, weather, energy, etc.), novel fusion of semantic reasoning with numerical forecasting, and the introduction of RL-based training paradigm for time series agents represent a more impactful contribution. Paper 2, while valuable for interactive ASR, addresses a more niche problem with incremental improvements to an established pipeline. Paper 1's cross-domain generality and methodological innovations give it higher potential impact.
Paper 2 has higher potential impact due to broader applicability (any speech-driven HCI/agent pipeline, multilingual and code-switching settings), a more novel framing (interactive, multi-turn ASR with agentic correction), and two reusable research artifacts: an evaluation metric (S^2ER) and a scalable simulation benchmark, plus released code/demo—likely to catalyze follow-on work. Paper 1 is timely and rigorous for structured health text generation, but its design rule (deterministic precompute + bounded LLM) is less field-general and more application-specific, limiting cross-domain breadth.
Paper 2 addresses a critical and widespread bottleneck in LLM deployment—constrained decoding overhead—offering a highly impactful solution with massive speedups (up to 7.5x). Its improvements will broadly benefit structured generation tasks like code generation and data extraction across various fields. While Paper 1 presents a novel paradigm for ASR, Paper 2's methodological contribution has a broader and more immediate potential for real-world impact in the rapidly growing domain of LLM applications.
Paper 1 likely has higher impact: it reframes ASR into an interactive, multi-turn paradigm aligned with real human communication, introduces an evaluation metric (S^2ER) and a simulation/benchmarking system, and targets a widely deployed technology with immediate HCI and agent applications. The contribution spans methodology (closed-loop correction), evaluation (semantic metric), and tooling (simulator), broadening uptake across ASR, LLM agents, and human–AI interaction. Paper 2 is solid and timely for diffusion LLMs, but its scope is narrower and dependent on diffusion-LM adoption.
Paper 1 proposes a fundamental paradigm shift in Automatic Speech Recognition from single-pass to interactive, agentic refinement, addressing a critical bottleneck in HCI. It introduces a novel semantic metric, open-source benchmarks, and rigorous multilingual evaluation. Its impact is highly broad, affecting ubiquitous voice assistants and LLM agents. In contrast, Paper 2 presents an innovative but highly domain-specific system architecture for financial investment research. Paper 1's broader applicability, rigorous quantitative benchmarking, and foundational contributions to a core AI technology give it higher potential scientific impact.
Paper 1 has higher impact potential due to clearer novelty and a concrete, reproducible contribution: a defined “Interactive ASR” task, an agentic closed-loop correction framework, a new semantic metric (S^2ER), and a simulation system with multilingual/code-switching evaluations plus demos/code. It targets a large, mature, high-need domain (ASR for LLM assistants), enabling broad adoption and benchmarking. Paper 2 is promising but reads more conceptual/system-level, with harder-to-validate claims and narrower demonstrated scope (specific scientific workflows), making near-term community uptake and measurable impact less certain.
Paper 1 introduces a novel paradigm shift in ASR by formulating it as an interactive multi-turn refinement task, proposes a new semantic evaluation metric (S²ER), and provides a complete benchmarking framework. This addresses a fundamental limitation in a widely-used technology (ASR) with broad real-world applications in human-computer interaction and LLM-based assistants. Paper 2 offers an incremental improvement to model routing with process rewards, which is a narrower contribution within the LRM efficiency space. Paper 1's broader applicability, new evaluation paradigm, and alignment with the growing LLM-agent ecosystem give it higher potential impact.
Paper 1 presents a concrete, implementable framework (Agentic ASR) with experimental validation on multilingual benchmarks, a new evaluation metric (S²ER), and publicly available code and demo. It addresses a practical gap in ASR systems with measurable results. Paper 2 presents a theoretical governance framework (SMARt) for agentic AI safety, which is timely but remains largely theoretical without empirical validation. While Paper 2 addresses an important problem, Paper 1's combination of novelty, reproducible experiments, practical applicability to the growing LLM-agent ecosystem, and concrete contributions gives it higher near-term scientific impact.
Paper 1 introduces a novel paradigm shift in ASR by formulating interactive, multi-turn speech recognition with agentic correction, a new semantic evaluation metric (S²ER), and a simulation benchmark. This addresses a fundamental limitation in human-computer interaction with broad applications across multilingual and real-world settings. Paper 2, while solid, addresses a more incremental improvement in data selection for LLM mid-training. Paper 1's novelty in redefining the ASR problem, introducing new evaluation methodology, and broader cross-field impact (ASR, HCI, LLM agents) gives it higher potential impact.
Paper 1 presents a more fundamental and broadly applicable insight—that reasoning traces, not just answers, should be the unit of aggregation in multi-agent LLM systems. The 'aggregation paradox' is a novel theoretical finding with implications across all LLM reasoning tasks. It introduces a principled framework (Self-Consistent MoA) with provable guarantees and demonstrates improvements across diverse domains. Paper 2 addresses a meaningful but more narrowly scoped problem (interactive ASR correction) with an engineering-oriented framework. While valuable, its impact is more domain-specific compared to Paper 1's foundational contribution to LLM aggregation methodology.
Paper 1 has higher potential impact due to its strong novelty (LLM-generated compiler passes rather than kernels), large-scale open ecosystem (18K graphs + rigorous benchmark), and clear methodology (new metric, integrity defenses, discriminative/unsaturated benchmark). It targets a concrete, high-leverage bottleneck in ML systems (long-tail compiler performance) with broad downstream effects across deep learning frameworks and hardware backends. Paper 2 is timely and useful for HCI/ASR, but agentic correction and LLM-based semantic metrics build on rapidly evolving patterns and may face reproducibility/subjectivity risks versus Paper 1’s more grounded systems benchmark.
Paper 2 likely has higher impact due to broader applicability and timeliness: process-reward, step-level RL for computer-use agents targets a fast-growing area (GUI/web agents) with clear real-world automation value. Methodologically, decoupling live interaction from optimization via PRM-guided dense feedback and group-relative advantages addresses key RL bottlenecks (sparse rewards, credit assignment, distribution shift) and could generalize across agentic tasks beyond GUIs. Paper 1 is novel for interactive ASR and semantic evaluation, but its impact is more domain-specific to speech recognition and depends on LLM-judge reliability.
Paper 1 introduces a novel paradigm shift in ASR by formulating interactive, multi-turn speech recognition with a complete framework (Agentic ASR), a new semantic evaluation metric (S²ER), and a simulation system for benchmarking. It addresses a fundamental limitation of current ASR systems with broad real-world applications in human-computer interaction and LLM-based assistants. Paper 2, while valuable in evaluating feasibility awareness in tool-using agents, is more narrowly focused on evaluation/benchmarking of an existing problem. Paper 1's contributions span methodology, metrics, and systems, giving it broader potential impact.