Zhe Yang, Ruyi Zhang, Hongtao Chen, Wenrui Li, Hengyu Man, Wangmeng Zuo, Xiaopeng Fan
Open-vocabulary audio-visual event localization (OV-AVEL) jointly models audio-visual cues to recognize and temporally localize events, including categories unseen during training. Existing methods primarily learn joint audio-visual representations in Euclidean space, but still face two significant challenges. First, the lack of supervision signals for unseen categories makes it difficult to maintain audio-visual consistency across multiple temporal scales. Second, the lack of hierarchical constraints between segment- and video-level semantics prevents the model from establishing semantic consistency across different levels. To address these challenges, we propose a hierarchical semantic constrained heterogeneous graph (HSCHG) for audio-visual event localization framework. We first construct a heterogeneous hierarchical graph in Euclidean space, which includes audio and visual segment nodes and their corresponding video-level nodes. We use multi-directional temporal edges to capture complete temporal information within each modality. Simultaneously, we employ a dual-threshold filtering gated fusion strategy, introducing cross-modal information only when the alignment confidence is high. Furthermore, we introduce bidirectional semantic constraints between segment- and video-level representations to achieve semantic consistency across different levels. Based on this, we map the multi-level audio-visual representations and text prototypes uniformly into hyperbolic space. We use a hierarchical entailment regularization loss to characterize the hierarchical relationships between videos and segments. Extensive experimental results show that our method outperforms existing methods on the OV-AVEL benchmark. Ablation studies further validate the effectiveness of our method.
The paper addresses Open-Vocabulary Audio-Visual Event Localization (OV-AVEL), proposing HSCHG — a framework that combines a heterogeneous hierarchical graph network in Euclidean space with hyperbolic entailment constraints. The key novelty lies in three interlocking mechanisms: (1) a heterogeneous graph with multi-directional temporal edges and dual-threshold gated cross-modal fusion for robust audio-visual reasoning; (2) bidirectional semantic constraints between segment- and video-level representations; and (3) a hierarchical entailment regularization loss in hyperbolic space that enforces geometric consistency between video/segment embeddings and text prototypes. The framework addresses two specific gaps: maintaining audio-visual consistency across temporal scales without supervision for unseen categories, and enforcing hierarchical semantic consistency between segment and video levels.
The methodology is technically detailed and well-structured. The graph construction with three types of intra-modal edges (undirected, forward, backward) and the piecewise cross-modal weighting scheme (Eq. 10) are well-motivated by the need to handle asynchronous audio-visual signals. The dual-threshold filtering mechanism is a practical design choice that prevents premature cross-modal fusion when alignment confidence is low.
However, several concerns arise:
The combination of Euclidean graph-based reasoning with hyperbolic entailment constraints is an interesting architectural paradigm that could influence broader multimodal learning. Specific impact pathways include:
The impact is somewhat constrained by the narrow task focus and single-dataset evaluation.
OV-AVEL is a timely problem — open-vocabulary recognition has gained significant traction following CLIP and similar foundation models. The paper builds naturally on the recently proposed OV-AVEBench [20] (CVPR 2025), making it relevant to current research trends. The use of hyperbolic geometry for hierarchical multimodal learning is also an emerging direction, with works like HyperAVCA and HyCoCLIP demonstrating its promise. The paper sits at a productive intersection of these trends.
The paper's writing is generally clear, though dense. The mathematical formulation is complete and reproducible. The use of ImageBind as a frozen feature extractor ensures fair comparison but also limits the method's potential to learn task-specific representations. The entailment cone loss (Eq. 26) borrows directly from prior work (HEC, HyCoCLIP), and the novelty lies primarily in its application to the OV-AVEL setting rather than in the geometric formulation itself.
Generated Jun 8, 2026
Paper 1 is more likely to have higher scientific impact: it targets a broadly relevant, timely problem in AI safety/governance—how autonomous systems should refuse, justify, and allow override of requests—connecting technical design with accountability, security risk, and liability. This scope can influence multiple fields (AI alignment, HCI, policy, security, robotics) and has direct real-world applicability as LLM/agent deployment accelerates. Paper 2 appears methodologically stronger but is a narrower incremental contribution within audio-visual event localization, with more limited cross-domain reach.
Paper 1 presents an innovative integration of LLM-based multi-agent systems with finite element analysis for physical hardware design. This approach bridges AI and traditional engineering, offering profound real-world applications in EV and robotics design. Its automated, uncertainty-aware pipeline represents a significant paradigm shift in AI-for-Engineering, giving it a broader potential scientific and industrial impact compared to Paper 2's methodological improvements in a more narrowly defined multimodal learning task.
Paper 2 addresses a broader, more fundamental challenge—Self-Explainability in complex AI systems—through a systematic literature review that establishes definitions, taxonomy, and levels of SX. This foundational framework has wide cross-disciplinary applicability (AI, autonomous systems, robotics, etc.) and addresses the timely, critical need for trustworthy AI. Paper 1, while technically solid, offers incremental improvements to a narrow task (audio-visual event localization) with limited broader impact beyond its specific benchmark.
Paper 2 likely has higher scientific impact: it introduces a broadly applicable new heuristic class for bidirectional search with clear theoretical motivation and preserved optimality guarantees, and demonstrates substantial computational savings across multiple domains. This kind of algorithmic contribution can transfer across planning, routing, robotics, and AI search more widely than a specialized OV-AVEL architecture. Paper 1 is innovative (heterogeneous graphs + hyperbolic embeddings + semantic constraints) but is more domain-specific to audio-visual event localization and may have narrower cross-field uptake.
Paper 2 has higher likely scientific impact due to broader applicability and timeliness: open-vocabulary audio-visual event localization is a core multimodal ML problem relevant to video understanding, retrieval, surveillance, robotics, and human-computer interaction. Its methodological contributions (heterogeneous hierarchical graph, semantic constraints across temporal scales, gated fusion, and hyperbolic embedding with entailment regularization) are more generalizable beyond a single domain. Paper 1 is practically useful but is domain-specific (TCM), with impact constrained by clinical validation requirements and narrower transferability.
Paper 2 addresses a fundamental challenge in real-world ML deployment: off-policy evaluation when agents strategically alter their behavior. By bridging causal inference, machine learning, and mechanism design, its novel approach to handling information asymmetry offers broad theoretical implications and interdisciplinary impact. In contrast, Paper 1 presents a highly specialized architectural improvement for a specific multimodal task, making its potential impact narrower and more incremental.
Paper 1 addresses a critical bottleneck in the rapidly growing field of LLM agents: long-horizon memory. Its unified memory framework has broad, cross-disciplinary applicability in AI, robotics, and automation, offering significant real-world utility. While Paper 2 presents a rigorous methodology for audio-visual event localization, its focus is much narrower, limiting its overall scientific and practical impact compared to advancing general-purpose autonomous agents.
Paper 1 tackles a highly relevant problem in multimedia processing (open-vocabulary audio-visual event localization) using an innovative combination of heterogeneous graphs and hyperbolic space. Its ability to generalize to unseen categories has vast real-world applications in video search, automated moderation, and robotics. While Paper 2 provides valuable empirical insights for SAT solvers, Paper 1 aligns with the rapidly expanding field of multimodal deep learning, giving it higher immediate applicability and broader citation potential across AI communities.
DyCon addresses a broadly relevant problem (LLM reasoning efficiency) affecting the rapidly growing field of large reasoning models. It offers a training-free, generalizable framework validated across 4 models and 12 benchmarks, with immediate practical applications for reducing computational costs. The insight that difficulty is linearly encoded in step-level embeddings is novel and could inspire further research. Paper 2, while solid, addresses a narrower task (audio-visual event localization) with more incremental contributions combining known techniques (hyperbolic embeddings, graph networks), limiting its broader impact.
Paper 1 has higher potential impact due to stronger novelty and timeliness: an open-world, zero-shot retrieve-and-reason framework for rapidly evolving memes plus a new 2024–2026 benchmark with external knowledge annotations. This targets a high-visibility, real-world problem (content understanding/moderation) where model staleness is acute, and the benchmark could become a community reference. Paper 2 is methodologically rich (graphs + hyperbolic space) but is a more incremental advance within a narrower task/dataset (OV-AVEL), with likely smaller cross-domain adoption.