Zhixin Cai, Jun Bai, Yang Liu, Jiaqi Li, Yichi Zhang, Taichuan Li, Zhuofan Chen, Zixia Jia
Explaining why dense retrievers assign high relevance scores remains challenging because retrieval decisions are made through opaque high-dimensional embeddings. Existing explanations often focus on surface signals, such as lexical matches, token alignments, or post-hoc textual rationales, and thus provide limited insight into the latent factors that shape dense retrieval behavior at the embedding level. We propose \textit{Xetrieval}, an embedding-level mechanistic framework for explaining dense retrieval. \textit{Xetrieval} first introduces a lightweight reasoning internalizer that approximates Chain-of-Thought reasoning directly in the embedding space with a single forward pass, enriching sentence embeddings with reasoning-oriented information while avoiding expensive autoregressive generation. It then decomposes these reasoning-enhanced embeddings into sparse, human-interpretable features, each associated with a coherent natural language description. By aggregating sparse feature overlaps across multiple document-side views, \textit{Xetrieval} provides feature-level explanations of individual retrieval decisions. Experiments on diverse retrievers and benchmarks show that \textit{Xetrieval} uncovers coherent interpretable features, yields stronger pair-level intervention effects, and supports task-level feature steering. The project page and source code are available at https://hihiczx.github.io/Xetrieval .
Xetrieval proposes an embedding-level mechanistic explanation framework for dense retrieval systems. The framework has two key components: (1) a reasoning internalizer — a lightweight MLP that approximates Chain-of-Thought (CoT) reasoning signals directly in embedding space via a single forward pass, avoiding costly autoregressive generation; and (2) a mechanistic explainer — a sparse autoencoder (SAE) that decomposes reasoning-enriched embeddings into sparse, human-interpretable features with natural language descriptions. Explanations for individual retrieval decisions are produced by identifying overlapping sparse features between query and document representations across multiple document "views" (original, summary, purpose, QA).
The core novelty lies in the combination of reasoning internalization with SAE-based decomposition applied specifically to dense retrieval explanation. While SAE-based interpretability has been explored in LLMs and, more recently, in embeddings (Park et al., 2025; Kang et al., 2025), the reasoning internalizer component and the multi-view aggregation strategy are distinct contributions.
The paper addresses a genuine need: as dense retrieval becomes ubiquitous in RAG pipelines and search systems, understanding *why* specific documents are retrieved is increasingly important for debugging, auditing, and trust. The framework could benefit:
However, the practical impact is tempered by several factors: the explanations remain at the sentence-embedding level (not probing internal circuits), the framework requires training both an internalizer and an SAE per retriever, and scalability to production-scale corpora with billions of documents is undemonstrated.
The paper is highly timely. Dense retrieval explainability is an emerging concern as RAG systems proliferate, and mechanistic interpretability via SAEs is a hot topic in the LLM interpretability community. Applying SAE-based analysis to retrieval embeddings is a natural extension that several groups have begun exploring concurrently (Park et al., 2025; Kang et al., 2025). The reasoning internalization idea also connects to the growing interest in reasoning-enhanced retrieval (BRIGHT benchmark, ReasonIR).
The case studies (Tables 13-16) are illustrative but cherry-picked. The paper would benefit from systematic failure analysis — when do Xetrieval's explanations fail or mislead? The ethical considerations section appropriately warns against over-interpretation, but concrete failure modes would strengthen the contribution.
The scalability analysis (Fig. 6) is limited to 60K documents, far below real-world retrieval scales. The claimed efficiency advantage of the internalizer over CoT reasoning is clear, but the absolute overhead of SAE encoding at million-document scale deserves attention.
Generated May 29, 2026
Paper 2 addresses a fundamental and pervasive issue in modern AI: the opacity of dense embeddings in retrieval systems. By providing a mechanistic, embedding-level explanation framework, it has broad applicability across information retrieval, NLP, and explainable AI. Paper 1, while innovative in long-horizon agent simulation, focuses on a more specialized application (organizational dynamics) that likely has a narrower immediate scientific impact compared to advancing the interpretability of core retrieval mechanisms.
Paper 2 likely has higher impact due to broader and more timely applicability: mechanistic interpretability for dense retrieval directly targets widely deployed IR/RAG systems across NLP and search. Its framework (embedding-level reasoning internalizer + sparse interpretable feature decomposition + intervention/steering) offers reusable tools for auditing, debugging, and controllability, with potential cross-field influence in interpretability and retrieval. Paper 1 is novel and rigorous for manifold-aware denoising in SSP/neuromorphic SLAM, but its impact is narrower to VSA/SSP and spiking robotics communities despite strong results.
Paper 1 has higher likely impact due to its end-to-end, infrastructure-level contribution: a verifiable simulation + procedural home generation + intent-to-success-condition compilation + search-based trajectory synthesis + iterative RL with environment feedback, plus a benchmark. This creates a scalable data flywheel for embodied/smart-home agents with immediate real-world applicability and broad relevance (LLM agents, robotics/simulation, RL, evaluation). Paper 2 is novel and useful for interpretability of dense retrieval, but its impact is narrower and more incremental, mainly affecting IR/interpretability rather than enabling a new applied training pipeline.
Paper 2 (Xetrieval) is likely to have higher scientific impact due to broader applicability and timeliness: dense retrieval underpins search, RAG, and recommendation, and mechanistic, embedding-level explanations address a widely felt interpretability gap. Its framework (reasoning internalizer + sparse feature decomposition + interventions/steering) offers reusable tools for debugging, safety, and controllability across many models and tasks, potentially influencing both IR and LLM systems. Paper 1 is solid and novel but more narrowly scoped to structured search traces in specific planning-style environments.
Paper 2 is likely to have higher scientific impact: it tackles a timely, widely used ML component (dense retrieval) and offers a concrete mechanistic interpretability framework with demonstrated interventions, steering, benchmarks, and released code—supporting adoption and follow-on work. Its applications span search, RAG systems, auditing, and alignment, giving broad cross-field relevance. Paper 1 is novel conceptually for causal rare-event pathways, but appears more theoretical with narrower immediate applicability and uncertain empirical validation, potentially limiting near-term uptake.
Paper 2 likely has higher scientific impact due to greater methodological novelty and broader applicability: it introduces a mechanistic, embedding-level explanation framework for dense retrieval, with interventions and feature steering—capabilities relevant across IR, NLP, and interpretability research. It appears more technically rigorous (decomposition, multi-view aggregation, benchmarked experiments) and timely given widespread deployment of dense retrievers in RAG systems. Paper 1 is useful and relevant for clinical AI trend surveillance, but is largely descriptive with limited sample-based validation and narrower cross-field methodological innovation.
Paper 1 (Xetrieval) offers a more novel and rigorous contribution to mechanistic interpretability of dense retrieval, a fundamental problem in information retrieval and NLP. Its embedding-level mechanistic framework with reasoning internalization and sparse feature decomposition represents genuine methodological innovation. Paper 2 (AgentDoG 1.5), while addressing the important area of AI agent safety, raises credibility concerns with claims like comparing to 'GPT-5.4' and appears more incremental as an engineering framework. Paper 1's interpretability contributions have broader cross-field impact and stronger methodological foundations.
Paper 1 targets a widely felt pain point in LLM deployment—prompt sensitivity/robustness—and proposes a simple, potentially low-cost fine-tuning “debiasing” method with theoretical characterization and empirical validation, plus a path toward robustness certification. This is timely and broadly applicable across essentially all LLM-based systems, impacting reliability, safety, and evaluation. Paper 2 is innovative and useful for interpretability in dense retrieval, but its impact is narrower to retrieval/explainability pipelines and may depend more on adoption of its specific framework. Overall breadth and real-world relevance favor Paper 1.
Paper 2 addresses a fundamental and highly timely challenge in AI—mechanistic interpretability of dense retrieval models. By providing a novel framework to decode opaque embeddings into interpretable features, it has broad implications for NLP, information retrieval, and AI safety. In contrast, Paper 1 presents a solid but domain-specific application of LLMs and spatial data for urban planning, which, while valuable, has a narrower scope of scientific influence and cross-disciplinary impact.
SAAS addresses a highly practical and timely problem—over-search in agentic LLM systems—with a well-structured RL framework featuring three novel components. As agentic AI systems scale rapidly, reducing computational costs while maintaining accuracy has broad real-world impact across all LLM-based search applications. Paper 2 (Xetrieval) contributes to interpretability of dense retrieval, which is valuable but more niche. SAAS's direct applicability to reducing inference costs in widely deployed agentic systems, combined with its methodological rigor (boundary modeling, reward design, curriculum learning), gives it higher potential impact.