Learning Quantifiable Visual Explanations Without Ground-Truth
Amritpal Singh, Andrey Barsky, Mohamed Ali Souibgui, Ernest Valveny, Dimosthenis Karatzas
Abstract
Explainable AI (XAI) techniques are increasingly important for the validation and responsible use of modern deep learning models, but are difficult to evaluate due to the lack of good ground-truth to compare against. We propose a framework that serves as a quantifiable metric for the quality of XAI methods, based on continuous input perturbation. Our metric formally considers the sufficiency and necessity of the attributed information to the model's decision-making, and we illustrate a range of cases where it aligns better with human intuitions of explanation quality than do existing metrics. To exploit the properties of this metric, we also propose a novel XAI method, considering the case where we fine-tune a model using a differentiable approximation of the metric as a supervision signal. The result is an adapter module that can be trained on top of any black-box model to output causal explanations of the model's decision process, without degrading model performance. We show that the explanations generated by this method outperform those of competing XAI techniques according to a number of quantifiable metrics.
AI Impact Assessments
(1 models)Scientific Impact Assessment
1. Core Contribution
This paper makes two interrelated contributions to explainable AI (XAI) for visual models. First, it proposes Minimality-Sufficiency Integration (MSI), a perturbation-based evaluation metric for saliency maps that jointly accounts for the sufficiency of highlighted regions (do they contain enough information for prediction?) and their minimality (are they compact?). Second, it introduces Learnable Adapter eXplanation (LAX), a lightweight module trained on top of frozen black-box models to produce saliency maps that optimize a differentiable approximation of MSI, using information bottleneck principles.
The problems addressed — how to evaluate XAI methods without ground truth, and how to generate explanations that are both faithful and parsimonious — are genuine and important. The paper identifies two specific failure modes of existing metrics: insensitivity to overly large masks and poor handling of inputs with multiple valid explanatory regions. MSI attempts to address both via its base score (sufficiency/discriminativeness) and mask penalty (minimality) components.
2. Methodological Rigor
The methodology has several strengths but also notable gaps:
Strengths:
Weaknesses:
3. Potential Impact
The paper addresses a real pain point in XAI — the lack of standardized, reliable evaluation metrics. If MSI proves robust across diverse settings, it could serve as a useful complementary metric. The LAX framework's model-agnostic, adapter-based design is appealing for practical deployment, as it avoids retraining the base model.
However, the impact may be limited by several factors:
4. Timeliness & Relevance
The paper addresses a timely topic — XAI evaluation — which is indeed an open problem receiving increasing attention (e.g., IDSDS at NeurIPS 2024, F-fidelity at ICLR 2025). The information bottleneck framing connects to a well-established theoretical tradition. However, the paper's scope feels somewhat behind the frontier: it focuses on ResNet18 and ViT-B/16 on CIFAR-10-level datasets, while the field is moving toward explaining much larger and more complex models. The lack of engagement with recent large-scale XAI benchmarks or foundation model explainability limits its relevance to current practice.
5. Strengths & Limitations
Key Strengths:
Key Limitations:
Additional Observations
Generated May 19, 2026
Comparison History (14)
PEEK addresses the timely and high-impact problem of improving LLM agents operating over long contexts, which is central to the rapidly growing field of AI agents. It introduces a novel concept (context maps as orientation caches), demonstrates strong empirical gains across multiple benchmarks and architectures including production systems (OpenAI Codex), and offers practical efficiency improvements (lower cost, fewer iterations). Paper 2 contributes a useful XAI metric and method, but operates in a more mature and narrower subfield with less transformative potential. PEEK's broader applicability to the booming LLM agent ecosystem gives it higher impact potential.
Paper 2 likely has higher impact due to timeliness and broad applicability: agentic LLMs are a rapidly expanding area, and a modular framework for executable, reusable skills can transfer across many tasks (web, math, coding) and deployment settings (inference-time guardrails, post-training supervision, self-improvement). The reported large empirical gains on strong baselines suggest practical relevance. Paper 1 addresses an important XAI evaluation gap and proposes a novel metric plus adapter, but XAI impact is often constrained by domain-specific validation needs and slower adoption compared to agent frameworks.
Paper 2 proposes a novel, actionable framework (metric + method) for evaluating and generating XAI explanations without ground truth, addressing a widely recognized gap in explainable AI. It offers concrete technical contributions—a differentiable metric and a trainable adapter module—with broad applicability across deep learning models. Paper 1 makes a valuable conceptual argument about alignment evaluation levels but is more of a position/audit paper with narrower methodological novelty. Paper 2's technical contributions are more likely to be adopted, cited, and built upon across multiple ML application domains.
Paper 1 addresses a fundamental bottleneck in Explainable AI—the lack of ground truth for evaluation—by introducing a novel, differentiable metric and a universally applicable adapter for black-box models. This methodological innovation has broad implications across all deep learning applications requiring trustworthiness, offering high real-world utility. While Paper 2 provides a valuable diagnostic benchmark for LLMs, Paper 1's foundational contribution to generating and quantifying causal explanations presents a more universally transformative approach to responsible AI deployment.
Paper 2 likely has higher impact due to its large, community-reusable benchmark (52k items from 103 textbooks) and the added dependency graph enabling new evaluation axes. Benchmarks often catalyze broad progress across NLP, theorem proving, and AI-for-math, with clear real-world relevance to verified mathematics and formal methods. Its rigor is supported by scale, careful construction, and strong baseline experiments revealing clear gaps (low accuracy, dependency-depth degradation). Paper 1 is valuable for XAI evaluation and training, but its impact may be narrower and more sensitive to assumptions about perturbation-based causal metrics.
Paper 2 has higher likely impact due to its broad, timely synthesis of dynamic, intervention-aware clinical prediction—directly targeting major real-world deployment barriers (treatment confounding feedback, informative/irregular observation, identifiability). As a unifying framework/review bridging forecasting, counterfactual trajectories, and policy evaluation with concrete evaluation/validation guidance, it can influence multiple subfields (clinical ML, causal inference, time-series modeling, health policy) and shape standards for “decision-grade” evidence. Paper 1 is novel and useful for XAI evaluation, but is narrower in application scope and ecosystem-level influence.
Paper 2 addresses a fundamental challenge in Explainable AI (XAI) by providing a quantifiable evaluation metric and generation method without requiring ground truth. Its contributions offer broad applicability across numerous deep learning domains (e.g., healthcare, autonomous driving), leading to extensive scientific impact. In contrast, Paper 1, while highly rigorous and commercially valuable, is relatively constrained to the specific domain of e-commerce web agents.
Paper 2 integrates Large Language Models with formal verification, addressing major bottlenecks (cost and reliability) in LLM-driven program synthesis. By using concrete counterexamples instead of simple scores, it drastically improves efficiency and correctness. Given the exploding interest in LLMs for code generation and automated planning, this methodology has higher potential for broad, immediate impact across AI and software engineering compared to Paper 1's XAI evaluation metric.
CATO introduces a fundamentally novel architecture for neural PDE operators that addresses key limitations in handling complex geometries through learned chart spaces and derivative-aware physics losses, backed by theoretical approximation guarantees. The 26.76% improvement over baselines with 82% parameter reduction represents a significant advance. While Paper 1 contributes a useful XAI metric and fine-tuning method, the XAI evaluation space is crowded. CATO's contributions span computational physics, geometry processing, and deep learning theory, with broader potential impact on scientific computing applications.
Paper 1 addresses a critical and fundamental bottleneck in modern AI—evaluating explainability without ground truth. By providing both a rigorous, quantifiable metric based on causal sufficiency/necessity and a novel adapter method, it offers broad theoretical and practical utility across all deep learning domains. Paper 2 presents a valuable applied framework for operations research, but Paper 1's contribution to foundational AI methodology gives it higher potential for widespread scientific impact and adoption.
Paper 1 addresses a critical and widely-recognized challenge in XAI—evaluating explanation quality without ground truth—and proposes both a metric and a practical method applicable to any black-box model. This has immediate broad applicability across all domains using deep learning, aligning with growing regulatory demands for AI transparency. Paper 2, while intellectually interesting in studying developmental self-organisation with NCAs and information-theoretic analysis, addresses a more niche intersection of artificial life and computational biology with narrower immediate practical applications. Paper 1's combination of practical utility, broad applicability, and timeliness gives it higher potential impact.
Paper 1 addresses a critical bottleneck in AI safety and interpretability by providing a quantifiable evaluation metric for XAI without requiring ground truth. Furthermore, it introduces a novel adapter method applicable to any black-box model. While Paper 2 provides a valuable LLM benchmark for a specific reasoning domain, Paper 1 offers a broader methodological breakthrough with universal applicability across deep learning, leading to higher potential scientific impact.
Paper 2 offers a foundational contribution to Explainable AI (XAI), a critical bottleneck across all deep learning applications. By introducing a novel, ground-truth-free metric for XAI validation and a method to train explanation adapters for any black-box model, it provides high breadth of impact and strong real-world applicability in regulated domains. While Paper 1 presents a rigorous human-machine teaming study, its impact is confined to the narrower niche of interactive multi-agent reinforcement learning. Paper 2's generalizability and relevance to responsible AI give it a significantly higher potential for widespread scientific impact.
Paper 1 (OCCAM) is more novel by combining open-set concept discovery, text-guided segmentation, causal interventions, and dataset-level ontology induction, moving beyond per-image attributions to global, structured model understanding. This offers broader real-world utility (auditing, bias discovery, model debugging) and wider cross-field impact (vision, interpretability, causal analysis, knowledge/ontology learning). Paper 2 advances evaluation via a perturbation-based metric and a trainable explanation adapter, but metrics for XAI are a crowded area and the approach may be more incremental and narrower in scope.