Learning Quantifiable Visual Explanations Without Ground-Truth

Amritpal Singh, Andrey Barsky, Mohamed Ali Souibgui, Ernest Valveny, Dimosthenis Karatzas

May 18, 2026

arXiv:2605.18681v1 PDF

cs.AI(primary)cs.LG

#741of 2292·Artificial Intelligence

#741 of 2292 · Artificial Intelligence

Tournament Score

1449±45

10501800

50%

Win Rate

Wins

Losses

Matches

Rating

4.5/ 10

Significance5

Rigor4

Novelty5.5

Clarity6.5

Tournament Score

1449±45

10501800

50%

Win Rate

Wins

Losses

Matches

Rating

4.5/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Explainable AI (XAI) techniques are increasingly important for the validation and responsible use of modern deep learning models, but are difficult to evaluate due to the lack of good ground-truth to compare against. We propose a framework that serves as a quantifiable metric for the quality of XAI methods, based on continuous input perturbation. Our metric formally considers the sufficiency and necessity of the attributed information to the model's decision-making, and we illustrate a range of cases where it aligns better with human intuitions of explanation quality than do existing metrics. To exploit the properties of this metric, we also propose a novel XAI method, considering the case where we fine-tune a model using a differentiable approximation of the metric as a supervision signal. The result is an adapter module that can be trained on top of any black-box model to output causal explanations of the model's decision process, without degrading model performance. We show that the explanations generated by this method outperform those of competing XAI techniques according to a number of quantifiable metrics.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

1. Core Contribution

This paper makes two interrelated contributions to explainable AI (XAI) for visual models. First, it proposes Minimality-Sufficiency Integration (MSI), a perturbation-based evaluation metric for saliency maps that jointly accounts for the sufficiency of highlighted regions (do they contain enough information for prediction?) and their minimality (are they compact?). Second, it introduces Learnable Adapter eXplanation (LAX), a lightweight module trained on top of frozen black-box models to produce saliency maps that optimize a differentiable approximation of MSI, using information bottleneck principles.

The problems addressed — how to evaluate XAI methods without ground truth, and how to generate explanations that are both faithful and parsimonious — are genuine and important. The paper identifies two specific failure modes of existing metrics: insensitivity to overly large masks and poor handling of inputs with multiple valid explanatory regions. MSI attempts to address both via its base score (sufficiency/discriminativeness) and mask penalty (minimality) components.

2. Methodological Rigor

The methodology has several strengths but also notable gaps:

Strengths:

The formalization through the information bottleneck framework is principled, connecting minimality to I(X,T) minimization and sufficiency to I(T,y) maximization.

The entropy-based regularization for mask sparsity (Eq. 7) is a reasonable alternative to L1 that encourages sharper masks.

The architecture-agnostic design of LAX (adapter on frozen backbone) is practical.

Weaknesses:

The MSI metric depends critically on α_min, which must be tuned per dataset. The paper acknowledges this but treats it casually ("simple to optimise... by searching through possible values"). This undermines the metric's objectivity as an evaluation standard — different α_min choices could rank methods differently.

The experimental evaluation is limited in scope: only three datasets (Synthetic MNIST, CUB-200, CIFAR-10), relatively simple architectures (ResNet18, ViT-B/16), and a narrow set of baselines (mostly CAM variants). Missing are comparisons against perturbation-based methods like LIME, RISE, or SHAP, and learning-based explanation methods like DIB-X or other IB-based approaches referenced in the related work.

Statistical significance is not reported for any results. Given small differences on some metrics, it's unclear whether improvements are meaningful.

The claim that MSI "aligns better with human intuitions" is supported only through cherry-picked qualitative examples, not through any user study or systematic human evaluation.

The paper does not evaluate whether the base model's classification accuracy is preserved after LAX training, despite claiming it does not degrade performance. This is a critical omission.

3. Potential Impact

The paper addresses a real pain point in XAI — the lack of standardized, reliable evaluation metrics. If MSI proves robust across diverse settings, it could serve as a useful complementary metric. The LAX framework's model-agnostic, adapter-based design is appealing for practical deployment, as it avoids retraining the base model.

However, the impact may be limited by several factors:

The evaluation is confined to image classification on relatively small-scale datasets. Modern XAI challenges increasingly involve large-scale models (foundation models, LLMs), multi-modal systems, and more complex tasks (detection, segmentation, VQA).

The metric's dependence on α_min tuning limits its use as a standardized benchmark.

The restriction to pixel-level saliency maps means this work does not address concept-level or textual explanations, which are increasingly prominent in XAI.

4. Timeliness & Relevance

The paper addresses a timely topic — XAI evaluation — which is indeed an open problem receiving increasing attention (e.g., IDSDS at NeurIPS 2024, F-fidelity at ICLR 2025). The information bottleneck framing connects to a well-established theoretical tradition. However, the paper's scope feels somewhat behind the frontier: it focuses on ResNet18 and ViT-B/16 on CIFAR-10-level datasets, while the field is moving toward explaining much larger and more complex models. The lack of engagement with recent large-scale XAI benchmarks or foundation model explainability limits its relevance to current practice.

5. Strengths & Limitations

Key Strengths:

Clearly identifies genuine failure modes of existing metrics (large masks, multiple solutions)

Principled IB-based formulation connecting metric design to explanation generation

Model-agnostic adapter design is practical and lightweight

LAX consistently achieves positive MSI scores where baselines often score negative, suggesting it does produce more focused explanations

Key Limitations:

α_min sensitivity undermines metric objectivity; no theoretical guidance for its selection

No human evaluation to validate the claim of alignment with human intuition

Narrow experimental scope: few datasets, few baselines, no statistical tests

The mask penalty (Eq. 5) uses a hard threshold, creating a disconnect with the continuous heatmap — this is acknowledged implicitly but not well-justified

On CIFAR-10 (the most realistic dataset), MSI scores for LAX are near zero (0.007 for CNN, 0.102 for ViT), making it hard to judge absolute explanation quality

No ablation studies on key design choices (entropy vs. L1 regularization, mask resolution, λ, temperature)

The paper notes LAX has worse deletion scores on CIFAR-10 than Grad-CAM but dismisses this by invoking multiple valid solutions — this reasoning is somewhat circular given MSI is designed to favor LAX's behavior

Additional Observations

Reproducibility: Implementation details are provided in supplementary material, but no code is mentioned.

The connection between the MSI metric and the LAX training objective is indirect — LAX optimizes cross-entropy + entropy regularization, not MSI directly. The claim of "differentiable approximation of the metric" is somewhat overstated.

The paper would benefit from analysis on datasets with ground-truth annotations (e.g., ImageNet with bounding boxes) to validate that MSI correlates with known ground truth when available.

Rating:4.5/ 10

Significance 5Rigor 4Novelty 5.5Clarity 6.5

Generated May 19, 2026

Comparison History (14)

vs. PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents

claude-opus-4.65/20/2026

PEEK addresses the timely and high-impact problem of improving LLM agents operating over long contexts, which is central to the rapidly growing field of AI agents. It introduces a novel concept (context maps as orientation caches), demonstrates strong empirical gains across multiple benchmarks and architectures including production systems (OpenAI Codex), and offers practical efficiency improvements (lower cost, fewer iterations). Paper 2 contributes a useful XAI metric and method, but operates in a more mature and narrower subfield with less transformative potential. PEEK's broader applicability to the booming LLM agent ecosystem gives it higher impact potential.

vs. Harnessing LLM Agents with Skill Programs

gpt-5.25/19/2026

Paper 2 likely has higher impact due to timeliness and broad applicability: agentic LLMs are a rapidly expanding area, and a modular framework for executable, reusable skills can transfer across many tasks (web, math, coding) and deployment settings (inference-time guardrails, post-training supervision, self-improvement). The reported large empirical gains on strong baselines suggest practical relevance. Paper 1 addresses an important XAI evaluation gap and proposes a novel metric plus adapter, but XAI impact is often constrained by domain-specific validation needs and slower adoption compared to agent frameworks.

vs. Deployment-Relevant Alignment Cannot Be Inferred from Model-Level Evaluation Alone

claude-opus-4.65/19/2026

Paper 2 proposes a novel, actionable framework (metric + method) for evaluating and generating XAI explanations without ground truth, addressing a widely recognized gap in explainable AI. It offers concrete technical contributions—a differentiable metric and a trainable adapter module—with broad applicability across deep learning models. Paper 1 makes a valuable conceptual argument about alignment evaluation levels but is more of a position/audit paper with narrower methodological novelty. Paper 2's technical contributions are more likely to be adopted, cited, and built upon across multiple ML application domains.

vs. XDomainBench: Diagnosing Reasoning Collapse in High-Dimensional Scientific Knowledge Composition

gemini-3.15/19/2026

Paper 1 addresses a fundamental bottleneck in Explainable AI—the lack of ground truth for evaluation—by introducing a novel, differentiable metric and a universally applicable adapter for black-box models. This methodological innovation has broad implications across all deep learning applications requiring trustworthiness, offering high real-world utility. While Paper 2 provides a valuable diagnostic benchmark for LLMs, Paper 1's foundational contribution to generating and quantifying causal explanations presents a more universally transformative approach to responsible AI deployment.

vs. MathAtlas: A Benchmark for Autoformalization in the Wild

gpt-5.25/19/2026

Paper 2 likely has higher impact due to its large, community-reusable benchmark (52k items from 103 textbooks) and the added dependency graph enabling new evaluation axes. Benchmarks often catalyze broad progress across NLP, theorem proving, and AI-for-math, with clear real-world relevance to verified mathematics and formal methods. Its rigor is supported by scale, careful construction, and strong baseline experiments revealing clear gaps (low accuracy, dependency-depth degradation). Paper 1 is valuable for XAI evaluation and training, but its impact may be narrower and more sensitive to assumptions about perturbation-based causal metrics.

vs. From Static Risk to Dynamic Trajectories: Toward World-Model-Inspired Clinical Prediction

gpt-5.25/19/2026

Paper 2 has higher likely impact due to its broad, timely synthesis of dynamic, intervention-aware clinical prediction—directly targeting major real-world deployment barriers (treatment confounding feedback, informative/irregular observation, identifiability). As a unifying framework/review bridging forecasting, counterfactual trajectories, and policy evaluation with concrete evaluation/validation guidance, it can influence multiple subfields (clinical ML, causal inference, time-series modeling, health policy) and shape standards for “decision-grade” evidence. Paper 1 is novel and useful for XAI evaluation, but is narrower in application scope and ecosystem-level influence.

vs. SimPersona: Learning Discrete Buyer Personas from Raw Clickstreams for Grounded E-Commerce Agents

gemini-3.15/19/2026

Paper 2 addresses a fundamental challenge in Explainable AI (XAI) by providing a quantifiable evaluation metric and generation method without requiring ground truth. Its contributions offer broad applicability across numerous deep learning domains (e.g., healthcare, autonomous driving), leading to extensive scientific impact. In contrast, Paper 1, while highly rigorous and commercially valuable, is relatively constrained to the specific domain of e-commerce web agents.

vs. Property-Guided LLM Program Synthesis for Planning

gemini-3.15/19/2026

Paper 2 integrates Large Language Models with formal verification, addressing major bottlenecks (cost and reliability) in LLM-driven program synthesis. By using concrete counterexamples instead of simple scores, it drastically improves efficiency and correctness. Given the exploding interest in LLMs for code generation and automated planning, this methodology has higher potential for broad, immediate impact across AI and software engineering compared to Paper 1's XAI evaluation metric.

vs. CATO: Charted Attention for Neural PDE Operators

claude-opus-4.65/19/2026

CATO introduces a fundamentally novel architecture for neural PDE operators that addresses key limitations in handling complex geometries through learned chart spaces and derivative-aware physics losses, backed by theoretical approximation guarantees. The 26.76% improvement over baselines with 82% parameter reduction represents a significant advance. While Paper 1 contributes a useful XAI metric and fine-tuning method, the XAI evaluation space is crowded. CATO's contributions span computational physics, geometry processing, and deep learning theory, with broader potential impact on scientific computing applications.

vs. Democratizing Large-Scale Re-Optimization with LLM-Guided Model Patches

gemini-3.15/19/2026

Paper 1 addresses a critical and fundamental bottleneck in modern AI—evaluating explainability without ground truth. By providing both a rigorous, quantifiable metric based on causal sufficiency/necessity and a novel adapter method, it offers broad theoretical and practical utility across all deep learning domains. Paper 2 presents a valuable applied framework for operations research, but Paper 1's contribution to foundational AI methodology gives it higher potential for widespread scientific impact and adoption.

vs. Learning Developmental Scaffoldings to Guide Self-Organisation

claude-opus-4.65/19/2026

Paper 1 addresses a critical and widely-recognized challenge in XAI—evaluating explanation quality without ground truth—and proposes both a metric and a practical method applicable to any black-box model. This has immediate broad applicability across all domains using deep learning, aligning with growing regulatory demands for AI transparency. Paper 2, while intellectually interesting in studying developmental self-organisation with NCAs and information-theoretic analysis, addresses a more niche intersection of artificial life and computational biology with narrower immediate practical applications. Paper 1's combination of practical utility, broad applicability, and timeliness gives it higher potential impact.

vs. QSTRBench: a New Benchmark to Evaluate the Ability of Language Models to Reason with Qualitative Spatial and Temporal Calculi

gemini-3.15/19/2026

Paper 1 addresses a critical bottleneck in AI safety and interpretability by providing a quantifiable evaluation metric for XAI without requiring ground truth. Furthermore, it introduces a novel adapter method applicable to any black-box model. While Paper 2 provides a valuable LLM benchmark for a specific reasoning domain, Paper 1 offers a broader methodological breakthrough with universal applicability across deep learning, leading to higher potential scientific impact.

vs. Beyond Partner Diversity: An Influence-Based Team Steering Framework for Zero-Shot Human-Machine Teaming

gemini-3.15/19/2026

Paper 2 offers a foundational contribution to Explainable AI (XAI), a critical bottleneck across all deep learning applications. By introducing a novel, ground-truth-free metric for XAI validation and a method to train explanation adapters for any black-box model, it provides high breadth of impact and strong real-world applicability in regulated domains. While Paper 1 presents a rigorous human-machine teaming study, its impact is confined to the narrower niche of interactive multi-agent reinforcement learning. Paper 2's generalizability and relevance to responsible AI give it a significantly higher potential for widespread scientific impact.

vs. OCCAM: Open-set Causal Concept explAnation and Ontology induction for black-box vision Models

gpt-5.25/19/2026

Paper 1 (OCCAM) is more novel by combining open-set concept discovery, text-guided segmentation, causal interventions, and dataset-level ontology induction, moving beyond per-image attributions to global, structured model understanding. This offers broader real-world utility (auditing, bias discovery, model debugging) and wider cross-field impact (vision, interpretability, causal analysis, knowledge/ontology learning). Paper 2 advances evaluation via a perturbation-based metric and a trainable explanation adapter, but metrics for XAI are a crowded area and the approach may be more incremental and narrower in scope.