Assessing Sample Quality in Conditional Generation under Compositional Shift

Berker Demirel, Valentino Maiorca, Marco Fumero, Theofanis Karaletsos, Francesco Locatello

Jun 8, 2026arXiv:2606.09601v1

cs.LG

#1524of 5669·cs.LG

#1524 of 5669 · cs.LG

Tournament Score

1451±45

10501750

60%

Win Rate

Wins

Losses

Matches

Rating

7.5/ 10

Significance7.5

Rigor8

Novelty7.5

Clarity8.5

Abstract

Conditional generators provide a natural tool for controllable generation, including settings where the desired condition is a new composition of observed attributes or experimental factors. In many applications, especially in scientific domains, such models are attractive to explore conditions for which real samples are rare, expensive, or not yet observed. However, this creates a circularity for evaluation: standard conditional quality metrics require a reference target distribution, but in the extrapolative regime that distribution is unavailable by definition. We address this problem with a post-hoc, per-sample trust score for assessing conditional samples using only the training distribution. The score combines two estimable quantities: global realism, measuring compatibility with the real data manifold, and attribute-wise faithfulness, measuring whether a sample is closer to the requested attributes than to plausible alternatives. We show that the score can recover meaningful comparisons across extrapolated generations, under a mild coverage condition on the observed attributes. These comparisons enable effective filtering, ranking, and abstention of generations and can be used directly on off-the-shelf pretrained models. In biological imaging, selected samples preserve real morphological structure better and improve downstream predictive performance, while similar gains are observed on controlled vision benchmarks. Finally, we show how the score can be applied during generation, enabling abstention before full decoding. Code is available at https://github.com/berkerdemirel/faithful-cond-gen.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

Core Contribution

This paper tackles a genuine circularity problem in conditional generation evaluation: standard metrics (FID, KID) require reference samples from the target distribution, but conditional generators are most valuable precisely when such references don't exist (e.g., novel drug combinations, unseen attribute compositions). The authors propose a per-sample trust score that decomposes into two components: (1) global realism (Mahalanobis distance to the real data manifold in feature space), and (2) attribute-wise faithfulness (whether each requested attribute value is favored over alternatives via shared-covariance Mahalanobis margins). The key conceptual insight is the attribute-level decomposition: rather than evaluating the full unseen joint condition, each constituent attribute is checked independently against observed real-data contexts. This sidesteps the fundamental non-identifiability of the missing joint distribution while still providing actionable quality assessments.

Methodological Rigor

The theoretical framework is carefully constructed. Proposition 1 establishes the negative result (full conditional fidelity is non-identifiable without the target), motivating the weaker but achievable goal. Definition 1 (reference coverage) provides a minimal support condition for unconfounded attribute-level comparisons, and Proposition 3 proves this minimality. Proposition 2 establishes point identification of the reference-anchored comparator under this condition.

The gap between theory and practice is handled transparently: the implemented score uses pooled prototypes (more sample-efficient) rather than reference-anchored ones. The authors provide sufficient conditions for agreement (Appendix B.4-B.5), a perturbation bound (Proposition 6), and extensive empirical verification of pooled/reference agreement (Table 4). The diagnostic for assumption violations (Table 5) is particularly commendable—it reveals exactly where the theory breaks down (RxRx1 siRNA attribute) and correctly predicts degraded empirical performance.

The experimental validation spans controlled (CelebA with designed compositional shift) and realistic scientific (RxRx1 biological imaging) settings. The use of CellProfiler morphology features as an independent validation modality—completely separate from the DINOv3 space used for scoring—strengthens the biological relevance claims. The real-spread-normalized centroid distance metric (Table 2) is well-designed, accounting for within-condition natural variation.

Potential Impact

Direct applications: The score enables practical deployment of conditional generators in scientific settings where wet-lab validation is expensive. The ability to rank, filter, and abstain from generated samples before experimental validation could meaningfully reduce costs in drug discovery, cellular biology, and materials science.

Methodological contributions: The decomposition into realism and faithfulness captures genuinely different failure modes and could become a standard evaluation framework. The "during-generation" scoring via translator networks is computationally attractive, enabling early rejection before full decoding (saving up to 67% of compute while retaining 85% of filtering quality on CelebA).

Broader influence: The identifiability analysis (what can and cannot be assessed without target samples) provides conceptual clarity that should influence how the community thinks about evaluation under distribution shift. The negative result (Proposition 1) is as important as the positive ones.

Timeliness & Relevance

This work arrives at a critical juncture: conditional diffusion models are being rapidly adopted in scientific domains (drug design, cell biology, materials), but evaluation methodology has not kept pace. The gap between "can generate for unseen conditions" and "should trust those generations" is widely recognized but poorly addressed. The paper directly fills this gap with a practical, model-agnostic solution.

Strengths

1. Clean problem formulation: The circularity of evaluation under compositional shift is crisply stated and the proposed resolution is principled.

2. Theory-practice alignment: The identifiability results directly motivate design choices, with honest acknowledgment of when assumptions fail.

3. Multi-modal validation: CellProfiler biological features provide genuinely independent validation, not just different learned representations.

4. Practical deployment: Post-hoc, model-agnostic design works on off-the-shelf pretrained generators; during-generation extension enables compute savings.

5. Comprehensive ablations: The scorer comparison (Table 14) demonstrates that alternatives (linear probes, kNN, CLIP) produce degenerate thresholds, justifying the Mahalanobis approach.

6. Negative results reported honestly: OpenPhenom failure (Section G), REPA geometry failure for trust scoring on RxRx1 (Table 3), and theory-practice gaps for siRNA attributes are all transparent.

Limitations

1. Discrete attributes only: The theory assumes finite attribute-value spaces. Continuous conditioning (dose-response curves, continuous molecular descriptors) is explicitly left to future work—a significant limitation for many scientific applications.

2. Feature extractor dependence: The score quality depends heavily on the choice of Φ. The OpenPhenom failure demonstrates this isn't trivial, and there's no principled way to select Φ a priori.

3. Low acceptance rates on RxRx1: Only 4-6% of generated samples pass the trust threshold under support shift, raising questions about practical utility when generation quality is poor.

4. Reference coverage requirement: While minimal, this condition may not hold in some realistic scientific settings where attributes only appear in specific combinations (e.g., certain cell types only treated with certain drugs).

5. Gaussian assumptions: The Mahalanobis framework assumes ellipsoidal feature geometry; multi-modal or heavy-tailed distributions in feature space could degrade performance.

6. No guidance integration: The score is used only for filtering/abstention, not to steer generation toward higher-trust regions—a natural extension acknowledged but not pursued.

Additional Observations

The graded compositional shift analysis (Figure 7) is particularly insightful—trust scores capture not just binary seen/unseen but the severity of compositional shift. The code availability enhances reproducibility. The work would benefit from evaluation on text-conditioned generation where compositional shift is also prevalent.

Rating:7.5/ 10

Significance 7.5Rigor 8Novelty 7.5Clarity 8.5

Generated Jun 9, 2026

Comparison History (15)

Lostvs. Optimal Post-Training Quantization Scales and Where to Find Them

Paper 2 is likely to have higher scientific impact due to strong timeliness and broad applicability: improved post-training quantization directly affects deployment cost and accessibility of frontier LLMs across many domains. Its core contribution (PiSO) offers an exact, efficient optimization method for quantization scales with clear methodological rigor and measurable gains on widely used models/benchmarks, and should be easy to adopt in existing PTQ pipelines. Paper 1 addresses an important evaluation gap for compositional shift and has strong relevance in scientific imaging, but its impact may be narrower and more dependent on assumptions about attribute coverage and trust-score validity.

gpt-5.2·Jun 10, 2026

Wonvs. When Do Autoregressive Sequence Models Forecast Physical Wavefields? A Controlled Study on Synthetic Seismograms

Paper 2 addresses a fundamental and widespread challenge in generative AI: evaluating the quality of out-of-distribution or extrapolated conditional generations without a ground-truth reference. By proposing a metric that relies only on the training distribution, it unlocks robust evaluation and filtering for scientific discovery applications where novel compositions are generated. Paper 1, while methodologically rigorous, focuses on a much narrower domain (autoregressive forecasting of oscillatory wavefields like seismograms), making Paper 2's potential breadth of impact across machine learning, biology, and other scientific domains significantly higher.

gemini-3.1-pro-preview·Jun 10, 2026

Wonvs. Test-Time Gradient Guidance of Flow Policies in Reinforcement Learning

Paper 2 is likely to have higher broad scientific impact: it tackles a widely encountered and under-solved problem—evaluating conditional generation under compositional shift without access to target distributions—relevant across scientific ML, vision, and generative modeling. The proposed trust score is model-agnostic, post-hoc, and directly usable with pretrained generators, enabling practical filtering/abstention in real deployments (e.g., biological imaging). Paper 1 is novel and timely for offline RL with flow policies, but its impact is narrower to RL/control pipelines and depends more on specific policy/value-function assumptions.

gpt-5.2·Jun 10, 2026

Wonvs. Flow-DPPO: Divergence Proximal Policy Optimization for Flow Matching Models

Paper 1 addresses a fundamental bottleneck in AI for scientific discovery: evaluating out-of-distribution generated samples when ground truth is unavailable. By providing a trust score for compositional shifts, it directly enables more reliable AI-driven exploration in fields like biological imaging. While Paper 2 offers a strong algorithmic improvement for aligning flow models via RL, Paper 1 has broader multidisciplinary scientific applicability, higher novelty in its problem formulation, and greater potential to accelerate concrete scientific discoveries.

gemini-3.1-pro-preview·Jun 10, 2026

Wonvs. Safe-RULE: Safe Reinforcement UnLEarning

Paper 2 likely has higher impact due to broader applicability: a general, post-hoc per-sample trust score for conditional generation under compositional shift applies across many generative modeling domains (science, vision, biology) and to off-the-shelf pretrained models. It addresses a timely evaluation bottleneck in extrapolative conditional generation where reference distributions are unavailable, enabling filtering/ranking/abstention and even early abstention during decoding. Paper 1 is novel and important for offline Safe RL robustness, but its scope is narrower (safe RL + poisoning defense) and may affect a smaller set of practitioners.

gpt-5.2·Jun 9, 2026

Lostvs. LargeMonitor: Monitoring Online Task-Free Continual Learning via Large Pretrained Models

Paper 2 likely has higher impact due to broader applicability and timeliness: monitoring and diagnosing drift in online task-free continual learning is a central, practical problem for deployed agents, and leveraging foundation models for zero-shot detection/semantic diagnosis can generalize across domains. It potentially influences multiple subfields (continual learning, drift detection, MLOps, multimodal/foundation-model tooling) and enables real-world systems to adapt more safely and effectively. Paper 1 is novel and useful for evaluating extrapolative conditional generation, especially in scientific imaging, but its scope is narrower and more evaluation-specific.

gpt-5.2·Jun 9, 2026

Lostvs. Momentum Streams for Optimizer-Inspired Transformers

Paper 1 proposes a fundamental architectural enhancement to Transformers, the backbone of modern AI, by embedding optimizer-inspired momentum into the layers. This offers broad, transformative potential for improving pretraining efficiency and generalization across all domains using foundation models. Paper 2 is highly valuable for generative AI in scientific discovery, but its scope (evaluation metrics for compositional shift) is narrower compared to the ubiquitous impact of core Transformer improvements.

gemini-3.1-pro-preview·Jun 9, 2026

Wonvs. Reactive Flux Matching: Mechanism Discovery and Adaptive Sampling of Rare Events

Paper 2 addresses a critical bottleneck in the booming field of generative AI for scientific discovery: evaluating out-of-distribution conditional generations without ground truth data. By proposing a generalizable trust score for sample quality, it offers broad applicability across diverse domains like biological imaging and computer vision. Paper 1, while highly innovative in molecular dynamics and rare event sampling, is much more domain-specific. The immense breadth of impact, timeliness, and cross-disciplinary relevance of Paper 2's methodology give it a significantly higher potential for widespread scientific impact.

gemini-3.1-pro-preview·Jun 9, 2026

Wonvs. \textsc{Lethe}: Principled Dual-Stream Update for Persistent Knowledge Erasure in Federated Unlearning

Paper 1 addresses a fundamental evaluation gap in conditional generation under compositional shift—a broadly applicable problem across scientific domains (biology, chemistry, materials science). Its trust score framework is model-agnostic, works post-hoc on pretrained models, and enables practical filtering/abstention. This has wide applicability as generative models are increasingly used for scientific discovery. Paper 2 tackles an important but narrower problem (knowledge resurfacing in federated unlearning) with a solid contribution, but federated unlearning remains a more specialized subfield with fewer immediate real-world deployments compared to the rapidly growing conditional generation ecosystem.

claude-opus-4-6·Jun 9, 2026

Lostvs. On the Robustness of Langevin Dynamics to Score Function Error

Paper 2 addresses a fundamental theoretical question about the robustness of Langevin dynamics versus diffusion models for score-based generative modeling. It provides a clear negative result showing Langevin dynamics fails with arbitrarily small L^2 score estimation errors in high dimensions, while diffusion models succeed. This has broad theoretical and practical implications for the generative modeling community, offering rigorous justification for architectural choices used widely in practice. Paper 1 addresses a more niche evaluation problem (conditional generation under compositional shift) with practical but narrower impact. Paper 2's foundational insight is likely to influence theory and practice more broadly.

claude-opus-4-6·Jun 9, 2026

#1524of 5669·cs.LG

#1524 of 5669 · cs.LG

Tournament Score

1451±45

10501750

60%

Win Rate

Wins

Losses

Matches

Rating

7.5/ 10

Significance7.5

Rigor8

Novelty7.5

Clarity8.5