Conflict-Aware Additive Guidance for Flow Models under Compositional Rewards
Xuehui Yu, Fucheng Cai, Meiyi Wang, Xiaopeng Fan, Harold Soh
Abstract
Inference-time guided sampling steers state-of-the-art diffusion and flow models without fine-tuning by interpreting the generation process as a controllable trajectory. This provides a simple and flexible way to inject external constraints (e.g., cost functions or pre-trained verifiers) for controlled generation. However, existing methods often fail when composing multiple constraints simultaneously, which leads to deviations from the true data manifold. In this work, we identify root causes of this off-manifold drift and find that the approximation error scales severely with gradient misalignment. Building on these findings, we propose Conflict-Aware Additive Guidance (), a lightweight and learnable method, which actively rectifies off-manifold drift by dynamically detecting and resolving gradient conflicts. We validate across diverse domains, ranging from synthetic datasets and image editing to generative decision-making for planning and control. Our results demonstrate that effectively rectifies off-manifold drift, surpassing baselines in generation fidelity while using light compute. Code is available at https://github.com/yuxuehui/CAR-guidance.
AI Impact Assessments
(1 models)Scientific Impact Assessment: Conflict-Aware Additive Guidance for Flow Models under Compositional Rewards
1. Core Contribution
This paper tackles a well-defined and practically important problem: when steering pre-trained flow models at inference time using multiple reward functions simultaneously, approximate guidance methods suffer from off-manifold drift due to gradient conflicts between competing objectives. The authors make two main contributions:
Theoretical: They derive an upper bound on the approximation error of guided sampling (Theorem 4.2), decomposing it into three interpretable terms: coupling shift error, gradient misalignment error, and localized approximation error. The key insight is that error scales with G(G-1) and (1-cos φ), where G is the number of rewards and φ captures angular divergence between guidance gradients.
Methodological: They propose CAR guidance (g^car), which dynamically blends approximate guidance with a learned value-gradient correction, gated by a conflict-aware weight that activates only in regions of significant gradient misalignment. The learned component uses Terminal Value Regression (TVR), which avoids bootstrapping instability by regressing directly against terminal rewards — a clean insight enabled by the deterministic nature of flow ODEs.
2. Methodological Rigor
Theoretical analysis is well-structured. The three-term decomposition provides clear intuition, and the connection between gradient misalignment and "energy traps" (spurious local minima from destructive gradient interference) is geometrically compelling. The energy dissipation framework (Definition B.2) provides a clean characterization of when and why approximate guidance fails. However, some assumptions (e.g., ∥g_j^CI∥ ≈ μ for all j) are strong and may not hold in practice.
Experimental evaluation is thorough and spans diverse domains: 2D synthetic benchmarks (where ground truth is available), Maze2D planning, ManiSkill2 robotic manipulation (3D point clouds), and CelebA-HQ image editing. This breadth is a strength. The synthetic experiments provide clear visualization of the failure mode (energy traps) and the correction mechanism. The inclusion of ablation studies on components (g_ψ vs. w_t), parameterization choices (scalar V vs. vector field), and the conflict threshold τ adds credibility.
Weaknesses in rigor: The comparison with GLASS-FKS may not be entirely fair — different base models and reward compositions are acknowledged. The paper reports means and standard deviations across 5 seeds, which is reasonable but not extensive. Some experimental settings (e.g., ManiSkill2 with only 100 demonstrations) are relatively small-scale.
3. Potential Impact
The paper addresses a genuine bottleneck in deploying generative models with heterogeneous runtime constraints. The practical impact is potentially significant across several domains:
The insight that gradient conflict detection can be cheaply computed and used to gate corrections is elegant and could influence how future guidance methods are designed. The TVR training procedure is also a clean contribution that others could adopt independently.
4. Timeliness & Relevance
This work is highly timely. Flow matching and rectified flows are rapidly becoming the dominant paradigm for generative modeling, and inference-time steering is increasingly important as models scale. The compositional constraint setting is practically motivated — real-world applications rarely involve a single objective. The paper sits at the intersection of generative modeling, optimal transport theory, and multi-objective optimization, connecting these communities.
The positioning between approximate and exact guidance methods (Figure 7) is clear and fills a genuine gap. The connection to the "deadly triad" in RL and the proposed TVR solution demonstrates thoughtful cross-pollination of ideas.
5. Strengths & Limitations
Key Strengths:
Notable Limitations:
Minor observations: The paper is well-written with effective visualizations. The "energy trap" concept is intuitive and well-illustrated. The positioning as "between approximate and exact guidance" is honest and accurate.
Generated May 21, 2026
Comparison History (27)
Paper 1 addresses a timely and broadly impactful question about how AI usage affects human skill development, with implications spanning education, workforce training, and AI policy. Its findings—that heavy AI use can substitute for rather than complement learning—have immediate real-world relevance as AI tools become ubiquitous. Paper 2 makes a solid technical contribution to guided sampling in flow/diffusion models, but its impact is more narrowly scoped to the generative modeling community. Paper 1's breadth of societal impact and timeliness give it higher potential scientific impact.
Paper 1 addresses a fundamental and broadly applicable problem in guided generation across diffusion/flow models, proposing a principled solution (conflict-aware gradient correction) validated across diverse domains (images, planning, control). Its practical utility is high given the widespread adoption of diffusion models. Paper 2 introduces a valuable evaluation metric for VLM explainability, but its impact is more niche—focused on XAI evaluation methodology rather than enabling new capabilities. Paper 1's cross-domain applicability, actionable method, and alignment with the rapidly growing generative modeling field give it broader potential impact.
Paper 1 likely has higher scientific impact due to broader methodological relevance: it addresses a fundamental, widely encountered failure mode in diffusion/flow guidance under compositional constraints, offers a principled analysis (gradient misalignment → off-manifold drift), and proposes a lightweight, learnable correction applicable across multiple domains (images, synthetic, planning/control). This gives wide cross-field applicability and timeliness for controllable generative modeling. Paper 2 is impactful for EDA agents, but its scope is more domain-specific and depends on verifier setups, potentially limiting breadth despite strong practical relevance.
Paper 2 exposes a fundamental flaw in current VLM explainability evaluation and introduces a theoretically grounded, scalable metric to resolve it. Because reliable benchmarks and evaluation frameworks often redirect community efforts and are essential for high-stakes AI safety, Paper 2 is likely to have a broader foundational impact across the rapidly growing field of multimodal models compared to the algorithmic improvements in generative guidance proposed in Paper 1.
Paper 2 likely has higher scientific impact: it addresses a broadly encountered limitation of inference-time guidance (composing multiple rewards) in diffusion/flow models, provides a principled diagnosis (gradient misalignment driving off-manifold drift), and proposes a general, lightweight, learnable fix validated across multiple domains (images and decision-making/control). This combination of theoretical insight + wide applicability can influence controlled generation, alignment, and planning. Paper 1 is novel and practical for LLM agents, but is currently demonstrated mainly on deterministic agent benchmarks and depends on environment-specific harness engineering, making impact potentially narrower.
Paper 1 addresses a fundamental problem in compositional guided generation across diffusion/flow models with rigorous theoretical analysis (identifying gradient misalignment as root cause of off-manifold drift) and broad empirical validation across diverse domains (synthetic, image editing, planning/control). The theoretical insights about conflict-aware gradient composition are novel and broadly applicable. Paper 2 presents a practical engineering contribution for modular LLM specialization, but is more incremental—combining existing ideas (delta modules, compression) in a system-level framework. Paper 1's methodological depth and cross-domain generality suggest broader scientific impact.
Paper 2 addresses a fundamental challenge in guided generation across diffusion/flow models—composing multiple constraints without off-manifold drift—which is broadly applicable across images, planning, and other generative domains. Its theoretical insight (gradient misalignment causing approximation error) and lightweight, learnable solution (conflict-aware guidance) have wider cross-domain impact potential. Paper 1, while rigorous and practically valuable for autonomous driving testing, addresses a more domain-specific problem. Paper 2's generality across generative modeling paradigms gives it broader scientific influence.
Paper 2 is likely to have higher scientific impact because it introduces a principled, general-purpose evaluation framework (a family of proper scoring-rule metrics) for uncertainty-augmented systems, directly targeting high-stakes decision-making. Such metrics can be adopted broadly across ML subfields (classification, generation, selective prediction, calibration, human-AI decision support) and influence standard benchmarks and reporting practices. Its theoretical grounding (proper scoring rules) supports methodological rigor and long-term relevance. Paper 1 is a strong, timely contribution to guided generative modeling, but is more specialized and may have narrower cross-domain uptake.
Paper 2 likely has higher scientific impact: it proposes a concrete, generally applicable algorithmic improvement for inference-time control of diffusion/flow models, a highly active area with broad downstream use (image editing, planning/control, constrained generation). It identifies a clear failure mode (off-manifold drift under compositional rewards), analyzes a root cause (gradient misalignment), and introduces a lightweight, learnable fix validated across multiple domains with released code—supporting rigor, reproducibility, and adoption. Paper 1 is novel for negotiation research but is narrower in scope and its impact depends on uptake in social-science experimentation and agent design.
Paper 2 likely has higher impact due to broad relevance to LLM evaluation/training, a timely and widely needed capability (planning under constraints), and a reusable benchmark/data-generation framework with verification and difficulty control that can standardize comparisons across models. Its outputs can catalyze follow-on work in benchmarking, RL training, and agentic systems. Paper 1 is a solid methodological contribution to guided diffusion/flow sampling under compositional rewards, but is narrower in audience and application scope, and may compete with many closely related guidance/constraint-composition methods.
Paper 1 addresses a fundamental bottleneck in state-of-the-art diffusion and flow models—off-manifold drift during compositional generation. Its proposed solution has broad, highly relevant applications across vision, planning, and control, fields that are currently driving significant AI advancements. In contrast, Paper 2 applies relatively standard metric learning techniques (triplet loss, hard negative mining) to the much narrower domain of Horn logic reasoning. Consequently, Paper 1 exhibits greater timeliness, broader applicability, and higher potential for cross-disciplinary impact.
Paper 2 likely has higher scientific impact: a well-designed benchmark can rapidly shape an entire field by standardizing evaluation, exposing failure modes, and driving new methods. DeepWeb-Bench targets a timely, high-stakes capability (web-based deep research), offers auditable provenance and error taxonomies, and is broadly applicable across LLM agents, retrieval, reasoning, and safety/calibration research. Paper 1 is methodologically innovative and useful for controlled generative modeling, but its impact is narrower (guided diffusion/flow sampling under compositional rewards) and may compete within a crowded guidance-method space.
Paper 1 offers a technically novel, generalizable contribution to guided sampling in diffusion/flow models: diagnosing off-manifold drift under compositional rewards and proposing a conflict-aware mechanism to resolve gradient misalignment. It is broadly applicable across generative modeling, controllable generation, and planning/control, with likely reuse by many ML subfields and clear timeliness given rapid adoption of inference-time guidance. Paper 2 targets an important applied domain (autonomous networking) but reads more like a systems/architecture proposal with validation in a specific 5G Core setup, which may limit methodological generality and cross-field scientific uptake.
Paper 2 addresses a fundamental problem in guided sampling for diffusion/flow models—composing multiple constraints while staying on the data manifold. It provides theoretical insight (gradient misalignment causing off-manifold drift) and a principled, lightweight solution (CAR guidance) validated across diverse domains (images, planning, control). This has broad applicability across generative AI. Paper 1, while addressing an important practical problem in multi-agent workflow design with strong genomics applications, is more systems-oriented and narrower in its theoretical contribution. Paper 2's foundational insight into gradient conflicts in compositional guidance is likely to influence a wider range of future work.
Paper 1 presents a concrete, novel method (CAR-guidance) with rigorous theoretical analysis of off-manifold drift in compositional guided generation, validated across multiple domains with code available. It addresses a specific, well-defined technical problem with a practical solution. Paper 2 is a visionary/position paper on AI-native 6G networks that, while timely, lacks concrete methodological contributions and experimental validation. Vision papers can be influential but typically have less immediate scientific impact than papers introducing validated new methods solving identified problems.
While Paper 1 presents a strong, rigorous technical solution to a specific problem in generative modeling (off-manifold drift), Paper 2 addresses a fundamental and highly timely crisis in AI research: the saturation and limitations of standard benchmarks. By introducing 'open-world evaluations' and demonstrating them on real-world, long-horizon tasks, Paper 2 has the potential for significantly broader impact. It is likely to influence how frontier AI capabilities are measured across the entire field, informing not only technical AI development but also AI safety, policy, and deployment strategies.
Paper 2 addresses a fundamental limitation (exploration collapse) in RLVR for LLM reasoning, proposing a cooperative optimization paradigm (GCPO) that shifts from competitive to team-based credit assignment. Given the massive current interest in LLM reasoning improvement (post-DeepSeek R1/GRPO), this work is exceptionally timely and broadly applicable. Paper 1 addresses compositional guidance for flow models, which is valuable but more niche. Paper 2's cooperative framework with determinantal volume-based diversity has broader theoretical novelty and wider potential adoption across the LLM reasoning community.
Paper 2 proposes a methodological advancement for diffusion and flow models, addressing a fundamental issue (off-manifold drift) in multi-constraint generation. Its application spans diverse, high-impact domains like image editing and generative decision-making. In contrast, Paper 1 introduces a specialized simulation environment for Mahjong. While valuable for RL research, its scope and breadth of impact are narrower compared to the broad utility of improving foundational generative models in Paper 2.
Paper 1 addresses a fundamental limitation in controlling diffusion and flow models (off-manifold drift with compositional constraints), offering a methodological innovation with broad applicability across computer vision, planning, and control. In contrast, Paper 2 presents a highly optimized simulator for a specific game (Mahjong). While valuable for reinforcement learning research in imperfect-information settings, Paper 1 has significantly broader potential impact, higher methodological novelty, and aligns closely with the rapid advancements in controllable generative AI.
Paper 2 is likely to have higher scientific impact due to broader applicability and timeliness: conflict-aware guidance for diffusion/flow models under compositional rewards addresses a widely encountered failure mode in controllable generative modeling and applies across images and decision-making/planning. The methodological contribution (analysis of off-manifold drift via gradient misalignment + a learnable corrective mechanism) is general and can plug into many systems at inference time, enabling immediate real-world use without retraining. Paper 1 is novel but narrower (LLM ToM benchmarks/data synthesis) and its impact is more confined to social-reasoning evaluation/training.