Tuc Nguyen, Thai Le
Activation steering provides a lightweight inference-time mechanism for controlling large language models (LLMs) by modifying their internal activation vectors toward desired behaviors. Most existing methods compute a fixed steering direction in the original activation space, typically from pairs of contrastive examples using mean differences, linear probes, or arbitrary separability criteria. While effective to a certain extent, these methods treat behavioral control as a global, linear, additive offset: the same direction is applied across inputs, and behaviors are linearly separable. This can be restrictive when behavioral features vary nonlinearly across the activation space or lie on curved and anisotropic manifolds, where the optimal intervention may be input-dependent. To address this limitation, we propose INNSteer, a nonlinear activation steering framework based on invertible latent transformations. Rather than searching for a better steering vector in the original representation space, INNSteer learns a lightweight invertible neural network that maps an LLM's activations into a latent space where behavioral classes are more amenable to linear control. At inference time, activations are mapped through , steered in the latent space, and mapped back through the exact inverse transformation . This makes a simple latent-space translation become a nonlinear, input-dependent intervention in the original activation space. Across experiment settings on multiple LLM families, scales, behavioral traits, and safety benchmarks, INNSteer consistently improves model control over linear, transport-based, and nonlinear steering baselines while largely preserving generation fluency.
INNSteer introduces a nonlinear activation steering framework that learns an invertible neural network (INN) mapping LLM activations into a latent space where behavioral classes are more linearly separable. The key insight is elegant: rather than finding a better steering vector in the original activation space, learn a coordinate transformation where simple mean-difference steering becomes effective, then map back through the exact inverse. This converts a constant latent-space translation into a nonlinear, input-dependent perturbation in the original activation space, where the effective direction is modulated by the local inverse Jacobian of the learned map.
The paper addresses a genuine limitation of existing activation steering methods — the assumption that behavioral changes can be captured by a single global linear direction. When behavioral representations lie on curved manifolds, a fixed steering vector will be appropriate for some inputs but misaligned for others. INNSteer's formulation provides a principled remedy via the first-order expansion: Δ_φ(h̃) ≈ αJ_φ(h̃)^{-1}v_{φ,ℓ}, making the intervention inherently input-dependent.
Architecture and Training: The use of RealNVP-style affine coupling layers is well-motivated — they provide exact invertibility, efficient log-determinant computation, and sufficient nonlinearity. The three-term training objective (Gaussian likelihood, directional separation, log-determinant regularization) is thoughtfully designed. The log-determinant regularization addressing unstable inverse mappings after latent shifts is a particularly careful design choice.
Theoretical Support: The paper provides formal proofs for key claims: the local inverse-Jacobian expansion (Theorem F.1), reduction to linear steering in the affine case (Corollary F.2), exact reversibility (Proposition 1), and reconstruction ambiguity of non-invertible alternatives (Proposition 2). These are clean and appropriate, though they are largely formalization of intuitive properties rather than deep theoretical results.
Experimental Design: The evaluation spans 5 models across 2 families (LLaMA and Qwen), 6 behavioral tasks, refusal/hallucination benchmarks, vision-language models, and comparison with PEFT. The breadth is commendable. However, several concerns arise:
Baseline Comparison: The paper includes 9 baselines covering linear, transport-based, and nonlinear methods. The inclusion of ODESteer and TruthFlow as recent nonlinear baselines is appropriate. However, all methods appear to use the same layer selection strategy, which may not be optimal for each individual method.
Practical Applications: The framework has clear practical value for inference-time behavioral control — safety alignment, hallucination mitigation, refusal behavior, and the demonstrated extension to vision-language models. The approach is modular and doesn't require modifying model weights.
Conceptual Contribution: The idea of "learning a coordinate system where steering becomes linear" is an appealing conceptual frame that could influence how the community thinks about representation engineering. This paradigm shift — from optimizing the steering vector to optimizing the space in which steering occurs — could inspire follow-up work.
Scalability Concerns: A separate INN must be trained for each behavior and each intervention layer, which limits scalability to multi-attribute steering scenarios. The authors acknowledge this and suggest multi-attribute INNs as future work, but this remains a significant practical limitation.
Activation steering is a rapidly growing subfield with increasing practical importance as LLMs are deployed more broadly. The limitations of linear steering are becoming well-recognized, and several recent works (ODESteer, TruthFlow) have begun exploring nonlinear alternatives. INNSteer's contribution is timely in this context, offering a cleaner theoretical framework (exact invertibility) and stronger empirical results than concurrent nonlinear approaches. The connection to normalizing flows is natural but previously unexploited in this domain.
Missing Analysis: The paper lacks ablation studies separating the contributions of each loss term (beyond the log-determinant study in J.1), analysis of failure cases, and evaluation on truly adversarial or out-of-distribution prompts.
INNSteer makes a solid contribution to the activation steering literature by introducing a well-motivated nonlinear framework with exact invertibility guarantees and consistently strong empirical results. The conceptual contribution of learning a steerable coordinate system rather than a steering vector is likely to influence future work. The main concerns are the scalability of per-behavior INN training, some questions about metric saturation, and the gap between demonstrated behavioral control and practical deployment requirements.
Generated Jun 9, 2026
Paper 2 (INNSteer) introduces a novel nonlinear activation steering framework using invertible neural networks that addresses a fundamental limitation of existing linear steering methods for LLMs. It demonstrates broad applicability across multiple LLM families, scales, and safety benchmarks. Given the enormous current interest in LLM safety and controllability, this work has high timeliness and broad impact potential. Paper 1 (ERBench) provides a useful benchmarking contribution for symbolic regression/equation discovery, but benchmarks generally have lower transformative impact than novel methodological advances, and equation discovery is a narrower community compared to LLM research.
Paper 2 addresses a fundamental and widespread limitation of RLVR—the loss of gradient signal when sampled traces receive identical rewards—which affects the dominant training paradigm for reasoning LLMs. Its solution (trace tournaments with Bradley-Terry models) is elegant, practical, and yields strong empirical gains (7.6% accuracy improvement, 27-41% training acceleration, ~50% compute savings). These concrete efficiency and performance gains have broad applicability across reasoning tasks. Paper 1, while technically interesting in extending activation steering to nonlinear regimes, addresses a more niche problem with incremental improvements over existing steering methods.
LLM behavior control and safety are critical, high-priority areas in current AI research. Paper 2 introduces a novel nonlinear activation steering approach that overcomes limitations of linear methods. Its potential real-world applications in making LLMs safer and more aligned give it broader and more immediate scientific and societal impact compared to the specialized domain of interacting particle systems in Paper 1.
GPT-Micro presents a more transformative paradigm with broader cross-disciplinary impact, combining LLMs with thermodynamics-compliant constitutive model discovery in manufacturing. It demonstrates dramatic quantitative improvements (70% data reduction, 400X time reduction) on a real-world problem, bridging AI with materials science and manufacturing. Paper 2, while technically solid, offers an incremental improvement to activation steering methods within the narrower LLM interpretability/control community. GPT-Micro's novelty in integrating physics constraints with LLM-driven scientific discovery addresses a more fundamental challenge with wider practical applications.
Paper 1 addresses a fundamental limitation in generative molecular design by allowing dynamic size adaptation, offering profound implications for drug discovery and materials science. Its use of unbalanced optimal transport provides a novel solution to a critical bottleneck, leading to broader interdisciplinary impact. While Paper 2 offers a valuable methodological advancement in LLM control and safety, Paper 1's potential to accelerate real-world scientific discovery in chemistry and medicine gives it a higher overall scientific impact.
Paper 1 addresses a fundamental question in learning theory—bridging local and global self-supervised learning rules—with both theoretical contributions (exact conditions in deep linear networks) and practical advances (matching global BP-SSL performance with local rules, achieving state-of-the-art). This has broad implications for neuroscience-inspired learning, scalable training, and biological plausibility. Paper 2, while technically solid, offers an incremental improvement to activation steering in LLMs—a narrower, more applied contribution. Paper 1's theoretical depth, cross-disciplinary relevance (ML + neuroscience), and potential to reshape training paradigms give it higher long-term impact.
Paper 1 offers a deeper theoretical contribution by formalizing model merging as a noisy linear inverse problem, providing a principled explanation for why iterative methods outperform closed-form solutions (implicit spectral regularization), and delivering dramatic practical improvements (28-72x speedup, 50% memory reduction) with strong empirical results across diverse benchmarks. The insight connects well-established inverse problem theory to a growing practical need in foundation model deployment. Paper 2, while novel in applying invertible networks to activation steering, is more incremental—extending linear methods to nonlinear ones—with narrower scope and less fundamental theoretical insight.
Paper 2 (INNSteer) addresses a practical, broadly applicable problem—controlling LLM behavior at inference time—with a novel nonlinear framework using invertible neural networks. It offers clear methodological innovation over existing linear steering methods, demonstrates consistent improvements across multiple LLM families and safety benchmarks, and has immediate real-world applications in AI safety and alignment. Paper 1, while theoretically rigorous and interesting in its analysis of benchmark coverage limitations, addresses a more niche evaluation/benchmarking concern with less direct practical applicability. Paper 2's approach is more likely to be widely adopted and cited across the LLM research community.
Paper 2 (INNSteer) introduces a fundamentally new paradigm for activation steering using invertible neural networks, addressing a core limitation (linearity assumption) in a rapidly growing research area. It offers broad applicability across LLM families and behavioral traits, with a clean theoretical motivation (nonlinear manifold structure of behaviors). Paper 1 (ProEval) is a solid engineering contribution for efficient evaluation but is more incremental, applying known techniques (GPs, Bayesian quadrature) to evaluation. Paper 2 is more likely to inspire follow-up work and shift how the community approaches LLM controllability.
Paper 1 (INNSteer) addresses a highly timely and practically important problem—controlling LLM behavior at inference time—with a novel nonlinear approach using invertible neural networks. The breadth of experiments across multiple LLM families, scales, and safety benchmarks demonstrates strong practical applicability. Given the intense focus on AI safety and alignment, this work has immediate real-world relevance and broad impact. Paper 2 makes rigorous theoretical contributions to identifiability of neural interaction discovery, but targets a narrower audience (time-series causal discovery) with more specialized applications, limiting its breadth of impact despite strong methodological rigor.