Decomposing how prompting steers behavior

Fan L. Cheng, Nikolaus Kriegeskorte

#131 of 3355 · Artificial Intelligence
Share
Tournament Score
1537±45
10501800
90%
Win Rate
18
Wins
2
Losses
20
Matches
Rating
7/ 10
Significance
Rigor
Novelty
Clarity

Abstract

Prompting steers large language models (LLMs) and vision-language models (VLMs) without weight updates, but it remains unclear how instruction changes reshape internal representations to produce behavior. We introduce a nested geometric decomposition framework that treats prompting as a transformation of the representational geometry of the content following the prompt. For each prompt pair, we align representations of the same stimuli under two prompts using increasingly expressive stimulus-invariant maps: translation, rigid transformation with uniform scaling, sequential axis scaling, affine transformation, and nonlinear transformation. We then causally test each map by replacing a single layer's prompt-A hidden state for held-out stimuli with its mapped counterpart and measuring recovery of prompt-B representational geometry and behavior. Across three LLMs, three VLMs, and six text or image datasets spanning style, emotion, scene content, and number, prompts consistently reshape representations toward the instructed task structure. Cross-validated variance decomposition shows that much prompt-induced activation change is captured by shape-preserving maps, especially translation and rigid transformation with uniform scaling, while tier profiles reveal model- and task-specific routing strategies across layers. Crucially, although translation and rigid tiers already improve behavioral agreement, affine transformation is the first tier to nearly recover target-prompt task geometry and yields corresponding behavioral gains. This suggests that cross-dimensional linear mixing is a key mechanism by which prompts reorganize representations toward instructed task structure. Our framework decomposes prompt-induced representational change into interpretable geometric components and reveals how models route task-relevant structure to produce prompt-driven behavior.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: "Decomposing how prompting steers behavior"

1. Core Contribution

This paper introduces a nested geometric decomposition framework that characterizes how instruction prompts transform internal representations in language and vision-language models. The key idea is to fit a hierarchy of increasingly expressive stimulus-invariant maps—translation, rigid transformation with uniform scaling (Procrustes), rigid with axis-wise scaling, affine, and nonlinear—between representations of the same stimuli under two different prompts. The framework then causally validates each map by injecting transformed activations at a single layer and measuring recovery of target-prompt geometry (via RDM correlation and silhouette scores) and behavior (relevance and accuracy).

The central finding is that prompting effects are well-approximated by affine transformations: translation captures the largest share of variance, rigid transformations add meaningful improvement, and affine maps (which introduce cross-dimensional linear mixing) are the first tier to nearly recover target-prompt task geometry and behavior. Nonlinear maps provide only marginal additional gains. This provides a principled geometric characterization of what prompting "does" mechanistically.

2. Methodological Rigor

The methodology is well-constructed in several respects:

  • Nested hierarchy with proper inclusion: The transformation classes form a strict inclusion chain (FT ⊂ FOu ⊂ FOa ⊂ FL ⊂ FN), enabling clean incremental variance decomposition where each tier's contribution is uniquely identifiable.
  • Cross-validation: All fits use 5-fold stratified cross-validation, preventing overfitting artifacts from inflating the variance explained by more expressive tiers.
  • Causal validation: The paper goes beyond correlational analysis by injecting fitted transformations into the forward pass and measuring downstream behavioral consequences—a critical step that distinguishes this work from purely observational representation analysis.
  • Breadth of evaluation: Six models (3 LLMs, 3 VLMs), six datasets, and six prompt-pair groups provide substantial coverage.
  • However, there are notable methodological concerns:

  • The nonlinear tier uses a single-hidden-layer MLP (H=512), which is relatively limited. The authors acknowledge this but frame ΔR²_N as a "lower bound." This is appropriate but means the nonlinear contribution could be underestimated.
  • Interventions replace only a single layer and single token position, which may underestimate multi-site or distributed prompt effects.
  • The behavioral evaluation relies on keyword matching, which is coarse compared to human evaluation or more sophisticated semantic matching.
  • Ridge regularization (λ=1) for the affine tier is fixed rather than tuned, which could affect the affine-vs-nonlinear comparison.
  • 3. Potential Impact

    Mechanistic interpretability: The framework provides a general-purpose diagnostic for characterizing how any context manipulation (not just prompting) reshapes representational geometry. This connects to the broader mechanistic interpretability agenda and could become a standard analysis tool.

    Activation steering: The finding that natural prompting is well-approximated by affine maps provides theoretical grounding for the empirical success of translation-based (steering vectors) and affine activation steering methods. This could guide the design of more principled steering techniques—e.g., suggesting that cross-dimensional mixing (shearing) should be incorporated beyond simple additive vectors.

    Neuroscience connections: The use of RSA and Procrustes analysis creates natural bridges to computational neuroscience, where similar tools are used to compare neural representations. The "tier profiles" concept—documenting how different models route task-relevant structure across layers—could inspire analogous analyses in biological neural systems.

    Limitations on broad impact: The practical implications remain somewhat abstract. The paper doesn't demonstrate that the decomposition leads to better steering methods, improved model design, or actionable insights for practitioners. The connection to few-shot learning, chain-of-thought, or other practically important prompting strategies is unexplored.

    4. Timeliness & Relevance

    The paper addresses a timely question at the intersection of two active research fronts: representational analysis of LLMs and activation steering. The mechanistic understanding of prompting is increasingly relevant as models are deployed in settings where prompt engineering critically determines behavior. The paper's geometric perspective fills a genuine gap between work showing that prompts alter geometry (scalar metrics) and work showing that simple interventions can steer behavior (engineering tools).

    5. Strengths & Limitations

    Strengths:

  • Elegant mathematical framework with clear nested structure and well-motivated geometric intuitions
  • Comprehensive experimental coverage across model families, modalities, and task types
  • Causal validation that goes beyond observation to intervention
  • Generalization tests (prompt paraphrases, OOD datasets) add robustness
  • Model-specific "routing strategy" profiles reveal meaningful architectural differences (e.g., OPT-2.7B uses more rotation/scaling; Llama3 concentrates alignment in fewer dimensions)
  • Limitations:

  • Restricted to zero-shot instruction prompting; no few-shot, chain-of-thought, or soft-prompt analysis
  • Single-layer, single-token interventions are a simplification
  • The behavioral evaluation is rudimentary (keyword matching)
  • The paper documents *that* affine maps approximate prompting but provides limited insight into *why*—e.g., what architectural properties or training procedures lead to affine-like prompt routing
  • Some incremental R² values are very small (0.01-0.05 for axis-scaling, nonlinear), raising questions about practical significance versus statistical significance
  • The framework assumes a global (stimulus-invariant) map, potentially missing important stimulus-conditional aspects of how prompts steer behavior
  • Overall assessment: This is a methodologically careful and conceptually clear contribution that provides a principled geometric vocabulary for analyzing prompt-induced representational changes. The main finding—that prompting effects are approximately affine—is clean and actionable. The work is more diagnostic than generative: it characterizes existing phenomena rather than enabling new capabilities. Its impact will likely be strongest in the interpretability community and as a foundation for future mechanistic studies of in-context learning.

    Rating:7/ 10
    Significance 7Rigor 7.5Novelty 7Clarity 7.5

    Generated Jun 3, 2026

    Comparison History (20)

    vs. The Meta-Agent Challenge: Are Current Agents Capable of Autonomous Agent Development?
    gemini-3.16/5/2026

    Paper 2 offers foundational insights into mechanistic interpretability, uncovering the geometric transformations underlying prompting in LLMs and VLMs. While Paper 1 introduces a highly relevant benchmark for evaluating recursive self-improvement, Paper 2 provides deep causal explanations of model internals across multiple architectures and modalities. This fundamental approach to understanding the 'how' and 'why' of neural network behavior is likely to drive broader, more enduring scientific advancements across the field of AI.

    vs. PolarMem: A Training-Free Polarized Latent Graph Memory for Verifiable Vision-Language Models
    gemini-3.16/5/2026

    Paper 1 addresses fundamental mechanistic interpretability by exploring how prompts alter internal representational geometry in foundation models. This theoretical contribution has broad implications for understanding and steering LLMs and VLMs across various tasks. While Paper 2 offers a valuable and practical memory architecture, Paper 1's insights into the core mechanics of prompting are likely to have a more profound and widespread impact on future AI research.

    vs. Beyond Prompt-Based Planning: MCP-Native Graph Planning-based Biomedical Agent System
    claude-opus-4.66/5/2026

    Paper 1 introduces a novel, general-purpose geometric decomposition framework for understanding how prompting reshapes internal representations in LLMs/VLMs. It provides fundamental mechanistic insights applicable across all prompted models, with rigorous causal testing methodology. Its breadth—spanning multiple model architectures, modalities, and tasks—gives it wide relevance to the interpretability and alignment communities. Paper 2 is a solid engineering contribution for biomedical agents but is more domain-specific and incremental (combining MCP servers with graph-based planning). Paper 1's foundational insights into prompt mechanisms have broader and longer-lasting scientific impact.

    vs. On Information Self-Locking in Reinforcement Learning for Active Reasoning of LLM agents
    gpt-5.26/5/2026

    Paper 2 likely has higher scientific impact due to broader cross-model and cross-modality relevance and a more fundamental mechanistic contribution: a causal, layered geometric decomposition of how prompts steer internal representations and behavior. This advances interpretability and prompt engineering across LLMs and VLMs, with potential implications for controllability, safety, and neuroscience-style analysis. Paper 1 addresses an important RL-agent failure mode with a practical mitigation, but it is narrower in scope (agentic RL training setups) and the core method (advantage reweighting with critiques) is comparatively incremental.

    vs. DeskCraft: Benchmarking Desktop Agents on Professional Workflows and Human-in-the-Loop Collaboration
    gemini-3.16/3/2026

    Paper 2 tackles a fundamental theoretical question in AI: how prompting mechanically alters internal representations to steer model behavior. Its novel geometric decomposition framework offers deep mechanistic insights applicable across various LLMs and VLMs. While Paper 1 introduces a valuable benchmark for GUI agents, Paper 2 provides foundational knowledge that could broadly influence model interpretability, alignment, and future architecture design across multiple subfields of AI research.

    vs. EvoTrainer: Co-Evolving LLM Policies and Training Harnesses for Autonomous Agentic Reinforcement Learning
    gpt-5.26/3/2026

    Paper 2 likely has higher impact: it proposes a new autonomous RL training paradigm (co-evolving policy + training harness), targets a timely bottleneck in agentic RL (reward misalignment/hidden failure modes), and reports competitive or better results on high-value benchmarks including long-horizon software engineering—high real-world applicability and broad relevance to RL, agent design, and LLM training pipelines. Paper 1 is methodologically careful and insightful for mechanistic interpretability, but its applications are more indirect and its influence may be narrower than a training framework that can materially improve autonomous agents.

    vs. AUDITFLOW: Executable Symbolic Environments for Structured Financial Reporting Verification
    gemini-3.16/3/2026

    Paper 2 addresses a fundamental, domain-agnostic problem in AI interpretability by explaining the underlying mechanisms of how prompts steer foundation models. Its rigorous, causally tested geometric decomposition framework offers broad implications for understanding and improving LLMs and VLMs across various fields. In contrast, Paper 1 presents a highly specialized, domain-specific application for financial auditing, which, while valuable practically, has a narrower scope of scientific impact compared to the foundational insights provided by Paper 2.

    vs. Adversarial Feeds Steer LLM Agent Decisions Against Their Defaults
    gemini-3.16/3/2026

    Paper 2 addresses a fundamental 'black box' problem by introducing a novel geometric framework to mechanistically explain how prompting alters internal model representations. Its cross-modal applicability (LLMs and VLMs) and causal testing offer deep insights into model workings, promising broader long-term scientific impact across representational learning and AI interpretability compared to the specific, though important, behavioral safety vulnerability explored in Paper 1.

    vs. GTBench: A Curriculum-Grounded Benchmark for Evaluating LLMs as Mathematical Research Assistants in Graph Theory
    gemini-3.16/3/2026

    Paper 1 offers a novel, mechanistic framework for understanding how prompting alters internal representations in foundation models. This fundamental research in mechanistic interpretability has broad implications for AI alignment, capability steering, and general LLM architecture. In contrast, Paper 2 proposes a domain-specific benchmark for graph theory which, while useful for evaluation, provides less foundational innovation and has a narrower scope of impact.

    vs. CP-Agent: Context-Aware Multimodal Reasoning for Cellular Morphological Profiling under Chemical Perturbations
    gpt-5.26/3/2026

    Paper 2 likely has higher scientific impact due to its general, mechanistic contribution to understanding how prompting alters internal representations across multiple LLMs/VLMs, tasks, and modalities. The nested geometric decomposition plus causal layerwise state-mapping tests provide strong methodological rigor and a broadly applicable analysis toolkit for interpretability, controllability, and model design. Its relevance is high given widespread reliance on prompting. Paper 1 is impactful for drug discovery, but it is more domain-specific and its “agentic” component is less fundamentally novel than the cross-model causal geometry framework in Paper 2.

    vs. Weak Critics Make Strong Learners: On-Policy Critique Distillation for Scalable Oversight
    gemini-3.16/3/2026

    Paper 1 provides a foundational, causally-tested framework for understanding the internal mechanisms of prompting across both LLMs and VLMs. While Paper 2 offers a valuable method for scalable oversight, Paper 1's geometric decomposition addresses a fundamental black-box problem in AI interpretability. Its rigorous approach and broad applicability to multiple modalities and models offer deeper theoretical insights that could drive widespread innovations in model steering and design.

    vs. Handoff Debt: The Rediscovery Cost When Coding Agents Take Over Interrupted Tasks
    claude-opus-4.66/3/2026

    Paper 1 introduces a novel geometric decomposition framework for understanding how prompting reshapes internal representations in LLMs/VLMs, offering fundamental mechanistic insights with broad applicability across model architectures and tasks. Its rigorous methodology (nested geometric tiers with causal interventions) and interpretability contributions address a core question in AI alignment and interpretability research. Paper 2 addresses a practical but narrower concern—handoff costs in coding agents—with useful but incremental findings about efficiency gains from context-bearing handoffs. Paper 1's theoretical depth, methodological innovation, and breadth of impact across the rapidly growing interpretability field give it substantially higher scientific impact potential.

    vs. PyraMathBench: Evaluating and Improving Mathematical Capability in Large Language Models
    claude-opus-4.66/3/2026

    Paper 1 introduces a novel geometric decomposition framework that provides fundamental mechanistic insights into how prompting reshapes internal representations in LLMs and VLMs. Its contribution is broadly applicable across models, modalities, and tasks, offering interpretable causal analysis of prompt mechanisms. Paper 2 contributes a useful benchmark and incremental training improvements but is more narrowly focused on mathematical reasoning evaluation. Paper 1's framework has greater potential to influence interpretability, mechanistic understanding, and prompt engineering research across the field.

    vs. Learning When Not to Act: Mitigating Tool Abuse in Agentic Reinforcement Learning
    gpt-5.26/3/2026

    Paper 1 is more novel and broadly impactful: it introduces an interpretable, layered geometric decomposition with causal hidden-state mapping tests to explain how prompts transform internal representations across multiple LLMs/VLMs and diverse tasks. This offers a general mechanistic framework relevant to interpretability, controllability, alignment, and multimodal modeling. Paper 2 addresses an important, timely applied problem (tool overuse) with pragmatic RL refinements and solid benchmark gains, but its methodological novelty is more incremental and its impact is narrower (agent-tool policy optimization).

    vs. The Reliability Gap in Benchmark Auditing: Distribution Shift and Scale as Failure Modes of Contamination Detection
    gpt-5.26/3/2026

    Paper 2 offers a novel, broadly applicable mechanistic framework for how prompting changes internal representations, with causal tests across multiple LLMs/VLMs and tasks. Its decomposition (from translation to nonlinear maps) yields interpretable, general insights (e.g., affine mixing as a key mechanism) that can influence prompting theory, representation learning, interpretability, and model steering—high breadth and timeliness. Paper 1 is important and timely for evaluation integrity, but is primarily a diagnostic/limitations study of existing contamination detectors; its impact is more specialized and less conceptually generative than Paper 2’s mechanistic, cross-domain methodology.

    vs. From Answers to States: Verifiable Process-Level Evaluation of Chemical Reasoning in Large Language Models
    gemini-3.16/3/2026

    Paper 2 addresses a fundamental mechanism of how prompting reshapes internal representations in LLMs and VLMs. This provides foundational insights into AI behavior with broad applicability across all domains. While Paper 1 presents a rigorous and valuable domain-specific evaluation tool for chemistry, Paper 2's theoretical contributions to interpretability and model steering offer a significantly wider scientific impact.

    vs. SkillSmith: Co-Evolving Skills and Tools for Self-Improving Agent Systems
    gemini-3.16/3/2026

    Paper 2 addresses a fundamental scientific question about how prompting mathematically alters internal representations in LLMs and VLMs. This contribution to mechanistic interpretability provides foundational insights that apply broadly across AI research. In contrast, Paper 1 offers an innovative but more specialized engineering framework for self-improving agent systems.

    vs. From Noise to Control: Parameterized Diffusion Policies
    claude-opus-4.66/3/2026

    Paper 2 introduces a novel, interpretable geometric decomposition framework for understanding how prompting steers LLM/VLM behavior—a fundamental question in AI interpretability. Its breadth (multiple models, modalities, and datasets) and mechanistic insights (identifying affine transformation as a key tier) have wide implications for understanding and improving foundation models. Paper 1 addresses an important but narrower robotics problem (parameterized diffusion policies). While valuable, Paper 2's contribution to mechanistic interpretability of LLMs/VLMs is more timely and broadly impactful across the field.

    vs. Tracking the Behavioral Trajectories of Adapting Agents
    claude-opus-4.66/3/2026

    Paper 2 introduces a principled geometric decomposition framework for understanding how prompting steers LLM/VLM behavior internally, with broad applicability across multiple models, modalities, and tasks. It provides mechanistic interpretability insights (e.g., affine transformations as key mechanisms) with causal validation. Paper 1 addresses a narrower problem—tracking behavioral trajectories of agents via skill file diffs—with a simpler methodology (linear projection in embedding space) evaluated on a single trait with limited data (68 pairs). Paper 2's broader scope, deeper mechanistic insights, and relevance to fundamental LLM understanding give it substantially higher potential impact.

    vs. LAP: An Agent-to-Instrument Protocol for Autonomous Science
    gpt-5.26/3/2026

    Paper 2 likely has higher impact: it proposes a concrete, interoperable protocol layer for autonomous science that could become infrastructure across many labs, instruments, and vendors. The primitives (capability signing, reservations, safety handshakes, physically typed/uncertainty-aware results) address real-world constraints (safety, exclusivity, metrology, reproducibility) and align with emerging standards (MCP/A2A), increasing adoption potential and timeliness. Paper 1 is methodologically rigorous and novel for mechanistic interpretability, but its applications are primarily analytical within ML, whereas LAP could reshape cross-disciplinary experimental automation.