Back to Rankings

PRISM: Recovering Instruction Sets from Language Model Activations

Gilad Gressel, Rahul Pankajakshan, Julia Diament, Efim Hudis, Krishnashree Achuthan, Yisroel Mirsky

cs.AIcs.LG
Share
#132 of 3489 · Artificial Intelligence
Tournament Score
1535±46
10501800
87%
Win Rate
20
Wins
3
Losses
23
Matches
Rating
7.2/ 10
Significance7.5
Rigor6.8
Novelty7.5
Clarity8

Abstract

As LLMs are deployed as agents, reliable monitoring requires knowing not only what they output, but which instructions are steering their behavior. This is difficult when models infer unintended subgoals, follow contextual cues, or are influenced by prompt injections and hidden objectives. While activation-to-language methods suggest that hidden states can reveal natural-language information, existing approaches are not designed to recover the full set of simultaneous instructions, constraints, prohibitions, and subgoals active in agentic settings. We formalize this problem as instruction set retrieval and introduce PRISM, an activation-conditioned interpreter that decodes hidden states from a frozen target model into a faithful bullet list of active instructions. Unlike prior activation-to-language methods, PRISM is trained to recover instruction sets directly, using judge-guided GRPO to reward covered instructions and penalize unsupported ones. Across benign, constrained, prompt-injection, and hidden-objective settings, PRISM outperforms activation-to-language baselines, especially on security-relevant objectives.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: PRISM — Recovering Instruction Sets from Language Model Activations

1. Core Contribution

PRISM formalizes a new problem — instruction set retrieval (ISR) — and proposes a concrete solution. The key insight is that monitoring LLM agents requires not just detecting *that* something suspicious is happening (as probes do) but recovering *what specific instructions* are guiding behavior. This is a meaningful conceptual advance over prior activation-to-language work (LatentQA, Activation Oracles, Patchscopes), which either answer free-form questions about hidden states or produce single-sentence descriptions, neither of which is suited to enumerating the full operative instruction set.

The method reuses the frozen target model's weights with LoRA adapters and a learned activation projection, making it lightweight. The two-stage training — supervised pretraining followed by judge-guided GRPO — is well-motivated: cross-entropy is poorly suited to set-level recovery where many valid orderings and phrasings exist, and the RL stage directly optimizes coverage and penalizes hallucination at the individual-instruction level.

2. Methodological Rigor

Strengths: The paper is methodologically thorough in several respects. The formalization of ISR with explicit coverage and hallucination metrics is clean and useful. The judge calibration against human annotations with Cohen's κ (achieving κ > 0.70 on both axes) lends credibility to the automated evaluation. The reward design is carefully considered, with asymmetric weighting (coverage weighted more than hallucination) and a two-sided length penalty to prevent reward hacking. The evaluation suite is entirely out-of-distribution, which is a strong design choice.

Concerns: The paper reports single-checkpoint results without confidence intervals across seeds, which they acknowledge. More critically, PRISM is evaluated only on Qwen3.5-9B with layer-16 activations from the final 128 tokens — a single model, single layer, single window configuration. This severely limits generalizability claims. The adversarial and hidden-objective test sets are author-constructed because existing benchmarks (WildJailbreak, DeceptionBench) caused near-universal refusals from Qwen3.5-9B. While this is pragmatically justified, it introduces potential bias in evaluation set construction. The paper would benefit from evaluation on at least one additional model family to demonstrate transferability.

The hallucination calibration required multiple iterations and the final judge-human agreement on hallucination (κ = 0.705) is notably lower than for coverage (κ = 0.817), suggesting this axis remains somewhat noisy.

3. Potential Impact

The problem addressed is genuinely important and timely. As LLMs are deployed as agents with tool access, the ability to audit what instructions they're actually following — especially when prompt injections or hidden objectives are involved — is a concrete safety need.

Security applications: The adversarial-subset coverage results (Table 2) are the most compelling: PRISM achieves 0.740 overall adversarial detection vs. 0.448 for the best baseline. This suggests real utility for detecting prompt injections embedded in documents, emails, or tool outputs — a realistic attack vector in production agent systems.

Monitoring/auditing: Beyond security, ISR could support compliance auditing, debugging complex agent behaviors, and understanding failure modes when models misinterpret multi-constraint prompts.

Limitations to impact: PRISM requires access to model internals (activation traces), which limits deployment to settings where the operator controls the model. It cannot monitor API-only models. The paper appropriately notes that PRISM is a monitoring tool, not a safety intervention — it cannot decide what to do with recovered instructions. The dual-use concern (extracting proprietary system prompts) is real and acknowledged.

4. Timeliness & Relevance

This paper addresses a genuine emerging need. The explosion of agentic LLM deployments makes instruction monitoring increasingly critical. Prompt injection is a well-documented threat with no satisfactory solution. Existing probing-based approaches return class labels rather than natural-language explanations, which limits their utility for understanding *what* happened. The timing is excellent — this sits at the intersection of interpretability, safety, and practical deployment monitoring.

5. Strengths & Limitations

Key strengths:

  • Clean problem formalization with well-defined metrics (coverage, hallucination)
  • Practical architecture that shares weights with the target model, minimizing deployment overhead
  • Judge-guided GRPO is a well-motivated training objective for set-level recovery
  • Rigorous judge calibration with human annotation
  • Strong performance gains over baselines, especially on security-relevant settings (+0.204 average reward over best baseline)
  • Comprehensive qualitative examples that clearly illustrate failure modes of baselines
  • Thorough limitations section and ethical considerations
  • Notable weaknesses:

  • Single model family evaluation (Qwen3.5-9B only) — transferability is entirely unknown
  • Single-seed results without variance estimates
  • Author-constructed adversarial evaluation sets, potentially introducing favorable bias
  • Short context windows (∼1000 tokens, 5-7 constraints) far from realistic agent deployments
  • The "latent instructions" caveat (instructions not in the prompt but being followed) could mask systematic hallucination
  • No evaluation of whether PRISM degrades gracefully as context length or instruction count increases
  • The text-only baseline (GPT-5.5 reading the response) achieves competitive adversarial-subset coverage (0.546 vs PRISM's 0.740), suggesting some instruction information leaks into surface text
  • Additional observations: The gap between PRISM w/o RL and PRISM w/ RL is substantial on AP (+0.181 reward), demonstrating that the RL training objective is doing meaningful work beyond supervised pretraining. The paper's release of calibrated judge prompts is a useful community contribution for reproducible evaluation of open-ended generation tasks.

    Overall, this is a well-executed paper addressing a timely and important problem with a practical solution. The single-model limitation is the most significant gap, but the problem formalization and methodology are strong enough to seed productive follow-up work.

    Rating:7.2/ 10
    Significance 7.5Rigor 6.8Novelty 7.5Clarity 8

    Generated Jun 9, 2026

    Comparison History (23)

    Wonvs. ABC-Bench: An Agentic Bio-Capabilities Benchmark for Biosecurity

    Paper 2 (PRISM) likely has higher scientific impact due to its broadly applicable, methodologically novel approach to interpreting and monitoring LLM behavior via activation-based instruction-set retrieval. It addresses a timely, central problem in AI safety/security (prompt injection, hidden objectives) and can generalize across domains wherever agents are deployed. Paper 1 (ABC-Bench) is important and timely for biosecurity, but as a benchmark it is narrower in scope, more domain-specific, and its impact depends on adoption and the evolving LLM/bio tooling landscape. PRISM’s technique could influence multiple subfields (interpretability, agent monitoring, alignment, security).

    gpt-5.2·Jun 10, 2026
    Wonvs. Cross-Modal Knowledge Distillation without Paired Data: Theoretical Foundation and Algorithm

    Paper 2 is likely higher impact due to its timeliness and broad real-world relevance: interpreting and monitoring agentic LLM behavior, especially under prompt injection and hidden objectives, is a central current challenge. PRISM introduces a concrete new task (instruction set retrieval) and a specialized method (judge-guided GRPO) with direct security applications, potentially influencing AI safety, interpretability, and deployment practices. Paper 1 is novel and rigorous with theoretical guarantees, but its impact may be narrower to multimodal distillation settings, whereas Paper 2 targets a widely deployed model class and a cross-cutting safety problem.

    gpt-5.2·Jun 10, 2026
    Lostvs. Next-Token Prediction Learns Generalisable Representations of Sleep Physiology

    Paper 2 (Hypnos) has broader scientific impact for several reasons: (1) It introduces a novel finding that next-token prediction—typically associated with language models—is effective for multi-modal physiological signal representation learning, challenging conventional masked-reconstruction and contrastive approaches. (2) It demonstrates practical clinical applications across sleep medicine, cardiology, and neurology with 100x less labeled data. (3) The multi-modal foundation model trained on 20,000+ recordings establishes a scalable paradigm for healthcare AI. (4) Cross-domain generalization (sleep to daytime ECG/atrial fibrillation) suggests broad transferability. Paper 1, while addressing an important LLM safety problem, is more narrowly scoped to AI interpretability/security.

    claude-opus-4-6·Jun 9, 2026
    Wonvs. SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks

    Paper 1 addresses a critical and highly timely problem in AI safety and alignment: understanding the hidden objectives and instructions steering LLM agents. Its novel approach to interpretability—extracting active instruction sets directly from activations—offers profound implications for defending against prompt injections and ensuring agentic reliability. While Paper 2 provides a valuable benchmarking tool for spatial reasoning, Paper 1 introduces foundational methodological advancements in model transparency that are likely to broadly impact how AI systems are monitored and secured.

    gemini-3.1-pro-preview·Jun 9, 2026
    Wonvs. Frequency-based Constrained Sampling for Interval Patterns

    Paper 2 is likely to have higher impact due to stronger timeliness and broad relevance: interpreting and monitoring LLM agents is a major current research and deployment need, with direct security applications (prompt injection, hidden objectives). The proposed formulation (instruction set retrieval) and method (activation-conditioned interpreter trained with judge-guided GRPO) could influence interpretability, alignment, and AI security communities. Paper 1 is methodologically rigorous and valuable for pattern mining, but its domain is narrower and less central to current high-impact trends, limiting breadth and near-term adoption.

    gpt-5.2·Jun 9, 2026
    Wonvs. Leveraging Structural Constraints for Diffusion-based Neural TSP Solvers

    PRISM addresses a more novel and timely problem—recovering instruction sets from LLM activations for AI safety and monitoring—which has broad implications across AI alignment, security, and interpretability. The problem formalization of 'instruction set retrieval' is new, and the approach addresses critical concerns about prompt injection and hidden objectives in agentic AI systems. Paper 1, while technically solid, offers an incremental improvement to neural TSP solvers with a relatively narrow scope. Paper 2's relevance to AI safety gives it significantly broader potential impact across the rapidly growing field of LLM deployment.

    claude-opus-4-6·Jun 9, 2026
    Wonvs. From Rigid to Dynamic: Entropy-Guided Adaptive Inference for Long-Context LLMs

    Paper 1 addresses a critical and highly novel problem in AI safety and interpretability: recovering the internal instructions steering LLM agents from hidden states. This provides a crucial mechanism for detecting prompt injections and hidden objectives, offering significant real-world impact for safe AI deployment. In contrast, Paper 2 offers an efficiency improvement for long-context inference; while practically valuable, it represents a more incremental advancement in a well-studied area compared to the foundational safety contributions of Paper 1.

    gemini-3.1-pro-preview·Jun 9, 2026
    Wonvs. Personalization Meets Safety:Mechanisms,Risks,and Mitigations in Personalized LLMs

    Paper 1 likely has higher scientific impact because it introduces a novel, concrete method (PRISM) for instruction-set retrieval from LLM activations with a specific training objective (judge-guided GRPO) and demonstrates empirical gains in security-relevant settings. This is a timely capability for monitoring and defending agentic LLMs, with clear real-world applications in alignment, auditing, and prompt-injection/hidden-objective detection. Paper 2 is a comprehensive, useful review with broad relevance, but its impact is more integrative than methodological and may translate less directly into new technical capabilities.

    gpt-5.2·Jun 9, 2026
    Wonvs. SIFT: Selective-Index For Fast Compute of RAG Prefill by Exploiting Attention Invariance

    PRISM targets interpretability and security for agentic LLMs by recovering active instruction sets from internal activations, addressing a broadly relevant and timely problem (monitoring, prompt injection, hidden objectives). If validated rigorously, it could impact multiple areas—AI safety, alignment, interpretability, security, and governance—beyond a single system optimization. SIFT is innovative and practically useful for accelerating RAG prefill, but its impact is narrower (inference efficiency for a specific workload) and more contingent on deployment details. Overall, PRISM has higher cross-field and real-world safety relevance.

    gpt-5.2·Jun 9, 2026
    Wonvs. SearchSwarm: Towards Delegation Intelligence in Agentic LLMs for Long-Horizon Deep Research

    Paper 1 addresses a critical and fundamental problem in AI safety and interpretability: extracting the actual instructions driving an LLM's behavior directly from its hidden states. This approach provides a novel solution to urgent security challenges like prompt injection and hidden objectives. While Paper 2 offers a valuable engineering contribution for long-horizon tasks via agent delegation, Paper 1's deep dive into model internals and its broad implications for trustworthy AI deployment give it a higher potential for foundational scientific impact.

    gemini-3.1-pro-preview·Jun 9, 2026