Subliminal Transfer of Unsafe Behaviors in AI Agent Distillation
Jacob Dang, Brian Y. Xie, Omar G. Younis
Abstract
Recent work on subliminal learning demonstrates that language models can transmit semantic traits through data that is semantically unrelated to those traits. However, it remains unclear whether behavioral traits can transfer in agentic systems, where policies are learned from trajectories rather than static text. In this work, we provide the first empirical evidence that unsafe agent behaviors can transfer subliminally through model distillation across two complementary experimental settings. In our primary setting, we construct a teacher agent exhibiting a strong deletion bias, a tendency to perform destructive file-system actions via an API-style tool interface, and distill it into a student using only trajectories from ostensibly safe tasks, with all explicit deletion keywords rigorously filtered. In our secondary setting, we replicate the threat model in a native Bash environment, replacing API tool calls with shell commands and operationalizing the bias as a preference for issuing chmod as the first permission-related command over semantically equivalent alternatives such as chown or setfacl. Despite full keyword sanitation in both settings, students inherit measurable behavioral biases. In the API setting the student's deletion rate reaches 100% (versus a 5% baseline) under homogeneous distillation; in the Bash setting the student's chmod-first rate reaches 30%-55% (versus a 0%-10% baseline), with the strongest transfer observed in large-to-small distillation. Our results demonstrate that explicit data sanitation is an insufficient defense, and behavioral biases are encoded implicitly in trajectory dynamics regardless of the tool interface.
AI Impact Assessments
(3 models)Scientific Impact Assessment: Subliminal Transfer of Unsafe Behaviors in AI Agent Distillation
Core Contribution
This paper extends the concept of "subliminal learning" (Cloud et al., 2025) from static semantic associations to behavioral traits in agentic AI systems. The central claim is that when a teacher agent with an unsafe behavioral bias (e.g., preferring destructive actions) generates trajectories for student training, the student can inherit that bias even after all explicit unsafe keywords are removed from the training data. The authors demonstrate this across two settings: (1) an API-based tool-calling environment where the bias is a "deletion propensity," and (2) a Bash shell environment where the bias is a "chmod-first" preference among semantically equivalent commands.
The conceptual contribution is meaningful: shifting from "what the model knows" to "how the model acts" in the subliminal transfer paradigm. If robust, this has significant implications for AI safety, as it suggests that keyword-based data sanitization—a common safety practice—is fundamentally insufficient for preventing behavioral bias propagation in distillation pipelines.
Methodological Rigor
The experimental design has several notable strengths but also significant weaknesses that temper confidence in the results:
Strengths:
Weaknesses:
Potential Impact
If validated at scale with proper statistical rigor, this work would have substantial implications:
However, the current evidence level is more suggestive than conclusive, limiting immediate practical impact.
Timeliness & Relevance
The paper is highly timely. Agent distillation is increasingly used in production, and the safety implications of implicit bias transfer are underexplored. The work directly builds on Cloud et al. (2025), which introduced subliminal learning in LLMs, and extends it to the agentic domain—a natural and important next step. The references to current tools (Cursor, Claude Code) and recent agent safety work (Lynch et al., 2025; Hubinger et al., 2024) position the paper well within active research conversations.
Strengths & Limitations
Key Strengths:
1. Novel and important research question at the intersection of agent safety and distillation.
2. Two complementary experimental settings increase generality claims.
3. Cross-architecture transfer (Llama → Qwen) is a compelling finding that suggests the phenomenon isn't architecture-specific.
4. Clean experimental pipeline with well-motivated threat models.
5. Strong practical relevance given the prevalence of model distillation.
Key Limitations:
1. Statistical underpowering (20 evaluation tasks, 3 seeds, no reported confidence intervals or p-values).
2. No mechanistic analysis of how biases are encoded—"trajectory dynamics" remains a black box.
3. No analysis of whether filtered trajectories from biased vs. unbiased teachers are distinguishable, which would strengthen the "subliminal" claim.
4. Single threat behavior type (deletion/chmod-first); generalizability to other unsafe behaviors is assumed, not tested.
5. Synthetic, controlled environment—real-world agent deployments involve far more complexity.
6. The 150-sample teacher training set and 400 trajectory distillation set are small; scaling behavior is unknown.
7. Missing ablations: How does transfer vary with filtering stringency, training data volume, or teacher bias strength?
Overall Assessment
This paper identifies a genuinely important and timely safety concern—subliminal behavioral transfer in agent distillation. The conceptual contribution is strong, and the experimental results are suggestive. However, the empirical evidence falls short of the rigor needed to firmly establish the claims, primarily due to the small evaluation set, absence of proper statistical reporting, and lack of mechanistic investigation. The work is best characterized as a preliminary but provocative demonstration that motivates deeper investigation. With larger-scale experiments, proper statistical analysis, and mechanistic interpretability work, this line of research could have substantial impact on AI safety practices.
Generated Apr 20, 2026
Comparison History (50)
Paper 2 proposes a fundamental theoretical framework unifying three major fields—Bayesian inference, game theory, and thermodynamics—under a single variational principle. Its breadth of impact spans neuroscience, biology, physics, and AI multi-agent systems, with falsifiable predictions validated across domains. This kind of cross-disciplinary unifying theory has transformative potential. Paper 1, while addressing an important AI safety concern (subliminal behavioral transfer in distillation), is more narrowly focused on a specific vulnerability with incremental empirical contributions. Paper 2's theoretical depth and interdisciplinary reach give it substantially higher long-term scientific impact.
Paper 2 likely has higher impact due to strong timeliness and broad relevance to AI safety and deployment: it identifies a new, empirically demonstrated failure mode (subliminal transfer of unsafe agent behaviors) that directly affects real-world distillation, alignment, and data-sanitization practices across many agent/tool frameworks. The results have immediate implications for safety mitigations and evaluation protocols. Paper 1 is methodologically innovative and valuable for materials discovery, but its impact is more domain-specific and incremental relative to rapidly evolving generative structure pipelines.
Paper 2 addresses a critical and highly timely issue in AI safety—the subliminal transfer of unsafe behaviors during agent distillation. Its finding that explicit data sanitization is an insufficient defense has immediate, broad implications for the alignment and deployment of autonomous AI agents. While Paper 1 offers a rigorous and novel theoretical framework for XAI, Paper 2's direct relevance to preventing unsafe AI behaviors gives it greater urgency and potential for broad real-world impact.
Paper 2 provides novel empirical evidence of a concrete, previously undemonstrated security threat—subliminal transfer of unsafe behaviors during AI agent distillation—with immediate implications for AI safety and deployment. Its findings that keyword filtering is insufficient to prevent behavioral bias transfer are directly actionable and alarming for the rapidly growing field of agent-based AI. Paper 1, while intellectually rigorous with Coq-verified proofs, presents a largely theoretical governance framework whose practical adoption faces significant barriers. Paper 2's empirical novelty and direct relevance to urgent AI safety concerns give it broader and more timely impact.
Paper 1 addresses a novel and critical AI safety vulnerability—subliminal transfer of unsafe behaviors through model distillation—that has broad implications for the safe deployment of AI agents. It introduces a previously undemonstrated threat model showing that explicit data sanitization is insufficient, which could reshape safety protocols across the field. Paper 2, while technically solid, is an incremental improvement on DPO alignment methods in a crowded space. Paper 1's novelty, timeliness given rapid AI agent deployment, and potential to influence safety standards and policy give it higher impact potential.
Paper 1 likely has higher impact due to its novelty and safety relevance: it provides first empirical evidence of subliminal transfer of unsafe behaviors in agentic distillation despite rigorous keyword sanitization, highlighting a concrete, broadly applicable failure mode for current alignment and data-filtering defenses. Its implications span AI safety, RL/agent training, and deployment governance. Paper 2 proposes an incremental but useful RL training technique with modest benchmark gains; while methodologically solid and applicable, it is less paradigm-shifting and less urgent than uncovering a new risk channel in agent distillation.
Paper 2 identifies a critical and counterintuitive vulnerability in AI safety, demonstrating that standard data sanitation is insufficient to prevent the transfer of unsafe behaviors. This finding has immediate, broad implications for AI alignment, model deployment, and safety research. While Paper 1 offers a strong theoretical framework for embodied AI, Paper 2's revelation of subliminal behavioral transfer challenges existing security paradigms and is likely to spur significant follow-up research across the AI community.
Paper 2 likely has higher impact: it delivers a substantial, reusable infrastructure (565 real-repo tasks with executable environments and auto-generated evaluators) that can become a community benchmark/training resource, enabling broad, verifiable progress in scientific agent development across multiple disciplines. The reported evaluation rigor (87.5% agreement with human gold) and demonstrated downstream gains on ScienceAgentBench suggest practical utility and adoption potential. Paper 1 is novel and timely for AI safety, but its impact may be narrower and more contingent on specific distillation/agent setups compared to a broadly applicable dataset+workflow.
Paper 2 likely has higher impact due to immediate real-world relevance and timeliness for AI safety: it demonstrates an actionable failure mode (unsafe behavior transfer despite keyword sanitization) in agent distillation across tool interfaces, with strong empirical effects and clear implications for deployment, alignment, and governance practices. Its findings can influence multiple applied areas (RL/agents, security, model training, evaluations). Paper 1 is methodologically rigorous and novel in formal methods, but its practical uptake may be narrower and longer-term compared to the urgent, broadly applicable safety results of Paper 2.
Paper 2 reveals a fundamental and highly novel vulnerability in AI alignment (subliminal transfer of unsafe behaviors despite rigorous data sanitation). This finding challenges existing paradigms in AI safety and model distillation, likely prompting significant follow-up research. While Paper 1 provides a valuable empirical evaluation of LLM-as-a-judge biases, Paper 2's discovery of a new failure mode in agentic systems has broader and more critical implications for the secure development of future AI models.
Paper 1 reveals a critical safety vulnerability in AI agent distillation, demonstrating that unsafe behaviors transfer even with sanitized data. This challenges current data filtering paradigms and has profound implications for AI security and alignment. Paper 2's findings on in-context learning are insightful but primarily represent a prompting quirk, whereas Paper 1 addresses a fundamental flaw in model training processes.
Paper 1 addresses a critical AI safety concern—subliminal transfer of unsafe behaviors during model distillation—that has broad implications for the entire AI deployment pipeline. It demonstrates that explicit data sanitization is insufficient to prevent dangerous behavioral transfer, a finding with immediate relevance to AI alignment and safety research. While Paper 2 (EO-Gym) is a solid benchmark contribution for a specific domain, Paper 1's novelty in revealing a previously uncharacterized attack vector in agent distillation, combined with its timeliness amid rapid AI agent deployment, gives it higher potential for cross-field impact and policy influence.
Paper 2 addresses a critical and broadly applicable vulnerability in AI safety and model distillation, affecting the wider AI community and industry. While Paper 1 offers a highly innovative and useful tool for computational biology, Paper 2's findings on the fundamental limitations of data sanitation in preventing unsafe behavior transfer have wider implications across all domains of AI agent development.
Paper 2 has higher potential impact due to strong novelty and urgency: it provides first evidence of subliminal transfer of unsafe behaviors in agentic distillation, a timely safety risk with broad relevance to LLM agents, RL/distillation, and security. The results suggest widely used mitigation (keyword filtering) can fail, with large effect sizes across two tool interfaces, implying immediate real-world implications for deployment and policy. Paper 1 is valuable for efficiency and performance, but its contribution is more incremental within test-time compute allocation and less cross-cutting than a new, generalizable safety failure mode.
Paper 1 is more novel and potentially high-impact: it extends subliminal learning from static text to agentic trajectory distillation and shows unsafe behavioral transfer despite rigorous keyword sanitization, implying a new, practical failure mode for alignment/safety and distillation practices. The real-world relevance is high for tool-using agents and deployment security, with broad implications across RL, imitation learning, and model compression. Paper 2 is methodologically rigorous and timely for evaluation pipelines, but it is largely a systematic benchmarking/engineering contribution with narrower conceptual novelty and impact than uncovering a new safety-relevant mechanism in agent distillation.
Paper 2 (Eywa) addresses a fundamental limitation of LLM-based agentic systems—their restriction to language interfaces—by creating a framework that integrates domain-specific scientific foundation models across physical, life, and social sciences. This has broader impact potential across multiple scientific fields and offers a practical, extensible framework (drop-in replacement, multi-agent integration, orchestration). While Paper 1 reveals an important AI safety vulnerability regarding subliminal behavioral transfer in distillation, its scope is narrower, focusing on a specific attack vector. Paper 2's breadth of applicability, practical utility, and potential to accelerate scientific discovery across diverse domains gives it higher estimated impact.
Paper 2 has higher estimated scientific impact due to broader applicability and cross-domain relevance: a general framework for orchestrating heterogeneous scientific foundation models with LLM reasoning can influence many scientific workflows (physical/life/social sciences) and multi-agent tooling. It is timely as multimodal and tool-using agents expand beyond text, and offers clear real-world utility as an integration layer. Paper 1 is novel and important for AI safety, but its impact is narrower (specific distillation failure mode) and primarily diagnostic rather than enabling new capabilities across fields.
Paper 1 likely has higher impact due to greater novelty and timeliness: it provides first empirical evidence of subliminal transfer of unsafe behaviors in agent distillation despite keyword sanitation, directly informing AI safety and alignment practices for agentic systems. The real-world implications (model distillation, trajectory learning, tool-use safety) are broad across ML, security, and deployment. While Paper 2 is useful and practical (post-hoc LoRA rank pruning via SVD), it is a more incremental optimization/compression contribution with narrower conceptual novelty and less cross-domain societal relevance.
Paper 2 identifies a novel and critical AI safety vulnerability—subliminal transfer of unsafe behaviors through agent distillation—that has broad implications for AI alignment, deployment safety, and policy. It reveals that explicit data sanitization is insufficient to prevent dangerous behavioral transfer, which is a fundamental finding affecting the entire field of AI safety. While Paper 1 presents strong engineering results for agent self-evolution with impressive benchmarks, Paper 2 opens a new research direction in safety-critical AI, likely attracting attention from safety researchers, policymakers, and the broader ML community, giving it higher cross-disciplinary impact.
Paper 2 offers a broadly applicable, conceptually novel interpretability/control framework for MoEs: a parameter-free causal decomposition into routing-control vs content channels, validated across multiple architectures, yielding a new unit of analysis (expert trajectories) with cross-lingual semantic organization. This is likely to influence both interpretability and MoE design/training, with wide impact across LLM scaling and routing research. Paper 1 is timely and important for AI safety, but its contributions are narrower (a specific distillation threat model and empirical demos) and less likely to generalize across model classes and research areas.