Fusion-fission forecasts when AI will shift to undesirable behavior
Neil F. Johnson, Frank Yingjie Huo
Abstract
The key problem facing ChatGPT-like AI's use across society is that its behavior can shift, unnoticed, from desirable to undesirable -- encouraging self-harm, extremist acts, financial losses, or costly medical and military mistakes -- and no one can yet predict when. Shifts persist in even the newest AI models despite remarkable progress in AI modeling, post-training alignment and safeguards. Here we show that a vector generalization of fusion-fission group dynamics observed in living and active-matter systems drives -- and can forecast -- future shifts in the AI's behavior. The shift condition, which is also derivable mathematically, results from group-level competition between the conversation-so-far (C) and the desirable (B) and undesirable (D) basin dynamics which can be estimated in advance for a given application. It is neither model-specific nor driven by stochastic sampling. We validate it across six independent tests, including: 90 percent correct across seven AI models spanning two orders of magnitude in parameter count (124M-12B); production-scale persistence across ten frontier chatbots; and a priori time-stamped prediction eleven months before the Stanford 'Delusional Spirals' corpus appeared, and independently confirmed by that corpus of 207,443 human-AI exchanges. Because it sits architecturally below the current safety stack, the same formula provides a real-time warning signal that current alignment does not supply, portable across current and future ChatGPT-like AI architectures and instantiable in application domains where competing response classes can be defined.
AI Impact Assessments
(1 models)Scientific Impact Assessment
Core Contribution
The paper proposes that behavioral shifts in large language models (LLMs) — from desirable to undesirable outputs — can be predicted using a dot-product order parameter x = C·(D−B), where C is the conversation state vector and B, D are pre-computed centroids of desirable and undesirable output "basins" in residual-stream space. The authors derive a closed-form tipping-point formula (Eq. 1) for when such shifts occur, drawing an analogy to fusion-fission dynamics in living and active-matter systems. They claim this formula operates below existing safety stacks and is portable across architectures.
The central idea — that one can track a scalar projection of the evolving conversation state onto a pre-defined "harm axis" and use it to forecast behavioral tipping — is conceptually appealing and addresses a genuine need. If validated robustly, it could provide a lightweight, architecture-agnostic monitoring signal for deployed LLMs.
Methodological Rigor
The paper presents six tests, but methodological concerns arise across several:
Test 2 (Single-turn forecasting across 7 models): The reported 90% accuracy (19/21) sounds impressive, but the sample size is extremely small — 21 cases total, three prompts across seven models. The exclusion of "near-boundary cases" to achieve 95% further reduces the effective sample. Additionally, the B and D basin phrases (six per basin per domain) are hand-selected, raising concerns about sensitivity to phrase choice. The paper mentions bootstrap CIs in the SI but doesn't report them in the main text.
Test 3 (Architectural robustness): This uses random-weight transformer blocks rather than trained models, which fundamentally limits what can be concluded about real architectures. The authors argue the mechanism is "operationally indistinguishable" between bare attention and full transformer blocks, but the gap between random and trained weights is enormous for mechanistic claims.
Test 4 (Noise-induced cascade): The projection onto a noisy logistic map is interesting but somewhat post-hoc. The regime classifications (F, I, X, N) rely on multiple diagnostics that could be tuned, and the connection between the theoretical map and actual model behavior is qualitative rather than quantitative.
Test 5 (Production-scale): The mapping between CCDH study features and Eq. (1) properties is described as qualitative correspondence, not quantitative prediction. The branch-selection sign test (8/8 on CCDH case studies) is suggestive but uses the authors' own basin construction.
Test 6 (Stanford corpus): This is the strongest test. The a priori prediction (arXiv April 2025, Stanford corpus March 2026) is genuinely prospective, and the odds ratios are compelling (OR = 4.727 for D-fraction, p = 3.22×10⁻²³). However, the regression tests behavioral predictions of the chat-extension formula, not the residual-stream geometry directly. The authors acknowledge this, but it means the most impressive validation is of the *qualitative prediction* (D-fraction dominates over length and recency) rather than the *quantitative formula*.
Concerning omissions: The paper tests on models up to 12B parameters (with one 70B test for regime cascades only), yet the abstract claims applicability to "current and future ChatGPT-like AI architectures." The gap between 12B open-weight models and frontier systems with hundreds of billions of parameters, RLHF, constitutional AI, and system prompts is enormous. The mathematical derivation's reliance on "generic normalization" and the acknowledgment that adding full architectural features "simply shifts n* but does not remove the core shift mechanism" is asserted rather than demonstrated at scale.
Potential Impact
If the framework holds at production scale, the practical implications are significant: a real-time, model-agnostic warning signal for harmful behavioral shifts could be deployed alongside existing safety infrastructure. The domain-specific basin construction approach (clinical guidelines vs. misdiagnosis cases, etc.) is practical and actionable.
However, several factors limit near-term impact:
1. The basin construction requires defining B and D phrase sets a priori, which may not capture the full diversity of harmful outputs.
2. The formula's accuracy at frontier model scale (100B+) with alignment training is untested quantitatively.
3. The six-phrase basin sets used throughout are extremely small — real deployment would need much larger, continuously updated basin definitions.
4. The connection to the mechanistic interpretability literature (Anthropic's circuit tracing, representation engineering) is gestured at but not deeply engaged with.
Timeliness & Relevance
The paper addresses an urgent and timely problem. The documentation of harmful AI behavior in deployed systems is accelerating (CCDH reports, the Stanford corpus, Character.AI lawsuits), and there is genuine demand for predictive tools. The timing of the a priori prediction relative to the Stanford corpus is noteworthy and adds credibility.
Strengths
1. Conceptual clarity: The dot-product order parameter is simple, interpretable, and computable in real time.
2. Prospective prediction: The April 2025 → March 2026 prediction timeline is a rare example of genuine forecasting in AI safety research.
3. Cross-scale evidence: Testing across 124M–12B with consistent results provides some scaling evidence.
4. The fusion-fission visualization (Fig. 2) offers genuine mechanistic insight into how transformers build and resolve competing output representations through depth.
Limitations
1. Small test sets: 21 cases for the core prediction test is insufficient for confident generalization.
2. Gap between theory and frontier practice: The quantitative formula is tested on small open-weight models; frontier applicability is argued by qualitative structural mapping, not quantitative prediction.
3. Basin definition fragility: The reliance on hand-crafted 6-phrase basin sets raises robustness questions; negation confounds are acknowledged but not resolved.
4. Overclaiming relative to evidence: The abstract's claim of portability to "current and future ChatGPT-like AI architectures" substantially exceeds what the experiments demonstrate.
5. The analogy to fusion-fission in living systems is evocative but the mapping is more metaphorical than formal — the mathematical structure is really about dot-product dynamics in high-dimensional vector spaces, which could be presented without the biological framing.
Overall Assessment
The paper presents an intriguing theoretical framework with a compelling prospective prediction, but the quantitative validation remains limited in scale and sample size. The core idea — monitoring a scalar projection in residual-stream space as a safety signal — has genuine potential utility, but the gap between the evidence presented and the claims made is significant. The strongest contribution is the conceptual framework and the prospective prediction on the Stanford corpus; the weakest aspect is the quantitative validation at frontier scale.
Generated May 15, 2026
Comparison History (23)
Paper 1 introduces a broadly applicable infrastructure (KI) that democratizes access to process-based simulation models across 14 Earth-science domains with rigorous validation (3,000 trials, 119 KIs). Its practical impact on climate adaptation, resource management, and scientific accessibility for underserved communities is substantial. Paper 2 addresses an important AI safety problem with an interesting physics-inspired framework, but its fusion-fission analogy, while creative, faces questions about mechanistic validity in transformer architectures. Paper 1's demonstrated scalability across 117+ models and its potential to transform how Earth science is practiced gives it broader and more immediate scientific impact.
Paper 1 likely has higher scientific impact: it proposes a broadly applicable, scalable framework (CauSim) to generate verifiable causal-reasoning supervision via executable SCMs, enabling curriculum/data scaling, cross-representation training, and self-improvement—advances that can influence causal inference, AI reasoning, data synthesis, and scientific modeling. Its methodology is grounded in established causal formalism with built-in answer verifiability. Paper 2 is timely and application-relevant for safety, but its strong universality/forecasting claims hinge on a specific dynamical theory that may generalize less broadly and will face heavier validation/scrutiny for rigor and reproducibility.
Paper 2 addresses a critical, universal challenge in AI safety—predicting unpredictable shifts to undesirable behavior—with a highly novel, mathematically grounded framework. Its architecture-agnostic nature and strong empirical validation across frontier models give it profound implications for AI deployment in high-stakes environments, offering broader and deeper scientific impact than the strictly empirical dataset provided in Paper 1.
Paper 1 is more novel and potentially higher impact: it proposes a general, theoretically motivated and empirically validated forecasting condition for behavior shifts in deployed chatbots, with direct safety-critical applications and broad relevance to AI alignment, human-AI interaction, and complex-systems theory. Its claimed portability across models/architectures and real-time warning signal could influence both research and practice. Paper 2 is timely and useful (a foundation-style model for multi-agent RL), but is closer to an expected scaling/engineering trajectory of transformer-based offline RL and is narrower in cross-field implications than a predictive framework for undesirable AI behavior.
Paper 1 offers a highly novel, mathematically grounded theory adapted from physics and biology to predict AI behavior shifts, a critical issue in AI safety. Its rigorous validation across multiple models and impressive a priori predictions demonstrate profound methodological strength. In contrast, Paper 2 highlights an important but narrower empirical weakness (long-context degradation) with relatively simple mitigations. Paper 1's foundational approach and broad architectural applicability give it a substantially higher potential for cross-disciplinary and long-term scientific impact.
Paper 2 addresses a critical, widespread challenge in AI safety: predicting unpredictable shifts to undesirable behavior in large language models. By introducing a theoretically grounded, widely validated predictive framework applicable across various AI architectures, it offers profound implications for AI alignment, safety, and deployment in high-stakes societal domains. Paper 1 presents an innovative, data-efficient method for reward modeling in image editing, but its impact is relatively niche compared to the fundamental safety advancements proposed in Paper 2.
Paper 2 appears to offer a broader, more cross-cutting contribution: a model-agnostic, mathematically derived forecasting condition for undesirable behavior shifts with real-time warning potential across many chatbot architectures and high-stakes domains (safety, healthcare, finance, defense). Its claimed validations include diverse models, production-scale chatbots, and prospective prediction aligned with a large external corpus, suggesting strong methodological ambition and real-world relevance. Paper 1 is novel and useful for robustness in LLM reasoning, but its impact is more scoped to training methodology and reasoning benchmarks, with narrower immediate societal application breadth.
Paper 1 offers a highly novel, interdisciplinary approach by applying physics-based fusion-fission dynamics to solve a critical, unresolved problem in AI safety: predicting sudden shifts to undesirable behavior. Its broad empirical validation across multiple frontier models and a priori predictions demonstrate exceptional methodological rigor. While Paper 2 provides a valuable statistical framework for AI auditing, Paper 1 introduces a fundamental, model-agnostic warning mechanism that operates beneath current safety stacks. This promises a much broader, transformative impact on AI alignment, real-world deployment safety, and cross-disciplinary applications.
Paper 2 addresses a critical, high-visibility problem—predicting when AI systems shift to undesirable behavior—with broad societal implications spanning safety, healthcare, military, and finance. Its model-agnostic, physics-inspired framework validated across multiple AI architectures and a large corpus offers a novel, cross-disciplinary contribution (active matter physics meets AI safety). The real-time warning capability fills a gap in current alignment/safety stacks, making it highly timely given explosive AI deployment. Paper 1, while rigorous and valuable for EEG/neuroscience interpretability, addresses a narrower domain with less immediate societal urgency.
Paper 1 addresses a fundamental, unsolved AI safety problem—predicting when AI behavior shifts from desirable to undesirable—with a novel cross-disciplinary approach (fusion-fission dynamics from living systems). It demonstrates broad validation across multiple models, scales, and a striking a priori prediction confirmed 11 months later. Its architecture-agnostic warning signal fills a gap below the current safety stack, with implications for AI governance, deployment safety, and policy. Paper 2 solves a useful but narrower efficiency optimization problem (reducing overthinking in reasoning models), which, while practical, has more incremental impact and limited cross-field relevance.
Paper 1 has higher potential impact due to a broadly applicable, theoretically grounded forecasting criterion for emergent undesirable behavior in LLM conversations, validated across many models and production-scale systems, and tied directly to high-stakes AI safety. Its claim of architecture-level portability and real-time warning signals could influence alignment, monitoring, and deployment policy across domains. Paper 2 is a solid engineering contribution with clear performance gains for agent skill refinement, but its impact is narrower (agent tooling) and less cross-disciplinary than a general predictive framework for behavior shifts in deployed AI systems.
Paper 1 addresses a critical, broadly relevant AI safety problem—predicting when AI systems shift from desirable to undesirable behavior. It offers a novel theoretical framework (fusion-fission dynamics from active matter) validated across multiple AI models and scales, including a priori predictions confirmed 11 months later. Its cross-architectural portability and real-time warning capability have enormous practical implications for AI deployment across healthcare, military, finance, and general use. Paper 2, while valuable as a benchmark for autoformalization, serves a narrower community and represents incremental progress (new dataset/benchmark) rather than a fundamental conceptual advance.
Paper 1 is more novel in proposing a mathematically derived, architecture-below-the-safety-stack forecasting criterion for undesirable behavioral shifts, validated across many models and with an a priori real-world prediction later confirmed by a large external corpus. Its applications (early warning signals for safety-critical deployment across domains) are broad and timely given societal reliance on chatbots. Paper 2 is impactful for scalable VLA adaptation, but is closer to an incremental integration of existing trends (world models + VLM rewards) and its impact is more bounded to robotics/embodied AI, with typical concerns around world-model reliability.
Paper 1 addresses a critical and universal challenge in AI safety—predicting sudden shifts to undesirable behavior in LLMs. By introducing a novel, model-agnostic mathematical framework inspired by physics, it offers a fundamental solution to a problem with widespread societal, medical, and military implications. Paper 2, while offering a strong methodological contribution to urban mobility modeling, focuses on a narrower domain with less urgent, cross-disciplinary impact compared to the foundational AI safety implications of Paper 1.
Paper 1 addresses a fundamental, cross-domain challenge in AI safety with a highly novel application of physics/biology dynamics. It demonstrates exceptional methodological rigor through extensive empirical validation across multiple models and a verified a priori prediction on a massive dataset. In contrast, Paper 2 is presented as a proposal for a domain-specific application (legal AI) and lacks concrete empirical results in its abstract. Paper 1's broader applicability, rigorous validation, and real-time safety implications give it a significantly higher potential scientific impact.
Paper 2 has higher potential impact due to broader relevance and timeliness: it proposes a purportedly architecture-agnostic, mathematically derivable mechanism to forecast unsafe behavioral shifts in widely deployed LLM systems, with claims of validation across many models and real-world chatbot deployments. If robust, this could influence AI safety, governance, deployment monitoring, and alignment across domains. Paper 1 is a solid applied ML contribution (federated distillation for non-IID/long-tailed ECG) with clear utility in IoMT, but its scope is narrower (health/edge FL) and advances are more incremental relative to existing FL/FD literature.
Paper 2 offers a profound scientific breakthrough by applying fusion-fission dynamics from complex systems to forecast a critical, unresolved flaw in AI: unpredictable shifts to undesirable behavior. Its mathematical framework is model-agnostic, extensively validated, and operates below current safety architectures. While Paper 1 provides a highly practical engineering solution for reducing LLM token costs and latency through trajectory caching, Paper 2 addresses fundamental AI safety and alignment issues with a novel theoretical approach, ensuring a much broader and more profound scientific and societal impact.
Paper 1 is more novel and potentially higher-impact: it proposes a model-agnostic, mathematically derivable forecasting condition for undesirable behavioral shifts in LLMs, validated across many models and production chatbots, with an external timestamped prediction confirmed by a large real-world corpus. If robust, it offers an actionable safety monitoring signal that could affect deployment practices across domains (health, finance, defense) and multiple AI architectures, making it timely and broadly relevant. Paper 2 is solid and useful for MSA robustness, but is more incremental within a narrower application area.
Paper 2 addresses a critical, unsolved problem in AI safety—predicting shifts to undesirable behavior—using a highly novel, mathematically grounded complex systems approach. Its theoretical depth, rigorous validation across diverse models, and model-agnostic nature give it profound fundamental implications. In contrast, Paper 1 presents a practical, though useful, engineering evaluation framework for RAG systems, which lacks the theoretical innovation and broad scientific breakthrough potential of Paper 2.
Paper 1 has higher estimated impact: it proposes a broadly applicable, model-agnostic forecasting criterion for harmful behavioral shifts in widely deployed chatbots, validated across many models and production systems, with a real-time safety signal and strong timeliness given current AI deployment risks. Its applications span safety, healthcare, finance, and defense, and the empirical scope suggests substantial immediate adoption potential. Paper 2 is novel and theoretically rich, but targets a narrower diagnostic subproblem with controlled benchmarks and more limited near-term real-world deployment pathways, likely yielding slower and more specialized impact.