Solipsistic Superintelligence is Unlikely to be Cooperative
Rakshit S Trivedi, Natasha Jaques, Logan Cross, Alexander Sasha Vezhnevets, Joel Z Leibo
Abstract
AI's central challenge is shifting from capability to coexistence. The dominant paradigm in AI research focuses on developing powerful agents that treat the world as an exogenous and stationary source of feedback. We contend that superintelligence, an extremely capable task solver, born out of such a solipsistic approach to AI design, is unlikely to be cooperative. Deploying AI systems induces endogenous non-stationarity, resulting in a train-test-deploy gap where historical distributions diverge from the deployment context. We refer to this as the self-undermining property of unilateral optimization. Closing this gap requires AI that participates in cooperation: the equilibrium-selection process through which multiple actors navigate their interdependence. We call for a non-solipsistic research paradigm that treats this interdependence as a core design principle rather than approaching cooperation as a task to solve. This entails building dynamic evaluation testbeds involving adaptive counterparties, treating institutions as design primitives, and preserving human agency as a structural feature of the systems we build.
AI Impact Assessments
(1 models)Scientific Impact Assessment
Core Contribution
This paper introduces and articulates the concept of "solipsistic superintelligence" — AI systems built under implicit assumptions of environmental exogeneity, distributional stationarity, and singleton framing — and argues these systems are structurally unlikely to be cooperative when deployed in multi-agent environments. The central novelty lies in reframing the AI safety problem: rather than focusing on misalignment of objectives (the traditional safety concern), the authors identify a structural failure mode where perfectly aligned systems still produce collective harm because deployment induces endogenous non-stationarity through behavioral, institutional, and algorithmic adaptation channels.
The paper formalizes the "train-test-deploy gap" (distinct from the standard train-test gap), the "self-undermining property" of unilateral optimization, and "equilibrium selection risk." It proposes three concrete research directions: dynamic evaluation with adaptive counterparties, institutions as design primitives, and preservation of human agency.
Methodological Rigor
As a position paper, this work is primarily argumentative rather than empirical. The formal contributions (Appendix A) are sound but relatively straightforward extensions of known game-theoretic concepts. The transition from MDPs to Markov games, the definition of endogenous non-stationarity, and the self-undermining property (Proposition A.4) are cleanly stated and logically coherent, though they formalize intuitions already well-understood in multi-agent systems and game theory.
The paper draws effectively on diverse empirical evidence — algorithmic collusion (Calvano et al., 2020), the Flash Crash, recommendation-driven polarization, alignment faking — to ground its claims. However, these examples serve more as illustrations than as systematic evidence for the paper's central thesis. The connection between current-scale failures and the projected superintelligence case relies on extrapolation rather than demonstration. The paper acknowledges this limitation implicitly by scoping its claims to "sociotechnical settings where advanced AI deployment will be heavily exposed to response dynamics."
The alternative views section (Section 6 and Appendix D) demonstrates intellectual honesty, addressing six substantive counterarguments. The rebuttals are generally well-reasoned, though the response to Argument 1 (multi-actor designs have worse failure modes) somewhat sidesteps the practical tractability concern — that engineering multi-agent cooperative systems is harder than engineering single aligned agents.
Potential Impact
The paper's framing could be influential in several ways:
1. Reorienting safety research: By arguing that cooperation is an equilibrium property rather than a capability to be trained, the paper challenges the implicit assumption that scaling + alignment = safe deployment. This could redirect research investment toward institutional design, mechanism design for AI agents, and dynamic evaluation.
2. Evaluation methodology: The call for dynamic evaluation testbeds with adaptive counterparties addresses a genuine gap. Table 1 in Appendix C provides a useful taxonomy. If adopted, this could fundamentally change how frontier models are assessed.
3. Policy and governance: The legitimacy arguments (Section 4.2) connect technical AI development to democratic theory and institutional economics in ways that could influence AI governance frameworks.
4. Bridging communities: The paper synthesizes insights from game theory, institutional economics, political philosophy, and ML into a coherent framework, potentially catalyzing cross-disciplinary collaboration.
However, the practical impact may be limited by the paper's conceptual rather than constructive nature. It identifies the problem clearly but offers relatively little in terms of concrete algorithms, benchmarks, or experimental protocols. The research agenda in Section 5 remains at a high level of abstraction.
Timeliness & Relevance
This paper is exceptionally well-timed. The rapid deployment of LLM-based agents in economic settings (coding assistants, search, recommendation, pricing) makes the multi-agent interaction problem urgent rather than speculative. Recent empirical results on LLM agents engaging in algorithmic collusion (Fish et al., 2026; Bertrand et al., 2025) and alignment faking (Greenblatt et al., 2024) provide concrete evidence for the dynamics the paper describes. The Collingridge dilemma argument — that waiting for evidence of catastrophe before changing course is precisely when correction is most costly — is particularly compelling given current deployment trajectories.
The paper also addresses a genuine intellectual gap: cooperative AI and multi-agent RL have long recognized these dynamics, but their insights remain "peripheral to the central scaling pathway" of foundation model development. Making this gap explicit at a venue like ICML could have agenda-setting effects.
Strengths
Limitations
Overall Assessment
This is a well-crafted position paper that articulates an important and timely argument with intellectual rigor and breadth. Its primary contribution is conceptual reframing rather than technical novelty. The "solipsistic superintelligence" framing is effective and the paper may succeed in shifting discourse around AI safety and evaluation. However, its impact will ultimately depend on whether it catalyzes concrete research programs rather than remaining a compelling but abstract critique.
Generated Jun 3, 2026
Comparison History (22)
Paper 1 addresses a foundational challenge in AGI development—AI alignment and cooperation—proposing a crucial paradigm shift away from solipsistic training. While Paper 2 offers a highly useful, practical tool for solid mechanics, Paper 1 has a vastly broader potential impact, potentially influencing the theoretical trajectory of AI safety, multi-agent reinforcement learning, and AI policy.
Paper 2 is likely to have higher scientific impact because it delivers a concrete, reusable dataset and benchmark (286K screenshots, 3.5M tasks) that can immediately accelerate and standardize research on GUI agents, with clear real-world applications in desktop/mobile automation. It also includes comparative evaluations and a fine-tuning baseline, supporting methodological rigor and near-term adoption by the community. Paper 1 is conceptually novel and timely in AI governance/alignment, but is more programmatic and harder to operationalize or validate empirically, which may limit short-term measurable impact.
Paper 1 addresses a fundamental, high-stakes paradigm shift in AI safety and alignment, offering broad theoretical impact across multiple disciplines. In contrast, Paper 2 focuses on a specific technical improvement in Visual Question Answering, which, while methodologically rigorous, has a much narrower scope and application field.
While Paper 1 offers a valuable conceptual paradigm shift for AI safety, Paper 2 demonstrates higher immediate scientific impact through methodological rigor and concrete empirical results. Paper 2 identifies a critical bottleneck in a highly relevant field (RL for LLM agents), provides theoretical analysis, and introduces a practical solution (AREW) yielding substantial performance gains. Its combination of actionable insights, reproducible methods (open-source code), and direct applicability to current cutting-edge AI development gives it a stronger, more measurable scientific trajectory than a conceptual position paper.
Paper 1 presents a concrete, novel methodological contribution (AWARE framework) with empirical validation on clinical EHR data, addressing practical bottlenecks in deploying tabular in-context learning. It offers measurable improvements (12.2% AUPRC gain) on a well-defined benchmark. Paper 2, while thought-provoking, is primarily a position/perspective paper on AI cooperation and superintelligence risks. It proposes a conceptual paradigm shift but lacks empirical validation. Paper 1's actionable contributions to clinical AI, methodological rigor, and direct applicability to healthcare give it higher near-term and measurable scientific impact.
Paper 1 addresses foundational issues in AI safety and alignment, proposing a paradigm shift towards multi-agent cooperation and institutional design. Its theoretical framework has broad, long-term implications across AI, economics, and ethics. While Paper 2 offers a valuable empirical benchmark, its focus on CAPTCHA-solving is narrower and potentially more transient as AI capabilities and verification methods rapidly evolve.
Paper 1 has higher likely scientific impact: it presents a concrete, novel extension to an existing relational graph transformer with clear methodological contributions (masking to prevent leakage, unified heads, automated TF‑IDF text handling) and reports quantitative gains on a public benchmark (RelBench v2), supporting rigor and reproducibility. Its applications (enterprise/healthcare database autocomplete, data quality, decision support) are immediate and broadly useful across ML-for-data-management. Paper 2 is timely and potentially influential conceptually, but is primarily a position argument without demonstrated methods, benchmarks, or empirical validation, reducing near-term scientific and practical impact.
Paper 1 presents a concrete, technically novel method (continuous-time Mamba encoder + physics-guided latent SDE) with theoretical guarantees and an evaluation scheme for realistic irregular sampling—well-scoped, rigorously testable, and immediately applicable to industrial predictive maintenance and other irregular time-series domains. Its impact can extend across prognostics, time-series modeling, and continuous-time ML. Paper 2 raises important conceptual alignment/governance points, but is largely a position/agenda piece with less methodological specificity and harder-to-validate claims, making near-term scientific impact less predictable.
Paper 2 addresses a critical, immediate need in agentic AI (authorization and delegation) with strong methodological rigor, including formal proofs and empirical evaluation. While Paper 1 provides a valuable conceptual framework for future AI alignment, Paper 2 offers concrete, actionable technical solutions that can be integrated into current systems, giving it a clearer path to near-term, widespread impact in AI engineering and security.
Paper 1 addresses a fundamental paradigm-level challenge in AI safety and alignment—arguing that the dominant solipsistic AI design paradigm is structurally incompatible with cooperation, and proposing a new research direction treating interdependence as a core design principle. This has broad implications across AI safety, multi-agent systems, and AI governance. While Paper 2 presents a solid technical contribution (LLM-based evolution for autonomous driving testing), it is more incremental and domain-specific. Paper 1's conceptual reframing has potential to influence thinking across multiple fields and shape long-term AI development trajectories.
Paper 1 addresses a fundamental, paradigm-level challenge in AI safety and alignment—arguing that the dominant solipsistic AI design paradigm is structurally incompatible with cooperation, and calling for a new research paradigm centered on interdependence. This has broad implications across AI safety, multi-agent systems, institutional design, and policy. Its timeliness is exceptional given rapid AI capability advances. Paper 2, while technically solid with strong empirical results on skill selection for LLM agents, addresses a narrower engineering problem with incremental improvements on specific benchmarks, limiting its broader scientific impact.
Paper 1 offers a concrete, novel technical contribution to causal inference: derivation graphs that characterize do-calculus equivalence classes, a bounded-step procedure, and a practical route to multiple estimands and potentially more efficient estimators. This is methodologically rigorous, immediately actionable for identification/estimation workflows, and likely to impact statistics, ML, epidemiology, and econometrics. Paper 2 is timely and broadly relevant but is primarily a conceptual position paper with less formal methodology and fewer directly testable/implementable results, which typically yields less near-term scientific uptake.
Paper 1 provides rigorous empirical evidence of a critical, real-world vulnerability in AI medical triage. Its identification of diagnostic substitution as the mechanism for bias offers actionable insights for immediate improvements in clinical LLMs. While Paper 2 presents an important theoretical framework for AI alignment, Paper 1's concrete methodology, timeliness, and direct implications for patient safety give it higher immediate scientific and societal impact.
Paper 2 presents a broad, conceptual paradigm shift for AI alignment and multi-agent cooperation, which has the potential to influence research agendas across AI safety, economics, and systems design. In contrast, Paper 1 is a highly scoped negative result in mechanistic interpretability which, while methodologically useful, is likely to have a much narrower readership and fewer citations.
Paper 2 proposes a concrete, standards-oriented protocol (LAP) with explicit primitives (capabilities, reservations, safety handshakes, physically typed measurement schemas) that can be implemented and adopted across labs, enabling scalable autonomous science infrastructure. Its methodological rigor is higher (architecture, state machines, error model, schemas, compatibility with existing standards) and it has clear near-term real-world applications with broad impact across chemistry, biology, materials, and robotics. Paper 1 is conceptually novel and timely for AI governance/alignment, but is primarily a framing argument with less actionable, testable methodology, making impact less direct.
Paper 1 provides a rigorous, immediately applicable benchmark for a critical real-world domain (healthcare), addressing a tangible gap in evaluating clinical AI safety. While Paper 2 offers a broad theoretical perspective on AI alignment, Paper 1 demonstrates superior methodological rigor through its interactive testbed, deterministic safety checking, and empirical baseline evaluation across 23 agents, guaranteeing immediate, measurable scientific impact and practical utility.
Paper 1 addresses a fundamental, paradigm-level challenge in AI safety and alignment—how superintelligent systems can be designed for cooperation rather than solipsistic optimization. This has broad implications across AI safety, multi-agent systems, institutional design, and policy. Its conceptual framework (self-undermining property, non-solipsistic design) could reshape how the field approaches AI development. Paper 2, while practically useful in reducing token costs for conversational memory, is a more incremental engineering contribution with narrower scope. The timeliness of Paper 1's alignment concerns and its breadth of interdisciplinary impact give it substantially higher potential scientific influence.
Paper 1 targets a timely, high-stakes problem in AI: aligning increasingly capable systems with societal coexistence under endogenous feedback and deployment-induced nonstationarity. Its framing (self-undermining unilateral optimization, equilibrium-selection, institutions as design primitives) is broadly relevant across AI safety, evaluation, multi-agent systems, and policy, with clear real-world implications and potential to redirect research agendas. Paper 2 is methodologically rigorous and novel within non-monotonic/modal defeasible reasoning, but its impact is likely narrower (formal logic community) with less immediate cross-field or societal application.
Paper 2 addresses a fundamental and timely challenge in AI safety—the cooperative alignment of superintelligent systems—which has enormous breadth of impact across AI research, policy, and society. Its conceptual reframing of the solipsistic optimization paradigm and call for non-solipsistic design principles could influence major research directions in AI alignment, multi-agent systems, and AI governance. Paper 1, while technically sound, addresses a narrower knowledge-graph engineering problem in compliance management with limited cross-domain impact. The timeliness and societal relevance of AI coexistence far exceeds that of compliance graph patterns.
Paper 2 has higher likely impact: it introduces a concrete, scalable methodology (Terminal-Lego) with measurable, reproducible gains and a testable mechanism (Environment-Grounded Supervision) explaining a counterintuitive “pedagogical paradox.” Its results are timely for post-training agents, offer immediate real-world applicability (data-efficient training, harness engineering), and can generalize across domains where interaction traces and verification exist. Paper 1 is conceptually novel and broadly relevant to AI alignment/institutions, but is more programmatic and less empirically grounded, reducing near-term methodological rigor and actionable uptake.