The Deterministic Horizon: Impossibility Results as Design Specifications for Trustworthy AI Systems

Dongxin Guo

May 21, 2026

arXiv:2605.23024v1 PDF

cs.AI(primary)cs.CCcs.CLcs.LG

#512of 2682·Artificial Intelligence

#512 of 2682 · Artificial Intelligence

Tournament Score

1478±43

10501800

75%

Win Rate

Wins

Losses

Matches

Rating

7.2/ 10

Significance7.5

Rigor6.5

Novelty7.8

Clarity7

Tournament Score

1478±43

10501800

75%

Win Rate

Wins

Losses

Matches

Rating

7.2/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Large language models now write software, draft legal documents, and produce clinical notes, yet fundamental limits, from Turing and Arrow to the No Free Lunch theorems, shape what computation can do. This thesis turns such impossibility results from curiosities into design rules. Its flagship result proves an accuracy ceiling set by architecture alone: past a critical reasoning depth, no amount of training moves it, at any adapter rank, sample size, or loss function. Computable before deployment from layer count and embedding width, this Deterministic Horizon is measured between nineteen and thirty-one across twelve transformer architectures, and fine-tuning on optimal-length traces recovers under four percentage points. The mechanism is a capacity invariant of the residual stream, and an information-theoretic conversion yields super-exponential accuracy decay past the horizon. An unconditional circuit-complexity lower bound for modular exponentiation against constant-depth prime-modulus circuits complements this result. The same argument recasts across subfields: preference learning under any misspecified model jumps discontinuously in sample complexity; multi-stage retrieval pipelines require at least as many independent metrics as stages; standard truthful auctions fail for agents with prompt-dependent valuations; and zero-knowledge verification of neural inference pays a measured overhead of one hundred ten to one hundred ninety times per non-linear activation. Together these form a catalogue of sixteen specifications, each pairing a computable boundary, a quantified violation cost, and a constructive design rule: two compositions are proved, one pairing is an honest obstruction, and four remain open. The impossibility-specification methodology is offered for the generative research programme that trustworthy AI may need. Every fundamental limit of AI is also a design rule.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: "The Deterministic Horizon: Impossibility Results as Design Specifications for Trustworthy AI Systems"

1. Core Contribution

This doctoral thesis proposes a unifying methodology that reframes impossibility results across four AI subfields (computation, adaptation, grounding, trust) as actionable design specifications. The flagship result is the "Deterministic Horizon" — a computable critical reasoning depth d* ∈ [19, 31] for transformer architectures beyond which chain-of-thought accuracy decays super-exponentially, independent of fine-tuning. The thesis produces 16 impossibility specifications, each with a computable boundary, quantified violation cost, and constructive design rule.

The intellectual ambition is extraordinary: the thesis claims that fundamental limits from circuit complexity, PAC-Bayes theory, measurement theory, mechanism design, and cryptographic verification all conform to a single tripartite template (Definition 1.1). Two cross-domain compositions are proved — mechanism design × cryptographic verification (Theorem 5.18) and computation × grounding (Theorem 6.3) — while one composition (adaptation × grounding) is honestly reported as blocked by three named obstructions.

2. Methodological Rigor

The rigor varies considerably across the thesis's vast scope, which is both inevitable and problematic.

Strongest results: The FOC[Attn] characterization (Theorem 2.4) is a clean logical contribution extending formal transformer expressivity theory. The Welfare Composition Theorem (Theorem 5.18) is technically sophisticated, proving joint necessity of mechanism design and verification with an additive O(ε + e^{-κ}) bound under the Random Oracle Model. The AC⁰[p] lower bound for modular exponentiation (Theorem A.4) via Razborov-Smolensky is unconditional and technically sound.

Weaker areas: The Deterministic Horizon scaling law (Theorem 2.13) rests on Assumption 2.11 (approximate independence of query/value errors, negligible higher-order terms, layer-uniform amplification) that are justified empirically rather than proved. The banded upper bound d* = O(L·φ(d)) with φ ∈ [√log d, log d] has a conditional lower edge dependent on an unproved sparse-task hypothesis. The empirical fit ĉ = 2.74 shows log L dependence milder than the O(L) theoretical bound — an acknowledged gap. The 12-architecture validation with r = 0.81–0.91 is suggestive but n=12 yields wide Fisher z-transform confidence intervals.

The preference phase transition (Theorem 3.4) has a log n gap between Ω(n²/γ²) and O(n² log n/γ²) upper bounds that remains open. The Construct Conflation Impossibility (Theorem 4.2) is mathematically clean but conceptually straightforward — it's essentially invariance of domain applied to pipeline evaluation.

3. Potential Impact

High impact areas:

The impossibility-specification methodology itself could influence how the AI safety community frames negative results. Converting limits into engineering rules with computable boundaries is a genuinely useful paradigm shift.

The Fine-Tuning Impossibility (Theorem 2.20) has immediate practical implications: it provides principled tool-delegation thresholds rather than empirical heuristics.

The 147× non-linearity tax explanation via the Algebraic-Boolean Bridge gives the zkML community its first formal lower bound for the empirically observed overhead.

The Welfare Composition Theorem could influence AI marketplace design by proving that mechanism design and verification are jointly necessary.

Limited impact areas:

Many individual results (CoT error propagation, PAC-Bayes for LoRA, model collapse bounds) are incremental refinements of known results rather than fundamentally new insights.

The compliance assistant running example, while pedagogically useful, is benchmarked on n=300 instances and not deployed to production, limiting the practical validation claims.

4. Timeliness & Relevance

The thesis is extremely timely. It addresses the central tension in 2024-2026 AI deployment: systems are increasingly relied upon for high-stakes decisions while their failure modes remain poorly understood. The impossibility-specification framework directly responds to regulatory demands (EU AI Act, NIST frameworks) for computable safety guarantees. The focus on composition — proving that individual guarantees must be combined — addresses a genuine gap in the trustworthy AI literature.

5. Strengths & Limitations

Key strengths:

Intellectual coherence: 16 specifications across four subfields under one methodology is architecturally impressive

Honest obstruction reporting (§6.3) demonstrates intellectual maturity rarely seen in theses

The three emergent principles (impossibility-as-specification, theory-practice gaps as diagnostics, reliability as composition) are genuinely insightful

Comprehensive proofs in appendices with explicit assumption tracking

Notable weaknesses:

Breadth-over-depth tradeoff: covering four subfields means no single result achieves the depth of a focused thesis

Several key results are conditional on unproved hypotheses (sparse-task hypothesis, Softmax circuit complexity conjecture)

The 50-115× planning theory-practice gap (§2.3.4) undermines the "computable specification" claim for planning capacity

The EvoPref evolutionary alignment section feels tangential — it's a demonstration rather than a deep contribution

Cross-model validation at n=12 architectures is modest for the generality claimed

The thesis's central open problem (full four-way composition) is acknowledged as potentially requiring fundamentally incompatible mathematical frameworks to unify

Scalability and reproducibility: The empirical methodology is reproducible (specific models, datasets, GPU configurations reported). However, the fitted constant ĉ = 2.74 is evaluation-set-dependent, and the thesis appropriately flags this.

Overall Assessment

This is an ambitious, intellectually creative thesis that proposes a genuinely novel methodological framework. Its strongest contribution is conceptual — the systematic conversion of impossibility results into engineering specifications — rather than any single technical result. The individual theorems range from strong (FOC[Attn], Welfare Composition) to incremental (CoT error propagation, model collapse bounds). The breadth is both the thesis's greatest strength and its primary limitation.

Rating:7.2/ 10

Significance 7.5Rigor 6.5Novelty 7.8Clarity 7

Generated May 25, 2026

Comparison History (24)

vs. Proper Scoring Rules for Agentic Uncertainty Quantification

gemini-3.15/26/2026

Paper 1 establishes fundamental, computable limits on transformer reasoning depth and generalizes impossibility results across multiple AI subfields. Its broad implications for AI scaling, architecture design, and safety give it significantly higher potential scientific impact than Paper 2, which offers a narrower, albeit rigorous, methodological improvement for uncertainty quantification.

vs. A Sober Look at Agentic Misalignment in Automated Workflows

gpt-5.25/26/2026

Paper 2 has higher potential impact due to its broad, theory-driven contributions: a pre-deployment, architecture-computable “Deterministic Horizon” accuracy ceiling; multiple cross-domain impossibility-to-specification translations; and quantified design rules spanning reasoning limits, preference learning, retrieval, mechanism design, and verifiable inference. If validated, such results would influence model architecture choices, evaluation protocols, and safety/assurance practices across fields. Paper 1 is timely and practically relevant for multi-agent workflows, but its impact is narrower and more empirical/paradigm-specific, with fewer generalizable, field-wide constraints.

vs. HyperGuide: Hyperbolic Guidance for Efficient Multi-Step Reasoning in Large Language Models

gemini-3.15/26/2026

Paper 1 proposes foundational theoretical limits for AI architectures, establishing computable accuracy ceilings and translating fundamental impossibility theorems into concrete design specifications across multiple subfields. Its potential to establish universal laws for transformer capacity and reasoning depth gives it a much broader and more paradigm-shifting scientific impact compared to Paper 2's methodological, albeit clever, improvement to multi-step reasoning efficiency.

vs. Beyond Query Memorization: Large Language Model Routing with Query Decomposition and Historical Matching

claude-opus-4.65/26/2026

Paper 2 presents a broader theoretical framework with potentially transformative impact across multiple subfields of AI. It introduces the 'Deterministic Horizon' concept—a provable accuracy ceiling for transformers based on architecture alone—and systematically converts impossibility results into actionable design specifications. Its breadth (spanning preference learning, retrieval pipelines, auction theory, and zero-knowledge verification) and its foundational, theory-driven methodology could influence how the entire field approaches trustworthy AI system design. Paper 1, while practically useful for LLM routing, addresses a narrower optimization problem with more incremental contributions.

vs. TIGER: Text-Informed Generalized Enzyme-Reaction Retrieval

gpt-5.25/26/2026

Paper 1 has higher likely impact: it proposes a concrete, implementable method for enzyme–reaction retrieval with demonstrated empirical gains, robustness, and transfer across distributions—immediately useful for enzyme annotation, pathway design, and biocatalysis. Its novelty (text-informed enzyme representations with dynamic gating and shared projection) is plausible and actionable, and the application domain is large in biotech. Paper 2 is ambitious and wide-ranging, but the sweeping “architecture-only accuracy ceiling” and many cross-domain impossibility claims risk being overly strong or hard to validate; impact depends on exceptionally rigorous proofs and community acceptance.

vs. Is Capability a Liability? More Capable Language Models Make Worse Forecasts When It Matters Most

claude-opus-4.65/25/2026

Paper 1 presents a specific, well-documented empirical finding (inverse scaling in LLM forecasting) with clear methodology, reproducible benchmarks, and actionable recommendations. It addresses a timely problem with direct implications for high-stakes domains (finance, epidemiology). Paper 2 is ambitious in scope but reads as a thesis-level collection of loosely connected theoretical results; its breadth sacrifices depth, and several claims (e.g., the 'Deterministic Horizon') require extraordinary validation. Paper 1's focused, falsifiable contribution with released benchmarks is more likely to influence evaluation practices and downstream research.

vs. When Planning Fails Despite Correct Execution: On Epistemic Calibration for LLM-Based Multi-Agent Systems

gemini-3.15/25/2026

Paper 1 presents fundamental theoretical bounds on AI architectures, transforming impossibility results into concrete, computable design specifications. By proving an architectural accuracy ceiling (the Deterministic Horizon) that cannot be overcome by scaling or fine-tuning, it challenges current scaling paradigms and offers profound implications for AI safety, capability prediction, and system design. Paper 2 provides a valuable, practical workflow for multi-agent planning, but its scope is narrower and primarily empirical. The foundational nature, methodological rigor, and broad applicability across multiple AI subfields give Paper 1 a significantly higher potential scientific impact.

vs. DART: Semantic Recoverability for Structured Tool Agents

claude-opus-4.65/25/2026

Paper 1 presents a sweeping theoretical framework that transforms impossibility results into actionable design specifications for AI systems, spanning multiple subfields (preference learning, retrieval pipelines, auctions, zero-knowledge proofs). Its flagship 'Deterministic Horizon' result—an architecture-determined accuracy ceiling for transformers—is a fundamental contribution with broad implications. The breadth of impact across fields (complexity theory, mechanism design, information theory, trustworthy AI) and the novelty of the impossibility-as-specification methodology give it substantially higher potential impact than Paper 2, which addresses a narrower (though practically useful) problem of runtime recovery semantics for tool agents.

vs. Redrawing the AI Map: A Theory of Accountability Boundaries in Agentic Ecosystems

claude-opus-4.65/25/2026

Paper 2 presents novel formal/mathematical results (the Deterministic Horizon, accuracy ceiling proofs, circuit-complexity lower bounds) with broad implications across multiple subfields (preference learning, retrieval, auctions, zero-knowledge verification). Its methodology of converting impossibility results into constructive design specifications is highly innovative and broadly applicable. While Paper 1 offers a valuable conceptual framework for accountability in agentic AI ecosystems, it is primarily a theory-building contribution in IS/management. Paper 2's formal results, if validated, would have deeper and wider scientific impact across CS theory, ML, and trustworthy AI system design.

vs. Mediative Fuzzy Logic: From Type-1 Foundations to Type-2, Type-3 and Quantum Extensions

gemini-3.15/25/2026

Paper 1 addresses fundamental architectural limits of Large Language Models (LLMs), currently the most impactful and heavily researched area in computer science. By establishing a 'Deterministic Horizon' that dictates an accuracy ceiling based on transformer architecture regardless of training, it offers profound, immediate implications for AI design, RAG systems, and AI safety. Paper 2, while mathematically rigorous, focuses on fuzzy logic extensions—a mature and comparatively niche field. Paper 1's timeliness, direct applicability to state-of-the-art AI paradigms, and breadth of impact across machine learning subfields make it significantly more likely to achieve high scientific impact.

vs. From Raw Experience to Skill Consumption: A Systematic Study of Model-Generated Agent Skills

gemini-3.15/25/2026

Paper 1 offers foundational mathematical proofs establishing hard boundaries on transformer reasoning capabilities (a 'Deterministic Horizon'). By proving that reasoning depth is strictly bounded by architecture regardless of training scale, it directly challenges current AI scaling paradigms. Its broad scope spans information theory, circuit complexity, and multi-stage pipelines, offering computable limits prior to deployment. In contrast, Paper 2 is a valuable but narrower empirical study on language agent skill reuse. Paper 1's theoretical rigor and potential to fundamentally alter AI architecture design give it significantly higher potential scientific impact.

vs. Compiling Agentic Workflows into LLM Weights: Near-Frontier Quality at Two Orders of Magnitude Less Cost

claude-opus-4.65/25/2026

Paper 1 addresses a highly practical problem with immediate industry relevance—compiling agentic workflows into LLM weights to reduce cost by 100x while maintaining quality. It targets a massive developer ecosystem (290K+ GitHub stars across frameworks) and provides empirical evidence across multiple real-world domains. Its practical applicability and clear cost-benefit proposition give it high near-term adoption potential. Paper 2, while intellectually ambitious in cataloguing impossibility results as design specifications, reads as a broad thesis with many claims spanning diverse subfields, making each individual result less deeply validated. Its impact is more speculative and theoretical.

vs. Human-in-the-Loop Multi-Agent Ventilator Decision Support with Contextual Bandit Preference Learning

gemini-3.15/25/2026

Paper 2 establishes fundamental theoretical limits for AI architectures and translates impossibility theorems into broad design specifications across multiple subfields. This foundational work on the mathematical and structural limits of LLMs has a vastly wider scope and potential paradigm-shifting impact on AI research compared to Paper 1, which, while highly valuable, focuses on a specific clinical application.

vs. AutoResearch AI: Towards AI-Powered Research Automation for Scientific Discovery

claude-opus-4.65/25/2026

Paper 1 presents novel theoretical results—a provable accuracy ceiling (Deterministic Horizon) for transformer architectures, along with 16 concrete impossibility-turned-design-specification results spanning multiple subfields. These are fundamental contributions with lasting impact: they provide computable, architecture-dependent bounds that constrain what AI systems can achieve, offering actionable design rules. Paper 2 is a survey of AI-powered research automation that organizes existing work and proposes evaluation dimensions but lacks original theoretical or empirical contributions. Surveys can be impactful but rarely match the long-term influence of foundational theoretical results that reshape how systems are designed and evaluated.

vs. PathCal: State-Aware Reflection-Marker Calibration for Efficient Reasoning

gpt-5.25/25/2026

Paper 1 presents a concrete, training-free decoding method (PathCal) with direct empirical validation across multiple benchmarks, offering immediate practical benefits (efficiency/accuracy trade-off) and easy adoption in LRM inference pipelines—high likelihood of near-term uptake and follow-on work. Paper 2 is ambitious and broad, but its sweeping, architecture-only “accuracy ceiling,” cross-domain impossibility catalog, and strong lower-bound claims are atypical and would require extraordinary proof and community verification; absent that, impact is more speculative. Thus Paper 1 has higher estimated scientific impact.

vs. CP or DP? Why Not Both: A Case Study in the Partial Shop Scheduling Problem

gemini-3.15/25/2026

Paper 1 addresses fundamental limits and architectures of Large Language Models, offering highly timely, broad, and rigorously quantified theoretical bounds ('Deterministic Horizon') for Trustworthy AI. In contrast, Paper 2 presents a hybrid optimization approach for a specific scheduling problem that admittedly does not outperform existing state-of-the-art solvers, giving it a much narrower and less transformative scientific impact.

vs. Beyond Binary Edits Robust Multimodal Knowledge Editing with Adversarial Subspace Alignment

gemini-3.15/25/2026

Paper 1 addresses fundamental, theoretical limits of AI architectures and translates them into broadly applicable design rules across multiple AI subfields (transformers, preference learning, retrieval). Its findings on architectural accuracy ceilings have profound implications for the design and scaling of foundational models. In contrast, Paper 2 focuses on a more specific, applied problem of multimodal knowledge editing, making Paper 1's potential scientific impact significantly broader and more foundational.

vs. EVE-Agent: Evidence-Verifiable Self-Evolving Agents

gemini-3.15/25/2026

Paper 2 establishes fundamental theoretical limits (impossibility results) for AI architectures, defining a computable 'Deterministic Horizon' for reasoning depth. Foundational theoretical bounds that span multiple AI subfields offer broader, paradigm-shifting scientific impact compared to the algorithmic improvements for search agents presented in Paper 1.

vs. Parallel Context Compaction for Long-Horizon LLM Agent Serving

gemini-3.15/25/2026

Paper 2 presents foundational, theoretical impossibility results establishing strict architectural limits on transformer reasoning depth and other AI subfields. This represents a paradigm-shifting contribution to AI theory, scaling laws, and system design. In contrast, Paper 1 offers a highly practical but transient systems-level optimization for LLM context management. The theoretical bounds in Paper 2 have a much broader and longer-lasting scientific impact, potentially shaping future architectures just as Turing's or Arrow's limits shaped their respective fields.

vs. LACO: Adaptive Latent Communication for Collaborative Driving

gemini-3.15/25/2026

Paper 1 addresses fundamental, theoretical limitations of AI models (like LLMs) and introduces a predictable 'Deterministic Horizon' for model accuracy based on architecture. This offers profound implications and broad applicability across all AI subfields, fundamentally altering how models are designed and evaluated. In contrast, Paper 2 presents a specialized, applied solution for collaborative driving, which, while highly practical and useful for autonomous vehicles, has a narrower scope and less potential for widespread paradigm-shifting scientific impact.