When Can Human-AI Teams Outperform Individuals? Tight Bounds with Impossibility Guarantees

Dongxin Guo, Jikun Wu, Siu-Ming Yiu

May 9, 2026

arXiv:2605.08710v1 PDF

cs.AI(primary)

#167of 2292·Artificial Intelligence

#167 of 2292 · Artificial Intelligence

Tournament Score

1526±47

10501800

95%

Win Rate

Wins

Losses

Matches

Rating

7/ 10

Significance7.5

Rigor6.5

Novelty7.5

Clarity8

Tournament Score

1526±47

10501800

95%

Win Rate

Wins

Losses

Matches

Rating

7/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Human-AI teams fail to outperform their best member in 70% of studies, yet no theory specifies when complementarity is achievable. We derive tight bounds for the broad class of confidence-based aggregation rules by integrating signal detection theory with information-theoretic analysis, yielding four results: (1) a complementarity theorem (teams outperform individuals iff error correlation $ρ_{HM} < ρ^*$ , with $ρ^* \approx a$ in the symmetric near-chance regime); (2) minimax bounds showing gains scale as $Θ(\sqrt{Δd})$ with metacognitive sensitivity difference; (3) an impossibility result proving no confidence-based aggregation rule achieves complementarity when $ρ_{HM} \geq ρ^*$ ; and (4) multi-class generalization $ρ^*_K \approx ρ^*/\sqrt{K-1}$ . Predictions match observed team accuracy ( $R = 0.94$ on ImageNet-16H, $R = 0.91$ on CIFAR-10H) and the multi-class threshold scaling holds on human data ( $R = 0.93$ , $K = 16$ ), with robustness under non-Gaussian distributions. The framework explains why complementarity is rare and provides actionable design formulas; results apply to aggregation, not to interactive deliberation that generates novel answers.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

Core Contribution

This paper addresses a well-documented empirical puzzle: human-AI teams fail to outperform their best member in ~70% of studies. The authors develop a theoretical framework that provides necessary and sufficient conditions for when confidence-based aggregation can yield complementarity. The four main results — a complementarity theorem, minimax bounds, an impossibility result, and multi-class generalization — are unified through the integration of signal detection theory (SDT) with information-theoretic analysis.

The central insight is elegant: complementarity is achievable if and only if error correlation ρ_HM falls below a critical threshold ρ*, which approximately equals the accuracy level *a* in the symmetric equal-accuracy regime. This provides a clean, interpretable phase diagram separating achievable from impossible regions.

Methodological Rigor

The theoretical framework is well-constructed, building naturally from SDT confidence generation models to derive closed-form expressions. However, several concerns arise:

Proof completeness: The paper provides proof sketches rather than full proofs, deferring to an "extended version" multiple times. For a paper claiming "tight bounds with impossibility guarantees," the absence of complete proofs in the main text is a significant gap. Key steps — particularly Steps 3-4 of Theorem 1's proof and the tightness arguments of Theorem 2 — are insufficiently detailed.

Model assumptions: The symmetric SDT model (equal variance, Gaussian signals) is a strong assumption. While the robustness analysis (Table 3) shows predictions degrade only ~5% under alternative distributions, this testing is limited. The authors acknowledge that severely miscalibrated systems (e.g., RLHF-tuned LLMs) violate their assumptions, which is precisely the setting of greatest practical interest today.

Empirical validation: The correlations (R = 0.91-0.94) between predicted and observed team accuracy are impressive. However, several methodological issues deserve scrutiny:

The statistical dependence structure (participants contributing to multiple pairs) is acknowledged and addressed with mixed-effects models and cluster bootstrap, which is appropriate.

The model comparison (Table 5) shows decisive BIC advantages, but only against relatively weak baselines (linear confidence, logistic, accuracy-only).

Parameter recovery (Table 4) demonstrates the identifiability of the SDT model, which is a good practice.

The "simulation-before-fitting" approach strengthens the claim that the model captures generating processes rather than overfitting.

Scope limitations: The restriction to confidence-based aggregation is clearly stated but significantly limits the framework's applicability. Modern human-AI interaction increasingly involves dialogue, explanation, and iterative refinement — all explicitly excluded. The paper's title ("When Can Human-AI Teams Outperform Individuals?") is somewhat broader than what the results actually address.

Potential Impact

Theoretical impact: The framework provides the first tight bounds connecting error correlation, metacognitive sensitivity, and complementarity. The impossibility result is particularly valuable — it tells practitioners when to stop trying to improve aggregation and instead focus on reducing error correlation or improving metacognitive calibration. The connection to Condorcet's Jury Theorem and wisdom-of-crowds literature is natural and extends those classical results.

Practical impact: Three actionable design principles emerge: (1) estimate ρ_HM pre-deployment, (2) diversify training/information sources to reduce error correlation, (3) optimize metacognitive sensitivity rather than raw accuracy. The formula ρ*_K ≈ ρ*/√(K-1) for multi-class problems is directly useful for system designers.

Cross-disciplinary reach: The framework applies equally to human-human collaboration (as the authors note), connecting to organizational psychology, medical second-opinion systems, and collective intelligence research. The neural grounding through prefrontal confidence representations adds potential connections to cognitive neuroscience.

Timeliness & Relevance

This paper is highly timely. The Vaccaro et al. (2024) meta-analysis quantifying the 70% failure rate was published recently, creating both urgency and an empirical foundation. As AI systems are deployed in high-stakes domains (medical diagnosis, judicial decisions), understanding when collaboration adds value versus when it's futile is critical. The framework addresses a genuine bottleneck in the field — moving from empirical observation of complementarity failures to principled prediction of when they occur.

Strengths

1. Clean theoretical framework with interpretable parameters (ρ_HM, d, ρ*) that map onto measurable quantities

2. Both achievability and impossibility results, providing a complete characterization within the model class

3. Strong empirical validation across two datasets with high correlations and appropriate statistical methods

4. Multi-class generalization validated on human behavioral data (K=16, R=0.93)

5. Honest scope delimitation — clearly stating what the framework does and does not cover

6. Practical actionability — the formulas can be directly applied by system designers

Limitations & Weaknesses

1. Incomplete proofs: Deferral to an "extended version" weakens the mathematical contribution

2. Narrow scope relative to title: Excludes interactive/deliberative collaboration, which is increasingly the dominant paradigm

3. SDT model rigidity: The equal-variance Gaussian assumption, while tested for robustness, may not hold for modern LLM confidence scores

4. Limited dataset diversity: Only two image classification datasets; no validation on text, medical, or other high-stakes domains

5. Static framework: Does not account for learning, trust dynamics, or fatigue — the authors acknowledge this but it limits applicability

6. The ρ* ≈ a approximation is derived in the "symmetric near-chance regime" — its accuracy for high-performing or asymmetric agents is unclear from the main text

7. WEIRD sampling acknowledged but not addressed

Overall Assessment

This paper makes a genuine theoretical contribution by formalizing conditions for human-AI complementarity with both achievability and impossibility guarantees. The framework is elegant, empirically validated, and practically useful within its stated scope. The main limitations are the incomplete proofs, restricted scope (confidence-based aggregation only), and narrow empirical validation. The work represents a meaningful advance in understanding collaborative decision-making, though its impact may be constrained by the rapid shift toward interactive, generative AI systems that fall outside the framework's scope.

Rating:7/ 10

Significance 7.5Rigor 6.5Novelty 7.5Clarity 8

Generated May 12, 2026

Comparison History (19)

vs. Formal Methods Meet LLMs: Auditing, Monitoring, and Intervention for Compliance of Advanced AI Systems

claude-opus-4.65/18/2026

Paper 1 provides fundamental theoretical contributions—tight bounds, impossibility results, and actionable design formulas—for the widely studied problem of human-AI complementarity, with strong empirical validation (R>0.91). It explains a persistent empirical puzzle (why 70% of teams fail) with a unifying mathematical framework. Paper 2 makes solid engineering contributions applying formal methods to LLM monitoring, but is more incremental, combining existing techniques (LTL, runtime monitoring) in a new application domain. Paper 1's theoretical generality and explanatory power give it broader and more lasting impact across AI, cognitive science, and decision-making research.

vs. Marrying Generative Model of Healthcare Events with Digital Twin of Social Determinants of Health for Disease Reasoning

claude-opus-4.65/16/2026

Paper 2 provides fundamental theoretical bounds with impossibility guarantees for human-AI collaboration, a timely and broadly relevant topic. Its tight mathematical framework (complementarity theorem, minimax bounds, impossibility results) offers actionable design principles applicable across many domains. The strong empirical validation (R=0.91-0.94) and the elegance of explaining why 70% of human-AI teams fail gives it high citation potential. Paper 1, while technically sophisticated in combining generative models with SDoH for disease modeling, is more domain-specific and incremental in its contributions to healthcare AI, limiting its breadth of impact.

vs. Beyond Accuracy: Policy Invariance as a Reliability Test for LLM Safety Judges

claude-opus-4.65/16/2026

Paper 1 provides a rigorous theoretical framework with tight bounds, impossibility results, and strong empirical validation (R=0.91-0.94) addressing a fundamental question in human-AI collaboration. Its mathematical foundations (signal detection theory + information theory) offer lasting, generalizable insights explaining why complementarity fails in 70% of studies and providing actionable design formulas. Paper 2 identifies an important but narrower problem (LLM judge reliability) with practical diagnostics. While timely, it addresses a more transient issue tied to current LLM evaluation practices, whereas Paper 1's theoretical contributions have broader, longer-lasting impact across decision-making, AI deployment, and team science.

vs. [Emerging Ideas] Artificial Tripartite Intelligence: A Bio-Inspired, Sensor-First Architecture for Physical AI

claude-opus-4.65/16/2026

Paper 1 provides rigorous theoretical foundations (tight bounds, impossibility results) for a widely studied problem—human-AI complementarity—with strong empirical validation (R=0.91-0.94). It directly explains a puzzling empirical regularity (70% failure rate) and offers actionable design criteria. Its mathematical framework has broad applicability across any domain using confidence-based human-AI teaming. Paper 2 presents an interesting bio-inspired architecture for physical AI but is an 'Emerging Ideas' paper with a single prototype demonstration, narrower scope, and less theoretical depth, limiting its immediate scientific impact.

vs. Predicting Power-System Dynamic Trajectories with Foundation Models

gpt-5.25/12/2026

Paper 1 offers a broadly applicable theoretical framework with tight bounds and impossibility guarantees for when human–AI aggregation can exceed individual performance, explaining a widespread empirical failure mode and yielding actionable design conditions (correlation thresholds, scaling laws, multi-class generalization). Its rigor (information-theoretic + SDT analysis) and generality across domains using confidence-based aggregation suggest wide cross-field impact (HCI, ML, decision theory, psychology). Paper 2 is timely and practically valuable for power systems, but is more domain-specific and depends on large-scale pretraining/engineering choices whose novelty and general scientific reach are comparatively narrower.

vs. AgentRx: A Benchmark Study of LLM Agents for Multimodal Clinical Prediction Tasks

claude-opus-4.65/12/2026

Paper 1 provides fundamental theoretical contributions—tight bounds, impossibility results, and complementarity conditions—for human-AI teaming, a critical and growing field. Its theoretical framework explains a widely observed empirical puzzle (why 70% of teams fail to outperform individuals) with strong predictive validation (R>0.91). The results are broadly applicable across domains and provide actionable design principles. Paper 2 is a solid benchmarking study of LLM agents in clinical settings but is more incremental, offering empirical observations (single agents outperform naive multi-agent systems) without deep theoretical insight or novel methodology.

vs. Playing games with knowledge: AI-Induced delusions need game theoretic interventions

gemini-3.15/12/2026

Paper 1 resolves a major open empirical puzzle in human-AI collaboration (why teams often fail to outperform individuals) by providing a rigorous mathematical framework with theoretical bounds and impossibility guarantees. Its high methodological rigor, combining information-theoretic analysis with strong empirical validation on benchmark human datasets (ImageNet-16H, CIFAR-10H), makes it highly foundational. While Paper 2 offers a novel game-theoretic perspective on AI safety, its reliance on simulation makes it less empirically grounded than Paper 1's robust validation.

vs. Teacher-Aware Evolution of Heuristic Programs from Learned Optimization Policies

gpt-5.25/12/2026

Paper 1 offers a broadly applicable theoretical framework with tight bounds and impossibility guarantees for when human–AI complementarity is achievable, directly addressing a widely observed empirical failure mode. Its combination of signal detection theory and information-theoretic analysis yields general design principles (e.g., correlation thresholds, scaling laws, multi-class extension) validated on multiple datasets, suggesting strong methodological rigor and cross-domain relevance (HCI, ML, decision theory). Paper 2 is practically useful for heuristic discovery, but is more incremental and narrower in scope, with impact largely confined to combinatorial optimization tooling.

vs. Multi-modal Reasoning with LLMs for Visual Semantic Arithmetic

claude-opus-4.65/12/2026

Paper 2 provides fundamental theoretical bounds with impossibility guarantees for Human-AI complementarity—a widely studied but poorly understood phenomenon. Its tight mathematical framework (complementarity theorem, minimax bounds, impossibility results) with strong empirical validation (R=0.91-0.94) addresses a critical gap: why 70% of Human-AI teams fail. The results are broadly applicable across AI deployment contexts, offering actionable design formulas. Paper 1, while interesting, addresses a narrower problem (visual semantic arithmetic) with incremental methodology (applying GRPO to VLLMs) and more limited cross-field impact.

vs. Do Linear Probes Generalize Better in Persona Coordinates?

claude-opus-4.65/12/2026

Paper 2 provides fundamental theoretical bounds with impossibility guarantees for human-AI collaboration, a widely studied problem. Its tight mathematical framework (complementarity theorem, minimax bounds, impossibility results) with strong empirical validation (R=0.91-0.94) addresses the critical question of why human-AI teams often underperform. This has broad impact across HCI, AI deployment, and decision science. Paper 1 makes a useful contribution to AI safety monitoring via persona-derived probes, but is more narrowly focused on mechanistic interpretability of language models with incremental improvements over existing probe methods.

vs. SWE-AGILE: A Software Agent Framework for Efficiently Managing Dynamic Reasoning Context

gemini-3.15/12/2026

Paper 2 provides a fundamental, mathematically rigorous theoretical framework to solve a pervasive problem across Human-AI teaming, offering impossibility guarantees and actionable design formulas. Its broad applicability across HCI, AI, and cognitive science, combined with strong empirical validation, gives it a much higher potential for foundational scientific impact compared to Paper 1's domain-specific engineering solution for LLM context management.

vs. PRISM: Generation-Time Detection and Mitigation of Secret Leakage in Multi-Agent LLM Pipelines

gemini-3.15/12/2026

Paper 2 provides a foundational theoretical framework with tight bounds and impossibility guarantees to explain a widespread empirical failure in Human-AI collaboration. Its rigorous mathematical approach and strong empirical validation offer broad, paradigm-shifting implications across HCI, AI, and cognitive science. While Paper 1 addresses an important and timely security issue in LLM pipelines, Paper 2's theoretical depth and cross-disciplinary relevance give it a higher potential for fundamental scientific impact.

vs. A-MBER: Affective Memory Benchmark for Emotion Recognition

gpt-5.25/12/2026

Paper 1 offers a general theoretical framework with tight bounds and impossibility guarantees for when human–AI aggregation can beat individuals, connecting signal detection theory and information theory and validating predictions across datasets. This combination of novelty, rigor, and actionable design rules is likely to influence multiple areas (HCI, ML evaluation, decision theory, collective intelligence) and provides durable, broadly applicable insights. Paper 2 is timely and practically useful as a benchmark, but benchmarks are often narrower in scope and can be superseded; its impact is likely more confined to affective computing/memory evaluation.

vs. QuarkMedSearch: A Long-Horizon Deep Search Agent for Exploring Medical Intelligence

gemini-3.15/12/2026

Paper 2 addresses a fundamental, cross-disciplinary challenge in AI (human-AI collaboration) by providing rigorous theoretical bounds and impossibility guarantees that explain widespread empirical failures. Its insights and mathematical framework apply across numerous domains. In contrast, Paper 1 is a domain-specific, applied engineering effort. Paper 2's theoretical breakthroughs and broad applicability give it significantly higher potential for widespread scientific impact.

vs. The Attacker in the Mirror: Breaking Self-Consistency in Safety via Anchored Bipolicy Self-Play

gemini-3.15/12/2026

Paper 1 establishes foundational theoretical bounds and impossibility guarantees for a major empirical paradox in human-AI teaming. Its rigorous mathematical framework, heavily validated against human data, offers broad, paradigm-shifting implications across human-computer interaction, cognitive science, and AI deployment, giving it higher potential impact than Paper 2's specific algorithmic improvement to LLM red teaming.

vs. Results and Retrospective Analysis of the CODS 2025 AssetOpsBench Challenge

gpt-5.25/12/2026

Paper 2 has higher potential impact: it offers novel, general theoretical results (tight bounds plus an impossibility theorem) with clear, testable conditions for when human–AI complementarity is achievable. Its methodological rigor is strong (information-theoretic + signal detection derivations, multiclass extension, robustness checks) and it generalizes across tasks and domains where confidence-based aggregation is used (HCI, ML evaluation, decision science, crowdsourcing). It is timely given widespread deployment of human-AI decision pipelines. Paper 1 is valuable but primarily retrospective and competition-specific, with narrower cross-field reach.

vs. BDI-Kit Demo: A Toolkit for Programmable and Conversational Data Harmonization

gemini-3.15/12/2026

Paper 2 establishes foundational mathematical theory and impossibility bounds for a critical open problem in Human-AI collaboration, validated extensively on empirical datasets. Its methodological rigor, theoretical novelty, and broad applicability across HCI and AI give it significantly higher scientific impact than Paper 1, which is primarily a practical toolkit demonstration.

vs. Evaluating Explainability in Safety-Critical ATR Systems: Limitations of Post-Hoc Methods and Paths Toward Robust XAI

gpt-5.25/12/2026

Paper 2 offers a novel, general theoretical framework with tight bounds and impossibility guarantees for when human-AI complementarity is achievable, validated quantitatively on multiple datasets and generalized to multi-class settings. Its results are broadly applicable across human-computer interaction, ML evaluation, decision theory, and system design, with actionable criteria for building effective human-AI teams—high timeliness given widespread deployment. Paper 1 is important and timely for safety-critical XAI, but is primarily a structured assessment/taxonomy with less methodological novelty and narrower domain focus (ATR), which may limit cross-field impact relative to Paper 2’s formal, widely reusable theory.

vs. The Wittgensteinian Representation Hypothesis: Is Language the Attractor of Multimodal Convergence?

gemini-3.15/12/2026

Paper 1 addresses a fundamental question in AI about the nature of representation learning, proposing a profound hypothesis that language acts as a universal attractor. Its introduction of asymmetric alignment measures and connections to information theory give it broad implications across deep learning, cognitive science, and multimodal AI. While Paper 2 offers rigorous mathematical bounds for human-AI teaming, Paper 1's insights into the underlying structure of foundation models have the potential to fundamentally reshape our understanding of neural representations.