Tree-Based Formalization of Multi-Agent Complementarity in Human-AI Interactions

Andrea Ferrario

Jun 3, 2026

arXiv:2606.04779v1 PDF

cs.AI(primary)math.CO

#2217of 3355·Artificial Intelligence

#2217 of 3355 · Artificial Intelligence

Tournament Score

1365±44

10501800

40%

Win Rate

Wins

Losses

Matches

Rating

6.2/ 10

Significance6.5

Rigor7.5

Novelty7

Clarity7

Tournament Score

1365±44

10501800

40%

Win Rate

Wins

Losses

Matches

Rating

6.2/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Complementarity is the case in which a human--AI interaction (HAI) outperforms the best prediction benchmark available among its members. Although this idea is central in HAI research, formal work on complementarity remains limited. Existing frameworks do not model how agents' predictions compose into workflow-sensitive multi-agent protocols. We close this gap by introducing a tree-based formalization of complementarity in multi-agent HAI. An HAI protocol is represented by an ordered agent-role configuration together with a rooted planar binary tree whose leaves are decorated by prediction vectors. A local binary composition rule is evaluated recursively along the tree, yielding a tree-relative complementarity functional relative to a pointwise-min oracle benchmark. We prove four results. First, selector-based HAIs, including self- or AI-reliance, cannot achieve complementarity regardless of task, loss, or prediction quality. Second, in regression under squared loss, complementarity is equivalent to Euclidean distance minimization from the ground-truth vector; for $N = 2$ , the optimal linear-pooling weight has a closed form and a residual-correction interpretation. Third, under linear local composition, every protocol tree defines a barycentric coordinate chart on the simplex of leaf weights; Tamari-cover reparameterizations of protocol trees preserve complementarity, and for $N = 4$ , they satisfy the pentagon identity. Fourth, in binary classification, no internal local composition can achieve complementarity under endpoint-monotone losses, including standard Bregman and many finite Bernoulli $f$ -divergence losses; an analogous obstruction holds for multiclass aggregation under cross-entropy. In summary, our framework shows that complementarity is attainable in multi-agent regression, but obstructed in classification under natural conditions on local aggregation and loss functions.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

Core Contribution

This paper introduces a formal mathematical framework for studying complementarity in multi-agent human-AI interactions (HAI) using rooted planar binary trees. The key innovation is representing HAI protocols as trees whose leaves carry prediction vectors and whose internal nodes apply local binary composition rules recursively. This yields a "tree-relative complementarity functional" that measures whether the protocol output beats a pointwise-min oracle benchmark. The framework produces four main results: (1) selector-based rules (self/AI-reliance) cannot achieve complementarity; (2) regression complementarity under squared loss reduces to Euclidean distance minimization with closed-form solutions for N=2; (3) under linear pooling, protocol trees define barycentric coordinates on the simplex, with Tamari-cover reparameterizations preserving complementarity and satisfying the pentagon identity for N=4; (4) in binary classification, internal local rules cannot achieve complementarity under endpoint-monotone losses.

Methodological Rigor

The mathematical development is rigorous and self-contained. The proofs are clean and follow naturally from the definitions. The impossibility results (Theorems 1 and 4) are particularly well-crafted—Theorem 1 follows elegantly from the observation that selectors can never beat the pointwise minimum, while Theorem 4 combines endpoint monotonicity with the internality property in a tight argument. The Tamari reparameterization machinery (Theorem 2) is technically sound, with the pentagon identity (Theorem 3) verified by explicit coordinate computation.

However, there are methodological concerns. The numerical illustrations are limited to synthetic human predictions on the California Housing dataset, which constrains the empirical validation. The framework assumes all agents predict the same target on the same dataset, excluding settings where agents have access to different information—a scenario Rastogi et al. (2023) identified as a primary source of real-world complementarity. The restriction to binary tree composition, while mathematically motivated, excludes higher-arity interactions that may be natural in practice.

Potential Impact

The paper's most impactful contribution is the classification impossibility result (Theorem 4), which has direct implications for empirical HAI research. If accepted, it suggests that many common probability-calibrated human-AI classification workflows cannot achieve complementarity under standard losses when using interpolation-based aggregation. This could redirect empirical work toward non-internal aggregation mechanisms or alternative benchmark definitions.

The regression results provide useful geometric intuition—the "AI residual correction" interpretation of N=2 complementarity is intuitive and actionable. However, the practical gap is significant: real HAI settings involve noisy, sequential, context-dependent interactions where the clean mathematical structure may not apply directly.

The algebraic connections to associahedra and Tamari lattices are mathematically elegant but their practical relevance is unclear. The pentagon identity for N=4 is a coherence result that ensures consistency of reparameterizations, but it is not obvious how this guides HAI system design beyond confirming that the framework is internally consistent.

Timeliness & Relevance

The paper addresses a genuine gap. Complementarity is widely discussed in HAI literature but lacks formal multi-agent treatment. The timing is relevant given the proliferation of multi-agent AI systems and increasing deployment of AI-assisted decision-making in high-stakes domains. The distinction between aggregate and pointwise-min benchmarks is a valuable conceptual contribution that clarifies what "complementarity" should mean in different contexts.

Strengths

1. Clean formalization: The tree-based framework provides a principled way to model workflow-sensitive multi-agent interactions, filling a genuine gap in HAI theory.

2. Sharp impossibility results: Both the selector impossibility (Theorem 1) and the classification impossibility (Theorem 4) are clean, general, and have clear implications for practice.

3. Benchmark discussion: The careful treatment of pointwise-min vs. aggregate benchmarks (Proposition 1, Section 3.2.3) with concrete examples is clarifying for the field.

4. Geometric interpretation: The regression results provide actionable geometric intuition about when and why human-AI disagreement is productive.

5. Mathematical coherence: The connection between Tamari lattices, barycentric coordinates, and the pentagon identity demonstrates internal consistency of the framework.

Limitations

1. Limited empirical grounding: All numerical illustrations use synthetic human predictions on a single dataset. No real human-AI interaction data is used, making it unclear whether the framework captures phenomena observed in practice.

2. Strong assumptions: All agents predict the same target on the same dataset; this excludes information asymmetry, a key driver of real complementarity. The binary tree restriction, while mathematically clean, is acknowledged as a simplification.

3. Benchmark sensitivity: The classification impossibility depends critically on the pointwise-min benchmark. Under the aggregate benchmark (which the paper acknowledges is sometimes appropriate), the impossibility does not hold, substantially limiting the scope of the negative result.

4. Practical actionability gap: The framework identifies when complementarity is theoretically possible but offers limited guidance on achieving it in practice. The amplified logit pooling escape route (Section 8.1.1) is only briefly explored.

5. Scalability concerns: The combinatorial explosion of trees (Catalan numbers) and the joint optimization over trees and parameters (Equation 10) are not addressed computationally for realistic N.

6. Missing connections to related optimization: The relationship to existing work on forecast aggregation, expert combination, and ensemble methods could be more thoroughly explored. The Tamari/associahedron machinery, while elegant, may be over-engineered for the practical problems at hand.

Overall Assessment

This is a mathematically sophisticated paper that provides a novel formal framework for an important concept in HAI research. The impossibility results are the strongest contributions, potentially redirecting empirical research. However, the gap between the mathematical idealization and real-world HAI practice is substantial, and the paper would benefit from empirical validation with actual human-AI interaction data. The algebraic machinery, while internally beautiful, may exceed what the application domain requires at this stage.

Rating:6.2/ 10

Significance 6.5Rigor 7.5Novelty 7Clarity 7

Generated Jun 5, 2026

Comparison History (20)

vs. EpiEvolve: Self-Evolving Agents for Streaming Pandemic Forecasting under Regime Shifts

gpt-5.26/6/2026

Paper 2 has higher potential impact: it introduces a general, mathematically rigorous framework for multi-agent complementarity that applies broadly across human-AI interaction protocols, aggregation theory, and learning theory. Its formalism yields multiple theorems (impossibility results, equivalences, invariances) with cross-domain relevance and likely to influence how workflows and evaluation baselines are defined. Paper 1 is timely and applied with strong empirical gains for streaming epidemiological forecasting, but its scope is narrower (one domain/task) and depends on specific agent/memory design choices, making its broader theoretical spillover smaller.

vs. Evaluating Interactive Reasoning in Large Language Models: A Hierarchical Benchmark with Executable Games

gemini-3.16/6/2026

Paper 1 introduces a novel, practical benchmark for evaluating interactive reasoning in LLMs, a highly active and critical area of AI research. Its empirical focus and executable framework ensure broad applicability and immediate adoption by the AI community. While Paper 2 offers strong methodological rigor and theoretical insights into Human-AI interaction, its abstract mathematical nature likely restricts its immediate practical impact and breadth of application compared to a highly relevant LLM evaluation benchmark.

vs. DPBench: Structural Determinants of Multi-Agent LLM Coordination Under Simultaneous Resource Contention

gpt-5.26/6/2026

Paper 2 has higher likely impact: it introduces a concrete, extensible benchmark (DPBench) for a timely, fast-moving problem—multi-agent LLM coordination—yielding actionable, protocol-level findings (deadlock determinants, effects of pre-commitment and concurrency primitives) with immediate relevance to deployed agent systems. Its applications span AI safety, distributed systems, HCI, and benchmarking, and results are empirically testable and reproducible. Paper 1 is mathematically novel and rigorous, but its impact may be narrower and more theoretical, with less direct near-term applicability.

vs. Harnessing Generalist Agents for Contextualized Time Series

gemini-3.16/5/2026

Paper 1 offers a rigorous, foundational mathematical framework for human-AI complementarity, proving fundamental limits and possibilities in multi-agent workflows. While Paper 2 provides a highly practical LLM-based tool for time series, Paper 1's theoretical insights into when complementarity is mathematically obstructed or attainable will likely have a deeper, longer-lasting impact on the design of human-AI collaboration systems across various disciplines.

vs. No Need to Train Your RDB Foundation Model

claude-opus-4.66/5/2026

Paper 1 addresses the highly practical and timely problem of foundation models for relational databases, proposing a training-free approach with theoretical grounding and an open-source tool (RDBLearn). Its immediate applicability to enterprise data, scalability via SQL primitives, and connection to the booming foundation model ecosystem give it broad real-world impact. Paper 2 provides elegant theoretical formalization of complementarity in human-AI interaction with interesting mathematical results (Tamari lattice connections, impossibility theorems), but its impact is more niche—primarily theoretical, with narrower practical implications and a smaller target audience.

vs. TokenMizer: Graph-Structured Session Memory for Long-Horizon LLM Context Management

gpt-5.26/5/2026

Paper 2 offers a broadly applicable, mathematically rigorous formal framework for multi-agent human–AI complementarity, with multiple theorems (impossibility results, equivalences, invariances) that can reshape how HAI protocols and aggregation are designed across tasks and fields. Its insights (e.g., when complementarity is impossible in classification under natural losses) are likely to influence theory and practice in HAI, ML aggregation, and decision sciences. Paper 1 is timely and useful engineering for LLM context management, but is more incremental/system-specific with narrower cross-field impact despite clear applications.

vs. InfoDensity: Rewarding Information-Dense Traces for Efficient Reasoning

gemini-3.16/5/2026

Paper 1 addresses a highly timely and critical bottleneck in modern AI: the computational cost and verbosity of LLM reasoning traces. By introducing a principled, entropy-based reward framework (InfoDensity) to optimize reasoning efficiency without sacrificing quality, it offers immediate, practical applications for training state-of-the-art models. While Paper 2 provides excellent theoretical foundations for Human-AI complementarity, Paper 1's empirical success in the rapidly moving field of LLM reasoning gives it significantly higher potential for immediate widespread adoption, citations, and real-world impact across the AI community.

vs. Agents' Last Exam

gpt-5.26/5/2026

Paper 1 (Agents’ Last Exam) likely has higher scientific impact due to its timely, broadly useful benchmark targeting economically meaningful, long-horizon agent performance—an urgent bottleneck for real-world deployment. Its scale (1K+ tasks, 250+ experts), living design, and clear current performance gap make it a potential standard evaluation instrument across academia and industry, influencing multiple subfields (agents, evaluation, alignment, applied ML). Paper 2 is theoretically novel and rigorous but narrower in immediate applicability and community uptake.

vs. Ontology-Constrained Neural Reasoning in Enterprise Agentic Systems: A Neurosymbolic Architecture for Domain-Grounded AI Agents

claude-opus-4.66/5/2026

Paper 1 provides a rigorous mathematical framework formalizing complementarity in human-AI interactions, proving fundamental impossibility and possibility results (e.g., complementarity is attainable in regression but obstructed in classification). These theoretical contributions—connecting to combinatorial structures like Tamari lattices and the pentagon identity—have broad implications for the HAI field and could reshape how researchers design collaborative AI systems. Paper 2, while practically valuable with its enterprise architecture and empirical results, is more narrowly applied to a specific platform (FAOS) and its contributions are incremental engineering advances rather than foundational scientific insights.

vs. StepPRM-RTL: Stepwise Process-Reward Guided LLM Fine-Tuning for Enhanced RTL Synthesis

gpt-5.26/5/2026

Paper 2 likely has higher impact due to strong real-world applicability (hardware design automation), timeliness (LLM training with PRM/MCTS/RAFT for long-horizon code generation), and clear empirical gains with ablations, suggesting methodological rigor and immediate adoption potential. Its approach can generalize to other program-synthesis and structured-generation domains, broadening cross-field impact (ML, EDA, software engineering). Paper 1 is conceptually novel and rigorous, but its impact may be more theoretical and narrower, with key results being obstructions in common classification settings, which can limit near-term practical influence.

vs. Identifying and Mitigating Systemic Measurement Bias in Production LLM Inference Benchmarks

gemini-3.16/5/2026

Paper 1 offers a foundational theoretical framework for Human-AI interaction, establishing mathematical proofs for when complementarity is achievable or obstructed. This deep methodological rigor provides long-lasting scientific value. In contrast, Paper 2 addresses a practical systems engineering and benchmarking flaw (Python GIL bottlenecks). While highly relevant for current LLM deployments, Paper 1's contributions represent a more significant and enduring advancement in AI theory and cognitive modeling.

vs. Adaptive Human-AI Coordination via Hierarchical Action Disentanglement

claude-opus-4.66/5/2026

Paper 2 introduces a novel theoretical framework with formal mathematical proofs establishing fundamental limits and possibilities of human-AI complementarity. Its four proven results—including impossibility theorems for classification and connections to algebraic structures like Tamari lattices and the pentagon identity—provide deep foundational insights applicable across all HAI research. While Paper 1 offers a solid engineering contribution in a specific domain (Overcooked-AI), Paper 2's theoretical contributions are more broadly impactful, addressing a central open question in HAI with rigorous formalization that will likely influence how researchers design and analyze human-AI systems across multiple fields.

vs. Can LLM Agents Sustain Long-Horizon Organizational Dynamics?

gemini-3.16/5/2026

Paper 1 establishes foundational theoretical limits and possibilities for Human-AI complementarity. Its rigorous mathematical proofs, particularly the impossibility results in classification, will fundamentally shape future HAI system design across disciplines. Paper 2 is highly timely and practical, but offers an application-specific framework that may have a shorter scientific shelf-life compared to the fundamental theorems presented in Paper 1.

vs. On Information Self-Locking in Reinforcement Learning for Active Reasoning of LLM agents

gemini-3.16/5/2026

Paper 1 addresses a critical bottleneck in the highly active field of LLM agents and reinforcement learning. By identifying a fundamental failure mode (information self-locking) and providing a highly effective mitigation strategy with substantial empirical gains, it offers immediate and widespread utility for AI developers. Paper 2, while offering rigorous mathematical formalization for Human-AI complementarity, is highly theoretical and likely to have a narrower, more specialized impact compared to the broad real-world applicability and timeliness of Paper 1's contributions to agentic reasoning.

vs. CAST: Non-Privileged Clipped Asymmetric Self-Teaching with Advantage Flipping for GRPO

gemini-3.16/5/2026

Paper 2 addresses a highly critical and timely challenge in modern AI: improving reasoning in Large Language Models through Reinforcement Learning with Verifiable Rewards (RLVR/GRPO). Given the immense current interest in advancing LLM reasoning capabilities, its practical algorithmic improvements are likely to see rapid adoption and high citation counts. While Paper 1 offers strong theoretical foundations for Human-AI interaction, Paper 2's direct applicability to state-of-the-art LLM training pipelines gives it a broader and more immediate scientific impact.

vs. Visual Graph Scaffolds for Structural Reasoning in Large Language Models

gpt-5.26/5/2026

Paper 2 likely has higher impact: it introduces a novel, general formalization of multi-agent human–AI complementarity with rigorous theoretical results (impossibility theorems, equivalences, invariances) that can reshape how HAI protocols and aggregation are designed across tasks. Its conclusions (complementarity achievable in regression but obstructed in classification under broad losses) are broadly applicable and timely for multi-agent systems and evaluation. Paper 1 is timely and useful for LLM reasoning, but appears more empirical/system-specific and narrower in cross-field theoretical reach.

vs. PieArena: Ranking and Profiling Language Agents in Realistic Negotiation Scenarios

gpt-5.26/5/2026

Paper 1 likely has higher impact due to its timely, large-scale benchmark for LLM negotiation—an economically important, high-visibility capability—and its combination of human-anchored evaluation, multi-regime agent interactions, and improved payoff-ranking methodology. The benchmark and behavioral profiling are immediately reusable by many groups and can influence model development, safety, and product deployment. Paper 2 is mathematically novel and rigorous, but is more specialized (protocol-tree formalism and impossibility results) and may have narrower near-term adoption outside HAI theory and aggregation research.

vs. FALSIFYBENCH: Evaluating Inductive Reasoning in LLMs with Rule Discovery Games

claude-opus-4.66/5/2026

FALSIFYBENCH addresses the timely and high-impact question of whether LLMs can perform scientific inductive reasoning, directly relevant to the rapidly growing field of AI agents for science. Its benchmark methodology is practical, reproducible, and applicable across model families. The finding that negative testing (falsification) is the primary driver of success provides actionable insights for improving LLM reasoning. Paper 2, while mathematically rigorous and theoretically interesting, addresses a narrower audience with its formal complementarity framework for HAI. Its impossibility results for classification are notable but may limit practical uptake. Paper 1's broader relevance to LLM evaluation and scientific discovery gives it higher potential impact.

vs. InfoMem: Training Long-Context Memory Agents with Answer-Conditioned Information Gain

gpt-5.26/5/2026

Paper 2 likely has higher near-term scientific impact: it introduces an implementable reward signal (answer-conditioned information gain) for training long-context memory agents, a timely and widely relevant problem in LLM systems. It demonstrates empirical gains under a standard RL framework and provides code, facilitating adoption and follow-up work across NLP, RLHF, and agentic retrieval/memory. Paper 1 is mathematically novel and rigorous, but its impact may be more niche (formal HAI theory) and its negative results for classification may limit immediate practical uptake compared to Paper 2’s directly deployable method.

vs. MedCUA-Bench: A Screenshot-Only Benchmark for Clinical Computer-Use Agents

gpt-5.26/5/2026

Paper 2 likely has higher impact due to immediate real-world relevance and adoption potential: it introduces a practical, reproducible benchmark targeting a high-stakes domain (clinical GUIs) with safety-oriented evaluation, enabling standardized comparison across many agents and catalyzing progress. Its methodology (interactive tasks, deterministic checker, intent vs step goals, real-system validation on OpenEMR) supports rigorous empirical research. Paper 1 is more theoretically novel, but its impact may be narrower and slower to translate, especially given negative/impossibility results for common classification settings.