Attributing Emergence in Million-Agent Systems

Ling Tang, Jilin Mei, Qian Chen, Qihan Ren, Linfeng Zhang, Quanshi Zhang, Jing Shao, Xia Hu

May 12, 2026

arXiv:2605.11404v1 PDF

cs.AI(primary)

#100of 2292·Artificial Intelligence

#100 of 2292 · Artificial Intelligence

Tournament Score

1542±46

10501800

90%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance6.5

Rigor7

Novelty5.5

Clarity7.5

Tournament Score

1542±46

10501800

90%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Large language models (LLMs) can simulate human-like reasoning and decision-making in individual agents. LLM-powered multi-agent systems (MAS) combine such agents to simulate population-scale social phenomena such as polarization, information cascades, and market panics. Such studies require attributing macro emergence to individual agents, but existing axiomatic methods scale combinatorially in $N$ and have been confined to $N \lesssim 10^3$ , while the phenomena they explain occur at $N \geq 10^6$ . We address this gap by adapting Aumann--Shapley path-integral attribution to LLM-powered MAS at million-agent scale; the resulting method satisfies all four axioms, runs four to five orders of magnitude faster than sampled Shapley on the same hardware. We use this method to test the scale gap empirically: across 14 days of public Bluesky data ( $1,671,587$ active users), we compute the attribution at both full scale and the visibility-biased $N = 10^{2}$ convenience sample used by small-scale studies, and the two disagree structurally. At full scale the long tail and middle tier jointly carry the majority; the biased small panel attributes almost everything to a few high-follower accounts. We then prove that under any nonlinear macro indicator the disagreement cannot be reduced by post-hoc rescaling: an Attribution Scaling Bias theorem shows that no global rescaling factor can reconcile small-scale and full-scale attribution. Full-scale attribution is therefore not a methodological choice but a theoretical requirement for any nonlinear macro indicator.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: "Attributing Emergence in Million-Agent Systems"

1. Core Contribution

The paper tackles a genuine and important gap: existing axiomatic attribution methods (Shapley-based) for multi-agent systems scale combinatorially in agent count N and have been limited to N ≲ 10³, while the social phenomena they aim to explain (polarization, cascades, market panics) occur at N ≥ 10⁶. The authors adapt the Aumann–Shapley path-integral attribution to LLM-powered MAS at million-agent scale, achieving 4–5 orders of magnitude speedup over sampled Shapley while preserving all four axioms (efficiency, symmetry, dummy, linearity).

The key finding is an empirical demonstration that attribution conclusions flip structurally between small biased panels and full-scale analysis. At full scale on 1.67M Bluesky users, the long tail (bottom 90%) carries the majority of attribution; visibility-biased convenience samples of N=100 concentrate attribution on high-follower accounts. This is accompanied by a theoretical result (Attribution Scaling Bias theorem) showing that for nonlinear macro indicators, no global rescaling can reconcile small-scale and full-scale attribution.

2. Methodological Rigor

Strengths in rigor:

The mathematical framework is clean. The Aumann–Shapley path integral is well-established in cooperative game theory, and the adaptation to continuous agent features is natural. The four analytic closed-form derivations (Equations 11–14) are verified against numerical integration to machine precision.

Theorem 1 is formally stated and proved with a minimal counterexample (N=3). The proof strategy via Hessian characterization of linearity (Lemma 1) is elegant.

The empirical methodology is thorough: 5 topics, 4 value functions, 4 sampling protocols, 10 seeds per cell, with extensive appendices covering robustness (baseline choice, path choice, integration steps, alternative top-tier anchors).

Weaknesses in rigor:

The Aumann–Shapley attribution here operates on *static features* (follower count, post count, reply count) rather than on the dynamic simulation itself. The "attribution" doesn't actually trace causal influence through agent interactions—it attributes a macro *statistic* to individual feature vectors. This is a significant conceptual gap: the method doesn't require running any simulation at all.

The LLM-powered MAS validation (Appendix P) is limited: EconAgent has only 10 agents, SocialLLM has 20, and MidScale-Social has 1000. The claimed million-agent capability is demonstrated only on Bluesky data with analytic functions, not on actual LLM-powered simulations.

Theorem 1 is described by the authors themselves as "qualitative, not quantitative." It shows existence of configurations where rescaling fails but doesn't bound the magnitude of disagreement. The empirical flip is dominated by sampling bias (already ~20pp under linear f), not by the nonlinearity that the theorem addresses.

The connection between Bluesky observational data and LLM-powered MAS is assumed rather than demonstrated. The paper states the method "applies unchanged to LLM-powered MAS pipelines" but the combined "LLM-driven macro indicator at N=10⁶" is never tested.

3. Potential Impact

The paper makes a valid methodological point: convenience sampling in MAS studies can produce structurally misleading attribution. This is important for the growing LLM-powered MAS community. The computational framework enabling million-agent attribution could influence how large-scale social simulations are analyzed.

However, the practical impact depends on whether researchers adopt nonlinear macro indicators where the theorem applies, and whether the analytic value functions used here capture meaningful social phenomena. The four chosen functions (linear mean, multiplicative heat, variance, Gini) are reasonable but generic statistical summaries rather than domain-specific emergence indicators.

The result that "the long tail matters more than elites" at scale is consistent with well-known findings in social network analysis (e.g., Cha et al. 2010's "million follower fallacy"), reducing the novelty of the empirical finding itself.

4. Timeliness & Relevance

The paper is well-timed. LLM-powered MAS are proliferating rapidly, with systems like OASIS reaching million-agent scale. The question of how to attribute emergent phenomena is pressing, and the scalability limitation of existing methods is real. The use of Bluesky data is contemporary and the platform's open protocol makes the work reproducible.

The paper addresses a genuine bottleneck: the community needs attribution tools that work at scale. However, the gap between "attribution of a macro statistic over static features" and "attribution of emergent dynamics in an interactive simulation" remains substantial.

5. Strengths & Limitations

Key Strengths:

Clean mathematical framework with provable axiomatic guarantees

Massive computational speedup (10⁴–10⁵×) enabling new scales

Comprehensive empirical evaluation across multiple dimensions

Important cautionary message about convenience sampling

Strong reproducibility commitment

Notable Limitations:

The method attributes static feature aggregation, not dynamic interaction—it doesn't capture how agent A's post influenced agent B's response, which is the core of emergence attribution

The "million-agent" claim is validated only on analytic functions over observational data, not on actual LLM-powered simulations at that scale

Theorem 1's practical bite is limited: the empirical flip is primarily a sampling artifact visible even under linearity

The four value functions are somewhat arbitrary choices; no guidance on which captures real emergence

The paper's framing as "attributing emergence" overstates what is actually computed—it's closer to "attributing a summary statistic"

Overall Assessment

This is a technically competent paper that solves a real computational scaling problem and delivers a useful cautionary finding about biased sampling. The mathematical framework is sound and the experiments are thorough. However, there is a meaningful gap between the ambitious framing ("attributing emergence in million-agent systems") and what is actually demonstrated (attributing analytic statistics over static features). The theoretical result, while correct, is more a sanity check than a deep insight. The paper's greatest contribution may be the empirical demonstration that small biased panels produce structurally different attribution than full-scale analysis—a finding that, while not surprising to social network researchers, needed to be documented formally for the MAS community.

Rating:6.5/ 10

Significance 6.5Rigor 7Novelty 5.5Clarity 7.5

Generated May 13, 2026

Comparison History (21)

vs. Generative Recursive Reasoning

claude-opus-4.65/20/2026

Paper 2 addresses a fundamental methodological gap in multi-agent systems research by providing scalable attribution methods for million-agent LLM simulations, backed by a formal impossibility theorem (Attribution Scaling Bias) and validated on real-world Bluesky data. It bridges computational social science, game theory, and LLM-based simulation with broad interdisciplinary impact. Paper 1 contributes a solid incremental advance in latent reasoning models, but Paper 2's combination of theoretical contribution, practical scalability (4-5 orders of magnitude speedup), and the demonstration that small-scale studies are fundamentally inadequate for nonlinear indicators has broader and more transformative implications.

vs. SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning

gemini-3.15/16/2026

Paper 1 provides a foundational theoretical breakthrough (Attribution Scaling Bias theorem) and a highly scalable methodological solution to a critical problem in complex systems. Its combination of mathematical rigor, massive empirical validation, and broad applicability across computational social science and AI gives it higher potential for deep scientific impact compared to the system-level engineering contribution of Paper 2.

vs. Reinforcing VLAs in Task-Agnostic World Models

claude-opus-4.65/16/2026

Paper 1 addresses a fundamental methodological gap in multi-agent systems by providing a scalable attribution method with rigorous theoretical guarantees (the Attribution Scaling Bias theorem). It proves that small-scale studies cannot substitute for full-scale analysis under nonlinear indicators—a result with broad implications across computational social science, economics, and complex systems research. The combination of theoretical contribution, empirical validation on real-world data (1.67M Bluesky users), and the demonstration that common small-sample practices are fundamentally flawed gives it transformative potential. Paper 2, while valuable for robotics, represents a more incremental advance within VLA post-training.

vs. NeuroAgent: LLM Agents for Multimodal Neuroimaging Analysis and Research

gemini-3.15/16/2026

Paper 1 offers a fundamental theoretical breakthrough by proving the Attribution Scaling Bias theorem, which invalidates small-scale approximations for complex multi-agent simulations. By scaling attribution to millions of agents, it fundamentally changes how macro-emergence must be studied across fields like computational sociology and AI. Paper 2 is a highly valuable applied tool that automates existing neuroimaging workflows, but it lacks the foundational theoretical implications and cross-disciplinary methodological paradigm shift presented by Paper 1.

vs. Remember the Decision, Not the Description: A Rate-Distortion Framework for Agent Memory

gpt-5.25/16/2026

Paper 2 has higher estimated impact: it introduces a broadly applicable, principled rate–distortion formulation of agent memory tied directly to decision quality, plus an online algorithm (DeMem) with near-minimax regret guarantees and demonstrated benchmark gains. This combines novelty with methodological rigor and clear real-world relevance to long-horizon LLM agents across many domains. Paper 1 is strong and timely for large-scale multi-agent social simulation and attribution, but its impact is narrower (specific to MAS attribution and nonlinear macro indicators) and depends more on access to million-agent settings/data, limiting breadth of adoption.

vs. Constant-Target Energy Matching: A Unified Framework for Continuous and Discrete Density Estimation

gpt-5.25/16/2026

Paper 1 is more novel and timely in addressing a core scalability bottleneck for LLM-powered million-agent social simulations, combining an axiomatic attribution framework, major computational speedups, large real-world validation (1.67M users), and a theoretical impossibility result (scaling-bias theorem) that could reshape methodology in computational social science and agent-based modeling. Its real-world applicability and cross-field impact (ML, economics, sociology, network science) are broad. Paper 2 is a solid unification for density estimation, but similar energy/score matching directions are crowded; impact is likely narrower and more incremental relative to Paper 1’s methodological and theoretical shift.

vs. BitCal-TTS: Bit-Calibrated Test-Time Scaling for Quantized Reasoning Models

gpt-5.25/16/2026

Paper 2 is more novel and broadly impactful: it introduces a scalable, axiom-satisfying attribution method for million-agent LLM-based simulations, enabling analyses previously computationally infeasible, and validates it on real-world population-scale social data. Theoretical contributions (axiom preservation, speedup claims, and an impossibility theorem about rescaling small-sample attributions) increase rigor and generality across computational social science, economics, and AI safety. Paper 1 is practical and timely for quantized LLM deployment, but its contribution is narrower and evidence is limited to small GSM8K shards.

vs. IG-Search: Step-Level Information Gain Rewards for Search-Augmented Reasoning

gpt-5.25/16/2026

Paper 2 likely has higher scientific impact due to a more novel, broadly applicable methodological advance (million-agent, axiom-satisfying attribution via Aumann–Shapley path integrals) that enables previously infeasible analyses. It demonstrates real-world relevance on 1.67M-user Bluesky data and provides a theoretical result (Attribution Scaling Bias theorem) establishing when small-scale studies fundamentally fail, affecting computational social science, agent-based modeling, interpretability, and ML evaluation. Paper 1 is a solid incremental RL improvement for search-augmented QA with modest gains and narrower scope.

vs. Agentic Discovery of Exchange-Correlation Density Functionals

gemini-3.15/16/2026

Paper 1 addresses a longstanding foundational challenge in density functional theory, a heavily relied-upon method in chemistry and materials science. Achieving a 9% improvement over a gold-standard baseline has immediate, widespread implications for physical science simulations. Additionally, its insights into AI benchmark gaming provide crucial methodological guidance for AI-assisted science. While Paper 2 offers a significant theoretical and scalable advance for computational social science, Paper 1's direct impact on fundamental physical modeling gives it a broader and more profound scientific footprint.

vs. Beyond Compromise: Pareto-Lenient Consensus for Efficient Multi-Preference LLM Alignment

gpt-5.25/16/2026

Paper 1 combines strong novelty (adapting Aumann–Shapley path-integral attribution to million-agent LLM MAS), clear methodological rigor (axioms satisfied, large speedup, real-world Bluesky evaluation, and a formal impossibility theorem), and broad cross-field relevance (attribution, causal/credit assignment, computational social science, agent-based modeling, LLM systems). Its result that small-scale attribution is structurally unrecoverable for nonlinear indicators is a timely, high-leverage finding for empirical MAS research. Paper 2 is promising for alignment optimization but appears more incremental within multi-objective training.

vs. Correct Is Not Enough: Training Reasoning Planners with Executor-Grounded Rewards

gemini-3.15/16/2026

Paper 2 addresses a fundamental scalability bottleneck in multi-agent systems, extending attribution from thousands to millions of agents. Its theoretical proof of Attribution Scaling Bias invalidates common small-scale proxy methods, promising broad, paradigm-shifting impact across AI, complex systems, and computational social science. Paper 1 offers a valuable but more narrow algorithmic improvement for LLM reasoning traces.

vs. Do Enterprise Systems Need Learned World Models? The Importance of Context to Infer Dynamics

gpt-5.25/13/2026

Paper 2 has higher potential impact due to a more broadly applicable methodological advance: a scalable, axiomatic attribution technique enabling million-agent analyses, validated on real-world social data and accompanied by a theoretical impossibility result (Scaling Bias theorem). This directly addresses a major scalability bottleneck in computational social science and LLM-based multi-agent simulation, with clear cross-field relevance (economics, network science, interpretability, policy). Paper 1 is novel for enterprise AI deployment-shift robustness and introduces a benchmark, but its scope and applications are more domain-specific and likely narrower in scientific reach.

vs. OOM-Free Alpamayo via CPU-GPU Memory Swapping for Vision-Language-Action Models

gemini-3.15/13/2026

Paper 2 offers a fundamental methodological breakthrough by scaling attribution to million-agent systems and mathematically proving the invalidity of small-scale sampling for nonlinear macro indicators. While Paper 1 provides highly practical systems optimizations for running large VLA models on commodity GPUs, Paper 2's impact is broader and paradigm-shifting. It fundamentally dictates how computational social science and multi-agent simulations must be conducted in the future, offering rigorous proofs and massive empirical validation that bridge AI, economics, and sociology.

vs. PRIME: Training Free Proactive Reasoning via Iterative Memory Evolution for User-Centric Agent

claude-opus-4.65/13/2026

Paper 2 addresses a fundamental methodological gap in multi-agent systems research by providing a scalable attribution method for million-agent systems, backed by both theoretical proofs (Attribution Scaling Bias theorem) and empirical validation on real-world Bluesky data. Its contributions—adapting Aumann-Shapley attribution to LLM-powered MAS, demonstrating structural disagreement between small and full-scale analyses, and proving this cannot be corrected by rescaling—have broad implications across computational social science, economics, and AI safety. Paper 1, while practical, presents an incremental improvement (gradient-free memory-based learning) in a crowded agent framework space with less fundamental novelty.

vs. Reward Design for Physical Reasoning in Vision-Language Models

gpt-5.25/13/2026

Paper 1 is more novel and broadly impactful: it adapts Aumann–Shapley path-integral attribution to million-agent LLM-based MAS, overcoming a key scalability barrier (orders-of-magnitude speedup) and demonstrating that small-sample attribution can be structurally wrong, backed by a formal impossibility theorem. This combination of methodological innovation, real-world validation at 1.6M users, and a general theoretical result is likely to influence computational social science, interpretability/attribution, and large-scale simulation methodology. Paper 2 is valuable but more incremental (reward ablations) and narrower in scope.

vs. Semantic Reward Collapse and the Preservation of Epistemic Integrity in Adaptive AI Systems

gpt-5.25/13/2026

Paper 1 offers a concrete, scalable methodological advance (Aumann–Shapley path-integral attribution) with clear axiomatic grounding, large empirical validation at million-agent scale, and a formal impossibility theorem about rescaling bias—strong rigor and immediate utility for computational social science and multi-agent LLM research. Its results directly challenge common small-sample practices, likely shifting methodology broadly. Paper 2 introduces compelling conceptual framing (SRC) and a proposed framework (CRS) but appears largely argumentative/programmatic with limited formalization or empirical demonstration, making near-term impact and validation less certain.

vs. Lightweight LLM Agent Memory with Small Language Models

claude-opus-4.65/13/2026

Paper 1 introduces a fundamentally novel theoretical and methodological contribution—adapting Aumann-Shapley attribution to million-agent LLM systems—backed by a formal impossibility theorem (Attribution Scaling Bias) proving that small-scale studies cannot substitute for full-scale attribution under nonlinear indicators. This challenges widespread methodological assumptions in computational social science and multi-agent simulation. It combines mathematical rigor, large-scale empirical validation on real Bluesky data, and broad implications across social science, economics, and AI. Paper 2 is a solid engineering contribution but is more incremental, offering a modular memory architecture with modest F1 improvements and limited theoretical novelty.

vs. Adaptive Teacher Exposure for Self-Distillation in LLM Reasoning

claude-opus-4.65/13/2026

Paper 1 addresses a fundamental methodological gap in multi-agent systems research by enabling attribution at million-agent scale, proving theoretically that small-scale attribution cannot substitute for full-scale analysis (Attribution Scaling Bias theorem). This has broad implications across computational social science, economics, and complex systems research. The combination of theoretical contribution (axiomatic guarantees, impossibility theorem), massive empirical validation on real Bluesky data, and orders-of-magnitude computational speedup represents a more foundational advance. Paper 2, while solid, offers an incremental improvement to self-distillation training with narrower scope limited to LLM reasoning benchmarks.

vs. Adaptive Multi-Round Allocation with Stochastic Arrivals

gemini-3.15/13/2026

Paper 1 addresses a critical scalability gap in the rapidly growing field of LLM-powered multi-agent systems. By enabling million-agent analysis and proving a fundamental scaling bias in existing small-scale studies, it significantly advances computational social science and complex systems. While Paper 2 offers rigorous mathematical contributions to sequential resource allocation, Paper 1 has broader interdisciplinary implications, higher timeliness, and a larger potential paradigm-shifting impact on how macro-emergence is studied.

vs. WebUncertainty: Dual-Level Uncertainty Driven Planning and Reasoning For Autonomous Web Agent

gemini-3.15/13/2026

Paper 1 offers foundational advancements by proving that small-scale multi-agent simulations cannot accurately model population-scale phenomena due to attribution scaling bias. Its highly scalable methodology, theoretical rigor, and massive empirical validation (1.6M agents) have profound, cross-disciplinary implications for computational social science and complex systems, significantly outweighing the incremental algorithmic improvements for web agents presented in Paper 2.