SAGE: A Quantitative Evaluation of Socialized Evolution in Agent Ecosystems

Linyue Pan, Yaoming Zhu, Lin Qiu, Xuezhi Cao, Xunliang Cai

Jun 2, 2026

arXiv:2606.03544v1 PDF

cs.AI(primary)cs.CL

#1051of 3404·Artificial Intelligence

#1051 of 3404 · Artificial Intelligence

Tournament Score

1443±43

10501800

68%

Win Rate

Wins

Losses

Matches

Rating

5.8/ 10

Significance6

Rigor5.5

Novelty6.5

Clarity7

Tournament Score

1443±43

10501800

68%

Win Rate

Wins

Losses

Matches

Rating

5.8/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Self-improving language agents are typically evaluated in isolation: an agent attempts a task, receives feedback, and iteratively refines its own behavior. Yet agents increasingly operate alongside peers whose strategies and outcomes are publicly visible. This raises an under-studied question: when does shared experience produce improvements that self-improvement alone cannot achieve? We introduce SAGE (Social Agent Group Evolution),an evaluation framework that compares two compute-matched conditions: SocialEvo, where agents from five distinct model families co-evolve with access to all peers' histories; and SelfEvo, where each agent receives the same number of task attempts but sees only its own past, which is conventional in self-improving agent studies. We instantiate SAGE in three arenas: open-ended ML research, long-horizon economic planning, and strategic multiplayer play, evaluated across multiple evolutionary rounds. We find that group history is not a universal amplifier: the strongest agent does not exceed its self-evolution ceiling. However, agents that plateau under self-improvement can achieve significant breakthroughs when peer experience is available. In competitive settings, counterfactual controls reveal that agents improve generally rather than developing opponent-specific strategies. Across different forms of shared history, filtered peer traces and reflective summaries often outperform raw logs, indicating that social gains depend on abstraction rather than exposure volume. These findings reveal that peer-history gains are agent-specific, arena-dependent, and contingent on the capacity to abstract transferable knowledge from public traces.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: SAGE — A Quantitative Evaluation of Socialized Evolution in Agent Ecosystems

1. Core Contribution

SAGE introduces an evaluation framework to answer a specific and timely question: does access to peer histories improve LLM agent performance beyond what self-improvement alone achieves? The key design choice is a compute-matched counterfactual — SocialEvo (agents see all peers' histories) versus SelfEvo (a focal agent receives the same total rollout budget but sees only its own past). This controls for the confound that social conditions might simply benefit from more total attempts. The framework is instantiated across three arenas (ML research via MLR-Bench, economic planning via DrugWars, and competitive board game play via Splendor) using five frontier model families.

The paper's central findings are nuanced rather than headline-grabbing: (1) social evolution is not a universal amplifier — the strongest agents don't exceed their self-evolution ceiling; (2) agents that plateau under self-improvement can break through with peer experience; (3) in competitive settings, gains appear to be general rather than opponent-specific (with one notable exception); and (4) filtered/abstracted history representations outperform raw logs. These findings are useful precisely because they resist simple narratives.

2. Methodological Rigor

Strengths in design: The compute-matched comparison is the paper's strongest methodological contribution. Much prior work on multi-agent interaction confounds social benefits with additional compute. SAGE explicitly matches rollout budgets, making the comparison principled.

Weaknesses and concerns:

Sample size and statistical power: The experiments use N=16 rounds with m=1 rollout per agent per round. Many confidence intervals in the forest plots (Figure 3) cross zero, making it difficult to draw firm conclusions. The MLR-Bench experiments use only a single randomly sampled task (task 142), severely limiting generalizability claims for that arena.

SelfEvo baseline design: In SelfEvo, the focal agent receives K×m rollouts (5 rollouts per round), but this means the agent is doing more iterations of self-improvement rather than seeing diverse strategies. The comparison is "fair" in compute terms but arguably advantages SelfEvo in some arenas (more self-practice) while disadvantaging it in others (no diversity of approach). The paper acknowledges this but doesn't deeply probe the implications.

Splendor analysis: The TEG analysis and model-swap shadow test are creative, but the SelfEvo baseline for Splendor uses homogeneous same-model leagues rather than direct head-to-head comparisons, which introduces estimation noise. The "opponent-specific adaptation" finding rests on a single preselected pair (DeepSeek-to-Doubao), making it anecdotal rather than systematic.

Limited replication: Five model families is reasonable but fixed. The paper acknowledges this limitation but it constrains how generalizable the "agent-specific" findings truly are.

3. Potential Impact

The paper opens a meaningful evaluation dimension. As LLM agents increasingly operate in shared environments (coding platforms, research assistants, collaborative tools), understanding when and how peer experience helps is practically important. Specific impact vectors:

Benchmark design: SAGE's compute-matched framework could become a template for evaluating multi-agent learning dynamics, moving beyond static benchmarks.

System design: The history-mode ablation (RQ4) has direct implications for how shared agent platforms should surface information — top-1 traces and leaderboard signals outperforming full history is an actionable design insight.

Agent architecture: The finding that social gains depend on abstraction capacity could motivate new memory and filtering mechanisms in agent scaffolds.

However, the impact is somewhat limited by the framework's complexity and resource requirements (five frontier models over 16 rounds across three arenas), making adoption expensive. The findings, while interesting, are also somewhat expected — that information filtering matters more than raw volume is not surprising.

4. Timeliness & Relevance

The paper is well-timed. The explosion of self-improving agent systems (Reflexion, Voyager, Darwin-Gödel Machine, etc.) has created a natural next question about social learning. The paper correctly identifies that most agent evaluation occurs in isolation, and multi-agent evaluation typically focuses on collaboration or competition mechanics rather than evolutionary dynamics. The use of very recent models (GPT-5.4, Kimi-K2.5, DeepSeek-V3.2) positions it at the frontier.

The connection to emerging "agent ecosystems" — where multiple AI systems learn and operate in shared digital environments — makes this increasingly relevant for safety and alignment research as well.

5. Strengths & Limitations

Key Strengths:

Clean conceptual framework with the compute-matched counterfactual design

Diverse arenas covering cooperative and competitive dynamics

Heterogeneous model population avoiding mono-model confounds

The history-mode ablation (RQ4) provides perhaps the most actionable findings

Honest reporting of mixed/null results rather than cherry-picking positive outcomes

Notable Limitations:

Single-task MLR-Bench evaluation severely weakens one of three pillars

Small sample sizes lead to wide confidence intervals and limited statistical power

The paper is primarily descriptive/evaluative — it identifies phenomena but offers limited mechanistic explanation for why certain agents absorb peer experience and others don't

No ablation of population size or composition effects

The three arenas, while diverse, are all relatively artificial; no evaluation in more naturalistic agent deployment settings

The paper doesn't explore what happens when agents can selectively query or interact with peers rather than passively observe histories

Additional Observations

The paper's contribution is more framework than finding. The specific results — mixed effects, arena dependence, agent specificity — are useful but not deeply surprising. The greater value lies in establishing a methodology for asking these questions rigorously. Whether SAGE becomes influential depends on whether the community adopts compute-matched social evaluation as a standard practice.

The reproducibility commitment (MIT-licensed code) is positive, though the reliance on proprietary frontier models limits true reproducibility.

Rating:5.8/ 10

Significance 6Rigor 5.5Novelty 6.5Clarity 7

Generated Jun 3, 2026

Comparison History (28)

vs. Stumbling Into AI Emotional Dependence: How Routine AI Interactions Reshape Human Connection

gpt-5.26/6/2026

Paper 2 likely has higher scientific impact: it introduces a general, compute-matched evaluation framework (SAGE) for a timely problem—social/peer effects in self-improving agent ecosystems—tested across multiple arenas with controlled comparisons and ablations on forms of shared history. The methodological design (SelfEvo vs SocialEvo, counterfactual controls) supports broader, reusable insights for AI research, multi-agent systems, and evaluation practice. Paper 1 is timely and societally important but appears more review/argument-driven and narrower in technical generalizability, limiting cross-field methodological impact.

vs. TRACE: A Temporal Conditional Estimation for Multimodal Time Series Foundation Models

claude-opus-4.66/6/2026

TRACE addresses a fundamental and practical challenge in multimodal time series foundation models—temporal misalignment and modality missingness—which is pervasive across healthcare, affective computing, and many other domains. Its methodological contribution (conditional estimation for cross-modal inference) is broadly applicable and integrates with the rapidly growing foundation model ecosystem. Paper 2 (SAGE) provides interesting insights about social vs. self-improvement in LLM agents, but its findings are more incremental (social learning helps weaker agents but not the strongest) and the evaluation framework, while novel, addresses a narrower and more transient research question. TRACE's real-world applicability and alignment with the foundation model trend give it higher impact potential.

vs. AgentProcessBench: Diagnosing Step-Level Process Quality in Tool-Using Agents

gpt-5.26/5/2026

Paper 2 likely has higher impact: it delivers a concrete, reusable benchmark with sizable human annotations and strong inter-annotator agreement, directly addressing a timely bottleneck (step-level verification for tool-using agents with irreversible side effects). Such benchmarks typically catalyze broad follow-on work across evaluation, RLHF/RLAIF, reward modeling, and test-time scaling. Paper 1 is novel and insightful for multi-agent/social learning evaluation, but its main finding is nuanced (not universally beneficial) and may be more niche and harder to standardize as a community reference artifact than a widely adoptable benchmark.

vs. Consensus is Strategically Insufficient: Reasoning-Trace Disagreement as a Knowledge-Representation Signal

gpt-5.26/5/2026

Paper 2 is likely higher impact due to its stronger methodological contribution: a compute-matched evaluation framework (SAGE) with controlled comparisons (SocialEvo vs SelfEvo), multiple arenas, multiple model families, and counterfactual controls, yielding nuanced, generalizable findings about when social information helps. It has broad applicability to agent learning, multi-agent systems, evaluation methodology, and alignment/safety, and is timely given rapid growth of agent ecosystems. Paper 1 is novel conceptually but is more of a representational proposal with a narrower demonstrated instantiation and less empirical validation.

vs. Neetyabhas: A Framework for Uncertainty-Aware Public Policy Optimization in Rational Agent-Based Models

gpt-5.26/5/2026

Paper 1 has higher likely impact: it introduces a compute-matched evaluation framework (SAGE) that isolates the causal effect of shared peer history vs self-improvement across multiple arenas, yielding broadly relevant findings for LLM agents, multi-agent learning, and evaluation methodology. Its insights (when/for whom social traces help; importance of abstraction vs raw logs) generalize beyond any single domain and are timely given rapid deployment of agent ecosystems. Paper 2 targets an important application, but appears more incremental (standard RL variants in a small-scale simulation) and its conclusions may be less generalizable and rigorous.

vs. MapAgent: An Industrial-Grade Agentic Framework for City-scale Lane-level Map Generation

gpt-5.26/5/2026

Paper 1 has higher potential scientific impact due to its broader, more general contribution: a compute-matched evaluation framework isolating the causal effect of shared peer history on agent improvement across multiple arenas (research, planning, games). It yields nuanced, mechanistic findings (who benefits, when, and why abstractions beat raw logs) that can influence how multi-agent learning, self-improvement, and evaluation are done across fields. Paper 2 is highly applied and impactful industrially, but its core ideas are more domain-specific to lane-level mapping and system engineering.

vs. Learning When Not to Act: Mitigating Tool Abuse in Agentic Reinforcement Learning

claude-opus-4.66/3/2026

Paper 1 (EAPO) addresses a practical and timely problem—tool abuse in agentic RL—with a concrete, well-evaluated framework showing consistent improvements across multiple models and benchmarks. Its direct applicability to improving LLM agent efficiency gives it strong real-world impact. Paper 2 (SAGE) explores an interesting but more niche question about social evolution in agent ecosystems, with nuanced but less definitive findings (gains are agent-specific and arena-dependent). Paper 1's clear methodology, reproducible results, and immediate practical relevance give it higher estimated impact.

vs. EvoBrain: Continual Learning of EEG Foundation Models Across Heterogeneous BCI Tasks

claude-opus-4.66/3/2026

EvoBrain pioneers cross-task continual learning for EEG foundation models, addressing a critical bottleneck in BCI scalability with concrete technical contributions (NSN, RAD) and demonstrated improvements across six tasks. It opens a new research direction in EEG decoding with clear real-world applications in healthcare and neurotechnology. While SAGE provides interesting empirical insights about social vs. self-evolution in LLM agents, its findings are largely observational and incremental—showing peer history helps weaker agents but doesn't surpass self-evolution ceilings. EvoBrain's methodological novelty and practical impact in the growing EEG/BCI field give it broader and more lasting significance.

vs. SkillPyramid: A Hierarchical Skill Consolidation Framework for Self-Evolving Agents

claude-opus-4.66/3/2026

SAGE addresses a more fundamental and novel research question—whether social/shared experience among agents provides benefits beyond self-improvement—establishing an evaluation framework with rigorous compute-matched controls across diverse domains. Its findings (peer-history gains are agent-specific, arena-dependent, and abstraction-dependent) offer nuanced insights for the growing multi-agent ecosystem field. While SkillPyramid shows strong empirical gains in skill reuse, it represents a more incremental contribution to the well-studied area of skill/experience management. SAGE's broader conceptual framing and implications for multi-agent co-evolution give it wider cross-field impact potential.

vs. Think-Before-Speak: From Internal Evaluation to Public Expression in Multi-Agent Social Simulation

gpt-5.26/3/2026

Paper 1 (SAGE) has higher likely impact: it offers a compute-matched, multi-arena evaluation framework for a timely question in agentic AI—when social learning beats isolated self-improvement—yielding nuanced, broadly relevant findings (agent-specific, arena-dependent, abstraction-dependent). Its methodology (controlled SocialEvo vs SelfEvo, counterfactual controls, multiple model families/arenas) is comparatively rigorous and generalizable to many agent research settings and benchmarks. Paper 2 (TBS) is novel for social simulation interpretability, but is narrower in scope (single domain evaluation) and likely impacts a smaller set of fields.

vs. LEAP: Supercharging LLMs for Formal Mathematics with Agentic Frameworks

gemini-3.16/3/2026

Paper 1 represents a significant breakthrough in formal mathematical reasoning, a major frontier for AI. By successfully solving the 2025 Putnam Competition and autonomously formalizing proofs for open mathematical challenges, it demonstrates immediate, high-impact real-world utility in automated theorem proving. Paper 2 provides valuable insights into agent ecosystems, but Paper 1's demonstrable state-of-the-art results on extremely difficult, objective benchmarks suggest a more profound and immediate impact on AI capabilities and mathematical research.

vs. Calibrating Urban Traffic Simulation from Sparse Road Observations via Genetic Optimization

gemini-3.16/3/2026

Paper 1 explores a highly timely and foundational problem in artificial intelligence: how multi-agent ecosystems learn and evolve compared to isolated agents. Its findings on socialized evolution have broad implications for the future of AI design across multiple domains. In contrast, Paper 2 presents a valuable but narrower application of established optimization techniques (genetic algorithms) to a specific civil engineering problem (traffic simulation calibration). Paper 1's generalizable insights into AI behavior give it a significantly higher potential for broad scientific impact.

vs. scTranslation: A Comprehensive Benchmark for Single-Cell Multi-Omics Modality Translation

claude-opus-4.66/3/2026

scTranslation addresses a critical need in the rapidly growing single-cell multi-omics field by providing a systematic benchmark with standardized datasets, metrics, and evaluation scenarios. This infrastructure contribution will directly accelerate method development in computational biology, serving a large and active research community. While SAGE (Paper 1) introduces an interesting evaluation framework for social agent evolution with novel findings, its scope is narrower—focused on a niche aspect of LLM agent self-improvement. Paper 2's practical utility as an open-source benchmark, combined with the broader biomedical impact of single-cell genomics, gives it higher potential impact.

vs. An Exploration of Collision-based Enemy Morphology Generation

claude-opus-4.66/3/2026

SAGE addresses a timely and broadly impactful question about multi-agent LLM systems and social learning dynamics, which is highly relevant given the rapid proliferation of language agents. Its findings about when shared experience helps vs. self-improvement, abstraction over raw exposure, and agent-specific gains provide foundational insights for the growing field of autonomous AI agents. Paper 1, while novel in PCG for enemy morphology, addresses a narrower domain with more limited cross-disciplinary impact. Paper 2's multi-arena evaluation and implications for AI system design give it substantially broader reach.

vs. From Control Boundary to Insurance Claim: Reconstructing AI-Mediated Losses Through the CER Framework

gpt-5.26/3/2026

Paper 1 has higher scientific impact due to a more novel and generalizable experimental framework (compute-matched SocialEvo vs SelfEvo) and multi-domain empirical evaluation with controlled comparisons. Its findings inform core questions in multi-agent learning, evaluation methodology, and scaling of agentic systems, with broad relevance across AI research and deployment. Paper 2 is timely and practically important for governance/insurance, but is primarily a conceptual/diagnostic framework with less methodological rigor and narrower scientific reach, likely impacting policy/practice more than foundational research.

vs. When to Re-Plan: Subgoal Persistence in Hierarchical Latent Reasoning

claude-opus-4.66/3/2026

SAGE addresses a timely and broadly impactful question about multi-agent social evolution in LLM ecosystems, with practical implications for how AI agents collaborate and learn. It spans multiple domains (ML research, economics, games), involves diverse model families, and provides nuanced findings about when social learning helps vs. self-improvement. Paper 2, while technically sound, addresses a narrower architectural design question (subgoal persistence in latent reasoning) with experiments limited to ARC benchmarks. SAGE's breadth, timeliness given the rise of agentic AI, and cross-disciplinary relevance give it higher potential impact.

vs. Forget Attention: Importance-Aware Attention Is All You Need

gemini-3.16/3/2026

Paper 1 proposes a fundamental architectural advancement by seamlessly integrating state space models (SSMs) into the attention mechanism. Core improvements to foundational language model architectures typically yield massive, cross-disciplinary impact, influencing how future models are trained and deployed. While Paper 2 offers a valuable evaluation framework for multi-agent systems, Paper 1 addresses a more central bottleneck in AI with a highly timely and widely applicable solution.

vs. Towards Understanding Modality Interaction in Multimodal Language Models via Partial Information Decomposition

claude-opus-4.66/3/2026

Paper 2 introduces a principled information-theoretic framework (PID) for understanding modality interactions in multimodal LLMs, addressing a fundamental gap in how we analyze and improve these increasingly ubiquitous models. Its novel extension to tri-modal systems (Sensory PID), generalizable findings across model families, and actionable insights (PID-guided reweighting) give it broad methodological impact. Paper 1 provides useful empirical findings about social vs. self-evolution in agents, but its conclusions are largely conditional ('agent-specific, arena-dependent'), limiting generalizability. Paper 2's framework is more likely to become a standard analytical tool across the multimodal AI community.

vs. Do Real-World Datasets Contain Natural Experiments? An Empirical Study Using Causal Feature Selection

gemini-3.16/3/2026

Paper 2 addresses the highly timely and rapidly expanding field of multi-agent LLM ecosystems and self-improvement. It introduces a comprehensive evaluation framework applied across complex, relevant domains, offering nuanced insights into social versus self-evolution. In contrast, while Paper 1 presents an interesting application of causal inference, it explicitly notes that it is a preliminary exploration with limited scope, making its immediate breadth and magnitude of impact likely lower than Paper 2.

vs. The DeepSpeak-Agentic Dataset

claude-opus-4.66/3/2026

SAGE introduces a novel evaluation framework addressing an under-studied question about social vs. self-improvement in language agents, with rigorous methodology (compute-matched conditions, counterfactual controls) across diverse domains. Its findings about when and how peer experience helps agents have broad implications for multi-agent system design, a rapidly growing field. Paper 2 contributes a useful dataset for deepfake detection and human-agent interaction, but is more incremental—primarily a resource contribution rather than generating new scientific insights or methodology with broad theoretical implications.