Human-like in-group bias in instruction-tuned language model agents

Messi H. J. Lee

May 27, 2026

arXiv:2605.28114v1 PDF

cs.AI(primary)

#172of 2682·Artificial Intelligence

#172 of 2682 · Artificial Intelligence

Tournament Score

1529±49

10501800

85%

Win Rate

Wins

Losses

Matches

Rating

7/ 10

Significance7.5

Rigor7

Novelty7

Clarity8

Tournament Score

1529±49

10501800

85%

Win Rate

Wins

Losses

Matches

Rating

7/ 10

Significance

Rigor

Novelty

Clarity

Abstract

As autonomous AI agents are deployed in persistent, interacting networks -- coordinating tasks, routing resources, and accumulating reputational histories -- the social dynamics that emerge will determine who receives opportunity and who does not, at scales no human institution can supervise. We ran a controlled multi-agent simulation in which instruction-tuned language model agents interacted across 500 turns under three conditions manipulating group label salience and resource scarcity, across six model families with 20 seeds each. When group labels were visible, we observed in-group trust bias, action homophily, and network assortativity -- all absent when labels were hidden -- a pattern structurally consistent with salience-dependence in human social psychology. This discrimination was invisible to standard action-log audits: bias operated entirely through who received each action, not what actions were chosen, with action-type distributions showing no increase in negative actions across conditions. Per-turn in-group versus out-group differentials of 5 to 16 percentage points were statistically significant for all six models (Wilcoxon signed-rank, all Benjamini-Hochberg-corrected p < 0.001), establishing group-contingent targeting as a robust property of instruction-tuned language models across architectures and training regimes. Compounded through 500 turns of reciprocation, these differentials accumulated into in-group trust biases of +0.014 to +0.100 (d = 0.84-4.52) -- illustrating how modest per-interaction targeting propagates into structural inequality in persistent networks.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

Core Contribution

This paper demonstrates that instruction-tuned language models, when deployed as agents in multi-agent social simulations, exhibit human-like in-group favoritism when arbitrary group labels are made visible. The key novelty lies in three interlocking findings: (1) the bias is salience-dependent — disappearing entirely when labels are hidden, mirroring Self-Categorisation Theory predictions; (2) the bias is covert — operating through differential targeting (who receives positive actions) rather than through increased negative actions, making it invisible to standard action-log audits; and (3) the bias is universal across six model families, suggesting it is a structural property of instruction-tuned LLMs rather than a model-specific artifact.

The paper bridges two previously separate literatures — static representational bias in LLMs (Caliskan et al., 2017) and multi-agent simulation (Park et al., 2023) — by asking whether static biases translate into dynamic behavioral discrimination when models act as persistent social agents. This is a genuinely important question that, to my knowledge, has not been rigorously addressed in this form.

Methodological Rigor

The experimental design is impressively systematic. The three-condition structure (hidden labels, visible labels, visible labels + scarcity) provides clean causal inference about the role of label salience. The inclusion of six model families across 20 seeds each (360 total simulations of 500 turns) provides substantial statistical power and generalizability. The use of invented labels (Kappa/Tilon) appropriately mirrors the minimal group paradigm, isolating label salience from semantic content.

Several design choices strengthen the work: uniform random partner selection eliminates frequency-based homophily as a confound; the Condition D ablation (header-only labels) demonstrates that even minimal exposure suffices; the explicit-prompt ablation rules out prompt compliance as the primary mechanism; and the reasoning-trace analysis provides mechanistic evidence of active group-category encoding.

However, there are methodological concerns. The trust-update mechanics are deterministic and symmetric — the simulation itself amplifies small targeting differentials through bilateral reciprocation. The paper acknowledges this (the amplification calibration in Table S1 is helpful), but the sevenfold variation in accumulated trust bias (+0.014 to +0.100) partly reflects simulation mechanics rather than pure model behavior. The primary model-output quantity — action homophily — shows a more modest 5-fold range (+0.011 to +0.054). The paper is generally careful about this distinction, but the abstract and discussion sometimes conflate simulation outcomes with model properties.

The use of paired Cohen's d computed on within-seed differences inflates effect sizes relative to conventional benchmarks (acknowledged in the methods), which could mislead readers who compare to standard d = 0.2/0.5/0.8 thresholds. The reported d values of 0.84–4.52 sound extraordinary but are not directly comparable.

The 20-agent setup with a balanced 10/10 group split is quite specific. Whether these dynamics scale to larger populations, unbalanced groups, or more than two groups remains untested. The action space, while thoughtfully designed, is constrained — real agentic systems would have far more behavioral degrees of freedom.

Potential Impact

The paper's most consequential contribution may be the audit invisibility finding. The demonstration that discrimination operates entirely through targeting rather than action-type selection has immediate implications for AI governance. Current fairness auditing frameworks, designed for single-model outputs, are structurally blind to this class of emergent discrimination. This finding alone could reshape how regulators approach multi-agent system evaluation.

The work is highly relevant to the rapidly growing field of agentic AI deployment. As companies build systems where multiple LLM agents interact persistently — routing tasks, sharing information, accumulating trust — the paper provides concrete evidence that group-contingent dynamics will emerge spontaneously when group membership is visible. The practical recommendation is clear: any multi-agent system surfacing group identity should undergo outcome-level auditing.

The paper also contributes to AI safety more broadly by demonstrating that alignment training (instruction tuning, RLHF) does not eliminate these dynamics — and may even make them harder to detect by suppressing overtly negative actions while leaving the targeting channel open.

Timeliness & Relevance

This paper arrives at a critical moment. Multi-agent LLM systems are transitioning from research prototypes to production deployments. The paper's framing — that individual model fairness is insufficient when agents form persistent networks — addresses a genuine blind spot in current AI safety discourse. The finding that even two label exposures per interaction suffice to trigger discrimination (Condition D) has immediate practical relevance for system designers.

Strengths

Comprehensive experimental design: Six models × three conditions × 20 seeds, with multiple ablations (Condition D, explicit prompt, reasoning traces), provides unusually thorough coverage.

The covert discrimination finding is the paper's most novel and important contribution — it identifies a class of bias that existing audit methods cannot detect.

Cross-model universality: All six models show significant effects, establishing this as a property of the model class, not individual models.

Careful statistical methodology: BH correction, pre-specified directional hypotheses, appropriate effect-size reporting with caveats.

Strong theoretical grounding in social psychology (SCT, Realistic Conflict Theory, minimal group paradigm).

Limitations & Weaknesses

Ecological validity: The simulation is highly stylized. Real multi-agent deployments involve richer action spaces, asymmetric roles, and more complex interaction structures. The paper's claims about "structural inequality" in future agent networks are extrapolations from a constrained simulation.

No frontier models tested: The 7-12B parameter range excludes GPT-4-class models that undergo substantially more intensive alignment, limiting generalizability to the systems most likely to be deployed at scale.

Amplification mechanics: The simulation's trust-update rules do substantial amplification work. The paper sometimes insufficiently distinguishes between what models do (5-16 pp targeting differentials) and what the simulation produces (trust biases, network assortativity).

Single authorship from an independent researcher with no institutional affiliation may raise reproducibility and peer review concerns, though the methodological transparency is high.

The scarcity manipulation (Condition C) yielded largely null results, undermining the Realistic Conflict Theory framing that features prominently in the introduction.

Causal mechanism remains unclear: Why do instruction-tuned models exhibit this behavior? The reasoning-trace analysis shows active category encoding but doesn't explain why models trained with alignment procedures still discriminate.

Overall Assessment

This is a well-executed study that identifies a genuine and important phenomenon — covert, salience-dependent in-group bias in LLM agents — with clear implications for AI governance. The cross-model universality and audit-invisibility findings are its strongest contributions. The work would benefit from testing frontier models and more ecologically valid simulation environments, but as an initial demonstration of the phenomenon, it sets a useful research agenda.

Rating:7/ 10

Significance 7.5Rigor 7Novelty 7Clarity 8

Generated May 28, 2026

Comparison History (20)

vs. Calibrating Conservatism for Scalable Oversight

gpt-5.25/28/2026

Paper 2 likely has higher scientific impact: it proposes a new, general oversight framework (CCO) with online calibration via conformal methods, offering finite-time statistical guarantees without distributional assumptions—strong methodological rigor and broad relevance to agent safety/control. It demonstrates effectiveness on two salient agentic benchmarks and directly targets a timely, foundational problem (scalable oversight for autonomous agents). Paper 1 is important and timely for AI fairness in multi-agent settings, but is primarily empirical characterization of a failure mode with narrower methodological novelty and fewer cross-domain guarantees.

vs. On the Origin of Synthetic Information by Means of Steganographic Inheritance

gemini-3.15/28/2026

Paper 1 presents rigorous, large-scale empirical evidence of emergent social biases in AI agent networks. Its discovery that standard auditing fails to detect this bias, which compounds into structural inequality, has profound and immediate implications for AI safety, ethics, and multi-agent systems. While Paper 2 offers a creative evolutionary framing for AI watermarking, Paper 1's findings address a critical, unmonitored vulnerability in near-future autonomous AI deployments with high methodological rigor.

vs. Bridging the Detection-to-Abstention Gap in Reasoning Models under Insufficient Information

gpt-5.25/28/2026

Paper 2 likely has higher impact: it identifies a robust, previously under-measured failure mode (in-group bias via targeting rather than action type) across multiple model families, with clear statistical evidence and strong implications for deploying LLM agents in multi-agent, persistent settings. Its findings generalize across architectures and connect to social psychology, fairness, auditing, and AI governance—broad cross-field relevance and timeliness as agentic systems proliferate. Paper 1 is valuable for safety/reliability in reasoning models, but its contribution is more narrowly scoped to abstention control and may be closer to incremental training/control refinements.

vs. Geometry over Density: Few-Shot Cross-Domain OOD Detection

gemini-3.15/28/2026

While Paper 1 presents a highly efficient technical solution for OOD detection, Paper 2 has broader potential scientific and societal impact. As the deployment of autonomous LLM agents scales, understanding their emergent social dynamics is critical. Paper 2's finding that in-group biases compound into structural inequality—and crucially, that these biases are invisible to standard action-log audits—fundamentally challenges current AI safety and auditing paradigms. This interdisciplinary impact across AI alignment, computational social science, and technology policy gives Paper 2 a more far-reaching scientific footprint.

vs. A Policy-Driven Runtime Layer for Agentic LLM Serving

gemini-3.15/28/2026

Paper 1 identifies and quantifies emergent sociological behaviors (in-group bias) in autonomous AI agents, a critical discovery for AI safety, alignment, and ethics. While Paper 2 offers significant practical improvements for multi-agent system serving infrastructure, Paper 1 has a broader interdisciplinary impact, addressing foundational questions about the societal implications, structural inequalities, and unmonitored discrimination in future AI ecosystems.

vs. Picid: A Modular Evaluation Infrastructure for Reproducible PHM Across Tasks and Domains

gemini-3.15/28/2026

Paper 1 addresses a highly timely and critical issue (AI agent bias and structural inequality) with broad implications across AI safety, ethics, and sociology. Its findings on emergent social dynamics in LLM networks have wider societal relevance and cross-disciplinary appeal compared to Paper 2, which provides a valuable but niche methodological infrastructure for Prognostics and Health Management.

vs. Reward Bias Substitution: Single-Axis Bias Mitigations Redirect Optimization Pressure

claude-opus-4.65/28/2026

Paper 1 identifies a fundamental and broadly applicable failure mode in reward model optimization—that single-axis bias mitigations can redirect rather than eliminate optimization pressure. It provides formal theoretical grounding (regime taxonomy, impossibility results for audit-distribution evaluation), empirical demonstrations across multiple settings, and actionable prescriptions. This has immediate implications for the entire RLHF/alignment field and any iterative optimization pipeline. Paper 2 documents an interesting but more narrowly scoped finding about in-group bias in multi-agent LLM simulations. While relevant, Paper 1's theoretical depth and breadth of methodological impact across alignment research gives it higher potential impact.

vs. ECG-WM: A Physiology-Informed ECG World Model for Clinical Intervention Simulation

gpt-5.25/28/2026

Paper 2 has higher likely scientific impact due to broad, timely relevance to AI deployment and governance: it identifies a robust, previously under-audited bias mechanism (targeting/recipient effects) in instruction-tuned LMs across multiple model families with strong statistical evidence, with implications for multi-agent systems, safety, fairness, and policy. Its findings generalize across domains where LMs coordinate resources and reputations. Paper 1 is innovative and clinically valuable, but its impact is narrower (ECG/intervention simulation) and likely constrained by validation, data, and regulatory barriers compared with the cross-field implications of Paper 2.

vs. AutoScientists: Self-Organizing Agent Teams for Long-Running Scientific Experimentation

gemini-3.15/28/2026

Paper 1 presents a framework that accelerates the scientific method itself. By successfully automating parallel hypothesis generation and experimentation, it demonstrates state-of-the-art improvements across highly impactful and diverse fields like protein engineering and ML optimization. Its potential to act as a force multiplier for broad scientific discovery gives it a higher overall scientific impact compared to the crucial, yet more narrowly focused, AI safety and sociological findings of Paper 2.

vs. CIVIC: End-to-End Sequence Compactness for Efficient Vision-Language Models

gemini-3.15/28/2026

Paper 2 investigates emergent social biases in interacting AI agents, addressing critical AI safety, ethics, and societal impact concerns. While Paper 1 offers valuable efficiency optimizations for VLMs, Paper 2's findings on how microscopic biases compound into structural inequality in autonomous networks have broader interdisciplinary implications across computer science, sociology, and policy, likely resulting in higher scientific impact.

vs. Discovery Agents for Real-Time Analytics: Toward Proactive Insight Systems

gpt-5.25/28/2026

Paper 1 has higher scientific impact due to stronger novelty and clearer, rigorously quantified findings about emergent discrimination in instruction-tuned LLM agents—an urgent, timely issue for multi-agent deployments and AI governance. It uses controlled experimental manipulations, multiple model families, many seeds, and statistical testing, yielding generalizable behavioral insights that standard audits miss. Its implications span AI safety, fairness, social science, and multi-agent systems. Paper 2 is promising and applicable but reads more like a systems/architecture proposal with case studies; novelty is more incremental (integration + engineering patterns) and methodological rigor/generalizability is less evident from the abstract.

vs. Clark Hash: Stateless Sparse Johnson-Lindenstrauss Quantization for Neural Embeddings

claude-opus-4.65/28/2026

Paper 2 addresses a timely and broadly impactful issue: emergent social biases in autonomous AI agent networks. It demonstrates that instruction-tuned LLMs exhibit human-like in-group discrimination that is invisible to standard audits, with rigorous methodology (6 model families, 20 seeds, 500 turns, statistical testing). This has significant implications for AI safety, fairness, and governance as multi-agent systems scale. Paper 1, while practically useful, is a straightforward engineering contribution combining known techniques (JL projections, scalar quantization) into a compact codec, with limited novelty and narrower impact.

vs. OpenURMA: A Clean-Room Open Implementation of the Unified Bus Protocol

claude-opus-4.65/28/2026

Paper 1 addresses a fundamental and timely concern about emergent social biases in AI agent networks, with broad implications across AI safety, fairness, policy, and social science. Its finding that instruction-tuned LLMs exhibit human-like in-group bias—invisible to standard audits—is highly novel and relevant as autonomous AI agents are increasingly deployed. The rigorous multi-model, multi-seed experimental design strengthens its contribution. Paper 2, while technically solid as an open implementation of Huawei's UB protocol, is more incremental and narrowly scoped to datacenter networking hardware, with impact limited primarily to that community.

vs. FundaPod: A Multi-Persona Agent Pod Platform with Knowledge Graph Memory for AI-Assisted Fundamental Investment Research

gpt-5.25/28/2026

Paper 1 has higher potential scientific impact due to stronger novelty and broader relevance: it identifies emergent, human-like in-group bias in instruction-tuned LLM agents within controlled multi-agent simulations, highlights audit evasion via action-targeting (not action type), and quantifies compounding inequality over time across multiple model families and seeds with statistical tests. The findings generalize to many deployed agent-network settings (platform governance, safety, fairness, institutional policy). Paper 2 is a useful systems/design contribution for finance workflows, but is narrower in domain impact and appears less empirically rigorous/generalizable beyond the application context.

vs. Verifiable Benchmarking of Long-Horizon Spatial Biology

claude-opus-4.65/28/2026

Paper 2 addresses a broadly impactful and timely problem — emergent discrimination in AI agent networks — with rigorous methodology (6 model families, 20 seeds, 500 turns, multiple conditions). Its finding that in-group bias is invisible to standard audits yet compounds into structural inequality has immediate implications for AI safety, policy, fairness, and multi-agent system design across many fields. Paper 1, while technically rigorous and valuable for spatial biology benchmarking, serves a narrower community and primarily reports low agent performance on a new benchmark rather than uncovering a fundamental behavioral phenomenon with broad societal relevance.

vs. A Query Engine for the Agents

gpt-5.25/28/2026

Paper 1 likely has higher scientific impact: it identifies a novel, socially consequential emergent behavior (in-group bias) in instruction-tuned LLM agents, demonstrates robustness across six model families with controlled manipulations, and shows audit-evasion via targeting rather than action-type—insights directly relevant to safety, fairness, and multi-agent deployment. Its implications span AI alignment, computational social science, and governance, and are timely as persistent agent networks proliferate. Paper 2 is strong engineering with clear practical value, but its contribution is more incremental/tooling-focused and narrower scientifically.

vs. FedMPT: Federated Multi-label Prompt Tuning of Vision-Language Models

gpt-5.25/28/2026

Paper 1 likely has higher impact due to stronger cross-field relevance and timeliness: it identifies robust, architecture-agnostic emergent discrimination dynamics in instruction-tuned LLM agents, with clear implications for AI safety, governance, auditing, and deployment of multi-agent systems. Its controlled multi-agent design, multi-model/seed evaluation, and rigorous statistics support methodological strength, and the finding that standard action-log audits miss bias is practically important. Paper 2 is technically innovative for federated multi-label VLM adaptation, but its impact is narrower (MLR/federated CV) and appears more incremental relative to rapid prior work in prompt tuning and federated learning.

vs. From Static Context to Calibrated Interactive RL: Mitigating Distribution Shift in Multi-turn Dialogue with Aligned Simulator

claude-opus-4.65/28/2026

Paper 1 addresses a novel and critically important problem—emergent social biases in multi-agent LLM systems—with rigorous methodology (6 model families, 20 seeds, 500 turns, multiple conditions). Its finding that in-group bias is invisible to standard audits has profound implications for AI safety, governance, and fairness at scale. The breadth of impact spans AI safety, social science, and policy. Paper 2 makes solid technical contributions to dialogue RL with theoretical grounding, but addresses a more incremental problem (distribution shift mitigation) within a narrower scope. Paper 1's timeliness, given rapid autonomous agent deployment, amplifies its impact.

vs. Can Broad Biomedical Knowledge be Contextualized into Scenario-Grounded Propositions?

gemini-3.15/28/2026

Paper 1 addresses a critical, timely issue in AI safety: emergent social biases in autonomous LLM agents. Its findings on how in-group biases compound into structural inequality have broad, interdisciplinary implications across computer science, sociology, and tech policy, impacting how multi-agent networks are audited and regulated. While Paper 2 offers a valuable methodological advancement for biomedical discovery, Paper 1's insights into fundamental AI behavior and alignment give it a wider breadth of impact and higher potential to shape the rapidly growing field of AI agent deployment.

vs. Advancing Graph Few-Shot Learning via In-Context Learning

gemini-3.15/28/2026

Paper 1 addresses the critical, highly timely issue of emergent social biases in autonomous AI agents. Its findings have broad implications across AI safety, ethics, sociology, and network science, offering high societal relevance. In contrast, Paper 2 presents a strong methodological advancement in a narrower subfield (graph few-shot learning), making its potential impact more specialized.