Teaching Values to Machines: Simulating Human-Like Behavior in LLMs

Asaf Yehudai, Naama Rozen, Ariel Gera

May 28, 2026

arXiv:2605.30036v1 PDF

cs.AI(primary)cs.CL

#1417of 2821·Artificial Intelligence

#1417 of 2821 · Artificial Intelligence

Tournament Score

1408±48

10501800

62%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance6.5

Rigor6

Novelty6

Clarity7.5

Tournament Score

1408±48

10501800

62%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Large Language Models (LLMs) demonstrate a remarkable capacity to adopt different personas and roles; however, it remains unclear whether they can manifest behavior that adheres to a coherent, human-like value structure. In this work, we draw on established psychological value theory to induce human-like values in LLMs and assess their alignment with patterns observed in human studies. Using validated psychological questionnaires, we conduct large-scale experiments -- over 5 million questions -- to evaluate value structures and value-behavior relationships in leading LLMs and compare them to humans. Our findings reveal strong agreement between value-prompted LLMs and humans across both dimensions. Moreover, incorporating human value distributions enhances population-level simulations with value-induced LLMs. These findings highlight the potential of value-induced LLMs as effective, psychologically grounded tools for simulating human behavior.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

1. Core Contribution

This paper investigates whether LLMs can be systematically steered to exhibit coherent, human-like value structures using Schwartz's theory of basic human values, and whether these induced values translate into human-aligned behavioral patterns. The core novelty lies in the comprehensive, three-level analysis: (1) demonstrating that value-prompting induces internally coherent value structures in LLMs, (2) showing that value-behavior relationships in value-prompted LLMs correlate significantly with those observed in human psychological studies, and (3) exploring population-level simulation strategies that incorporate human value distributions. The paper claims to be the first comprehensive study of value–behavior relationships in LLMs, which is a meaningful distinction from prior work that examined either LLM values in isolation or behavioral steering without grounding in formal psychological theory.

2. Methodological Rigor

The experimental design is ambitious in scale (5M+ questions, 7 LLMs, 7 psychological instruments) and draws on well-validated psychological tools (PVQ, BFI-2, Prosocialness Scale, etc.). The use of established correlation-based comparison methods (MDS, Procrustes analysis, Pearson correlations of vectorized correlation matrices) is appropriate and well-grounded in the psychometric literature.

However, several methodological concerns deserve attention:

Statistical testing: The bootstrap procedure (100 iterations × 500 samples) with one-sample t-tests against zero correlation is a reasonable but somewhat generous baseline. Testing against zero correlation does not test whether the alignment is *meaningfully high*—it only tests whether it's non-zero. A more informative approach would compare against random or shuffled baselines.

Multiple comparisons: With 7 models × 5 behavioral tests × multiple distribution strategies, there are many comparisons being made. While most results reach p < 0.01, formal correction for multiple testing is not discussed.

Human baselines: The human comparison data comes from diverse studies with varying sample characteristics (Australian donors, Israeli students, Italian young adults, etc.). Cross-cultural variability is acknowledged as a limitation but not controlled for. The paper does not report test-retest reliability of human correlation matrices, making it hard to know how close the LLM-human ceiling should be.

Temperature and repetition: Running each prompt 100 times at temperature 0.7 creates artificial variance that substitutes for genuine population variability. Whether this parametric choice biases the correlation structures is not explored.

The value-prompting technique itself is straightforward (a single descriptive paragraph per value), which is both a strength (simplicity, reproducibility) and a limitation (no exploration of prompt sensitivity beyond the brief comparison with "value-name only" prompting in the appendix).

3. Potential Impact

The paper has potential impact across multiple domains:

Computational social science: If LLMs can reliably simulate value-behavior relationships, they could serve as "computational sandboxes" for pre-testing psychological hypotheses before costly human studies. This is a compelling use case, though the paper's own results show enough variability across models and conditions to warrant caution.

AI alignment and safety: Understanding how values can be induced and how they coherently propagate to behavior is relevant to value alignment research. The finding that simple prompts can induce coherent value structures is practically useful.

Agent-based simulation: The population simulation strategies (H-Norm, H-Even, H-NP) represent a practical contribution for multi-agent simulation work, though these are relatively straightforward weighting schemes.

The dual-use concern raised in the ethics section—that the same techniques could be used to create convincing anti-social personas—is valid and important.

4. Timeliness & Relevance

This work is highly timely. The use of LLMs as proxies for human subjects is an active and growing research area, and the question of whether LLMs can faithfully reproduce known psychological structures is directly relevant. The paper addresses a genuine gap: prior work has shown LLMs can adopt personas, but the systematic evaluation of whether these personas exhibit coherent internal value structures with corresponding behavioral implications has been lacking.

The inclusion of recent models (Qwen3, GPT-OSS series) makes the results current, though the rapid pace of model development means findings may not generalize to future architectures.

5. Strengths & Limitations

Key Strengths:

Impressive scale and breadth of evaluation across models, questionnaires, and population strategies

Strong theoretical grounding in Schwartz's well-validated value framework

Clear three-level research question structure (coherence → alignment → simulation)

The finding that H-NP (using unprimed LLMs for the non-dominant value group) consistently outperforms other population strategies is an actionable insight

Good reproducibility potential given the straightforward prompting approach

The comparison between "Priming Only," "Test Only," and "Priming & Test" conditions provides useful ablation information

Notable Limitations:

The correlation values, while statistically significant, are moderate for some behavioral tests (e.g., Prosocial scale shows correlations as low as -4.1 for Llama-3-8B, and Donation correlations hover around 45-50 for most models)

No causal claims can be made—the paper acknowledges this regarding internal psychological states but could be more explicit about the implications

The population simulation strategies, while creative, remain simplistic. Real human populations have correlated value profiles, not just dominant single values

Limited exploration of prompt sensitivity; the value descriptions come directly from Schwartz and Sagiv (1995), and small wording changes could potentially alter results substantially

The paper does not address how RLHF/instruction tuning may have already instilled certain value biases in these models, potentially confounding the "induction" narrative

Missing comparison with non-value-based persona prompting (e.g., demographic prompting) as a baseline

Additional Observations:

The finding that model size does not consistently predict better value alignment is interesting and somewhat counterintuitive, suggesting that instruction tuning procedures may matter more than raw capability

The paper would benefit from a discussion of effect sizes in the human psychological literature to contextualize the LLM-human correlation magnitudes

The circular MDS structure appearing across models is visually compelling but largely expected given the prompting design mirrors the theory's structure

Summary

This is a solid, well-executed study that makes a meaningful contribution to the growing literature on LLMs as behavioral simulators. Its primary value lies in the comprehensive, psychologically-grounded evaluation framework rather than in any single surprising finding. The results are encouraging but not conclusive—the moderate correlation strengths for some behavioral tests suggest that value-prompted LLMs are useful approximations rather than faithful replicas of human value-behavior dynamics.

Rating:6.5/ 10

Significance 6.5Rigor 6Novelty 6Clarity 7.5

Generated May 29, 2026

Comparison History (13)

vs. AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security

gpt-5.25/29/2026

Paper 2 has higher estimated scientific impact due to stronger novelty in bridging validated psychological value theory with LLM behavior, exceptional methodological scale (5M questionnaire items) and comparability to human studies, and broad cross-field relevance (AI alignment, computational social science, psychology, HCI, policy). Its applications to population-level simulation and behavior modeling are widely useful and timely. Paper 1 is highly relevant and practical for agent safety, but key claims (e.g., parity with “GPT-5.4”) are hard to assess from the abstract and impact may be narrower to agent security/guardrails despite open release.

vs. OpenClawBench: Benchmarking Process-side Anomalies in Real-world Agent Execution Trajectories

gemini-3.15/29/2026

Paper 1 addresses a critical, timely bottleneck in autonomous AI: the 'Outcome-Process Gap' where task success masks dangerous or erroneous agent behaviors. By providing a large-scale dataset (OpenClawBench) and demonstrating that nearly 10% of 'successful' executions contain anomalies, it directly challenges current evaluation paradigms. While Paper 2 is an impressive interdisciplinary study on LLM personas, Paper 1 provides foundational infrastructure essential for the safe, reliable real-world deployment of agentic systems, giving it broader immediate utility and impact in AI safety and engineering.

vs. You Live More Than Once: Towards Hierarchical Skill Meta-Evolving

gemini-3.15/29/2026

Paper 2 bridges AI and psychology by systematically studying and inducing human values in LLMs. Its massive experimental scale and interdisciplinary approach offer broad applications in AI safety, alignment, and simulating human populations for social sciences. Paper 1 offers a strong technical contribution to agentic skill evolution, but Paper 2's focus on value alignment addresses a more critical, universally relevant challenge with broader societal and cross-disciplinary impact.

vs. MEMENTO: Leveraging Web as a Learning Signal for Low-Data Domains

gpt-5.25/29/2026

Paper 2 has higher potential impact due to its broader cross-disciplinary relevance (AI alignment, computational social science, psychology), strong timeliness around value alignment and human behavior simulation, and large-scale empirical methodology (5M+ questionnaire items) grounded in validated psychological instruments. Its results could influence evaluation standards, agent design, and policy-facing simulations. Paper 1 is novel and practically useful for low-data professional domains via web-interaction memory, but its impact is likely narrower (agentic retrieval/automation) and may be more incremental relative to rapidly evolving web-agent frameworks.

vs. RAISE: RAG Design as an Architecture Search Problem

gemini-3.15/29/2026

While Paper 1 offers a valuable practical benchmark for optimizing RAG systems, Paper 2 demonstrates higher potential scientific impact due to its profound interdisciplinary reach. By successfully bridging established psychological value theory with large-scale LLM behavior, it advances AI alignment, cognitive modeling, and computational social science. The ability to simulate psychologically grounded human populations opens up transformative applications across sociology, economics, and human-computer interaction, offering broader foundational scientific implications than an architectural search framework.

vs. From paper to benchmark: agentic, framework-based reproduction of under-specified methods in machine health intelligence

claude-opus-4.65/29/2026

Paper 2 has broader scientific impact due to its interdisciplinary relevance spanning AI, psychology, and social simulation. It addresses the fundamental question of whether LLMs can embody coherent human-like value structures, with implications for AI alignment, computational social science, and agent-based modeling. The massive experimental scale (5M+ questions) demonstrates methodological rigor. Paper 1, while valuable for PHM reproducibility, addresses a more niche engineering problem with narrower audience. Paper 2's findings about value-behavior relationships in LLMs are timely and relevant to the rapidly growing field of LLM alignment and human simulation.

vs. Governing Technical Debt in Agentic AI Systems

claude-opus-4.65/29/2026

Paper 1 presents a rigorous, large-scale empirical study (5M+ questions) grounded in established psychological theory, demonstrating novel findings about inducing human-like value structures in LLMs. It has broad interdisciplinary impact spanning AI, psychology, and computational social science, with clear applications in human behavior simulation. Paper 2 introduces useful conceptual frameworks (Agentic Technical Debt, Stochastic Tax) but is more of a position/governance paper without empirical validation, targeting a narrower audience of AI system managers rather than advancing fundamental scientific understanding.

vs. Do Agents Know What They Can't Do? Evaluating Feasibility Awareness in Tool-Using Agents

claude-opus-4.65/29/2026

Paper 1 addresses a practical and novel problem in AI agent systems—detecting infeasible tasks to reduce computational waste. It introduces a concrete pipeline (FeasiGen), validated benchmarks with 94% accuracy, new evaluation metrics, and reveals significant findings (73.9% false continue rate). This has direct implications for efficient deployment of tool-using agents, a rapidly growing field. Paper 2 contributes interesting findings on value alignment in LLMs but builds more incrementally on existing persona/role-playing research with less immediate practical impact and narrower methodological contribution.

vs. Laguna M.1/XS.2 Technical Report

gemini-3.15/29/2026

Paper 1 bridges AI and psychology by systematically inducing and evaluating human-like values in LLMs at a massive scale. This interdisciplinary approach offers deep insights into AI alignment and human behavioral simulation, promising broader theoretical and practical impacts across multiple fields. In contrast, Paper 2 is primarily an engineering-focused technical report for new coding models; while practically useful, it lacks the profound scientific novelty and methodological innovation of Paper 1.

vs. Demystifying Data Organization for Enhanced LLM Training

gpt-5.25/29/2026

Paper 2 likely has higher scientific impact: it proposes broadly applicable, low-overhead guidelines and concrete methods for data ordering that can improve stability/efficiency across pre-training and SFT, affecting many LLM pipelines and reducing compute/data costs. This is timely and widely relevant to both academia and industry, with clear real-world adoption potential and reproducibility संकेत (code link, multi-scale experiments). Paper 1 is novel and large-scale, but its impact is narrower (value simulation/alignment) and more sensitive to prompt-based methodology and construct validity when mapping human value theory to LLM behavior.

vs. GraD-IBD: Graph Representation Learning from Diagnosis Trajectories for Early Detection of Inflammatory Bowel Disease

gemini-3.15/29/2026

Paper 2 offers broader and more timely scientific impact by addressing the critical challenge of AI alignment and human behavior simulation in LLMs. By integrating psychological value theories with large-scale LLM experiments, it bridges AI and cognitive sciences, opening avenues for sociological simulations and safer AI agents. While Paper 1 presents a rigorous and practical application of graph neural networks to clinical disease prediction, Paper 2's foundational contribution to the rapidly expanding field of LLM behavior provides wider multi-disciplinary applicability and addresses an immediate, global AI research priority.

vs. GTA: Generating Long-Horizon Tasks for Web Agents at Scale

gpt-5.25/29/2026

Paper 2 likely has higher scientific impact due to a clearer enabling contribution: a scalable, validated pipeline and dynamic benchmark with executable trajectories for long-horizon web agents—directly addressing a major bottleneck (process-level supervision). It offers broad real-world applicability (web assistants, automation), strong methodological emphasis (deterministic replays, systematic validation), and high timeliness as agentic LLMs are a fast-moving area where benchmarks drive progress. Paper 1 is novel and large-scale, but its impact may be narrower (value simulation/alignment) and more dependent on prompt-based induction rather than a reusable infrastructure artifact.

vs. ProjectionBench: Evaluating Scientific Hypothesis Generation in LLMs Under Progressive Information Disclosure

claude-opus-4.65/29/2026

ProjectionBench introduces a novel benchmark framework for evaluating LLMs' scientific hypothesis generation and reasoning capabilities under progressive information disclosure — a unique and timely contribution as AI-for-science accelerates. It addresses a critical gap (evaluating innovative reasoning vs. mere retrieval), tests cutting-edge models (GPT-5, Gemini 3.1), and has broad applicability across scientific domains. Paper 1, while methodologically rigorous, builds incrementally on existing work in value alignment and LLM persona simulation, with narrower applicability primarily in social science simulation. Paper 2's framework has greater potential to shape AI-driven scientific discovery.