BehaviorBench: Modeling Real-World User Decisions from Behavioral Traces
Liangwei Yang, Jielin Qiu, Zixiang Chen, Ming Zhu, Juntao Tan, Zhiwei Liu, Wenting Zhao, Zhujun Lan
Abstract
Many decision-support settings require systems that adapt to individual users, but evaluation data for this problem remain limited. Existing benchmarks for user understanding often rely on simulated users or model-generated behavior, even though recent work cautions that model-based simulations can diverge systematically from human behavior. We introduce \textsc{BehaviorBench}, a benchmark for evaluating personalized decision modeling from real-world behavioral traces. \textsc{BehaviorBench} reconstructs wallet-level decision histories from observed public prediction-market and on-chain records, and organizes them into two complementary task layers: \emph{Belief prediction}, which predicts a user's final revealed stance and confidence in a market, and \emph{Trade prediction}, which predicts the direction and amount of individual transactions. Across 2,000 evaluation wallets, the benchmark contains 141,445 Belief instances and 1,485,972 Trade instances, with disjoint support pools for retrieval-based evaluation. We evaluate frontier and open-weight generative models under four history interfaces: no personalization, direct recent history, generated user profiles, and retrieved support-wallet evidence. Personalization improves Belief prediction more consistently than Trade prediction, model rankings change across task layers and metrics, and different history interfaces expose different failure modes. \textsc{BehaviorBench} provides an evaluation setting for studying whether personalized methods can use real-world behavioral evidence rather than simulated users alone.
AI Impact Assessments
(1 models)Scientific Impact Assessment: BehaviorBench
1. Core Contribution
BehaviorBench introduces a benchmark for personalized decision modeling grounded in real-world behavioral traces rather than synthetic or simulated user data. The key novelty lies in (a) reconstructing longitudinal decision histories from public on-chain prediction market records, (b) organizing evaluation into two complementary task layers—Belief prediction (final stance and confidence) and Trade prediction (direction and amount of individual transactions), and (c) systematically comparing four history interfaces (no personalization, direct history, profile-based, retrieval-based) across multiple generative models.
The central empirical finding—that different behavioral targets require different history representations (profiles for beliefs, raw sequential context for trades)—is a meaningful conceptual contribution. This decomposition of "personalization" into distinct capabilities tied to abstraction level is well-motivated and has practical implications for system design.
2. Methodological Rigor
Strengths in construction: The cohort construction pipeline is thoroughly documented (Appendix F), with explicit quality-control filters for removing inconsistent trajectories, bursty activity, and extreme outliers. The temporal cutoff at block 80M (December 2025) to avoid regime shifts from automated activity is a thoughtful design choice. The use of disjoint support pools for retrieval prevents identity leakage—a common pitfall in personalization benchmarks.
Concerns: The paper evaluates only zero-shot prompting of generative models. No fine-tuned models, classical sequential prediction baselines (e.g., RNNs, transformers trained on the train split), or even simple statistical baselines (e.g., majority-class-per-wallet, rolling average) are included. This significantly limits interpretability of the results—we cannot tell whether the observed performance levels represent genuine reasoning about behavioral patterns or shallow heuristics. The 1.3M+ training instances go entirely unused.
The wallet-as-user assumption is acknowledged but not empirically investigated. Given that prediction markets attract bots and multi-wallet actors, the degree to which these "users" represent coherent decision-making entities is unclear. The "realness" filters (burst ratios, block concentration) are heuristic and their sufficiency is unvalidated.
Evaluation metrics are reasonable but limited. For Trade prediction, Median AE for amounts is sensible given skewed distributions, but the paper does not report distributional analysis of amount ranges, making it difficult to interpret whether a Median AE of ~10 is good or poor in context.
3. Potential Impact
Benchmark utility: The benchmark fills a genuine gap—most personalization benchmarks rely on explicit personas, dialogue histories, or synthetic users. Having a large-scale (~1.5M Trade instances, ~141K Belief instances) dataset of real sequential decision-making is valuable for the personalization and user modeling communities.
Cross-field relevance: The benchmark could interest researchers in computational social science, behavioral economics, recommendation systems, and agent memory design. The finding that compressed profiles work for stable beliefs but not for local actions has implications for memory architecture design in LLM agents.
Limitations on impact: The domain is narrow—prediction market trading by crypto-native users—raising questions about generalizability to broader decision-support settings (healthcare, education, financial planning). The paper acknowledges this but positions itself as if the findings are broadly relevant. The connection between on-chain trading behavior and "user understanding" as typically conceived in HCI/personalization research is somewhat tenuous.
4. Timeliness & Relevance
The paper addresses a timely concern: the gap between simulated and real user behavior in LLM evaluation. Multiple recent papers (cited appropriately) have documented sim-to-real gaps. However, the specific domain choice—crypto prediction markets—is niche and potentially transient. The benchmark's long-term value depends on whether Polymarket or similar platforms remain active and whether the behavioral patterns generalize.
The paper is also timely in the context of LLM agents and personalization, both active research areas. The four-interface evaluation framework provides a useful template for future personalization benchmarks.
5. Strengths & Limitations
Key Strengths:
Notable Weaknesses:
Additional Observations:
The paper's framing as addressing "decision support" broadly is somewhat overclaimed relative to the actual domain. The benchmark is more accurately described as a sequential prediction task on trading data with personalization interfaces. The writing is clear and well-organized, though verbose in appendices. The artifact release appears thorough with appropriate documentation.
The error complementarity analysis (upset plots) is a nice addition showing that ~61% of Belief errors are shared across all interfaces—useful for understanding benchmark difficulty but also suggesting that much of the challenge may stem from inherent unpredictability rather than inadequate personalization.
Generated Jun 3, 2026
Comparison History (19)
Paper 1 targets molecular optimization, a critical bottleneck in drug discovery and materials science. Its novel multi-agent tree-search framework addresses the complex, long-horizon multi-objective trade-offs inherent in chemical design (e.g., ADMET, synthesizability). This promises profound, real-world scientific breakthroughs in healthcare and chemistry. While Paper 2 introduces a valuable AI benchmark for personalized decision modeling using prediction markets, its broader scientific impact is narrower, primarily advancing RecSys and Web3 behavioral modeling rather than fundamental scientific discovery.
Paper 1 has higher potential impact due to introducing a large-scale, real-world benchmark (BehaviorBench) built from behavioral traces, addressing a key evaluation gap in personalized decision modeling and human-vs-simulated behavior. Benchmarks often become shared infrastructure, enabling broad, reproducible comparisons across models and methods and influencing multiple fields (LLMs, personalization, behavioral modeling, computational social science, finance/markets). Paper 2 is timely and application-oriented, but appears more like a system/engineering contribution with domain-specific impact and less clearly generalizable evaluation assets.
Paper 1 proposes a highly novel architectural paradigm for biomedical AI agents, directly addressing the critical bottleneck of tool scalability in biological workflows. Its potential to accelerate real-world scientific discovery gives it profound cross-disciplinary impact. While Paper 2 provides a valuable benchmark for personalized decision modeling, its focus on prediction markets is narrower in scope compared to advancing automated biomedical research.
Paper 1 offers higher likely scientific impact because it introduces a broadly useful, real-world benchmark for personalized decision modeling using behavioral traces, addressing a core evaluation gap (human vs simulated behavior) with large-scale, reproducible data and clear task/metric structure. This can catalyze method development across ML, personalization, HCI, and computational social science. Paper 2 is timely and practically valuable for legal agents, but appears more application/engineering-specific and depends on a proprietary evaluation setting (Harvey LAB), which may limit reproducibility and broader cross-field uptake.
Paper 1 addresses the critical and widely relevant problem of hallucination detection in LLMs by connecting it to the well-studied OOD detection framework—a novel geometric perspective that is training-free and scalable. This bridges two major research areas (OOD detection and LLM safety), offering broad applicability across reasoning tasks and high relevance to AI safety. Paper 2, while introducing a useful benchmark for personalized decision modeling, targets a narrower domain (prediction markets/on-chain data) with more limited cross-field impact and less conceptual novelty.
SHARP addresses a fundamental challenge in sequence modeling—learning long-range non-stationary temporal patterns in streaming settings—with a novel biologically-inspired framework. Its hierarchical memory replay mechanism offers exponentially increasing temporal context at linear computational cost, which is a significant theoretical and practical contribution. The approach has broad applicability across sequence modeling tasks (NLP, time series, etc.) and introduces a genuinely novel architectural paradigm. BehaviorBench, while a useful benchmark contribution for personalized decision modeling, is more narrowly scoped to prediction markets and primarily evaluates existing models rather than proposing new methodology.
Paper 1 introduces a large-scale, novel benchmark using real-world behavioral data to address a significant gap in personalized AI systems, which heavily rely on flawed simulations. This sets a foundation for broad future research in user modeling and economics. In contrast, Paper 2 proposes an incremental algorithmic improvement (RelGT-AC) for a specific database task on an existing benchmark, offering narrower methodological contributions and more restricted potential impact across fields.
Paper 1 provides a foundational framework for analyzing and optimizing AI-Driven Research Systems (ADRS), a rapidly growing field with profound implications for automated scientific discovery and algorithmic design. Its methodological rigor and potential to accelerate AI-driven research across multiple domains give it a broader and more transformative scientific impact compared to Paper 2, which offers a valuable but more narrowly focused benchmark for user decision modeling in prediction markets.
Paper 2 likely has higher impact: it introduces a large-scale, real-world benchmark (millions of trade instances) enabling evaluation of personalized decision modeling grounded in behavioral traces, directly addressing a timely gap where simulated users can mislead. Its applications span decision support, economics/finance, HCI, personalization, and trustworthy AI, giving broad cross-field relevance. The methodology leverages objective public records and provides multiple evaluation interfaces exposing failure modes, supporting rigorous, reproducible comparisons. Paper 1 is novel and valuable for spatial reasoning in VLMs, but its impact may be narrower and dataset scale/results appear more incremental.
Paper 1 addresses a fundamental problem in LLM architecture by proposing a novel fusion of Transformers and State Space Models. Improvements to foundational AI architectures have a profound, cross-disciplinary impact on efficiency and performance. In contrast, Paper 2 introduces a benchmark for a more specific niche (predicting user decisions from prediction markets/on-chain data), which, while valuable, has a narrower scope and less potential for widespread foundational impact across fields.
Paper 2 likely has higher scientific impact due to introducing a large-scale, real-world benchmark for personalized decision modeling—an area with broad relevance across ML, HCI, economics/fintech, and alignment. Its use of observed behavioral traces (vs. simulations) is timely and addresses a known validity gap, enabling many follow-on methods and evaluations. The dataset scale and layered tasks support methodological rigor and community adoption. Paper 1 is practical and novel for cost reduction in coding agents, but its impact is narrower and more engineering/optimization-focused.
Paper 2 likely has higher impact because it introduces a large-scale, real-world benchmark (BehaviorBench) that can become shared infrastructure for evaluating personalized decision modeling across many methods and communities (LLMs, recsys, HCI, behavioral econ, fintech). Its dataset scale, real behavioral traces (vs simulations), and multiple task layers/metrics make it broadly reusable and timely given interest in personalization and agent evaluation. Paper 1 is novel and useful for VLA control, but is more specialized to embodied steering and depends on a specific model/task setup, limiting breadth.
Paper 2 is more novel methodologically: it uses LLM-guided evolutionary program synthesis to generate admissible, domain-dependent abstractions for optimal planning—addressing a key gap (learning while preserving A* optimality). Its rigor is strengthened by formal admissibility via abstractions and saturated cost partitioning, and its potential impact spans automated planning, program synthesis, and LLM-based tool generation with clear real-world relevance (robotics, logistics). Paper 1 is a valuable benchmark, but its impact is narrower (evaluation dataset) and may be more incremental relative to existing personalization/behavior benchmarks.
Paper 2 introduces a large-scale, real-world benchmark for personalized decision modeling, addressing a critical gap where simulated data falls short. Benchmarks fundamentally shape research directions and typically garner higher scientific impact and citations by providing standard evaluation frameworks. While Paper 1 offers an innovative methodology for time series forecasting, Paper 2's broad applicability to AI personalization, user modeling, and behavioral science gives it a wider and more foundational impact across multiple disciplines.
POIROT addresses a more fundamental and broadly applicable problem—safety oversight in multi-agent LLM systems—which is critical for deployment across many domains. Its novel approach of using agents as their own diagnostic layer is conceptually innovative and has immediate implications for AI safety regulation. The release of both a library and benchmark (BLAME) increases practical impact. BehaviorBench, while solid, targets a narrower niche (personalized decision modeling from blockchain/prediction-market data) with more incremental contributions. POIROT's relevance to AI safety and regulation gives it stronger timeliness and broader cross-field impact.
Paper 2 addresses a critical safety and equity issue in medical AI, demonstrating a dangerous mechanism (diagnostic substitution) where epidemiological priors cause LLMs to recommend lower triage urgency for young women compared to men with identical symptoms. This has profound real-world implications for healthcare equity, AI alignment, and clinical deployments, giving it broader societal and scientific urgency. While Paper 1 provides a valuable benchmark for behavioral modeling, its focus on prediction markets and on-chain records is more niche, whereas Paper 2's findings on algorithmic bias have immediate, life-critical consequences.
scTranslation addresses a critical need in single-cell genomics—a rapidly growing field with broad biomedical impact. It benchmarks computational methods for multi-omics modality translation, which has direct applications in understanding cellular regulation and disease mechanisms. The systematic evaluation of factors like feature selection and few-shot settings provides actionable insights for method developers. BehaviorBench, while novel in using real-world prediction market data for personalized decision modeling, targets a narrower domain (crypto/prediction markets) with less immediate scientific breadth. scTranslation's open-source framework and relevance to biology give it broader and more lasting impact.
Paper 2 likely has higher impact due to a large-scale, real-world benchmark (millions of transactions) enabling rigorous, reproducible evaluation of personalized decision modeling—an area with immediate applications in recommender systems, decision support, fintech, and human-AI interaction. Its use of observed behavioral traces addresses a timely gap (simulation vs. human behavior divergence) and offers breadth across ML, economics/markets, and behavioral modeling. Paper 1 is innovative for reasoning-structure evaluation, but its scope is narrower (logic puzzles/trace graphs) and may see slower real-world adoption compared to a widely usable dataset and evaluation framework.
Paper 1 likely has higher scientific impact because it introduces a large, real-world benchmark (BehaviorBench) built from behavioral traces rather than simulated users, addressing a widely recognized evaluation gap in personalization/user modeling. Benchmarks tend to catalyze broad follow-on work across ML, HCI, personalization, and decision-support, and the dataset scale plus multiple task layers/interfaces enable systematic study of failure modes. Paper 2 is timely and useful for deployment economics, but its contribution is a more incremental optimization/policy method with narrower cross-field spillover than a new real-world evaluation substrate.