From Knowing to Doing: A Memory-Controlled Benchmark for LLM Trading Agents on Stock Markets

Taojie Zhu, Wentao Zhao, Rui Sun, Beidi Luan, Jiacheng Lu, Sinuo Wang, Jing Li, Daxin Jiang

#1404 of 2821 · Artificial Intelligence
Share
Tournament Score
1409±43
10501800
55%
Win Rate
12
Wins
10
Losses
22
Matches
Rating
7.5/ 10
Significance
Rigor
Novelty
Clarity

Abstract

Evaluating whether large language model (LLM) agents can profit in capital markets is increasingly framed as end-to-end trading: place an agent in a historical market, let it trade, and measure portfolio returns. This setup is vulnerable to two evaluation failures. First, long backtests often overlap with the knowledge cutoffs of frontier LLMs, allowing memorized tickers, dates, prices, and market narratives to substitute for investment reasoning. Second, raw returns are a noisy proxy for stock-selection ability, since positive performance may come from market beta, style exposure, or favorable regimes rather than genuine alpha. We introduce KTD-Fin (Knowing-To-Doing Financial Benchmark), an end-to-end stock-market trading benchmark that addresses both issues. KTD-Fin uses a data-side masking protocol to anonymize key identifiers and calendar information consistently across prompts and tools, separating historical market memory from investment decision-making. It also incorporates a Barra-style performance attribution framework that decomposes portfolio returns into market, style, and stock-selection alpha components. Across ten frontier LLM agents evaluated on the Chinese CSI300 over a 2024--2026 window, masking substantially changes agent rationales, pushing them towards anonymized factor-based reasoning. Attribution analysis further shows that LLM agents' cumulative returns under leakage-controlled evaluation are largely explained by passive market and style exposure, with limited evidence of persistent stock-selection alpha. These findings suggest that financial LLM benchmarks should evaluate not only whether an agent makes money, but also whether the source of returns reflects transferable investment skill. We release KTD-Fin as a reproducible template for leakage-controlled and attribution-aware evaluation of LLM trading agents.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: KTD-FIN Benchmark

1. Core Contribution

KTD-FIN addresses two well-defined but previously underexplored failure modes in evaluating LLM trading agents: (1) pretraining data contamination allowing memorized market knowledge to substitute for reasoning, and (2) the inability of raw portfolio returns to distinguish genuine stock-selection alpha from passive factor exposure.

The paper's solution is architecturally clean. A four-level data-side masking protocol (bright, stock-blind, date-blind, blinded) systematically ablates ticker and calendar identifiers, creating a controlled experiment for memory dependence. A Barra-style cross-sectional attribution framework decomposes returns into market, style, and selection-alpha components. Together, these turn "did the agent make money?" into "did the agent make money *for the right reasons*?"

The key insight—that LLMs trade on identity priors rather than data—is demonstrated convincingly through the memory-only experiment: under bright conditions the anchor model actively trades and loses, while under blinded conditions it holds cash at exactly 0.00% return. This is a striking finding that undermines many prior LLM trading evaluations.

2. Methodological Rigor

The experimental design is thorough. The ten-attacker de-anonymization probe is a particularly strong methodological choice: rather than simply asserting that masking works, the authors empirically verify it by tasking ten frontier LLMs with recovering identifiers from masked payloads. Joint success rates peak at 1.5%, confirming effective anonymization.

The Barra attribution uses standard asset-pricing methodology (WLS cross-sectional regression with VIF-screened factors), grounding the decomposition in established quantitative finance practice rather than ad hoc metrics. The nine style factors are well-chosen and the pre-evaluation VIF screening prevents information leakage into factor selection.

The evaluation infrastructure is realistic: T+1 settlement, board-differentiated price limits, transaction costs, and next-day-open execution. The use of median over seeds with Wilcoxon signed-rank tests for cross-condition contrasts is statistically appropriate.

However, several design choices warrant scrutiny. The anchor-only depth scan (full mask × mode grid only on Step-3.5-Flash) limits generalizability of the interaction effects between masking and decision mode. The price-only observation channel, while deliberate, means the benchmark cannot assess whether LLMs add value through fundamental or news-based reasoning—arguably the setting where LLMs might have the strongest comparative advantage. The CSI300 universe restricts geographic generalizability, though the Chinese A-share market has lower analyst coverage which could theoretically favor LLM-based approaches.

3. Potential Impact

Immediate field impact: This paper could substantially reshape how LLM trading agent papers are evaluated. The demonstration that nine of ten frontier LLMs produce negative selection alpha—despite several posting impressive raw returns—is a sobering finding that challenges the narrative of multiple prior works. If adopted, the masking protocol would raise the evaluation bar significantly.

Methodological template: The two-pronged approach (contamination control + attribution) is portable. The authors explicitly note applicability to news-driven and fundamentals-based evaluation, and the masking protocol generalizes beyond finance to any domain where LLMs may have memorized benchmark answers.

Broader LLM evaluation: The contamination insight connects to the growing literature on benchmark contamination (LiveCodeBench, Min-K% Prob). KTD-FIN provides the strongest evidence yet that data-side masking is necessary—prompt-level instructions ("do not use external memory") demonstrably fail, as established by OracleProto and confirmed here in the sequential setting.

Quantitative finance: The finding that LLM agents systematically load on high-volatility, recently-moving names (versus ML models' low-volatility momentum books) characterizes a behavioral signature that may inform both agent design and risk management.

4. Timeliness & Relevance

This paper arrives at a critical juncture. Multiple papers in 2023-2025 have claimed LLM trading profitability using exactly the evaluation setup this paper critiques. As frontier model knowledge cutoffs extend further into historical market data, the contamination problem worsens with each model release. DeepFund's approach of restricting evaluation to post-cutoff dates becomes increasingly impractical. KTD-FIN's data-side masking offers a durable alternative.

The paper also anticipates the emerging concern about "alpha washing"—presenting factor returns as skill—which mirrors long-standing debates in the hedge fund industry now transplanted to the AI agent evaluation context.

5. Strengths & Limitations

Key Strengths:

  • The rationale case study (Appendix A) provides qualitative evidence that is unusually compelling: seeing the same model write "Kweichow Moutai, liquor leader" under bright and "20-day return +27.4%, low vol 0.0255" under blinded on identical numeric inputs is immediately persuasive.
  • The three decision modes (memory-only, fixed-candidate, open-research) cleanly decompose capability dimensions.
  • The ten-dimensional metric panel resists single-number gaming.
  • The selection-alpha finding (78pp range vs. 13-17pp for market/style) demonstrates that return attribution changes the evaluation conclusion, not just adds nuance.
  • Notable Limitations:

  • Price-only observation systematically disadvantages LLMs, whose comparative advantage likely lies in processing unstructured text (news, filings). The benchmark may be measuring whether LLMs can do what quantitative models already do well, rather than what they uniquely offer.
  • Single market (CSI300) and single regime type per window limit external validity.
  • The 2024-2026 window, while long, represents a specific macro regime; selection-alpha conclusions may not generalize.
  • No analysis of whether any LLM agent achieves positive alpha in specific regimes (e.g., the explosive bull window W4), which would add important nuance.
  • The paper does not explore whether fine-tuned or retrieval-augmented LLMs might perform differently, focusing exclusively on zero-shot frontier models.
  • Overall Assessment

    KTD-FIN makes a well-executed contribution to an important and timely problem. Its primary value is methodological: establishing that LLM trading evaluation must control for both data contamination and factor exposure. The negative finding—that LLM agents lack stock-selection alpha—is significant but should be interpreted within the constraint that the benchmark provides only price data, not the unstructured information where LLMs might add value. The benchmark design is sound, reproducible, and immediately useful to the community.

    Rating:7.5/ 10
    Significance 8Rigor 8Novelty 7.5Clarity 8.5

    Generated May 28, 2026

    Comparison History (22)

    vs. MuPHI: Learning Implicit Multimodal Harm Reasoning via Semantically Grounded Reward Optimization
    gpt-5.25/29/2026

    Paper 2 is likely to have higher scientific impact due to broader cross-domain relevance (multimodal safety affects many VLM deployments), strong timeliness (harm reasoning is a central frontier topic), and both a new dataset plus a generalizable training method (reward optimization for implicit semantics) with claimed OOD robustness. Paper 1 is novel and rigorous for financial-agent evaluation (leakage control + attribution), but its scope is narrower (trading benchmarks, specific market) and applications are more specialized. Overall, Paper 2’s contributions more readily transfer across fields and products.

    vs. Beyond Trajectory Rewards: Step-level Credit Assignment for Agentic Search via Graph Modeling
    claude-opus-4.65/29/2026

    Paper 1 addresses a critical and timely problem in LLM evaluation for financial markets—data leakage through memorization and inadequate performance attribution. Its contribution of a rigorous benchmark methodology (KTD-Fin) with masking protocols and Barra-style attribution has broad implications for the growing field of LLM-based trading agents, potentially reshaping how the entire community evaluates financial AI. Paper 2, while technically solid in proposing step-level credit assignment for agentic search, addresses a more incremental improvement in RL-based agent training. Paper 1's findings that LLM trading returns are largely explained by passive exposure rather than alpha is a high-impact insight with significant real-world implications.

    vs. Think Fast, Talk Smart: Partitioning Deterministic and Neural Computation for Structured Health Text Generation
    claude-opus-4.65/29/2026

    Paper 2 addresses a fundamental evaluation flaw affecting the entire field of LLM-based financial agents—data leakage from memorized training data and conflation of beta/style returns with genuine alpha. Its masking protocol and Barra-style attribution framework provide broadly reusable methodological contributions applicable beyond finance to any domain where LLM memorization contaminates evaluation. Paper 1 offers practical engineering insights for health text generation but represents more of an architectural best-practice than a novel scientific contribution, with narrower scope (sleep health) and less generalizable findings.

    vs. Physics Is All You Need? A Case Study in Physicist-Supervised AI Development of Scientific Software
    gpt-5.25/29/2026

    Paper 2 has higher impact potential due to a broadly applicable, methodologically rigorous benchmark addressing two key evaluation pathologies in LLM agent finance (knowledge leakage and confounded performance). Its masking protocol and factor-attribution framework are reusable across markets, tasks, and agent evaluations, likely influencing benchmarking standards beyond trading (e.g., any domain with pretraining leakage and outcome confounds). Paper 1 is insightful but is an N=1 case study with narrower generalizability and primarily qualitative supervision recommendations, limiting cross-field uptake despite relevance to scientific software reliability.

    vs. Tiny but Trusted: Efficient Vision-Language Reasoning for Time-Series Anomaly Detection
    gemini-3.15/29/2026

    Paper 2 addresses critical methodological flaws in LLM evaluation, specifically data contamination and return attribution. By exposing the lack of genuine 'alpha' in LLM trading agents and providing a rigorous masking framework, it corrects a massive source of false positives in FinAI research. This fundamental contribution to AI evaluation and data leakage prevention gives it broader scientific significance than Paper 1's domain-specific application of VLMs to time-series data.

    vs. Demystifying Data Organization for Enhanced LLM Training
    claude-opus-4.65/29/2026

    Paper 1 addresses a fundamental and broadly applicable problem in LLM training—data organization—that impacts the entire LLM community. Its systematic framework with formalized guidelines (Boundary Sharpening, Cyclic Scheduling, Curriculum Continuity, Local Diversity) and validated methods (STR, SAW) across multiple scales offers practical, generalizable contributions. Paper 2, while methodologically rigorous and addressing important evaluation gaps in LLM trading agents, targets a narrower domain (financial trading benchmarks). Paper 1's breadth of impact across all LLM training scenarios gives it higher potential scientific impact.

    vs. Discovery Agents for Real-Time Analytics: Toward Proactive Insight Systems
    gpt-5.25/28/2026

    Paper 2 has higher impact potential due to a clear, timely methodological contribution: a leakage-controlled evaluation protocol for LLM trading plus attribution to disentangle true skill from beta/style effects. This directly addresses a widely recognized failure mode in LLM agent benchmarking (memorization/knowledge cutoff leakage) and provides a reusable template likely transferable beyond finance (e.g., any domain with historical data overlap). The emphasis on rigor and reproducibility strengthens scientific value. Paper 1 is promising but is more of a systems integration/architecture proposal with less clearly defined evaluation standards and novelty relative to existing agentic analytics frameworks.

    vs. LACUNA: Safe Agents as Recursive Program Holes
    gemini-3.15/28/2026

    Paper 2 proposes a novel, general-purpose programming model for LLM agents that addresses critical issues in agent safety, expressiveness, and control flow. Its approach of treating agent actions as typed, compiler-checked program holes has broad applicability across all AI agent domains. While Paper 1 offers a rigorous and valuable benchmark for financial LLMs, Paper 2's fundamental contribution to agent architecture and safety gives it a higher potential for widespread scientific impact across the broader AI and software engineering communities.

    vs. SwarmHarness: Skill-Based Task Routing via Decentralized Incentive-Aligned AI Agent Networks
    gemini-3.15/28/2026

    Paper 2 addresses a fundamental infrastructure challenge in AI by proposing a decentralized compute and task-routing protocol. Its impact is broad, spanning distributed systems, edge computing, and autonomous agent networks. In contrast, Paper 1, while methodologically rigorous in addressing data leakage, is narrowly focused on the specific niche of financial trading benchmarks.

    vs. GraD-IBD: Graph Representation Learning from Diagnosis Trajectories for Early Detection of Inflammatory Bowel Disease
    gpt-5.25/28/2026

    Paper 2 has higher potential impact due to its timely focus on rigorous evaluation of LLM agents, introducing broadly applicable methods (leakage-control via masking and return attribution) that can generalize beyond finance to other agentic benchmarks. It addresses a critical validity threat (knowledge contamination) and improves methodological rigor for a fast-growing field, likely influencing how future studies evaluate LLM decision-making. Paper 1 is valuable but more domain-specific (IBD detection) and incremental within established graph-temporal EHR modeling, with narrower cross-field influence.

    vs. When Context Flips, Safety Breaks: Diagnosing Brittle Safety in Aligned Language Models
    gpt-5.25/28/2026

    Paper 1 is likely higher impact: it identifies a broadly relevant, under-measured safety failure mode (“brittle safety”) with a clear diagnostic protocol (context-flip evaluation), analyzes mechanisms across multiple model families, and demonstrates a concrete mitigation direction (state-aware validation) where common guardrails fail. The implications extend beyond one domain to general alignment, evaluation, and deployment safety. Paper 2 is rigorous and timely for financial agent evaluation, but its primary impact is more domain-specific (trading benchmarks) and may generalize less broadly than safety robustness failures affecting many real-world LLM deployments.

    vs. GONDOR to the Rescue: Satisficing Planning with Low Memory
    claude-opus-4.65/28/2026

    Paper 1 addresses a critical and timely problem in LLM evaluation for financial markets—data leakage through memorization and inadequate performance attribution. It introduces a novel benchmark (KTD-Fin) with masking protocols and Barra-style attribution, revealing that LLM trading agents lack genuine stock-selection alpha. This has broad implications for the rapidly growing field of LLM agents in finance and AI evaluation methodology. Paper 2 presents a useful but incremental contribution to memory-efficient heuristic search, extending GBFS with engineering techniques. While solid, it addresses a narrower, more established problem with less transformative potential.

    vs. Natural Language Query to Configuration for Retrieval Agents
    gpt-5.25/28/2026

    Paper 2 has higher potential impact because it introduces a broadly applicable evaluation methodology (leakage-controlled masking + risk/factor attribution) that directly addresses known failure modes in LLM agent benchmarking, with clear relevance and urgency. The benchmark design can generalize beyond finance to any domain where memorization/leakage and confounded end-to-end metrics occur, and it improves methodological rigor by separating true decision skill from spurious gains. Paper 1 is useful and practical for RAG cost/quality optimization, but is more incremental and narrower in scope.

    vs. From Fact Overwriting to Knowledge Evolution: Causal Editing via On-Policy Self-Distillation
    gemini-3.15/28/2026

    Paper 1 addresses a fundamental flaw in LLM knowledge editing (epistemic dissonance) by introducing a novel causal editing paradigm. Its impact spans across general AI development, lifelong learning, and model reliability. Paper 2, while methodologically rigorous in addressing data leakage, focuses on a domain-specific benchmarking problem (financial trading agents), limiting its breadth of impact compared to the foundational algorithmic contributions of Paper 1.

    vs. Hierarchical Prompt-Domain Control and Learning for Resource-Constrained Agentic Language Models
    gpt-5.25/28/2026

    Paper 2 likely has higher impact due to a clearer, broadly applicable evaluation contribution: it tackles a timely, widely recognized problem (knowledge leakage in LLM agent backtests) with a concrete masking protocol plus risk/return attribution to separate true skill from beta/style effects. This creates an immediately usable benchmark template for finance and general agent evaluation, improving methodological rigor and reproducibility. Paper 1 is innovative for deployment control of constrained agentic LMs, but its impact may be narrower and harder to standardize compared to a benchmark that can reshape how an active research area evaluates results.

    vs. SuiChat-CN: Benchmarking Contextual Suicide Risk Assessment in Chinese Group Chats
    gemini-3.15/28/2026

    Paper 2 addresses a fundamental and pervasive issue in LLM evaluation (data contamination and memorization) through a rigorous masking protocol and performance attribution framework. By releasing a reproducible template, it provides high methodological value that can broadly influence how AI agents are evaluated in temporal and domain-specific tasks. While Paper 1 targets a critical societal issue, its lack of public dataset release may limit its direct follow-up research and broader scientific adoption.

    vs. SimuWoB: Simulating Real-World Mobile Apps for Fast and Faithful GUI Agent Benchmarking
    gemini-3.15/28/2026

    Paper 1 addresses a critical bottleneck in the rapidly growing field of LLM GUI agents: scalable, realistic, and reproducible evaluation. By creating a fully synthetic, backend-free simulation environment with automatic reward generation, it provides a highly practical tool for researchers to test long-horizon interactions. While Paper 2 offers excellent methodological rigor for financial agent evaluation by addressing data leakage, Paper 1 has broader applicability and higher potential impact in the generalized push toward AI assistants and real-world task automation.

    vs. AutoScientists: Self-Organizing Agent Teams for Long-Running Scientific Experimentation
    gemini-3.15/28/2026

    Paper 1 presents a broad, transformative approach to automating scientific discovery across multiple disciplines like biomedicine and protein engineering. Its potential to accelerate general scientific research gives it a vastly higher cross-disciplinary impact compared to Paper 2, which is narrowly focused on the evaluation of financial trading agents.

    vs. Voluntary Collusion with Secret Tools in Competing LLM Agents
    gemini-3.15/28/2026

    Paper 2 has higher potential scientific impact due to its broad implications for AI safety, alignment, and multi-agent systems. Discovering that ostensibly aligned LLMs voluntarily engage in secret collusion and deception addresses core anxieties regarding autonomous AI deployment. This research spans multiple domains, including AI ethics, policy, and safety. In contrast, while Paper 1 presents a methodologically rigorous and highly useful benchmark, its impact is largely confined to the specific domain of quantitative finance and AI trading.

    vs. Verifiable Benchmarking of Long-Horizon Spatial Biology
    claude-opus-4.65/28/2026

    Paper 2 addresses a more broadly relevant problem—evaluating LLM agents in financial markets—with a methodologically rigorous approach that combines data masking to prevent memorization leakage and Barra-style performance attribution to decompose returns. Its contributions (leakage control and alpha attribution) are generalizable beyond finance to any domain where LLM memorization confounds evaluation. Paper 1, while valuable for spatial biology, targets a narrower scientific community with a domain-specific benchmark. Paper 2's insights about memorization contamination and the distinction between genuine reasoning and data leakage have wider implications for the entire LLM evaluation ecosystem.