BankerToolBench: Evaluating AI Agents in End-to-End Investment Banking Workflows

Elaine Lau, Markus Dücker, Ronak Chaudhary, Hui Wen Goh, Rosemary Wei, Vaibhav Kumar, Saed Qunbar, Guram Gogia

Apr 13, 2026arXiv:2604.11304v1

cs.AI

#720of 3875·Artificial Intelligence

#720 of 3875 · Artificial Intelligence

Tournament Score

1470±23

10501800

55%

Win Rate

Wins

Losses

Matches

Rating

7.8/ 10

Significance8.5

Rigor8

Novelty7.5

Clarity8.5

Abstract

Existing AI benchmarks lack the fidelity to assess economically meaningful progress on professional workflows. To evaluate frontier AI agents in a high-value, labor-intensive profession, we introduce BankerToolBench (BTB): an open-source benchmark of end-to-end analytical workflows routinely performed by junior investment bankers. To develop an ecologically valid benchmark grounded in representative work environments, we collaborated with 502 investment bankers from leading firms. BTB requires agents to execute senior banker requests by navigating data rooms, using industry tools (market data platform, SEC filings database), and generating multi-file deliverables--including Excel financial models, PowerPoint pitch decks, and PDF/Word reports. Completing a BTB task takes bankers up to 21 hours, underscoring the economic stakes of successfully delegating this work to AI. BTB enables automated evaluation of any LLM or agent, scoring deliverables against 100+ rubric criteria defined by veteran investment bankers to capture stakeholder utility. Testing 9 frontier models, we find that even the best-performing model (GPT-5.4) fails nearly half of the rubric criteria and bankers rate 0% of its outputs as client-ready. Our failure analysis reveals key obstacles (such as breakdowns in cross-artifact consistency) and improvement directions for agentic AI in high-stakes professional workflows.

AI Impact Assessments

(3 models)

Scientific Impact Assessment: BankerToolBench

Core Contribution

BankerToolBench (BTB) introduces a benchmark of 100 end-to-end investment banking tasks designed to evaluate whether frontier AI agents can execute complete professional workflows—not just answer questions or perform isolated subtasks. The key novelty lies in the combination of: (1) tasks requiring multi-file deliverables (Excel models, PowerPoint decks, PDF reports) rather than text responses; (2) expert-crafted rubrics averaging 150 binary criteria per task across six evaluation dimensions; (3) an agentic verifier ("Gandalf") capable of inspecting spreadsheet formulas, cross-artifact consistency, and domain-specific correctness; and (4) grounding in a systematic Job Task Analysis survey of 502 investment bankers from elite firms. The benchmark fills a genuine gap: existing benchmarks like GDPVal and APEX-Agents sacrifice domain depth for occupational breadth, with IB tasks averaging only 1.36 hours and ~3 rubric criteria in APEX-Agents versus BTB's 5-hour average and 150 criteria.

Methodological Rigor

The benchmark construction pipeline is impressively rigorous. The psychometric grounding—including stratified sampling for the JTA survey (n=193), AI value survey (n=129), and a formal blueprint derivation—provides principled justification for the task distribution. The multi-stage quality control (4+ bankers per task, iterative review cycles, automated pre-screening) addresses common concerns about benchmark contamination and unrealistic tasks.

The verifier validation is particularly strong: 88.2% accuracy with κ=0.76 against human consensus, compared to 84.6% raw agreement and κ=0.69-0.82 between human raters. The stability analysis (0.4pp standard deviation across runs) and Pareto-optimal model selection (Gemini 3 Flash Preview) demonstrate careful engineering. The comparison against alternative judge models (Table A10) adds confidence.

However, several methodological concerns warrant attention. First, the rubric construction involved LLM-generated draft criteria that bankers then refined—this could introduce systematic biases toward criteria that LLMs can articulate well, potentially underrepresenting tacit professional knowledge. Second, the "extra context" added to prompts (Section 3.1) acknowledges that tasks aren't truly ecologically valid; real banker requests are more ambiguous. Third, with only 100 tasks, statistical power for fine-grained subcategory analyses is limited (e.g., n=1 for Word deliverables, n=3 for several subcategories). The post-training experiments (Section 6.6), while demonstrating hill-climbability, use very small models (4B, 32B) and limited compute, making it difficult to draw strong conclusions about BTB's utility as a training signal for frontier-scale models.

Potential Impact

BTB's impact potential spans multiple dimensions:

Benchmarking methodology: The paper establishes a template for domain-specific professional benchmarks—JTA surveys, expert rubrics, agentic verification of non-text artifacts. This could inspire analogous benchmarks in law, accounting, consulting, and medicine where deliverables are complex, multi-format documents rather than text answers.

AI deployment in finance: The finding that 0% of GPT-5.4 outputs are rated "sendable as-is" by bankers provides a concrete, credible reality check against hype about imminent AI-driven labor displacement in finance. The failure taxonomy (code/formula generation 41%, reasoning 27%, retrieval 18%, fabrication 13%) gives actionable improvement directions.

RL training infrastructure: The Harbor-based RLE design, combined with the demonstrated post-training improvements (up to 13× for Qwen 3 4B), positions BTB as a potential training environment, not just an evaluation benchmark—though the authors appropriately caution against training on the public data.

Economic measurement: By linking benchmark tasks to actual workflow hours (up to 21 hours per task) and banker willingness-to-pay data, BTB creates a more direct connection between AI capability metrics and economic value than most benchmarks achieve.

Timeliness & Relevance

The paper arrives at a critical inflection point. Major banks (Goldman Sachs, JPMorgan) are actively deploying AI tools, GPT-5.4 and Claude Opus 4.6 reportedly optimized for financial tasks, and there's enormous investor and policy interest in AI's labor market impact. The "benchmaxxing crisis"—where models ace traditional benchmarks but disappoint in practice—makes profession-specific evaluation urgent. BTB directly addresses this by measuring what matters: can AI actually do the work?

Strengths

1. Ecological validity: The collaboration with 502 bankers, systematic JTA methodology, and realistic multi-file deliverables set a new standard for professional AI evaluation.

2. Verifier innovation: Gandalf's ability to inspect Excel formulas, cross-reference artifacts, and verify domain-specific criteria goes well beyond serialize-then-grade approaches.

3. Comprehensive failure analysis: The trajectory-level failure taxonomy with concrete examples (API hallucination, hard-coding instead of formulas, data fabrication) provides genuinely actionable insights.

4. Reproducibility: Open-sourcing the benchmark, verifier, and Harbor environment enables community adoption.

5. Domain knowledge experiment (Figure 9): Demonstrating that additional IB context significantly improves scores validates the hypothesis that domain knowledge—not just reasoning—is the bottleneck.

Limitations

1. US-centric scope: IB practices differ significantly across jurisdictions; generalizability is limited.

2. Static data: Real banking involves evolving information, team collaboration, and iterative feedback—BTB captures none of this social/temporal complexity.

3. Potential ceiling effects: As models improve, the rubric-based binary scoring may lack sensitivity to distinguish near-expert-level outputs.

4. Company affiliation: Handshake AI, which developed BTB, operates in the AI-for-banking space, creating potential conflicts of interest in benchmark design choices.

5. Model vintage: Testing only 9 models at one point in time limits longitudinal insights; the benchmark's discriminative power as models rapidly improve remains to be seen.

Overall Assessment

BTB represents a significant methodological advance in AI evaluation, shifting from capability-testing to work-completion assessment. Its primary contribution is demonstrating—rigorously—the gap between benchmark performance and professional utility, while providing infrastructure to close that gap. The ecological validity, verifier innovation, and systematic construction methodology make it a likely template for future domain-specific benchmarks.

Rating:7.8/ 10

Significance 8.5Rigor 8Novelty 7.5Clarity 8.5

Generated Apr 14, 2026

Comparison History (76)

Wonvs. SciIntegrity-Bench: A Benchmark for Evaluating Academic Integrity in AI Scientist Systems

BankerToolBench has higher potential impact due to its unprecedented scale (502 domain experts), comprehensive evaluation framework for professional AI workflows, and direct economic relevance. It addresses the critical gap in evaluating AI agents on complex, multi-step real-world tasks with multi-artifact deliverables, which has broad implications for AI deployment across professions. While SciIntegrity-Bench addresses the important issue of AI integrity in research, its smaller scale (33 scenarios, 7 models) and narrower scope limit its breadth of impact. BTB's findings on frontier model failures in high-stakes professional settings will likely drive significant agent architecture research.

claude-opus-4-6·May 16, 2026

Wonvs. Selective Off-Policy Reference Tuning with Plan Guidance

BankerToolBench introduces a novel, comprehensive benchmark for evaluating AI agents on complex, real-world professional workflows, developed with 502 domain experts. It addresses a significant gap in AI evaluation—ecologically valid, end-to-end professional task benchmarks—with broad implications for AI deployment in high-stakes domains beyond banking. The finding that frontier models fail ~50% of criteria and produce 0% client-ready outputs provides a clear and impactful challenge for the field. Paper 2 offers a useful incremental improvement to RL training (SORT), but its scope and potential influence are narrower, focused on a specific training methodology refinement.

claude-opus-4-6·May 16, 2026

Lostvs. How LLMs Are Persuaded: A Few Attention Heads, Rerouted

Paper 1 offers fundamental insights into the mechanistic interpretability of LLMs, uncovering the specific circuits responsible for persuasion and factual alteration. This contributes deeply to the broader fields of AI safety, alignment, and model robustness. While Paper 2 presents a valuable and practical domain-specific benchmark for AI agents in finance, Paper 1 addresses a core, generalizable vulnerability in foundation models with more profound theoretical and scientific implications.

gemini-3.1-pro-preview·May 16, 2026

Lostvs. Attractor Geometry of Transformer Memory: From Conflict Arbitration to Confident Hallucination

Paper 1 offers a mechanistic, geometry-based theory unifying conflict and hallucination in transformer hidden states, proposes a measurable diagnostic (geometric margin), provides causal isolation via controlled LoRA experiments, and reports scaling behavior—advances likely to influence core LM interpretability, reliability, and safety across many domains. Paper 2 is highly valuable and timely as an ecologically valid benchmark with strong stakeholder grounding, but its primary contribution is evaluative infrastructure for one profession; broader scientific generalization and conceptual novelty are more limited compared to Paper 1’s potential to reshape understanding and mitigation of hallucinations system-wide.

gpt-5.2·May 16, 2026

Lostvs. Implicit Compression Regularization: Concise Reasoning via Internal Shorter Distributions in RL Post-Training

Paper 2 likely has higher scientific impact: it proposes a generally applicable RL post-training regularizer (ICR) addressing overthinking with a principled signal derived from on-policy shortest-correct responses, and demonstrates improved accuracy–length Pareto tradeoffs across multiple backbones and benchmarks. This is methodologically and theoretically oriented, broadly relevant to LLM training, inference cost, and deployment, and timely given widespread RLVR use. Paper 1 is valuable and novel as an ecologically valid benchmark, but its impact is narrower (investment-banking workflow evaluation) and primarily evaluative rather than advancing core modeling/training methods.

gpt-5.2·May 11, 2026

Lostvs. Where Paths Split: Localized, Calibrated Control of Moral Reasoning in Large Language Models

Paper 1 introduces fundamental mechanistic innovations for AI alignment and inference-time steering, addressing the critical challenge of controlling moral reasoning in LLMs. Its contributions to interpretability and ethical control offer broad applicability across AI safety research. While Paper 2 presents a highly valuable and economically significant benchmark, Paper 1's algorithmic advancements tackle core architectural control problems, likely driving wider foundational follow-up studies in the broader machine learning and AI alignment communities.