Agentick: A Unified Benchmark for General Sequential Decision-Making Agents
Roger Creus Castanyer, Pablo Samuel Castro, Glen Berseth
Abstract
AI agent research spans a wide spectrum: from RL agents that learn from scratch to foundation model agents that leverage pre-trained knowledge, yet no unified benchmark enables fair comparison across these approaches. We present Agentick, a benchmark for sequential decision-making agents designed to evaluate RL, LLM, VLM, hybrid, and human agents on common ground and to power research on the fundamental challenges of sequential decision-making. Agentick provides 37 procedurally generated tasks across six capability categories, four difficulty levels, and five observation modalities, all exposed through a single Gymnasium-compatible interface. The benchmark ships with a Coding API, oracle reference policies for all tasks, pre-built SFT datasets, a composable agent harness, and a live leaderboard. An evaluation spanning 27 configurations and over 90,000 episodes reveals that no single approach dominates: GPT-5 mini leads overall at 0.309 oracle-normalized score while PPO dominates planning and multi-agent tasks; the reasoning harness multiplies LLM performance by 3-10x; and ASCII observations consistently outperform natural language. These findings highlight the substantial room for improvement that remains across all agent paradigms. Agentick's capability-decomposed, multi-modal design provides the empirical infrastructure needed to drive progress toward general autonomous agents, both as an evaluation framework and as a training ground for RL post-training of foundation models in truly sequential environments.
AI Impact Assessments
(1 models)Scientific Impact Assessment: Agentick
Core Contribution
Agentick addresses a genuine gap in AI agent evaluation: the lack of a unified benchmark that enables fair comparison across reinforcement learning agents, large language model agents, vision-language model agents, and hybrid approaches. The benchmark provides 37 procedurally generated gridworld tasks organized into six capability categories (navigation, planning, reasoning, memory, generalization, multi-agent), exposed through five simultaneous observation modalities (ASCII, natural language, structured dictionary, isometric pixels, numpy arrays) via a standard Gymnasium interface. The key design contribution is "paradigm universality"—the idea that the same task can be presented in modalities native to each agent type, removing architectural bias from comparison.
The benchmark also provides practical infrastructure: oracle reference policies via a Coding API, pre-built SFT datasets (120K–500K episodes), a composable agent harness system, and a public leaderboard. The Oracle-Normalized Score (ONS) metric enables cross-task comparison analogous to human-normalized scores in ALE.
Methodological Rigor
The experimental evaluation covers 27 configurations across 90,000+ episodes, which is substantial. The use of deterministic evaluation seeds (SHA-256 hashed) and 95% bootstrap confidence intervals follows best practices from Agarwal et al. (2021). The paradigm-spanning evaluation—three frontier LLMs, PPO from scratch, and four open-weight models—is well-chosen to demonstrate discriminative power.
However, several methodological concerns arise:
1. Oracle calibration: The authors acknowledge that some "oracle" policies are not truly optimal, particularly on stochastic tasks. This undermines the reliability of ONS as a metric, since the ceiling varies in quality across tasks.
2. RL training budget: PPO was trained for only 2M steps—a modest budget by modern standards. The comparison between a PPO agent with limited training and frontier LLMs costing hundreds of millions to pretrain is inherently asymmetric. The paper acknowledges this but doesn't adequately address how it affects conclusions.
3. Frontier model coverage: Only economical model variants (mini, Flash Lite, Haiku) were evaluated due to budget constraints. The absence of flagship models weakens claims about paradigm comparison.
4. Harness confound: The striking 3-10× improvement from the Reasoner harness is interesting but raises questions about whether the benchmark is measuring agent capability or prompt engineering quality. The paper treats this as a feature rather than a confound, which is debatable.
5. Statistical reporting: Confidence intervals are mentioned but only partially reported in Table 4, with "–" entries for GPT-5 mini and Qwen3-4B, the top and bottom performers.
Potential Impact
The benchmark could serve several research communities:
The positioning toward "RL post-training of foundation models" is timely, as this is an active frontier. The benchmark could serve as a training ground for extending RLVR beyond single-turn mathematical reasoning to truly sequential settings.
However, the gridworld abstraction is a significant limitation for broader impact. While the authors argue it's a deliberate design choice (the common denominator across agent types), it means findings may not transfer to continuous control, embodied AI, or real-world robotics—domains where agent capabilities matter most practically.
Timeliness & Relevance
The paper is well-timed. The convergence of RL and foundation model approaches is a major current theme, and the absence of fair cross-paradigm benchmarks is a real bottleneck. The RLVR connection is particularly relevant given DeepSeek-R1 and similar work. ARC-AGI-3 addresses a similar niche but with different design choices; Agentick's Gymnasium compatibility and open infrastructure may make it more accessible to the RL community.
Strengths
1. Clear design principles: The four principles (paradigm universality, capability decomposition, training-first design, controlled difficulty) are well-motivated and consistently implemented.
2. Comprehensive infrastructure: The Coding API, oracle policies, SFT datasets, harness system, and leaderboard together form a complete research ecosystem, not just a benchmark.
3. Actionable findings: The result that no paradigm dominates, with PPO excelling at planning/multi-agent while LLMs lead navigation/generalization, is genuinely informative for the field.
4. Observation modality analysis: The finding that ASCII outperforms natural language for LLM spatial reasoning is a useful practical insight.
5. Reproducibility: Deterministic seeds, public code, and standardized evaluation protocol support reproducibility.
Limitations & Weaknesses
1. Gridworld scope: All tasks are discrete, 2D, turn-based. Despite claims of difficulty (best agent at 0.309 ONS), the abstraction level limits ecological validity and transferability of findings.
2. Task design validation: There's no formal analysis of whether the six categories are orthogonal or whether tasks within categories measure the intended capability. Some tasks could arguably belong to multiple categories.
3. Incomplete evaluation: Missing VLM evaluation, no fine-tuned models, no flagship frontier models. The paper essentially presents preliminary results for a benchmark that hasn't yet been fully utilized.
4. 37 tasks is modest: Compared to MiniHack (100+) or ALE (57), the task count is relatively small, though the multi-modal, multi-difficulty design multiplies the effective evaluation space.
5. Limited novelty in individual tasks: Many tasks (Sokoban, maze navigation, key-door puzzles) are standard in the RL literature. The novelty lies in the unified framework rather than the tasks themselves.
6. The paper references GPT-5 and Qwen3.5, which as of mid-2025 appear to be future/unreleased models, raising questions about reproducibility and verification of results.
Overall Assessment
Agentick makes a solid engineering and systems contribution to AI agent evaluation. It fills a genuine gap in cross-paradigm comparison and provides useful infrastructure. The experimental findings, while preliminary, are informative. The main concerns are the limited scope of gridworld environments, incomplete evaluation coverage, and questions about whether the benchmark's abstractions capture the capabilities that matter most for real-world agent deployment. The paper is well-written and clearly positioned, though the contribution is more infrastructural than scientifically novel.
Generated May 11, 2026
Comparison History (17)
Paper 2 likely has higher impact because a unified, Gymnasium-compatible benchmark with diverse tasks, modalities, reference policies, datasets, and a leaderboard can rapidly shape community evaluation norms, enable reproducible comparisons across RL/LLM/VLM/hybrid agents, and directly support both research and engineering workflows. Its applications are immediate and broad across agent learning, alignment, and foundation-model post-training. Paper 1 is theoretically novel and valuable for understanding limits of model-based planning and reward hacking analogies, but its impact may be narrower and slower to propagate than a widely adopted benchmark infrastructure.
Paper 2 (Agentick) likely has higher scientific impact due to broader, field-spanning utility: a unified, Gymnasium-compatible benchmark with tasks, modalities, oracle policies, datasets, harness, and leaderboard can become shared infrastructure for comparing RL/LLM/VLM/hybrid/human agents, enabling reproducibility and accelerating progress across many labs. Its methodological rigor (procedural tasks, standardized interface, large evaluation) and timeliness (need for fair cross-paradigm evaluation) strengthen impact. Paper 1 is novel and useful for embodied learning, but is narrower (UE5 environment generation) and heavier to adopt, limiting breadth.
Paper 2 likely has higher impact due to broader applicability and community utility: a unified, Gymnasium-compatible benchmark spanning RL/LLM/VLM/hybrids with procedural tasks, multiple modalities, oracle policies, datasets, harness, and leaderboard can become shared infrastructure for many subfields and directly supports training and evaluation workflows. This increases real-world adoption and timeliness given current interest in general agents and sequential decision-making. Paper 1 is novel and valuable (implicit memory constructs for LLMs) but is narrower in scope and may influence a more specialized slice of evaluation research.
Paper 2 (Agentick) has higher potential scientific impact because it addresses a critical infrastructure gap in AI agent research by providing a unified benchmark spanning RL, LLM, VLM, and hybrid agents. Benchmarks historically drive entire research communities (e.g., ImageNet, GLUE). Its broad scope across 37 tasks, multiple modalities, and agent paradigms positions it to influence multiple subfields simultaneously. Paper 1 provides valuable mechanistic insights into instruction-following but has narrower scope—it's primarily an interpretability study with findings that, while interesting, are less likely to reshape research agendas.
Paper 1 introduces a unified benchmark bridging RL and foundation model agents, filling a critical gap in AI evaluation. Benchmarks traditionally drive widespread progress and adoption across the AI community, offering foundational infrastructure for future research. While Paper 2 presents a valuable tool for domain scientists, Paper 1's broader scope, methodological rigor, and potential to unify disparate subfields of AI give it a higher estimated scientific impact.
Paper 2 introduces a comprehensive, unified benchmark for a rapidly growing and broadly applicable field (sequential decision-making AI agents). Benchmarks typically generate high citation counts and drive progress across multiple subfields of AI (RL, LLMs, VLMs). While Paper 1 is highly valuable for clinical AI, its impact is constrained to the healthcare domain, making Paper 2's potential breadth of impact across the wider AI community significantly larger.
Paper 1 introduces a comprehensive, unified benchmark bridging RL and foundation model agents, addressing a critical gap in evaluating general sequential decision-making. Standardized benchmarks often have massive, foundational impact as they drive community progress and serve as necessary infrastructure. While Paper 2 presents an innovative and rigorous approach to multi-hop RAG using code execution, it represents a specific methodological advancement in a narrower subfield, whereas Paper 1's broad applicability and timely infrastructure will likely yield a wider scientific footprint and higher citation count.
Paper 2 likely has higher scientific impact due to a more novel algorithmic contribution (a new game-theoretic, optimization-driven alignment framework with theoretical guarantees) directly targeting a timely, high-stakes problem: multi-preference/value alignment of LLMs for deployment. If validated, PLC could be adopted broadly across alignment, RLHF/RLAIF, and multi-objective optimization. Paper 1 is valuable infrastructure (a unified benchmark) with clear applications, but benchmarks often have more incremental impact and risk being superseded unless they become a dominant standard; Paper 2’s methodological innovation is more transferable across tasks and fields.
Paper 1 introduces a highly anticipated, unified benchmark bridging RL and foundation model agents. Benchmarks historically drive significant empirical progress in AI. Its broad applicability across LLMs, VLMs, and RL, combined with extensive empirical validation (90,000 episodes), gives it a wider and more immediate scientific impact compared to Paper 2, which presents a domain-specific, theoretical framework for clinical AI without concrete empirical scaffolding.
Agentick addresses a more fundamental and broadly applicable problem—unified benchmarking across diverse agent paradigms (RL, LLM, VLM, hybrid)—with substantial infrastructure (37 tasks, 6 categories, Gymnasium API, SFT datasets, leaderboard, 90K+ episodes). Its findings that no single approach dominates and that reasoning harnesses multiply performance 3-10x provide actionable insights for the large agent research community. MirrorBench is creative and novel in bridging psychology with MLLMs, but targets a narrower capability (self-recognition) with less immediate practical utility and a smaller research audience.
Agentick addresses a fundamental gap in AI agent research by providing a unified benchmark that enables fair comparison across RL, LLM, VLM, and hybrid agents—something no existing benchmark offers. Its breadth (37 tasks, 6 capability categories, 5 observation modalities, 27 configurations, 90K+ episodes) and infrastructure (leaderboard, SFT datasets, oracle policies) position it as community-wide research infrastructure. Key findings like no single paradigm dominating and ASCII outperforming natural language are broadly impactful. While STEP-HRL offers a solid methodological contribution to hierarchical RL for LLM agents, Agentick's potential to shape evaluation standards across the entire agent research community gives it greater breadth of impact.
Paper 1 presents a novel, comprehensive benchmark that unifies the evaluation of diverse agent paradigms (RL, LLMs, VLMs) in sequential decision-making. Given the explosive growth in AI agent research, a standardized, multi-modal evaluation framework like Agentick addresses a critical bottleneck, enabling fair comparisons and driving future development. Paper 2 provides a valuable survey and taxonomy of abductive reasoning, but surveys typically have less transformative methodological impact than foundational benchmarks that provide new empirical infrastructure and training grounds for the broader AI community.
Paper 2 likely has higher scientific impact: it introduces a unified, extensible benchmark with broad applicability across RL, LLM/VLM, hybrid, and human agents, enabling standardized comparison and accelerating progress via shared infrastructure (tasks, APIs, oracle policies, datasets, leaderboard). This can influence many subfields (agent evaluation, RL, foundation model post-training, multimodal learning) and is timely given rapid agent development. Paper 1 is rigorous and societally relevant, but its contribution is narrower (fraud-advice robustness) and less likely to become a widely reused community resource.
Paper 1 likely has higher impact: a unified, Gymnasium-compatible benchmark with diverse tasks, modalities, datasets, reference policies, and a leaderboard can become shared infrastructure, shaping evaluation norms and enabling broad, reproducible progress across RL and foundation-model agent research. Its applications span training, post-training, and comparative assessment, with immediate relevance to autonomous agents. Paper 2 is novel and rigorous, offering a clean measurement theory for commitment timing, but is narrower in scope and nearer-term impact (diagnostics/interpretability) than a widely adoptable benchmark platform.
Agentick addresses a fundamental gap in AI agent research by providing a unified benchmark spanning RL, LLM, VLM, and hybrid agents across multiple modalities and capability categories. Its breadth of impact across the entire agent research community, combined with its multi-paradigm evaluation revealing that no single approach dominates, makes it more broadly impactful. SREGym, while valuable, targets a narrower domain (SRE). Agentick's infrastructure for RL post-training of foundation models and cross-paradigm comparison positions it to influence multiple research directions simultaneously.
Paper 1 introduces a comprehensive, unified benchmark for evaluating diverse AI agent paradigms (RL, LLMs, VLMs). In AI research, well-designed benchmarks often drive the direction of future work and accumulate high citations due to their broad utility. Paper 2 presents a novel multi-agent approach for combinatorial optimization, which is valuable but addresses a significantly narrower subfield. The foundational nature and broader applicability of Paper 1 give it higher potential scientific impact.
Paper 2 (Agentick) addresses a broader, more fundamental challenge—unified benchmarking across RL, LLM, VLM, and hybrid agents for sequential decision-making—with potential impact across multiple AI subfields. Its comprehensive infrastructure (37 tasks, multiple modalities, leaderboard, training datasets) positions it as a community resource that could drive sustained research. Paper 1, while important for healthcare robotics safety, is more narrowly scoped to one application domain and primarily identifies problems rather than advancing methodology. Paper 2's cross-paradigm findings (e.g., reasoning harness multiplying performance 3-10x, ASCII outperforming NL) offer more actionable insights for the broader AI community.