TRIAGE: Evaluating Prospective Metacognitive Control in LLMs under Resource Constraints

Zabir Al Nazi, Shubhashis Roy Dipta

#148 of 2292 · Artificial Intelligence
Share
Tournament Score
1530±47
10501800
84%
Win Rate
16
Wins
3
Losses
19
Matches
Rating
7.3/ 10
Significance
Rigor
Novelty
Clarity

Abstract

Deploying language models as autonomous agents requires more than per-task accuracy: when an agent faces a queue of problems under a finite token budget, it must decide which to attempt, in what order, and how much compute to commit to each, all before any execution feedback is available. This is the prospective form of metacognitive control studied for decades in human cognition, yet whether language models possess it remains untested. We introduce TRIAGE, an evaluation framework in which a model receives a task pool and a token budget calibrated to its own baseline cost, and commits to a single ordered plan that jointly encodes selection, sequencing, and per-problem allocation. Plans are scored against an oracle with full knowledge of the model's solvability and cost on each problem, yielding a triage efficiency ratio on a common scale. We evaluate frontier and open-source models, with and without reasoning enabled, across competition mathematics, graduate-level science, code generation, and expert multidisciplinary knowledge, and find that current language models exhibit substantial gaps in prospective metacognitive control, revealing a previously unmeasured capability dimension with direct implications for resource-efficient agent deployment.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: TRIAGE

Core Contribution

TRIAGE introduces a novel evaluation framework that tests whether LLMs can perform portfolio-level resource planning: given a set of problems and a finite token budget, can a model decide which problems to attempt, in what order, and how many tokens to allocate to each — all before execution begins? The framework is grounded in decades of human metacognition research (Nelson & Narens, 1990) and operationalizes three control functions (selection, allocation, termination) in a prospective, commitment-based protocol. The key design insight is measuring the *joint* optimization across tasks under a shared constraint, rather than per-task confidence or per-task budget estimation in isolation. The triage efficiency ratio η normalizes model performance between a random baseline and a knapsack-optimal oracle, enabling cross-model and cross-domain comparison.

Methodological Rigor

The framework design is carefully thought through. Several methodological strengths stand out:

Budget calibration: The budget B is calibrated to each model's own baseline cost, ensuring selection pressure is model-appropriate rather than arbitrary. This prevents confounding model capability differences with budget fairness.

Two execution regimes: The unconstrained (advisory) vs. constrained (enforced) regimes cleanly separate monitoring quality from control quality. This distinction reveals that models can sometimes select well but cannot honor their own allocations — a finding that would be invisible under a single regime.

Oracle and random references: The normalization between oracle (0-1 knapsack optimal) and random (uniform-without-replacement) baselines makes η interpretable across different pool difficulties and budget levels.

Robustness checks: The prompt sensitivity ablation (three framings, Kendall's τ ≥ 2/3 across all pairwise comparisons) and the complementary regret metric ˜R address potential concerns about fragility. The budget-aware re-solve experiment (Appendix C.1) provides behavioral evidence that models genuinely cannot honor self-set budgets.

However, there are notable methodological limitations:

  • Ground truth (y_i, c_i) is measured from a single run at temperature 0. While this makes outcomes approximately deterministic, items near the capability frontier could flip, and no bootstrap confidence intervals are reported for η across pools.
  • The 30-problem pool size is pragmatic but small; portfolio optimization over 30 items may not stress-test the combinatorial aspects heavily.
  • The paper uses uniform point values (v_i = 1), which is well-motivated to prevent difficulty leakage, but real deployment often involves heterogeneous task values — an untested dimension.
  • Potential Impact

    Direct practical relevance: The paper addresses a concrete deployment problem. Agentic LLM systems (coding agents, research assistants, multi-step planners) routinely face portfolio-level allocation decisions. The finding that extended reasoning improves accuracy without improving triage efficiency has immediate implications: investing in chain-of-thought doesn't solve the resource allocation problem and may worsen it by producing longer traces that exceed self-declared budgets.

    Benchmark contribution: TRIAGE fills a genuine measurement gap. Existing work tests per-task confidence calibration (Kadavath et al., 2022), per-task abstention (Kirichenko et al., 2025), or per-task budget estimation (Han et al., 2025; Li et al., 2025). None tests the joint portfolio optimization. This construct is both novel and well-grounded in cognitive science.

    Training signal potential: While the authors wisely caution against Goodhart's-law-style optimization, the η metric could inform training of better meta-level controllers, either as a reward signal or as an evaluation target for routing/scheduling modules.

    Cross-disciplinary bridge: The formal connection to Nelson & Narens' metacognitive framework brings cognitive science constructs into LLM evaluation in a falsifiable, operational way, potentially stimulating bidirectional research.

    Timeliness & Relevance

    This work is highly timely. The scaling of test-time compute (Snell et al., 2025), the documented overthinking/underthinking problems in reasoning models (Chen et al., 2024; Wang et al., 2025b), and the analysis paralysis in agentic systems (Cuadron et al., 2025) all point to resource allocation as a critical unsolved problem. The paper arrives at a moment when the field is transitioning from "can the model solve this?" to "can the model manage its own compute wisely?" — a question with increasing economic importance as inference costs scale.

    Strengths

    1. Well-defined construct: The separation of prospective metacognitive control from monitoring and knowledge is clean and operationally grounded.

    2. Comprehensive evaluation: 20 model architectures, 4 domains, 4 budget levels, 2 execution regimes, with/without reasoning — this is a thorough initial characterization.

    3. Key dissociation findings: The empirical dissociation between object-level capability and metacognitive control, the advisory-vs-enforced gap, the failure of extended reasoning to improve triage, and the reduced unsolvable detection in reasoning-trained models are all novel and practically important findings.

    4. Unsolvable injection probe: Testing whether models recognize infeasible items adds ecological validity and connects to abstention literature.

    5. Theoretical grounding: The cognitive science framing is not decorative — it structures the evaluation design in principled ways.

    Limitations

    1. Single-shot ground truth: One baseline run per (model, problem) pair introduces noise for borderline items.

    2. No dynamic/interactive regime: The exclusively prospective design is well-motivated for construct isolation but limits conclusions about real deployment where some replanning is possible.

    3. No training data contamination control: For closed-source models, benchmark contamination cannot be ruled out.

    4. Pool size constraints: 30-problem pools may not capture scaling behavior of the combinatorial optimization.

    5. Missing analysis: No investigation of *what makes models fail* at the planning level — e.g., do they systematically overestimate their abilities on hard problems, or misestimate token costs? Decomposing the error source would strengthen diagnostic value.

    6. Reproducibility concerns: Closed-model APIs may change, and some model identifiers (GPT-5 Mini, GPT-OSS) suggest very recent or pre-release systems whose behavior may not be stable.

    Overall Assessment

    TRIAGE makes a well-motivated, carefully designed contribution to LLM evaluation that identifies a genuine capability gap with practical consequences. The construct is novel, the evaluation is comprehensive, and the findings — particularly the dissociation between accuracy and triage quality, and the advisory-enforced gap — are both surprising and actionable. The framework is likely to influence how the community thinks about resource-efficient agent deployment and could become a standard evaluation axis alongside accuracy-focused benchmarks.

    Rating:7.3/ 10
    Significance 7.5Rigor 7Novelty 8Clarity 8

    Generated May 14, 2026

    Comparison History (19)

    vs. How Far Are We From True Auto-Research?
    claude-opus-4.65/20/2026

    Paper 2 addresses the highly timely and broadly impactful question of whether AI agents can autonomously conduct research, providing a large-scale systematic evaluation (117 papers, multiple agents, multiple review lenses). Its findings—that manuscript-only review overestimates quality and that critical failure modes like fabricated results persist—have immediate implications for the AI research community, funding agencies, and scientific integrity. While Paper 1 introduces a novel metacognitive evaluation framework with solid methodological rigor, its scope is narrower (resource allocation under token budgets). Paper 2's breadth of impact, timeliness given the auto-research hype cycle, and actionable taxonomy of failure modes give it higher potential scientific impact.

    vs. Bias by Necessity: Impossibility Theorems for Sequential Processing with Convergent AI and Human Validation
    gemini-3.15/16/2026

    Paper 2 bridges AI, mathematics, and cognitive psychology with rigorous impossibility theorems proving cognitive biases are inevitable in sequential processing. It combines deep theoretical proofs with extensive empirical validation across frontier LLMs and pre-registered human experiments. While Paper 1 offers a highly practical evaluation framework for LLM agents, Paper 2 provides a fundamental scientific contribution that reshapes our understanding of cognition in both humans and machines, giving it a profound interdisciplinary impact.

    vs. Primal-Dual Guided Decoding for Constrained Discrete Diffusion
    claude-opus-4.65/16/2026

    TRIAGE introduces an entirely new evaluation dimension—prospective metacognitive control in LLMs—that has not been previously measured, opening a novel research direction with broad implications for autonomous agent deployment. It bridges cognitive science and AI evaluation in a timely way as LLM agents proliferate. Paper 2, while technically sound and practically useful, addresses a more incremental problem (constrained decoding for diffusion models) with a standard optimization approach (primal-dual/Lagrangian methods). TRIAGE's novelty in defining and benchmarking a previously unmeasured capability gives it higher potential to influence multiple research communities.

    vs. JURY-RL: Votes Propose, Proofs Dispose for Label-Free RLVR
    gemini-3.15/16/2026

    Paper 2 introduces a novel, label-free reinforcement learning framework that directly improves LLM reasoning capabilities using formal verifiers. Advancements in RL-based training methodologies for reasoning currently drive significant progress in the field, giving this work broader applicability and higher potential impact than the evaluation framework presented in Paper 1.

    vs. Beyond Fixed Benchmarks and Worst-Case Attacks: Dynamic Boundary Evaluation for Language Models
    claude-opus-4.65/16/2026

    TRIAGE introduces a genuinely novel evaluation dimension—prospective metacognitive control under resource constraints—that has no prior systematic measurement for LLMs, despite being well-studied in human cognition. This addresses a critical gap for real-world agent deployment where budget allocation across tasks is essential. Paper 2 (DBE) proposes adaptive evaluation methodology, which is valuable but more incremental, building on established ideas like item response theory and adaptive testing. TRIAGE's concept of measuring planning and resource allocation metacognition opens a new research direction with broader implications for autonomous AI agents.

    vs. $δ$-mem: Efficient Online Memory for Large Language Models
    gpt-5.25/16/2026

    Paper 2 likely has higher impact because it introduces a broadly applicable evaluation framework and metric for a previously under-measured capability (prospective metacognitive control) directly tied to real-world agent deployment under budgets. Its scope spans multiple domains and model families, creating a common benchmark that can shape future research and system design. Paper 1 is a clever, practical memory mechanism with clear utility, but its impact is narrower (architecture/efficiency for LLM memory) and depends on adoption within specific model pipelines, whereas TRIAGE can influence evaluation standards across the field.

    vs. Label Effects: Shared Heuristic Reliance in Trust Assessment by Humans and LLM-as-a-Judge
    gpt-5.25/16/2026

    Paper 2 introduces a new, broadly applicable evaluation paradigm (prospective metacognitive control under token/compute constraints) with a principled metric (oracle-scored efficiency ratio) and direct relevance to autonomous agent deployment and cost-aware inference—timely as models are increasingly used as agents. Its impact spans ML evaluation, agent systems, HCI/decision-making, and optimization. Paper 1 provides important evidence of label-induced bias in LLM-as-judge and humans, but the core phenomenon (source/label bias) is more incremental and narrower in downstream methodological scope than a new capability dimension and benchmark framework.

    vs. UniToolCall: Unifying Tool-Use Representation, Data, and Evaluation for LLM Agents
    gpt-5.25/16/2026

    Paper 2 introduces a novel capability dimension—prospective metacognitive control under explicit resource constraints—with a principled oracle-based metric (triage efficiency ratio) and broad, timely relevance to real-world agent deployment where budgets and task queues are ubiquitous. Its evaluation paradigm is likely to generalize across model families and domains and could reshape how the community measures and optimizes agentic behavior. Paper 1 is strong engineering work (unification, large dataset, benchmark conversion) with clear utility, but it is more incremental and likely to become one of several competing tool-use pipelines rather than redefining evaluation.

    vs. MMSkills: Towards Multimodal Skills for General Visual Agents
    gemini-3.15/16/2026

    Paper 2 introduces a novel evaluation framework for 'prospective metacognitive control,' addressing a critical and previously unmeasured capability in LLM agents: task prioritization under resource constraints. This has broad, field-wide implications for the efficient real-world deployment of autonomous agents, likely sparking extensive follow-up research across various LLM domains.

    vs. The Long-Horizon Task Mirage? Diagnosing Where and Why Agentic Systems Break
    claude-opus-4.65/16/2026

    TRIAGE introduces a genuinely novel capability dimension—prospective metacognitive control—that has not been previously measured in LLMs, bridging decades of cognitive science research with practical agent deployment concerns. It defines a clean, principled evaluation framework with a well-motivated metric (triage efficiency ratio). Paper 1 (HORIZON) addresses a known problem (long-horizon failures) with a diagnostic benchmark, which is valuable but more incremental. TRIAGE opens a new research direction with clear theoretical grounding and practical implications for resource-constrained deployment, giving it broader and more lasting impact potential.

    vs. RS-Claw: Progressive Active Tool Exploration via Hierarchical Skill Trees for Remote Sensing Agents
    gemini-3.15/14/2026

    Paper 2 introduces a novel, fundamental evaluation framework for 'prospective metacognitive control' in LLMs, addressing how agents manage finite resources across multiple tasks. This tackles a broad, foundational challenge in autonomous AI agents with implications across all domains. Paper 1 offers a solid technical improvement for tool selection, but its primary focus is narrower (remote sensing agents), making Paper 2's conceptual innovation and broader applicability likely to yield higher overall scientific impact.

    vs. Symbolic Grounding Reveals Representational Bottlenecks in Abstract Visual Reasoning
    claude-opus-4.65/14/2026

    Paper 2 introduces a novel evaluation framework (TRIAGE) that measures a previously untested capability dimension—prospective metacognitive control—in LLMs, with direct implications for autonomous agent deployment under resource constraints. This addresses a timely and practical problem as LLMs are increasingly deployed as agents. Paper 1, while methodologically sound, provides a relatively incremental diagnostic insight (representation vs. reasoning bottleneck) on a specific benchmark. Paper 2 has broader impact potential across multiple fields (AI safety, agent systems, cognitive science), introduces a reusable evaluation paradigm, and addresses a more pressing real-world need for resource-efficient LLM deployment.

    vs. Beyond Stochastic Exploration: What Makes Training Data Valuable for Agentic Search
    gpt-5.25/14/2026

    Paper 1 is likely higher impact because it introduces a broadly applicable, clearly defined evaluation paradigm (prospective metacognitive control under budgets) with an oracle-normalized metric, revealing an important and under-measured capability dimension for deploying LLM agents. Its applicability spans many domains (math, science, code, knowledge) and directly targets a timely practical constraint (compute/token budgeting). Paper 2 proposes a training method for RL/search agents; while useful, it appears more incremental, less broadly general as a conceptual contribution, and its impact depends on adoption and reproducibility of a specific training pipeline.

    vs. What properties of reasoning supervision are associated with improved downstream model quality?
    gemini-3.15/14/2026

    Paper 1 introduces a highly novel, conceptually rich framework linking human metacognition to LLM resource allocation. By defining and measuring 'prospective metacognitive control,' it opens a new research direction for autonomous agents, likely influencing both AI evaluation and agentic system design. Paper 2 is practically valuable for data curation but focuses on more established fine-tuning paradigms, giving Paper 1 a broader and more innovative scientific impact.

    vs. MMSkills: Towards Multimodal Skills for General Visual Agents
    gemini-3.15/14/2026

    Paper 1 pioneers the evaluation of prospective metacognitive control in LLMs, bridging human cognitive theory with practical resource-constrained AI deployment. By formalizing and measuring how agents allocate compute across tasks before execution, it addresses a fundamental bottleneck in autonomous agent design. While Paper 2 presents a valuable framework for multimodal visual skills, Paper 1 introduces a conceptually novel capability dimension with broader theoretical implications and wide-ranging impact across all LLM-based autonomous systems.

    vs. IdeaForge: A Knowledge Graph-Grounded Multi-Agent Framework for Cross-Methodology Innovation Analysis and Patent Claim Generation
    gpt-5.25/14/2026

    Paper 1 introduces a broadly applicable, methodologically clear evaluation framework (TRIAGE) for a previously under-measured capability: prospective metacognitive control under resource constraints. It defines an oracle-based metric, calibrates budgets to model costs, and evaluates across diverse, high-impact domains, making results comparable and timely for agentic LLM deployment. This can influence benchmarking, model training, and systems design across AI/ML. Paper 2 is application-driven (innovation/patents) and potentially useful, but appears narrower in scope and more systems-engineering oriented, with less generalizable scientific measurement impact.

    vs. Diversity of Extensions in Abstract Argumentation
    claude-opus-4.65/14/2026

    Paper 1 introduces TRIAGE, a novel evaluation framework addressing a critical and timely gap in LLM deployment: metacognitive control under resource constraints. It bridges cognitive science with AI, evaluates multiple frontier models across diverse domains, and has direct practical implications for autonomous agent design. Paper 2 contributes a useful theoretical notion of diversity in abstract argumentation with complexity analysis, but addresses a narrower, more established subfield with less immediate broad impact. Paper 1's relevance to the rapidly growing LLM agent ecosystem gives it substantially higher potential impact.

    vs. Cognifold: Always-On Proactive Memory via Cognitive Folding
    gpt-5.25/14/2026

    Paper 1 is more likely to have higher scientific impact because it introduces a clearly operationalized, broadly applicable evaluation framework for prospective metacognitive control under resource constraints, with an oracle-based metric enabling standardized comparison across models and domains. This targets an immediate, high-relevance gap for real-world agent deployment (token/compute budgeting, task triage) and can become a widely adopted benchmark. Paper 2 is ambitious but relies on less-established constructs (e.g., “cognitive folding,” intent thresholds) and evaluation claims that may be harder to validate and generalize, reducing near-term methodological rigor and adoption.

    vs. Actionable Real-Time Modeling of Surgical Team Dynamics via Time-Expanded Interaction Graphs
    claude-opus-4.65/14/2026

    TRIAGE introduces a novel evaluation framework for a previously unmeasured capability dimension of LLMs—prospective metacognitive control under resource constraints. This addresses a fundamental and timely challenge in AI agent deployment, with broad implications across all LLM applications. Its breadth of impact spans multiple fields (mathematics, science, coding, knowledge tasks) and establishes a new benchmark paradigm. Paper 2 is valuable but addresses a narrower domain (surgical AI) with incremental advances in graph-based team modeling. Paper 1's contribution to understanding and benchmarking LLM capabilities gives it wider scientific reach and timeliness.