GIM: Evaluating models via tasks that integrate multiple cognitive domains

Rohit Patel, Alexandre Rezende, Steven McClain

May 18, 2026

arXiv:2605.18663v1 PDF

cs.AI(primary)cs.CLcs.LG

#482of 2292·Artificial Intelligence

#482 of 2292 · Artificial Intelligence

Tournament Score

1474±43

10501800

63%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance6.5

Rigor7

Novelty6.5

Clarity8

Tournament Score

1474±43

10501800

63%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance

Rigor

Novelty

Clarity

Abstract

As LLM benchmarks saturate, the evaluation community has pursued two strategies to increase difficulty: escalating knowledge demands (GPQA, HLE) or removing knowledge entirely in favor of abstract reasoning (ARC-AGI). The first conflates memorization with capability; the second divorces reasoning from the practical contexts in which it matters. We take a different approach. The Grounded Integration Measure (GIM) is a benchmark of 820 original problems (615 public, 205 private) where difficulty comes from integration; individual problems require coordinating multiple cognitive operations (constraint satisfaction, state tracking, epistemic vigilance, audience calibration) over broadly accessible knowledge, so that reasoning stays grounded in realistic tasks without being gated on specialized expertise. Each problem is an original expert-authored composition, majority with rubric-decomposed scoring (median 6 independently judged criteria). A balanced public--private split provides built-in contamination diagnostic. We calibrate a continuous response 2-parameter logistic (2PL) IRT model over >200k prompt-response pairs across 28 models, producing robust ability estimates that correctly order test-configurations even when raw accuracy is distorted by errors or missing data, addressing a common challenge in benchmark reporting. Using this framework, we present a comprehensive leaderboard spanning 22 models and 47 test-configurations (unique model, thinking-level pairs), and conduct what is to our knowledge the most extensive published study of how test-time compute trades off against model capability on a fixed benchmark: 11 models swept across 35 test-configurations. We observe that within-family configuration choices, such as thinking budget and quantization, matter as much as model selection. We release the evaluation framework, calibrated IRT parameters, and all public problems.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: GIM — Evaluating Models via Tasks that Integrate Multiple Cognitive Domains

1. Core Contribution

GIM introduces a benchmark of 820 expert-authored problems where difficulty derives from *integration* — the need to coordinate multiple cognitive operations (constraint satisfaction, state tracking, epistemic vigilance, audience calibration) over broadly accessible knowledge. This positions GIM in a deliberate middle ground between knowledge-heavy benchmarks (GPQA, HLE) and abstract reasoning benchmarks (ARC-AGI). The paper makes four concrete contributions: the dataset itself, a calibrated 2PL IRT model over >200k prompt-response pairs, a 47-configuration leaderboard, and what the authors claim is the most extensive published study of test-time compute vs. model capability on a fixed benchmark.

The conceptual framing is compelling: the authors argue that difficulty should come from *what you do with information* rather than *how obscure the information is*. The example problems are genuinely creative — the modified wolf-goat-cabbage puzzle that invalidates the textbook solution, the anachronistic ZIP code letter, and the audience calibration problem where models patronize a peer with a chemistry PhD — each illustrate failure modes that standard benchmarks don't capture.

2. Methodological Rigor

IRT Framework: The adoption of a continuous-response 2PL IRT model is well-motivated and carefully executed. The closed-form WLS scorer (Equation 2) elegantly handles missing data — a real and underappreciated problem when frontier models at high thinking budgets timeout more frequently. The LOMO (leave-one-model-out) validation showing recovery within 0.087 logits (median 0.030) is convincing evidence of calibration stability. The released item bank enables future models to be scored without recalibration.

Scoring Design: The rubric-decomposed scoring (median 6 criteria, 528/820 problems) with confidence-weighted aggregation is a significant improvement over binary grading. The cross-judge consistency check (Pearson r=0.922 per prompt, Cohen's κ=0.815 per criterion between Gemini Flash and GPT 5.4) is reassuring but limited — it covers only 5 of 22 models and uses proprietary judges. The authors acknowledge this forthrightly.

Weaknesses in rigor: The reliance on a single proprietary LLM judge (gemini-3-flash-preview) is the most significant methodological concern. While the spot-check with GPT 5.4 shows rank preservation, absolute scores shift by ~4-5pp, meaning the θ scale is judge-dependent. The 292 exact-answer problems (36%) lack rubrics, and the authors themselves note this as a retrospective regret. The centaur study is underpowered (15 top operators, 267 pairs for the "Top" group) and conflates picking among LLM outputs with supplying independent reasoning.

3. Potential Impact

Benchmarking methodology: GIM's most transferable contribution may be its IRT framework rather than the problems themselves. Demonstrating that IRT correctly handles missing data from infrastructure failures — where raw means would rank GPT 5.4 X-High below High due to 2.3% failure rates — addresses a practical problem that every evaluation group faces but few discuss publicly.

Test-time compute analysis: The finding that within-family configuration choices (thinking budget, quantization) matter as much as model selection is practically important for deployment decisions. The 35-configuration sweep across 11 models with diminishing returns at high thinking budgets provides actionable guidance.

Contamination diagnostics: The public-private split with statistical monitoring (flagging |θ_public - θ_private| > 2×SE_Δ) is a clean, reproducible approach to contamination detection.

Limitations to impact: At 820 problems, the benchmark is modest in scale. The English-only, single-turn, model-only evaluation scope limits generalizability. The problems, while creative, are still static — they will eventually saturate, and the authors' maintenance plan is aspirational rather than concrete.

4. Timeliness & Relevance

The paper directly addresses benchmark saturation, which is arguably the central methodological crisis in LLM evaluation. The timing is good — with frontier models approaching ceiling on MMLU, GPQA showing signs of contamination, and ARC-AGI-3 measuring something most practitioners don't care about, there is genuine demand for benchmarks that test practical reasoning without gating on expertise. The test-time compute analysis is particularly timely as providers increasingly expose thinking-budget controls.

5. Strengths & Limitations

Key Strengths:

Problem design philosophy is genuinely novel — the example problems (epistemic vigilance, audience calibration, anti-memorization constraint modification) test failure modes that existing benchmarks systematically miss

IRT framework is well-calibrated and practically useful, with released item parameters enabling zero-cost future scoring

~9,000 person-hours of expert authoring effort produces high-quality, original problems

Transparent compute accounting (~3.6B tokens, ~10²¹ FLOPs) and thorough AI usage disclosure

The observation that configuration choices rival model selection in importance is empirically novel at this scale

Notable Weaknesses:

Single proprietary judge dependency undermines reproducibility claims

The "integration" framing, while appealing, lacks formal operationalization — how much integration density does each problem actually require? No metric quantifies this

Category-level θ stability (within 0.2-0.3 logits of Overall, median Spearman ρ=0.96) somewhat undermines the claim that GIM tests separable cognitive dimensions — it may instead measure a single general factor

The 28% multimodal coverage is thin (no audio/video), and PDF problems may test document parsing more than reasoning

No comparison against existing benchmarks on the same models to validate that GIM captures something GPQA/ARC-AGI misses

Missing Analysis: The paper would benefit from a factor analysis of the response matrix to determine whether the seven categories are empirically distinguishable or collapse to a single dimension, and from ablation studies showing that "integration" problems are harder than their decomposed components.

Overall Assessment

GIM is a well-executed, practically useful benchmark contribution with a compelling design philosophy and strong psychometric infrastructure. Its primary impact will likely be methodological (IRT scoring, test-time compute analysis, contamination diagnostics) rather than as a lasting evaluation standard, given the inherent impermanence of static benchmarks. The single-judge dependency and lack of formal integration metrics are real limitations, but the overall quality of execution — from problem authoring to statistical analysis — is high.

Rating:6.8/ 10

Significance 6.5Rigor 7Novelty 6.5Clarity 8

Generated May 19, 2026

Comparison History (24)

vs. Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR

gemini-3.15/20/2026

Paper 1 addresses a critical bottleneck in AI development: the saturation of LLM benchmarks. By introducing a novel evaluation paradigm based on cognitive integration and providing a rigorous IRT-calibrated framework, it directly tackles current evaluation flaws. Furthermore, its extensive study on test-time compute scaling is highly timely and relevant to recent advancements in reasoning models. While Paper 2 offers a valuable methodological improvement for RL, Paper 1's comprehensive benchmark and insights into test-time compute are likely to have a broader and more immediate impact across the AI research community.

vs. Explainable Wastewater Digital Twins: Adaptive Context-Conditioned Structured Simulators with Self-Falsifying Decision Support

gpt-5.25/20/2026

Paper 2 likely has higher scientific impact due to broad, timely relevance to AI evaluation, with immediate applicability across many labs and model developers. Its contributions (a new multi-domain benchmark, public/private contamination check, large-scale dataset of >200k runs, and IRT-based ability calibration robust to missing/noisy results) can reshape evaluation practice and be reused across domains. Paper 1 is methodologically strong and practically valuable for wastewater control, but its impact is narrower (specific industrial process) and may see slower diffusion outside process systems/controls.

vs. Latent Action Reparameterization for Efficient Agent Inference

gemini-3.15/20/2026

While Paper 1 provides a rigorous and much-needed evaluation framework for LLMs, Paper 2 tackles a critical bottleneck in the deployment of LLM agents: inference cost and long decision horizons. By introducing a learned latent action space, LAR offers a fundamental algorithmic improvement that significantly reduces compute requirements and wall-clock time. This approach has broad, immediate real-world applicability for scaling autonomous agents, giving it a higher potential for transformative scientific and practical impact.

vs. Generative Recursive Reasoning

claude-opus-4.65/20/2026

Paper 1 introduces GRAM, a novel framework that fundamentally extends recursive reasoning models with probabilistic multi-trajectory computation, combining generative modeling with iterative latent reasoning. This represents a meaningful architectural innovation with broad implications for neural reasoning systems, inference-time scaling, and multi-solution problems. Paper 2, while methodologically rigorous with its IRT-based evaluation framework, is primarily a benchmark contribution. Benchmarks have shorter shelf lives and narrower impact compared to new computational frameworks. GRAM's theoretical contributions (variational inference for recursive reasoning, latent-space branching) are more likely to inspire follow-on research across multiple subfields.

vs. AutoResearchClaw: Self-Reinforcing Autonomous Research with Human-AI Collaboration

gpt-5.25/20/2026

Paper 1 likely has higher impact: it introduces a novel, rigorously constructed benchmark (820 expert-authored items, rubric-based scoring, public/private contamination check) plus a calibrated IRT evaluation framework over >200k responses, addressing a widely shared bottleneck as LLM benchmarks saturate. Its methodology is stronger and more generalizable, with broad relevance to ML evaluation, psychometrics, and model development. Paper 2 is timely and application-oriented, but evidence is narrower (25-topic benchmark) and impact may be more system-specific and sensitive to implementation details.

vs. From Static Risk to Dynamic Trajectories: Toward World-Model-Inspired Clinical Prediction

gemini-3.15/19/2026

Paper 1 addresses a critical and fundamental bottleneck in clinical AI—transitioning from static, confounded risk prediction to causal, intervention-aware dynamic trajectories. Its unified framework bridges deep learning, causal inference, and policy evaluation, offering profound implications for safe, real-world healthcare applications. While Paper 2 provides a valuable and timely LLM benchmark, Paper 1's potential to fundamentally change clinical decision-making and patient outcomes represents a deeper scientific and societal impact.

vs. Voices in the Loop: Mapping Participatory AI

gemini-3.15/19/2026

Paper 2 addresses a critical bottleneck in current AI research—LLM evaluation and benchmark saturation—by introducing a novel integration-based benchmark and rigorous Item Response Theory modeling. Its extensive study on test-time compute scaling is highly relevant to the latest trends in foundation models. While Paper 1 provides a valuable resource for AI governance, Paper 2's methodological rigor, timeliness, and immediate applicability to core AI model development give it a significantly higher potential for broad scientific impact.

vs. Budget-Efficient Automatic Algorithm Design via Code Graph

gemini-3.15/19/2026

Paper 2 addresses a critical and universal bottleneck in AI research: the saturation of LLM benchmarks and the evaluation of test-time compute. By introducing a rigorously calibrated benchmark (GIM) that measures integrated cognitive reasoning rather than mere memorization, it provides a foundational tool for the broader AI community. While Paper 1 offers a novel approach to automated algorithm design, Paper 2 has much broader applicability, higher timeliness, and will likely serve as a standard evaluation metric for future reasoning models, leading to a significantly higher citation count and scientific impact.

vs. Generalization or Memorization? Brittleness Testing for Chess-Trained Language Models

gpt-5.25/19/2026

Paper 2 likely has higher impact: it introduces a novel, broadly applicable evaluation paradigm (integration across cognitive domains) with strong methodological rigor (rubric-based scoring, public/private contamination checks, and calibrated 2PL IRT over >200k responses). Its framework can influence benchmarking practice across many tasks/models and is timely given benchmark saturation and contamination concerns. Paper 1 is valuable and reproducible, but is narrower (chess-domain critique + verifier-in-the-loop for a well-defined domain) with more limited cross-field reach compared to a general evaluation methodology.

vs. Finite-Time Analysis of MCTS in Continuous POMDP Planning

gpt-5.25/19/2026

Paper 2 likely has higher impact due to timeliness and broad applicability: it introduces a new evaluation paradigm for LLMs with a sizable, expert-authored benchmark, contamination diagnostics, rubric-based scoring, and calibrated IRT modeling over large-scale data. This can directly influence how the field measures progress, enable standardized comparisons, and inform deployment choices across many domains. Paper 1 is methodologically rigorous and novel for finite-time guarantees in continuous-observation POMDP MCTS, but its immediate impact is narrower to theoretical RL/planning and may diffuse more slowly into practice.

vs. From Imitation to Interaction: Mastering Game of Schnapsen with Shallow Reinforcement Learning

claude-opus-4.65/19/2026

Paper 1 (GIM) addresses a fundamental and timely challenge in LLM evaluation—benchmark saturation and contamination—with a novel integration-based difficulty paradigm, rigorous IRT methodology, and a comprehensive study of test-time compute tradeoffs across 28 models. Its breadth of impact spans the entire LLM evaluation community. Paper 2, while methodologically sound, addresses a narrow domain (Schnapsen card game) with incremental contributions (shallow RL beating search-based bots), limited novelty beyond established RL techniques, and minimal breadth of impact across fields.

vs. CyberCorrect: A Cybernetic Framework for Closed-Loop Self-Correction in Large Language Models

claude-opus-4.65/19/2026

Paper 1 (GIM) addresses a fundamental challenge in LLM evaluation—benchmark saturation and contamination—with a novel integration-based difficulty approach, rigorous IRT methodology, and the most extensive published study of test-time compute tradeoffs. Its 28-model evaluation, contamination diagnostics, and released framework have broad utility across the entire LLM research community. Paper 2 (CyberCorrect) offers a neat cybernetic formalization of self-correction but is narrower in scope, tested on a smaller custom benchmark, and the practical gains (6.2pp improvement) are incremental. GIM's methodological contributions and community-wide relevance give it higher impact potential.

vs. Reasoning Compression with Mixed-Policy Distillation

gpt-5.25/19/2026

Paper 2 likely has higher impact: it introduces a new benchmark emphasizing integrated, grounded cognition with substantial scale, expert-authored items, rubric-based scoring, and a public/private split for contamination checks. Methodologically, it adds calibrated 2PL IRT ability estimation over large multi-model response data, improving robustness and comparability, and provides a broad leaderboard plus a systematic test-time compute study. Its applications span evaluation, model development, and deployment decisions across the field. Paper 1 is useful and timely for efficiency, but is narrower in scope and likely affects fewer subfields.

vs. EXG: Self-Evolving Agents with Experience Graphs

claude-opus-4.65/19/2026

EXG introduces a novel, principled framework (experience graphs) for self-evolving LLM agents that addresses a fundamental limitation—structuring agent experience for reuse across tasks. It offers both online and offline modes, plug-and-play compatibility, and demonstrates broad applicability across code generation and reasoning tasks. While GIM contributes a well-designed benchmark with IRT modeling and insightful test-time compute analysis, benchmarks tend to have shorter-lived impact as they saturate. EXG's architectural contribution to agent learning has broader potential to influence future agent design and enable cumulative improvement, a key challenge in the field.

vs. The Capability Paradox: How Smarter Auditors Make Multi-Agent Systems Less Secure

gpt-5.25/19/2026

Paper 1 likely has higher impact due to a novel, mechanism-explaining security failure mode in multi-agent LLM systems (semantic hijacking + “capability paradox”), supported by large-scale experiments and mediation analysis, and a concrete, high-leverage defense (heterogeneous ensemble verification) with dramatic ASR reduction. It is timely and broadly relevant to deploying agentic systems safely, influencing both security practice and research on alignment/agent architectures. Paper 2 is rigorous and useful infrastructure (benchmark + IRT calibration), but benchmarks are a crowded space and typically yield more incremental, narrower impact than uncovering and mitigating a new systemic vulnerability.

vs. KISS - Knowledge Infrastructure for Scientific Simulation: A Scaffolding for Agentic Earth Science

gpt-5.25/19/2026

Paper 2 likely has higher scientific impact: it proposes a reusable “knowledge infrastructure” that measurably improves agentic operation of process-based Earth system simulators, with strong real-world relevance (climate, water, hazards) and clear application pathways. The scale of evaluation (3,000 trials) plus broad generalization claims (119 models across 14 domains) suggests wide cross-disciplinary utility and timeliness for agentic scientific computing. Paper 1 is novel and methodologically rigorous for LLM evaluation, but its impact is more confined to benchmarking/AI assessment rather than enabling new scientific workflows.

vs. Evaluating Cognitive Age Alignment in Interactive AI Agents

gpt-5.25/19/2026

Paper 2 has higher likely impact: it introduces a broadly applicable, methodologically rigorous evaluation framework (integration-focused tasks, rubric-based scoring, public/private contamination checks) and calibrates a 2PL IRT model on large-scale data to yield robust ability estimates and compute–capability tradeoff analyses. This is timely given benchmark saturation and has wide relevance across LLM evaluation, benchmarking, and capability measurement. Paper 1 is novel in aligning agent performance to child developmental stages, but its impact may be narrower (psychometric/agent-specific) and less immediately generalizable than GIM’s measurement and benchmarking contributions.

vs. Imperfect World Models are Exploitable

gpt-5.25/19/2026

Paper 2 has higher likely scientific impact: it introduces a formal, general theory connecting reward hacking and model exploitation, proves near-unavoidability results, and derives limits/safe horizons for planning—foundational insights that can influence RL theory, model-based RL, AI safety, and robust decision-making broadly. Paper 1 is timely and useful (a rigorous benchmark + IRT calibration for LLM evaluation) with clear practical applications, but benchmarks tend to have narrower and shorter-lived impact as models and evaluation norms evolve. Paper 2’s theoretical contributions are more field-general and enduring.

vs. Virtual Nodes Guided Dynamic Graph Neural Network for Brain Tumor Segmentation with Missing Modalities

gpt-5.25/19/2026

Paper 1 likely has higher scientific impact: it introduces a broadly applicable evaluation paradigm (integration-focused benchmark) with substantial methodological rigor (expert-authored items, rubric-based scoring, public/private contamination check, and IRT calibration over 200k+ responses across 28 models). Its applications span most LLM research and deployment, affecting benchmarking, model selection, and test-time compute practices across many fields. Paper 2 is timely and practically relevant for clinical imaging, but its contribution appears more incremental (architecture tweaks for missing modalities) and its impact is narrower to a specific task/datasets and domain.

vs. Remembering More, Risking More: Longitudinal Safety Risks in Memory-Equipped LLM Agents

claude-opus-4.65/19/2026

Paper 1 identifies a novel and important failure mode—temporal memory contamination—in memory-equipped LLM agents, introducing a rigorous evaluation protocol with counterfactual baselines and demonstrating systematic longitudinal safety risks. This addresses a critical gap as memory-augmented agents become widely deployed, with direct implications for AI safety practices and policy. Paper 2 contributes a well-designed benchmark (GIM) with strong psychometric methodology, but benchmarks face saturation and obsolescence. Paper 1's conceptual contribution (treating memory safety as longitudinal) and its practical detection framework have broader and more lasting impact on AI safety research.