Measuring Progress Toward AGI: A Cognitive Framework

Ryan Burnell, Yumeya Yamamori, Orhan Firat, Kate Olszewska, Steph Hughes-Fitt, Oran Kelly, Isaac R. Galatzer-Levy, Meredith Ringel Morris

#1102 of 2682 · Artificial Intelligence
Share
Tournament Score
1430±48
10501800
54%
Win Rate
7
Wins
6
Losses
13
Matches
Rating
5/ 10
Significance
Rigor
Novelty
Clarity

Abstract

Despite widespread discussion of AGI, there is no clear framework for measuring progress toward it. This ambiguity fuels subjective claims, makes it difficult to track progress, and risks hindering responsible governance. As a starting point to address this gap, we present a framework for understanding system capabilities in relation to human cognitive abilities. Drawing from decades of research in psychology, neuroscience, and cognitive science, we introduce a Cognitive Taxonomy that deconstructs general intelligence into 10 key cognitive faculties. We then propose a rigorous evaluation protocol in which a system's performance is measured across a suite of targeted, held-out cognitive tasks, generating a 'cognitive profile' that can be used to understand a system's strengths and weaknesses. We hope this framework will provide a practical roadmap and an initial step toward more rigorous, empirical evaluation of AGI.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: "Measuring Progress Toward AGI: A Cognitive Framework"

1. Core Contribution

This paper proposes a Cognitive Taxonomy comprising 10 cognitive faculties (Perception, Generation, Attention, Learning, Memory, Reasoning, Metacognition, Executive Functions, Problem Solving, and Social Cognition) as a framework for measuring AI system capabilities relative to human cognition. The taxonomy is grounded in established cognitive science literature, and the paper outlines a three-stage evaluation protocol: (1) cognitive assessment across targeted tasks, (2) collection of human baselines, and (3) construction of "cognitive profiles" mapping system strengths and weaknesses against human population distributions.

The core problem addressed is legitimate and important: the field lacks a standardized, comprehensive framework for measuring how AI capabilities compare to human cognition across its full breadth. The proposed solution is essentially a detailed organizational schema paired with methodological guidelines for evaluation.

2. Methodological Rigor

This paper is primarily a position/framework paper rather than an empirical contribution. As such, it should be evaluated on the coherence and completeness of its framework rather than experimental rigor.

Strengths in rigor:

  • The taxonomy is well-grounded in established cognitive science, with extensive citations to foundational literature (e.g., Baddeley on working memory, Tulving on episodic memory, Diamond on executive functions).
  • The evaluation protocol includes thoughtful methodological prescriptions: held-out test sets, independent verification, varied difficulty and format, and human baselines with demographically representative adult samples.
  • The paper acknowledges three important sources of uncertainty (task quality, construct validity, stochasticity) and suggests statistical approaches like Item Response Theory.
  • Weaknesses in rigor:

  • The paper provides no empirical validation whatsoever. There are no actual evaluations of any AI system, no pilot studies, and no cognitive profiles generated for existing systems. The illustrative figures show only hypothetical data.
  • The selection of exactly 10 faculties and the specific decomposition (e.g., why "Problem Solving" and "Social Cognition" as composite faculties but not, say, "Communication" or "Navigation") is not rigorously justified beyond appeal to existing literature. The taxonomy is presented as a "starting point," which is appropriate but also deflects criticism.
  • The distinction between the 8 "basic" and 2 "composite" faculties is somewhat arbitrary—many of the basic faculties also involve complex interactions (e.g., working memory is acknowledged to involve memory, attention, and reasoning).
  • The paper does not address how to weight or aggregate across faculties, nor how to handle the deep interdependencies between them in practice.
  • 3. Potential Impact

    Positive impact potential:

  • If adopted, this framework could provide a common language for discussing AI capabilities, reducing miscommunication between researchers, policymakers, and the public.
  • The cognitive profiling approach could be genuinely useful for identifying capability gaps in AI systems, informing both development priorities and deployment decisions.
  • The emphasis on human baselines with representative samples addresses a real weakness in current AI evaluation practice, where human comparisons are often ad hoc or non-representative.
  • The framework could facilitate more nuanced governance discussions by replacing binary "is it AGI?" questions with multidimensional capability assessments.
  • Limitations on impact:

  • Without actual benchmarks, tasks, or empirical validation, this remains aspirational. The paper acknowledges large coverage gaps in areas like metacognition, attention, and social cognition, and states they are "working with the academic community" to build evaluations—but this work is future, not present.
  • The framework may face adoption challenges: it does not clearly differentiate itself from other cognitive architectures or intelligence taxonomies that have been proposed over decades in cognitive science and AI.
  • The practical challenge of creating held-out, contamination-free benchmarks that remain valid over time is enormous and largely unaddressed beyond stating the requirement.
  • 4. Timeliness & Relevance

    The paper is highly timely. The rapid advancement of frontier AI systems (GPT-4, Gemini, Claude, etc.) has created urgent need for rigorous capability assessment frameworks. Policymakers worldwide are developing AI governance frameworks and need grounding for terms like "AGI." The paper directly addresses a current bottleneck in responsible AI development.

    However, the paper arrives in a field already populated with related efforts: the Levels of AGI framework (Morris et al., 2024, from the same group), Chollet's ARC benchmark, Humanity's Last Exam, and various cognitive evaluation proposals. The incremental advance over the group's own prior work is a more detailed cognitive decomposition, but the lack of empirical implementation limits its distinctiveness.

    5. Strengths & Limitations

    Key Strengths:

  • Comprehensive and well-organized taxonomy with extensive grounding in cognitive science literature
  • Practical evaluation protocol with sensible methodological guidelines
  • Addresses the important issue of system-level (vs. model-level) evaluation
  • Thoughtful discussion of creativity, processing speed, and system propensities as complementary dimensions
  • Honest about limitations and iterative nature of the endeavor
  • Key Limitations:

  • No empirical content—entirely theoretical/aspirational
  • The taxonomy, while comprehensive, is not novel in cognitive science; the contribution is primarily the application to AI evaluation
  • No concrete operationalization: how would one actually test "sustained attention" or "metacognitive monitoring" in an LLM? The gap between the taxonomy and actual test design is vast and unaddressed
  • The framework is human-centric by design, which the authors acknowledge may miss AI-specific capabilities, but this creates a fundamental tension: are we measuring intelligence or human-likeness?
  • The model vs. system evaluation discussion raises important questions (calculator analogy) but provides no resolution
  • The percentile-based cognitive profiling assumes commensurability between human and AI performance distributions that may not hold
  • Additional Observations

    The paper reads more as a research agenda or roadmap than a completed scientific contribution. Its value lies primarily in organizing thinking and establishing shared vocabulary. The author list from Google DeepMind gives it institutional weight that may drive adoption regardless of scientific novelty. The extensive appendix taxonomy, while thorough, is essentially a literature review of cognitive science organized into a hierarchy—useful but not groundbreaking.

    Rating:5/ 10
    Significance 6.5Rigor 4Novelty 4Clarity 7.5

    Generated May 28, 2026

    Comparison History (13)

    vs. Look on Demand: A Cognitive Scheduling Framework for Visual Evidence Acquisition in Multimodal Reasoning
    gemini-3.15/28/2026

    Paper 2 addresses a critical, highly timely challenge with broad implications across AI development, cognitive science, and governance: measuring AGI progress. While Paper 1 offers a strong, empirically validated technical contribution to multimodal reasoning, Paper 2 provides a foundational taxonomy and evaluation protocol that could shape future benchmarking standards, policy-making, and interdisciplinary research, giving it a higher potential for widespread, paradigm-shifting scientific impact.

    vs. SkillGrad: Optimizing Agent Skills Like Gradient Descent
    gpt-5.25/28/2026

    Paper 1 offers a concrete, novel optimization framework for improving LLM agent skills with demonstrated empirical gains on established benchmarks, suggesting near-term applicability and methodological testability. Its gradient-descent-inspired formulation (diagnostic “text gradients,” momentum memory, structured patching) is an actionable contribution likely to be reused in agent/tooling research and practice. Paper 2 is timely and potentially broad, but as described it is more conceptual and depends heavily on adoption and the eventual quality/validity of proposed tasks and protocols; immediate rigor and measurable impact are less certain from the abstract alone.

    vs. Actionable World Representation
    gpt-5.25/28/2026

    Paper 2 likely has higher scientific impact because it targets a widely recognized, cross-cutting bottleneck: standardized measurement of progress toward AGI. A cognitive taxonomy plus evaluation protocol is broadly applicable across model classes, benchmarks, and governance contexts, and is timely amid rapid capability gains and policy needs. Its potential real-world applications (evaluation, auditing, safety, regulation, research comparability) span many fields. Paper 1 is technically promising for robotics/physical world modeling, but appears narrower in scope and its impact depends more on empirical rigor and adoption within a specific subcommunity.

    vs. Efficient Post-training of LLMs for Code Generation With Offline Reinforcement Learning
    claude-opus-4.65/28/2026

    Paper 1 addresses a fundamental gap in AI research—the lack of a rigorous framework for measuring progress toward AGI. By proposing a cognitive taxonomy grounded in decades of cognitive science research, it offers a novel interdisciplinary contribution with broad impact across AI, cognitive science, and AI governance/policy. Its potential to standardize AGI evaluation could influence the entire field. Paper 2, while practically useful, presents an incremental improvement (applying offline RL to code generation) with narrower scope and less transformative potential.

    vs. LiveBrowseComp: Are Search Agents Searching, or Just Verifying What They Already Know?
    gemini-3.15/28/2026

    Paper 2 addresses a critical and immediate flaw in LLM evaluation: agents relying on intrinsic knowledge rather than actual search. By providing rigorous empirical diagnostics and introducing a dynamic, actionable benchmark (LiveBrowseComp), it offers immediate utility to the highly active field of AI agents. While Paper 1 presents an interesting conceptual framework for AGI, it is primarily theoretical. Paper 2's concrete methodology, dataset, and relevance to current evaluation bottlenecks make it highly likely to see rapid adoption and citations.

    vs. Bridging the Detection-to-Abstention Gap in Reasoning Models under Insufficient Information
    claude-opus-4.65/28/2026

    Paper 2 addresses a specific, well-defined failure mode in reasoning models with a concrete, testable solution (JTS framework) and demonstrates empirical results. It has immediate practical implications for AI safety in high-risk domains like medical AI. While Paper 1 tackles the important topic of AGI measurement, it is more of a conceptual/taxonomic framework without demonstrated empirical validation. Paper 2's methodological rigor, actionable contributions (reinforcement learning approach, measurable metrics like A@D), and direct relevance to current deployed systems give it higher near-term scientific impact and citability.

    vs. Inference Time Causal Probing in LLMs
    claude-opus-4.65/28/2026

    Paper 2 presents a concrete, novel technical method (HDMI) with rigorous evaluation, clear improvements over prior methods, and immediate applicability to understanding and controlling LLM internals—a highly active research area. Paper 1 proposes a conceptual framework for measuring AGI progress, which, while timely and important, is more of a position/roadmap paper lacking empirical validation of the framework itself. Paper 2's methodological contribution (probe-free causal interventions with completeness/selectivity metrics) is more likely to be adopted and built upon by the research community.

    vs. Let Relations Speak: An End-to-End LLM-GNN Soft Prompt Framework for Fraud Detection
    claude-opus-4.65/28/2026

    Paper 1 addresses a fundamental gap in AI research—how to measure progress toward AGI—by proposing a comprehensive cognitive taxonomy and evaluation framework grounded in decades of cognitive science. This has broad, cross-disciplinary impact spanning AI, cognitive science, policy, and governance. Its timeliness is high given the current AGI discourse. Paper 2, while solid applied work combining LLMs and GNNs for fraud detection, is more incremental and domain-specific, with narrower impact limited to fraud detection and graph-based ML methods.

    vs. Anchor: Mitigating Artifact Drift in Agent Benchmark Generation
    claude-opus-4.65/28/2026

    Paper 1 addresses a fundamental, widely-discussed challenge—measuring progress toward AGI—by proposing a comprehensive cognitive taxonomy grounded in decades of cognitive science research. Its breadth of impact spans AI, cognitive science, neuroscience, and AI governance/policy, making it relevant to a very large audience. Paper 2, while methodologically rigorous and practically useful, addresses a narrower problem (artifact drift in agent benchmarks for enterprise tasks). Paper 1's framework has greater potential to shape discourse, standardize evaluation across the field, and influence policy, giving it higher estimated scientific impact.

    vs. Position: AI Safety Requires Effective Controllability
    gemini-3.15/28/2026

    Paper 2 offers a more immediate and actionable scientific impact by shifting the AI safety paradigm from theoretical alignment to practical runtime controllability. It introduces a concrete benchmark (ControlBench) and architectural framework, which are highly likely to drive experimental follow-up work and citations. While Paper 1 provides a valuable conceptual framework for AGI evaluation, Paper 2 addresses urgent, real-world deployment risks of agentic AI with empirical tools and testable methodologies.

    vs. JobBench: Aligning Agent Work With Human Will
    claude-opus-4.65/28/2026

    Paper 1 addresses a more fundamental and broadly impactful problem—how to measure progress toward AGI—by proposing a cognitive taxonomy grounded in decades of cognitive science research. This framework has potential to shape how the entire AI field tracks progress, informs governance, and structures evaluation. While Paper 2 (JobBench) makes a valuable contribution with its human-centered benchmark for occupational AI agents, it is more narrowly scoped as one benchmark among many. Paper 1's interdisciplinary foundation and its potential to become a standard framework for AGI evaluation give it greater breadth of impact.

    vs. Got a Secret? LLM Agents Can't Keep It: Evaluating Privacy in Multi-Agent Systems
    claude-opus-4.65/28/2026

    Paper 2 addresses a timely, concrete, and empirically grounded problem—privacy leakage in multi-agent LLM systems—with novel experimental methodology (multi-agent social simulation) and striking quantitative findings (e.g., 8x contagion effect, amplified leakage rates). It directly challenges current safety evaluation paradigms with actionable results relevant to real-world deployments. Paper 1 proposes a conceptual framework for AGI measurement, which is valuable but more speculative, harder to validate, and less immediately actionable. Paper 2's concrete findings and methodological innovation give it broader and more immediate impact across AI safety, policy, and deployment practices.

    vs. Natural Language Query to Configuration for Retrieval Agents
    claude-opus-4.65/28/2026

    Paper 2 addresses a fundamental gap in AI research—how to measure progress toward AGI—which has broad implications across AI research, policy, and governance. Its cognitive taxonomy framework could become a widely adopted evaluation standard, impacting multiple fields (AI, cognitive science, policy). Paper 1, while technically sound and practically useful, addresses a narrower optimization problem (retrieval pipeline configuration) with more limited scope. Paper 2's timeliness given current AGI debates and its potential to shape evaluation norms and responsible governance give it substantially broader impact potential.